AMD Announces Instinct MI200 Accelerator Family: Taking Servers to Exascale and Beyondby Ryan Smith on November 8, 2021 11:39 AM EST
- Posted in
- AMD Instinct
- Infinity Fabric
- CDNA 2
AMD’s CDNA 2 Architecture: Doubling-Down on Double Precision
With the launch of the MI200 series comes the launch of AMD’s CDNA 2 GPU architecture. AMD has been very public on their server GPU roadmaps over the past couple of years (due in part to the Frontier supercomputer win), so we’ve known for a while that CDNA 2 was coming, and that it would be built on a post-7nm node. Still, until today, AMD has remained mum on its architectural specifics and general feature set.
So what’s new with CDNA 2? In short, what we’re looking at is a further refinement of CDNA (1), itself an extension of the GCN-based Vega architecture. GCN’s high ALU density and pure thread-level parallelism approach to processing has always mapped well to GPU compute, so while AMD has gone in other directions for its consumer graphics GPUs, for its sever products it has stayed with a tried and true design. But even as proven as GCN is, there are still multiple knobs AMD can play with to tune any given version of the architecture for the task at hand, and this is what they’ve done for CDNA 2.
Taking a quick look at an individual CU, we see virtually nothing has changed with respect to organization. CDNA 2 still packs 4 SIMD16 units into a CU, allowing it to process 64 vector instructions per clock. The CUs also contain 4 Matrix Cores each, which are responsible for CDNA 2’s matrix/tensor processing capabilities. All of this, in turn, is paired with a scheduler, a shared L1 cache and load/store units, and the local data share.
What AMD’s diagrams don’t show is that the width of each ALU/shader core within a CU has been doubled. CDNA 2’s ALUs are 64-bit throughout, allowing the architecture to process FP64 operations at full speed. This is twice the rate of MI100, where FP64 operations ran at one-half the rate of FP32 operations.
|AMD GPU Throughput Rates
|CDNA 2||CDNA (1)||Vega 20|
|Packed FP32 Vector||256||N/A||N/A|
This boost to FP64 performance is one of the biggest differences between CDNA 2 and earlier architectures, and it reflects the supercomputer-focused goals of AMD’s design. Even with the rise of machine learning and matrix operations, a lot of data science still requires good ole’ double precision floats processed via SIMDs, and this is exactly what AMD has delivered. AMD has invested heavily in FP64 performance, and as a result MI200 is far, far faster in FP64 vector operations than any other AMD or NVIDIA GPU.
All of this underscores the different directions AMD and NVIDIA have gone; for Ampere, NVIDIA invested almost everything in matrix performance, to the detriment of significantly improving vector performance. AMD on the other hand has gone the other direction; overall throughput per matrix core hasn’t really improved, but vector SIMD performance has.
And it’s not just FP64 performance that AMD’s ALU update affects, either. As a result of the wider ALUs, it’s now possible to take advantage of this hardware with packed instructions. By packing two FP32 instructions in a single float2 vector, each ALU can process two FP32 FMA/FADD/FMUL instructions per clock cycle. Under ideal circumstances, this doubles CDNA 2’s per-CU vector FP32 performance as compared to CDNA (1), for a grand total of over 4x the FP32 performance.
With that said, like all packed instruction formats, packed FP32 isn’t free. Developers and libraries need to be coded to take advantage of it; packed operands need to be adjacent and aligned to even registers. For software being written specifically for the architecture (e.g. Frontier), this is easily enough done, but more portable software will need updated to take this into account. And it’s for that reason that AMD wisely still advertises its FP32 vector performance at full rate (47.9 TFLOPS), rather than assuming the use of packed instructions.
Meanwhile, AMD has also given their Matrix Cores a very similar face-lift. First introduced in CDNA (1), the Matrix Cores are responsible for AMD’s matrix processing. And for CDNA 2, they’ve been expanded to allow full-speed FP64 matrix operations as well, bringing them up to the same 256 FLOPS rate as FP32 matrix operations, a 4x improvement over the old 64 FLOPS/clock/CU rate. Unlike the vector ALUs, however, there are no packed operations or other special cases to take advantage of, so this change doesn’t improve FP32/FP16 performance, which remains at 256 FLOPS and 1024 FLOPS per clock respectively.
AMD has also used this iteration of the CDNA architecture to promote bfloat16 to a full-speed format. Whereas it previously ran at half-speed on CDNA (1), on CDNA 2 it runs at full speed, or 1024 FLOPS/clock/CU.
Moving up a level, the general layout of CDNA 2 GPUs is largely unchanged from CDNA (1). The CUs are still divided into 4 Compute Engine groups, and fed by a quartet of Asynchronous Compute Engines (ACEs). As always, AMD obfuscates some of the finer details, but the basic organization is still CDNA as we know it.
Finally, while CDNA 2 is primarily compute-focused, not every last transistor is dedicated to the task. Like its predecessor, CDNA 2 also includes a new version of AMD’s Video Codec Next (VCN) block, which is responsible for video encode and decode acceleration. Judging from AMD’s comments on the matter, this would seem to be included more for decode acceleration than any kind of video encoding tasks, as fast decode capabilities are needed to disassemble photos and videos for use in ML training. This version of the VCN does not support AV1, so the highest format supported for encode/decode is HEVC.
Overall, AMD’s architectural updates for CDNA 2 are relatively modest, but they reflect the kind of changes we’d expect to see for a GPU meant to win a supercomputer contract or three. AMD’s focus on FP64 performance – both vector and matrix – is a good fit for traditional science workloads, and the new packed FP32 instructions should prove useful in those environments as well. Past that, the bulk of AMD’s effort seems to have gone into I/O in order to support the extra 4 IF links, an improvement that should pay off particularly well in the kind of large server clusters and supercomputers AMD is going for.
An EMIB Alternative: Elevated Fanout Bridge 2.5D
Last but not least along our tour of AMD’s Instinct MI200, let’s talk about chip packaging. Besides AMD’s on-die innovations, the MI200 family will also come with an interesting packaging innovation to help connect the GCDs to their respective HBM2E memory stacks. For their new GPUs, AMD is relying on a technology called Elevated Fanout Bridge 2.5D, which takes fanout packaging to the next level by incorporating a small silicon bridge above the substrate itself to connect two dies.
Like all HBM users, AMD has to contend with the memory technology’s unique combination of high speeds and dense trace counts. HBM requires too many traces to be routed through standard substrates, so it needs to be routed through something better: silicon. The end result has been that the first generation of HBM products have used chip-sized silicon interposers, which essentially placed both the HBM stacks and the GPU itself on top of another piece of silicon that has the necessary traces etched into it. Silicon interposers work well enough, but they need to be nearly a big as the entire chip, and that is both expensive and the through-silicon vias (TSVs) make the whole thing less-than-pleasant to assemble.
HBM Circa 2014: Full-size Silicon Interposers
That in turn has led to more recent efforts to focus on installing a much smaller silicon bridge – ideally, something just big enough to cover the HBM-related contacts on both chips. Intel already has a solution to this, EMIB, which places a small bridge die within the substrate. However, besides EMIB being an Intel technology, it’s not without its own drawbacks, so AMD and its fab partners think they can do one better than that. And that one better thing is Elevated Fanout Bridge 2.5D.
So what makes Elevated Fanout Bridge 2.5D different? In short, EFB builds above the substrate, rather than inside it. In this case, the entire chip pair – the GPU and the HBM stack – are placed on top of a mold with a series of copper pillars in it. The copper pillars allow the coarse-pitched contacts on the chips to make contact with the substrate below in a traditional fashion. Meanwhile, below the high-precision, fine-grained microbumps used for HBM, a silicon bridge is instead placed. The end result is that by raising the HBM and GPU, it creates room to put the small silicon bridge without digging into the substrate.
Compared to a traditional interposer, such as what was used on the MI100, the benefits are obvious: even with the added steps of using EFB, it still avoids having to use a massive and complex silicon interposer. Meanwhile, compared to bridge-in-substrate solutions like EMIB, AMD claims that EFB is both cheaper and less complex. Since everything takes place above the substrate, no special substrates are required, and the resulting assembly process is much closer to traditional flip-chip packaging. AMD also believes that EFB will prove a more scalable solution since it’s largely a lithographic process – a point that’s particularly salient right now given the ongoing substrate bottleneck in chip production.
Overall, EFB looks very similar (if not identical) to TSMC’s InFO-L packaging technology, which was announced back in 2020 and uses an above-substrate bridge. Given the close working relationship between AMD and TSMC, it’s not clear how much of EFB is really an AMD innovation versus them employing InFO-L. But regardless, a more cost-effective means of implementing HBM is a very important step forward for AMD’s GPU team.
MI200 Availability: Frontier First, Everyone Else in Q1’2022
Wrapping things up, while today is the official launch day for the Instinct MI200 family, the new hardware is in fact already shipping. AMD began the first shipments of MI250X parts for the Frontier supercomputer back in Q3 of this year, and they are continuing to ship them in order to fill the US Department of Energy’s order for tens-of-thousands of GPUs.
Frontier is itself a major success story for AMD, and it’s something AMD is hoping they can use to pivot towards even more success in the server GPU marketplace. With 4 MI250X accelerators in every single node, it’s MI250X that truly underpins the United States’ first exascale computer. And that contract, in turn, has become a major vote of confidence in AMD by the US Department of Energy. As AMD seeks to move into new territory in the server GPU space – and as it seeks to carve out its share of NVIDIA’s lucrative datacenter market – showcasing that their hardware and software stack are ready for wide-scale deployment is critical to winning over customers. All of which is an experience AMD just got through over the last few years on the CPU side of matters.
The MI200 family, in turn, is AMD’s best chance yet to break into the server GPU market in a big way. At the risk of downplaying AMD’s accomplishments here, reaching parity with NVIDIA in form factors and I/O via OAM and Infinity Fabric 3 are going to go a long way towards making their products feature-competitive with the market leader. And AMD’s performance figures, if they pan out in production systems, will help carry them much the rest of the way.
Still, AMD’s competition won’t be sitting still. Intel’s Ponte Vecchio – the heart of the rival Aurora supercomputer – looms close. And market leader NVIDIA has their own cards to play, both figuratively and literally. So there’s ample opportunity for AMD to make gains in the server GPU market over the next cycle, but it’s something they will have to work hard at.
In the meantime, AMD won’t be alone in their journey. After the Frontier order is complete, AMD will be able to begin focusing on ODM and OEM orders, with ODM/OEM availability expected in the first quarter of next year. Many of AMD’s regular partners slated to offer MI200 accelerators in their systems, including Dell, Supermicro, Lenovo, HPE, Gigabyte, and ASUS- and all of whom have been major participants in the success of AMD’s EPYC CPUs as well.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Targon - Monday, November 8, 2021 - linkFor RDNA 3, a key to consumer graphics is that you want it to appear to be a single GPU, even across multiple dies. At that point, something like the I/O die in Ryzen will be a front end, with the multiple GPU dies behind it that will be "hidden". The running of shaders and such would then apply things to preferably a single die, but could potentially span multiple dies with some potential performance hits if that actually happened. On the CPU side, you don't have single threads that span multiple cores, but in graphics, single programmed shaders and such will run across multiple GPU cores(not dies), so that is where you get very good scaling with more GPU cores. Obviously, the designers at AMD will understand this stuff a lot better than I do, so they may have ways to avoid potential slowdowns due to things that span multiple dies.
Kevin G - Tuesday, November 9, 2021 - linkNot really, everything is links by Infinity Fabric, including those in the same package. Putting two into the same package like this is a density play.
One thing worth noting is that a single die still has eight Infinity Fabric links and is an option for AMD to bring that to market. This would be a straight forward way to drop down power consumption without altering the underlaying platform these plug into.
eastcoast_pete - Monday, November 8, 2021 - linkThanks Ryan! While I am certainly not in the market for one of those, I wonder if there is a chance of a deep(er) dive into the interconnect used?
Unrelated, the choice of N6 makes sense, as TSMC has pushed its customers to port their N7 designs to N6 for a while now (read: they have capacity and get more chips per wafer vs N7), while N5 is basically hogged/reserved by Apple. Insisting on N5 at TSMC would probably both drastically raise costs and restrict numbers, as AMD would have had to take the leftover capacity Apple isn't using when it becomes available. Going to Samsung (if even possible) wouldn't have gained anything: As we know from one of Andrei's reviews, TSMC N7 is as efficient as Samsung's " 5 nm", and TSMC's N6 is, at minimum, not worse than their N7.
mode_13h - Monday, November 8, 2021 - linkZen 4 will use N5. So, fab availability might not have been the deciding factor, given that these should be a relatively low-volume product.
Ryan Smith - Monday, November 8, 2021 - link"I wonder if there is a chance of a deep(er) dive into the interconnect used?"
A good chunk of what I already know is in this article. But what else would you like to know?
Kevin G - Tuesday, November 9, 2021 - linkOne thing I'd like to know is why has it taken AMD so long to enable CPU to GPU links via Infinity Fabric? Vega 20 was the first GPU to have those 25 Gbit links and it appears as if Rome on the CPU side had them as well. The pieces seemed to be there to do much of this on the previous generation of products, just never enabled.
mode_13h - Wednesday, November 10, 2021 - linkVega 20 used them for an over-the-top connector. And in Mac Pro, they used it for dual-GPU graphics cards (which I think could *also* accommodate an over-the-top bridge?).
I'm unclear how much benefit there'd be to using those links for host communication, instead of just PCIe 4.0 x16.
mode_13h - Monday, November 8, 2021 - link* yawn *
Wake me when AMD decides to support ROCm on its consumer GPUs.
Not that I care much, anyway, as HiP is basically CUDA behind the thinnest of veneers. AMD at least had a case, when they were in the Open CL camp, but now they've completely given away the API game to Nvidia.
I also wonder how much longer GPUs are going to be relevant for deep learning. AMD should snap up somebody like Tenstorrent or Cerebras, while it still has the chance.
flashmozzg - Monday, November 8, 2021 - linkDid you miss ROCm 5.0 announcment?
mode_13h - Tuesday, November 9, 2021 - linkI saw that it was announced, but even now I can't find any statement about it supporting consumer GPUs. If you know differently, please provide a link.
Also, last I checked AMD was the only one not yet on OpenCL 3.0. I hope it addresses that, too.