As part of Apple’s spring “Peek Performance” product event this morning, Apple unveiled the fourth and final member of the M1 family of Apple Silicon SoCs, the M1 Ultra. Aimed squarely at desktops – specifically, Apple’s new Mac Studio – the M1 Ultra finds Apple once again upping the ante in terms of SoC performance for both CPU and GPU workloads. And in the process, Apple has thrown the industry a fresh curveball by not just combining two M1 Max dies into a single chip package, but by making the two dies present themselves as a single, monolithic GPU, marking yet another first for the chipmaking industry.

Back when Apple announced the M1 Pro and the ridiculously powerful M1 Max last fall, we figured Apple was done with M1 chips. After all, how would you even top a single 432mm2 chip that’s already pushing the limits of manufacturability on TSMC’s N5 process? Well, as the answer turns out to be, Apple can do one better. Or perhaps it would be more accurate to say twice as better. As for the company’s final and ultimate M1 chip design, the M1 Ultra, Apple has bonded two M1 Max dies together on to a single chip, with all of the performance benefits doubling their hardware would entail.

The net result is a chip that, without a doubt, manages to be one of the most interesting designs I’ve ever seen for a consumer SoC. As we’ll touch upon in our analysis, the M1 Ultra is not quite like any other consumer chip currently on the market. And while double die strategy benefits sprawling multi-threaded CPU and GPU workloads far more than it does more single-threaded tasks – an area where Apple is already starting to fall behind – in the process they re breaking new ground on the GPU front. By enabling the M1 Ultra’s two dies to transparently present themselves as a single GPU, Apple has kicked off a new technology race for placing multi-die GPUs in high-end consumer and workstation hardware.

M1 Max + M1 Max = M1 Ultra

At the heart of the new M1 Ultra is something a bit older: the M1 Max. Specifically, Apple is using two M1 Max dies here, and then bonding them together to form a massive amalgamation of 114B transistors.

As M1 Max itself has been shipping for the last 5 months, the basic architecture of the chip (and its underlying blocks) is at this point a known quantity. M1 Ultra isn’t introducing anything new in teams of end-user features in that respect, and instead the chip is all about scaling up Apple’s M1 architecture one step further by placing a second silicon die on a single chip.

Starting with speeds and feeds, by placing two M1 Max dies on a single package, Apple has doubled the amount of hardware at their disposal in virtually every fashion. This means twice as many CPU cores, twice as many GPU cores, twice as many neural engine cores, twice as many LPDDR5 memory channels, and twice as much I/O for peripherals.

On the CPU front, this means Apple now offers a total of 20 CPU cores. This is comprised of 16 of their performance-focused Firestorm cores, and 4 of their efficiency-focused Icestorm cores. Given that M1 Ultra is aimed solely at desktops (unlike M1 Max) the efficiency cores don’t have quite as big of a role to play here since Apple doesn’t need to conserve energy down to the last joule. Still, as we’ve seen they’re fairly potent cores on their own, and will help add to the CPU throughput of the chip in heavily threaded scenarios.

As is typical for an Apple product announcement, the company isn’t disclosing clockspeeds here. The desktop-focused nature of the chip means that, if they desire, Apple can push clockspeeds a bit higher than they did on the M1 Max, but they would need to leave their energy efficiency sweet spot to do it.

In practice, I will be surprised if the M1 Ultra CPU cores are clocked much higher than on the M1 Max. Which for Apple’s CPU performance is mixed blessings. For multithreaded workloads, 16 Firestorm cores is going to provide enough throughput to top some performance charts. But for single/lightly-threaded workloads, Firestorm has already been outpaced by newer architectures such as Intel’s Golden Cove CPU architecture. So don’t expect to see Apple recover the lead for single-threaded performance here; instead it’s all about MT and especially energy efficiency.

Meanwhile, doubling the number of M1 Max dies on the chip means that Apple is able to double the number of memory channels on the chip, and thus their overall memory bandwidth. Whereas M1 Max had 16 LPDDR5-6400 channels for a total of 408GB/second of memory bandwidth, M1 Ultra doubles that to 32 LPDDR5 channels and 800GB/second of memory bandwidth. And as with the M1 Max, this is accomplished by soldering the LPDDR5 chips directly to the chip package, for a total of 8 chips on M1 Ultra.

The doubled memory chips also allows Apple to double the total amount of memory available in their hardware. Whereas M1 Max topped out at 64GB, M1 Ultra tops out at 128GB. This is still less memory than could be found on a true high-end workstation (such as a Mac Pro), but it puts Apple ahead of all but the highest-end PC desktops, and should be plenty sufficient for their content creator crowd.

As we saw with the launch of the M1 Max, Apple already provides more bandwidth to their SoCs than the CPU cores alone can consume, so the doubled bandwidth isn’t likely to have much of an impact there than otherwise ensuring that the CPU cores are just as well fed as they are on the M1 Max. Instead, all of this extra memory bandwidth is meant to keep pace with the growing number of GPU cores.

Which brings us to the most interesting aspect of the M1 Ultra: the GPU. With 32 GPU cores, M1 Max was already setting records for a monolithic, integrated GPU. And now Apple has doubled things to 64 GPU cores on a single chip.

Unlike multi-die/multi-chip CPU configurations, which have been commonplace in workstations for decades, multi-die GPU configurations are a far different beast. The amount of internal bandwidth GPUs consume, which for high-end parts is well over 1TB/second, has always made linking them up technologically prohibitive. As a result, in a traditional multi-GPU system (such as the Mac Pro), each GPU is presented as a separate device to the system, and it’s up to software vendors to find innovative ways to use them together. In practice, this has meant having multiple GPUs work on different tasks, as the lack of bandwidth meant they can’t effectively work together on a single graphics task.

But, if you could somehow link up multiple GPUs with a ridiculous amount die-to-die bandwidth – enough to replicate their internal bandwidth – then you might just be able to use them together in a single task. This has made combining multiple GPUs in a transparent fashion something of a holy grail of multi-GPU design. It’s a problem that multiple companies have been working on for over a decade, and it would seem that Apple is charting new ground by being the first company to pull it off.

UltraFusion: Apple’s Take On 2.5 Chip Packaging

The secret ingredient that makes this all possible – and which Apple has been keeping under wraps until today – is that M1 Max has a very high speed interface along one of its edges. An interface that, with the help of a silicon interposer, allows two M1 Max dies to be linked up.

Apple calls this packaging architecture UltraFusion, and it’s the latest example in the industry of 2.5D chip packaging. While the details very from implementation to implementation, the fundamentals of the technology are the same. In all cases, some kind of silicon interposer is put beneath two chips, and then signals between the two chips are routed through the interposer. The ultra-fine manufacturing capabilities of silicon mean that an enormous number of traces can be routed between the two chips – in Apple’s case, over 10,000 – which allows for an ultra-wide, ultra-high bandwidth connection between the two chips.

Officially, Apple only states they’re using a silicon interposer here, which is the generic term for this technology. But, going by Apple’s promotional videos and mockup animations, it looks like they’re using a small, silicon bridge of some sort. Which would make this similar in implementation to Intel’s EMIB technology or Elevated Fanout Bridge (EFB) technology. Both of these are already on the market and have been used for years, so Apple is far from the first vendor to use the technology. But what they’re using it for is quite interesting.

With UltraFusion, Apple is able to offer an incredible 2.5TB/second of bandwidth between the two M1 Max dies. Even if we assume that this is an aggregate figure – adding up both directions at once – that would still mean that they have 1.25TB/second of bandwidth in each direction. All of which is approaching how much internal bandwidth some chips use, and exceeds Apple’s aggregate DRAM bandwidth of 800GB/second.

We’ll go more into this in the obligatory follow-up article, but the important point to take away here is that Apple has become the first vendor to bond two GPUs together with such a massive amount of bandwidth. This is what’s enabling them to take a stab at presenting the two GPUs as a single device to the OS and applications, as it allows them to quickly shuffle data between the GPUs as necessary.

But it should also be noted that there are plenty of details that can make or break the usefulness of this approach. For example, is 2.5TB/second enough, given the high performance of the GPUs? And what is the performance impact of the additional latency in going from GPU to GPU? Just because Apple has doubled the number of GPU cores by gluing them together doesn’t mean Apple has doubled their GPU performance. But at the end of the day, if it works even remotely well, then the implications for GPU designs going forward are going to be immense.

GPU Performance: Exceeding GeForce RTX 3090

Thanks to UltraFusion, Apple has become the first vendor to ship a chip that transparently combines two otherwise separate GPUs. And while we’ll have to wait for reviews to find out just how well this works in the real world, Apple is understandably excited about their accomplishment, and the performance implication thereof.

In particular, the company is touting that the M1 Ultra’s GPU performance exceeds that of NVIDIA’s GeForce RTX 3090, which at the moment is the single fastest video card on the market. And furthermore, that they’re able to do so while consuming a bit over 100 Watts, or 200 Watts less than the RTX 3090.

From a performance standpoint, Apple’s claims look reasonable, assuming their multi-GPU technology works as advertised. For as fast as the RTX 3090 is, it can’t be overstated just how many more transistors Apple is throwing at the matter than NVIDIA is; the GA102 GPU used by NVIDIA has 28.3 billion transistors, while the combined M1 Ultra is 114 billion. Not all of which are being used for graphics on the M1 Ultra, of course, but with so many transistors, Apple doesn’t have to be shy about throwing more silicon at the problem.

The amount of silicon Apple has at their disposal is also one of the keys to their low power consumption. As we’ve already seen with the M1 Max, Apple has built a wide enough GPU that they can keep clockspeeds nice and low on the voltage/frequency curve, which keeps overall power consumption down. The RTX 3090, by contrast, is designed to chase performance with no regard to power consumption, allowing NVIDIA to get great performance out of it, but only by riding high on the voltage frequency curve. And of course, Apple enjoys a huge manufacturing process advantage here, using TSMC’s N5 process versus Samsung’s 8nm process.

Still, given the ground-breaking nature of what Apple is trying to pull off with their transparent multi-GPU design, it has to be emphasized that Apple’s performance claims should be taken with a grain of salt, at least for now. Apple typically doesn’t do things half-baked, but as combining two GPUs in this fashion is yet unproven, a bit of skepticism is healthy here.

First Thoughts

While Apple has telegraphed their intention to scale up their chip designs since the first days of their Apple Silicon-powered Macs, I believe it’s safe to say that the M1 Ultra exceeds most expectations. Having reached the practical limits of how big they can make a single die, Apple has taken the logical next step and started placing multiple dies on a single chip in order to build a workstation-class processor. A step that is necessary, given the constraints, but also a step that is historically more cutting edge than is typical even for Apple.

The net result is that Apple has announced a SoC that has no peer in the industry across multiple levels. Going multi-die/multi-chip in a workstation is a tried and true strategy for CPUs, but to do so with GPUs will potentially put Apple on a level all of their own. If their transparent multi-GPU technology works as well as the company claims, then Apple is going to be even farther ahead of their competitors both in performance and in developing the cutting-edge technologies needed to build such a chip. In that respect, while Apple is trailing the industry a bit with their UltraFusion 2.5D chip packing technology, what they’re attempting to do with it is more than making up for lost time.

All of which is to say that we’re very eager to see how M1 Ultra performs in the real world. Apple has already set a rather high bar with the M1 Max, and now they’re aiming to exceed it with the M1 Ultra. And if they can deliver on those goals, then they will have twice set a new high point for SoC design in the span of just 6 months. These are exciting times, indeed.

Comments Locked

219 Comments

View All Comments

  • andrewaggb - Wednesday, March 9, 2022 - link

    I like where a lot of the M1 is going. I'm curious how much is better design, how much is just having an enormous transistor budget, how much is better manufacturing, and how much is false marketing.

    For example, comparisons to nvidia products made on samsung's process aren't fair from a power consumption point of view at all. The samsung process is way worse and the 3090 wasn't tuned for power consumption, it's a massive card with a huge heatsink and cooler.

    The massive memory bandwidth and faster integrated graphics is something I've often wondered why AMD doesn't pursue in a laptop/desktop form (like a ps4/ps5) as clearly they know how. It's not clear to me why Apple is doing it first but good for them. Personally I would probably be willing to trade off upgradable ram for better performance if they don't price gauge you for it. For SSD's I can't see any valid reason to make them non-upgradeable as PCIE4/5 isn't holding them way back. To date I've never seen a vendor that doesn't price gauge on RAM and SSD upgrades so I'm not hopeful in this regard.

    I also see some misreporting, eg there's an article on Toms claiming a 3990x threadripper is the same on geekbench but if you look at the linux (not windows) results for the 3990x it's 40% faster. There are also claims of 3090ish performance but then independent testing shows way lower results most of the time. And something like ethereum mining is dreadfully slow on the m1 variants by all accounts, etc.

    At the end of the day I still think it's probably the best laptop chip you can get but I hope we see some more competition.
  • mode_13h - Sunday, March 20, 2022 - link

    > massive memory bandwidth and faster integrated graphics is something I've
    > often wondered why AMD doesn't pursue in a laptop/desktop form

    $$$

    The consoles and this Apple Studio Pro are vertically-integrated, which means they're not stacking margins on top of margins. Furthermore, MS and Sony lose money on the console hardware and make it up in software royalties and subscription services.

    If AMD makes a monster APU, it's going to be expensive to the point where it'd be a niche product and most in that price range would prefer to get just a CPU and dGPU.

    Worse, it would limit you to non-upgradable memory, since you can't get the needed bandwidth with DIMMs.

    > clearly they know how.

    Right. They *do* know how, and haven't seen fit to do it for the past 10 years since they started making console APUs. That's your clearest sign that it hasn't been viable for the mainstream market thus far, because they've been keen to play up their GPU advantage over that entire time.
  • valuearb - Sunday, March 20, 2022 - link

    SSD is replaceable. Macs last forever. It's got a ton of super high speed thunderbolt ports for all the external storage and devices you want. And its hands down the fastest video and graphics production system you can get under $5k.
  • mode_13h - Monday, March 21, 2022 - link

    It'll be interesting to see how long that LPDDR5 lasts. DRAM does wear out with use and all the heat it's going to be feeling from the CPU won't help!
  • Dolda2000 - Saturday, April 2, 2022 - link

    >SSD is replaceable
    Only with original Apple parts (since it's not a standard NVMe drive), which not only means that you cannot upgrade it with third-party options, but also that you can't take the SSD out of an otherwise potentially faulty system and get the contents out of it in another system.
    >It's got a ton of super high speed thunderbolt ports for all the external storage and devices you want.
    What if I don't want external storage, but just larger internal storage? Or RAIDed internal storage? Or more memory? Or a 32-core GPU with only a 10-core CPU? Or to upgrade my CPU (or the WiFi adapter) three years down the line? And so on and so forth.
    >And its hands down the fastest video and graphics production system you can get under $5k.
    Only because of the media engines, and there's absolutely no reason they couldn't be on something like a PCIe card.
    >Macs last forever
    At this point I hope you're trolling.
  • rcfa - Wednesday, March 30, 2022 - link

    “Repairability” is the biggest cause for necessitating repairs.
    It is much better for the environment to toss one and recycle one in e.g. every 10’000 units than to toss/recycle zero units, but have to repair one in 1’000.
    Apple has the volume to make these statistics, and they optimize globally not based on the emotions of a single affected consumer.
    Reliably of glued-down, soldered-in-place parts is much higher than that of removable, socketed parts.
    While the latter can be repaired, stock kereping, packaging, shipping of repair parts has a significant environmental footprint, too; and multiplied by the higher failure rate, the environment takes a beating compared to the non-repairable approach.
  • mode_13h - Thursday, March 31, 2022 - link

    There are valid points on either side. Neither approach is going to be right for everyone, and some people just value the expandability of a machine that's designed to be upgradable.

    I have a workstation with a NVMe SSD in a PCIe slot. It has no M.2 slots, but I have enough PCIe lanes that I could add a PCIe card when they came along. I like being able to do that sort of thing. I also upgraded the GPU in this machine, which was a huge benefit, though I also had to swap around PSUs when I did. I was also glad to have that option. Another mid-life upgrade I typically do is to double a machine's RAM. Sometimes, I've even done 2 RAM upgrades, doubling capacity each time.

    At work, I've upgraded machines' CPUs, RAM, and GPUs to fulfill project needs.

    So, as much as you might favor a certain approach, you really can't say it's right for everyone and every purpose.
  • Dolda2000 - Saturday, April 2, 2022 - link

    >Reliably of glued-down, soldered-in-place parts is much higher than that of removable, socketed parts.
    Do you have a citation for that claim? It seems immediately false to my experience. Extremely rarely have I had such components as socketed CPUs, GPUs, RAM or SSDs fail, and even less so in a way that is traceable to them being socketed instead of soldered. Intrinsic failures appear to be much more common with such things as capacitors or other simple components that are merely part of something like a motherboard or PSU, and in reality failures are much more commonly caused by some sort of user error or other external factor (liquid damage, dropping things, power surges, and so on), which would affect a monolithic system just as much as a modular system.
    >It is much better for the environment to toss one and recycle one in e.g. every 10’000 units than to toss/recycle zero units, but have to repair one in 1’000.
    To begin with, I don't believe that's true, unless you have some citation for that as well. It seems obvious to the naked eye that a proper and limited repair has a significantly more limited impact. Changing a bad cap or cleaning up liquid damage is literally nothing compared to recycling and assembling a whole system. I also don't believe that modular systems fail more than an order of magnitude more often than monolithic systems. Nothing in my experience tells me that would be the case. See above, I guess.
    >Apple has the volume to make these statistics, and they optimize globally not based on the emotions of a single affected consumer.
    Is your argument actually just "trust Apple, they're just such good guys, they would never do anything to screw over the customers in order to improve their bottom line, it just looks that way to you because they are optimizing for global variables"?
  • mdriftmeyer - Wednesday, March 9, 2022 - link

    Not that I'm raving about the Studio but you do realize with a small form factor NUC design you would slave a Thunderbolt Hub off the Studio, then slave off a NAS and be done with it.

    I agree the Zen 4 will turn heads on the 5N process this Fall.
  • kwohlt - Monday, March 14, 2022 - link

    Those are certainly all words.
    " Like pay 4K + extra 1-2K for 4TB" Doesn't seem entirely unreasonable. Go to Lenovo's website and configure a P620 workstation and see how you can pay $1K for 2TB of NVME. $1K per TB is how much I expect to pay for enterprise storage solutions anyways.

    "you cannot run any sort of VMWare Hypervisor home server class software OS". Lol who cares? We run our hypervisors on our datacenters on proper servers, not workstations that live on employee's desks.

    "Why not simply get an office decommissioned XEON processor and a good PowerEdge / ProLiant HPE chassis and add HBA SAS expander and blow the roof out of the RAID"
    Buy used, slower hardware instead because it's cheaper?

    "Or say for eg buy the new upcoming Threadripper series which is a HEDT machine and will churn everything you throw at it and will live super long long time and on top you can get a TON of DRAM memory plus the massive I/O bandwidth ?"
    I guess that's always an option, but expect to pay a lot. Lenovo still charges $4500 to upgrade to a W3975WX, and you're still left with significantly less VRAM no matter what discreet card you go with, and consuming multiple times more power, from a much larger device.

    Not sure why you're hating so much - M1 Ultra and Mac Studio are very price competitive and offer incredible perf/watt, and it's not even the highest tier M1 series SoC that's going to be released.

Log in

Don't have an account? Sign up now