Arm Cortex-X4: Fastest Arm Core Ever Built (Again)

Diving further into Arm's new CPU core microarchitectures, we'll start with the Cortex-X4, which stands out as the most substantial advancement. Arm has consistently achieved significant double-digit improvements in instructions per cycle (IPC) with each iteration, starting from the original Cortex X1 core, then progressing to the Cortex X2, and continuing with the Cortex-X3 IP introduced last year, and they'll do so again for the Cortex-X4 in 2023 as well. The Cortex-X4 is specifically designed to cater to cutting-edge flagship Android-based smartphones and leading mobile devices that utilize robust Arm IP-based System-on-Chips (SoCs). Representing a subtle yet impactful enhancement over its predecessor, the Cortex-X4 further refines the capabilities of the Cortex-X3 core.

The Cortex-X4 is designed to deliver top-tier compute performance in mobile System-on-Chips (SoCs), particularly tailored to handle demanding workloads like AAA gaming and bursty operations. The Cortex-X4 is Arm's highest-performing core to date, featuring an anticipated core clock speed of 3.4 GHz and an increased L2 cache per core, doubling its capacity to 2 MB compared to last year's 1 MB Cortex-X3 . Despite these enhancements, Arm has managed to maintain a minimal increase in the physical size of the core, with the more complex X4 CPU core coming in at under a 10% die size increase (the additional L2 cache excluded).

As for power efficiency, Arm claims a notable improvement in power savings of approximately 40% compared to previous generations. Don't expect to see too many CPU vendors take advantage of that, since the primary job of the X-series is to run fast, but it goes to show what the X4 can accomplish in conjunction with the latest fab nodes.

Arm Cortex-X4: Front End Reshuffle, Redesigning Instruction Fetching

In terms of architecture, the Cortex-X4 exhibits similarities to its predecessor, the Cortex-X3, with the primary focus being on refining the existing architecture and optimizing efficiency across various core components.

Now while things haven't changed all too much architecturally from the Cortex-X4 to the Cortex-X3, the Cortex-X4 front end has had a reshuffle and a tweak of the instruction fetching block. The aim of Arm has been to keep latencies low while offering peak bandwidth throughout its Cortex-X4 core and within the entire TSC23 core cluster.

With regards to the Cortex-X4's front end, the big architectural change here has been through its dispatch width.  The Cortex-X4 now has a more focused 10-wide dispatch width, up for the 6/8-wide dispatch width of the X3. That said, while the front-end has gotten wider, the effective pipeline length has actually shrunken even so slightly; the branch mispredict penalty is down from 11 cycles to 10.

The other big front-end focus has been on the instruction fetching process itself; Arm has essentially redesigned the entire instruction fetch delivery system to ensure better efficiency throughout the pipeline when compared to the Cortex-X3.

The latest architecture also takes another pass on improving Arm's branch prediction units, further improving their prediction accuracy. Arm isn't saying much about how they accomplished this, though we do know that they've targeted conditional branch accuracy in particular. None of this comes for free, though; Arm was quick to note that the improved predictors were more expensive to implement. Still, Arm believes this is worth it in keeping the beast (Cortex-X4) fed, so to speak.

Shifting to the back-end of the CPU core, Arm has taken a focus on execution bandwidth. Among other changes, Arm has increased the number of ALUs from 6 to 8. Of these, six are simple ALUs for processing single-cycle uOPS. Meanwhile, there are two complex ALUs for processing dual and multi-cycle instructions, Arm has also squeezed another branch unit in, giving the Cortex-X4 a total of 3, up from 2, as well as adding an extra Integer MAC. Meanwhile on the floating point side of matters, the Cortex-X4 also upgrades a pipelined FP divider.

So to some extent, the X4's performance improvements come from a brute force increase throughout the chip, with the chip able to dispatch and retire more instructions in a single clock. The goal for the Cortex-X4 is to offer peak performance on both benchmarks and real-world workloads, as well as an increase in the fetch bandwidth for any instruction set going through the pipeline. The benefits come through latency reductions and instruction fusion benefits for larger instruction footprint workloads.

Increasing the Micro-op Commit Queue (MCQ) capacity – and thus the size of the window for instruction re-ordering – is another refinement in Arm's toolbox for Cortex-X4. As with previous increases in Arm's re-order buffers, the larger queue affords more opportunities to look for instruction re-ordering, to hide memory stalls and otherwise extract more opportunities for the rest of the CPU back-end to get some work done. And with CPU performance continuing to outpace memory bandwidth, the need for larger buffers only grows with each generation.

Finally, at the far back end of the X4 CPU core, Arm has added a fourth address generation unit. Interestingly, this one is just for stores; Arm already had a load-only unit, but opted for a store-only unit rather than converting it to a full mixed LS unit.

The L1 cache subsystem of the Cortex-X4 has also received a lot of work. The L1's translation lookaside buffer (TLB) has been doubled to 96 entries, and there's a new L1 temporal data prefetcher. Finally, Arm has taken steps to reduce the number of L1 data bank conflicts on the X4.

There have also been some changes made to better support the larger L2 cache size of the Cortex-X4 that we previously discussed. The L2 has been moved physically closer to the CPU core for performance reasons, and Arm has been able to expand the L2 size without any resulting increase in latency. So there is less of a trade off here than is often the case for increasing cache sizes.

Cortex-X4: IPC Uplift, Scalable up to 14-Cores, Up to 32 MB L3 Cache

One of the primary benefits of Arm's v9.2 architecture shift is that it offers increased scalability. The TSC23 core cluster now supports up to 14 cores which adds a level of flexibility for SoC vendors to implement into their latest designs. Perhaps one of the biggest changes is support for up to 32 MB of shared L3 cache within the TSC23 core cluster. The levels of L3 cache implemented is of course down to the SoC manufacturer, but the maximum levels that can be offered is 32 MB, which allows increased support for higher-end mobile devices such as tablets and notebooks, where applicable.

The maximum number of cores across the entire TSC23 core cluster stretches to 14 in total, with a mixture of big and little cores, with multiple avenues for SoC vendors to explore to capitalize on things like performance gains and efficiency. All of this flexibility is given to the SoC vendors to design their own variations depending on the level of the device. So a flagship mobile device will leverage different combinations of Cortex-X4, Cortex-A720, and Cortex-A520 depending on multiple factors such as cost, power budget, and expected performance levels.

A bigger core and optimizations of existing processes typically come with a performance benefit. Arm is claiming that, based on its pre-silicon simulation numbers, the Cortex-X4 will deliver a 15% IPC uplift at iso-frequency and iso-bandwidth versus the Cortex-X3 used in last year's flagship Android SoCs. There are a number of factors at play here in delivering that total performance improvement, including front-end optimizations and improvements, as well as a larger L2 per core cache of 2 MB and a larger L1D-TLB, which is a cache designed for recently accessed page transitions.

Arm in 2023: Moving to Armv9.2, 64-Bit Only, New DSU-120 Cortex A720: Middle Core, Big on Efficiency
Comments Locked

52 Comments

View All Comments

  • Kangal - Monday, May 29, 2023 - link

    I also forgot to mention, we've had leaks for more than a year about ARMv9 and their Second-Gen cores. They were promised with a sizeable performance improvement at a reduced power draw.

    Turns out the rumours were not correct. Well sort of. We assumed the advances came just from the architecture but that's not it. We're seeing a modest improvement in the architecture, and the benefits coming from a number of other factors. They're relying on a new DynamIQ setup, more cache, faster memory, all mixing together to have an overall notable improvement. Going 64bit-only in microcode will have unseen benefits too. And the elephant in the room is the jump to TSMC-3NM node shrink, which will likely have frequency increases.

    So comparing the QC 8g1 (Samsung 5nm) to the (TSMC-5nm) QC 8g1+ and QC 8g2 (+5nm-TSMC) and (TSMC-3NM) QC 8g3 will be a mixed bag.
  • Kangal - Monday, May 29, 2023 - link

    A TCS23 (X4+720+520) with 1+5+2 configuration, only yields a +27% performance uplift at the same power, compared to TCS22 (X3+715+510) in 1+3+4 cluster.

    Something is miscalculated there!!!

    They either mean:
    1) TCS23 vs TCS23, with only difference being different configuration
    2) TCS22 vs TCS22, with only difference being different configuration
    3) TCS22 vs TCS23, and they meant PLUS an extra +27% performance on top of the architectural improvements
    4) It's not a typo, and they really did mean you ONLY get +27% uplift total. Which doesn't make sense since they claimed the X4 uses -40% less energy than X3, whilst the A720 uses -20% less energy than A715, and the A520 uses -22% less energy than A530. Logically speaking if you just multiply the efficiency gains by the core quantities you get an impressive figure. Unless you divide that by the total cores, that gives you an average drop by -23% energy, but that's not the total. Unless the engineers or the marketers are utterly incompetent there at ARM, and they meant this -23% figure gets increased to -27% figure (+4% efficiency gain) just based on the cluster configuration difference. That's not a great improvement, it's negligible, and not substantial enough to require a new silicon stamp (which explains MediaTek).

    1x40% + 5x20% + 2x22% = 184% / 8 = 23%
  • Doug_S - Monday, May 29, 2023 - link

    There are so many factors changing like more L2 cache, supporting more L3 cache, faster memory, better processes and they don't tell you anything about what the differences are.

    If they said X3 with x cache, y DRAM on process z was compared to X3 with x cache, y DRAM on process z then you could assume the performance uplift was due to their architecture. But they are turning all those knobs so who knows what improvement comes from the core versus what is around the core and what node it is on.
  • Doug_S - Monday, May 29, 2023 - link

    Ugh I meant X3 compared to X4 of course.
  • Kangal - Tuesday, May 30, 2023 - link

    I got that, but the architecture has been rather since the Cortex-A78. Just like how there was a long period of time since the release of the Cortex-A57 compared to the Cortex-A72. That extra time let ARM make a lot of big architectural improvements. In fact, it was pretty lengthy that we got Custom Cores developed by the likes of Nvidia, Qualcomm, Samsung, all which were vastly superior to the Cortex-A57 and they matched the subsequent release of the Cortex-A72.

    My biggest concern is the mistakes in their slides.
    If you have a 1+3+4 TCS22 design, and you do nothing but change the A510 to A520, you should see an upgrade of 22% per core. So 22% x4 should see a +88% uptick in performance. Now compare that to the mere 27% upgrade they said if you upgraded all the core types (X4 / A720 / A520) and you went with a larger chipset with the 1+5+2 design. Something is clearly amiss.

    Another solution to the riddle is they are using the problematic silicon from Samsung-5nm. Making a comparison between the flawed QC 8g1, against a new chipset using the same node, but upgrading the Core-types and the Cluster-design. Even then it's a bad excuse, because that would mean its barely competing against the 6-month old QC 8g2 (on TSMC node), and we collectively just ignore the existence of the MediaTek Dimensity chipsets.

    I think we will have to wait to hear the announcement of next-gen chips for 2024, in the form of QC 8g3 and MTK D9400. Let's see their claimed battery life improvement, their performance improvement, and deduce the efficiency from there. Look at which silicon they're building upon (TSMC 4nm vs 5nm). And finally look at the in depth reviews from the likes of Anandtech/Andrei, Geekerwan, and Golden Reviewer.
  • iphonebestgamephone - Wednesday, May 31, 2023 - link

    Techtechpotato instead of anandtech/andrei
  • Findecanor - Monday, May 29, 2023 - link

    ARM MTE and PAC are two different things, and I find it really silly to see them touted for use together.
    MTE steals eight pointer bits that would have been used for PAC, and on some implementations the bits for PAC would then be as few as 3.

    You would better pick one or the other, depending on your protection scheme.
  • syxbit - Monday, May 29, 2023 - link

    So Arm are still years behind Apple? And possibly will be slower than Nuvia too?
    I guess this just helps QUalcomm, as smaller companies have no choice. Either use slow off the shelf parts, or pay QCOMM for their superior chip (assuming Nuvia is as good as claimed.).
  • DanNeely - Thursday, June 1, 2023 - link

    Blame some of the mainland China android forks. They've spent years trying to pretend 64 bit only was never going to happen and were nowhere near ready when last years x3 dropped support for 32 bit code.
  • Arnulf - Monday, May 29, 2023 - link

    "up for the 6/8-wide dispatch with of the X3"

    From ... width

Log in

Don't have an account? Sign up now