Throughout the world, if there's one universal constant in the smartphone and mobile device market, it's Arm. Whether it's mobile chip makers basing their SoCs on Arm's fully synthesized CPU cores, or just relying on the Arm ISA and designing their own chips, at the end of the day, Arm underlies virtually all of it. That kind of market saturation and relevance is a testament to all of the hard work that Arm has done in the last few decades getting to this point, but it's also a grave responsibility – for most mobile SoCs, their performance only moves forward as quickly as Arm's own CPU core designs and associated IP do.

Consequently, we've seen Arm settle into a yearly cadence for their client IP, and this year is no exception. Timed to align with this year's Computex trade show in Taiwan, Arm is showing off a new set of Cortex-A and Cortex-X series CPU cores – as well as a new generation of GPU designs – which we'll see carrying the torch for Arm starting later this year and into 2024. These include the flagship Cortex-X4 core, as well as Arm's mid-core Cortex-A720. and the new little-core Cortex-A520.

Arm's latest CPU cores build upon the foundation of Armv9 and their previous Total Compute Solution (TCS21/22) ecosystem. For their 2023 IP, Arm is rolling out a wave of minor microarchitectural improvements through its Cortex line of cores with subtle changes designed to push efficiency and performance throughout, all the while moving entirely to the AArch64 64-bit instruction set. The latest CPU designs from Arm are also designed to align with the ongoing industry-wide drive towards improved security, and while these features aren't strictly end-user facing, it does underscore how Arm's generational improvements are to more than just performance and power efficiency.

In addition to refining its CPU cores, Arm has undertaken a comprehensive upgrade of its DynamIQ Shared Unit core complex block, with the DSU-120. Although the modifications introduced are subtle, they hold substantial significance in terms of improving the efficiency of the fabric holding Arm CPU cores together, along with extending Arm's reach even further in terms of performance scalability with support for up to 14 CPU cores in a single block – a move designed to make Cortex-A/X even better suited for laptops.

With three new CPU cores and a new core complex, there's a lot to cover. So let's dive right in.

Arm TCS23 at a High Level: Pushing Efficiency & Going Pure 64-bit

Expanding on the enhancements introduced in the Armv9.1 architecture last year, Arm is progressing through its scheduled development cycle with the latest Armv9.2 architecture. The primary objective of this cycle is to eliminate support for 32-bit applications and transition to a comprehensive 64-bit platform. Underpinning this transition is Arm's strategic framework, "Total Compute Solutions" (TCS), which revolves around three core principles: compute performance, security, and developer access. This approach forms the foundation for Arm's methodology and guides its efforts in delivering optimal performance, robust security measures, and streamlined developer capabilities.

Arm's focus on phasing out the 32-bit instruction set has been one it has been working towards for several years. For their latest TCS23, they have finally created a fully 64-bit cluster to capitalize on the benefit of a complete 64-bit mobile ecosystem, excising AArch32 (32-bit instruction) support entirely.. So whether it's a big, mid, or little core, for Arm's latest generation of IP there is only AArch64.

Developing a dynamic system-on-a-chip (SoC) that caters to a broad spectrum of mobile devices, ranging from cutting-edge flagship smartphones to entry-level models, necessitates a meticulous and consistent approach to maintaining competitiveness in a rapidly expanding market. In the realm of flagship devices, for instance, Qualcomm's Snapdragon 8 Gen2 SoC stands out, leveraging a cluster of Arm's Cortex-X3, Cortex A715/710, and Cortex-A510 cores. The upcoming iteration of Qualcomm's Snapdragon 8 Gen3 and other SoC manufacturers are poised to harness the power of Arm's TSC23 core cluster and intellectual property to further enhance performance in the subsequent generation of flagship mobile devices.

Arm's latest DynamIQ Shared Unit, DSU-120, offers support for up to 14 CPU cores in a cluster, which opens the door to a significant number of different CPU core combinations. We'll see what SoC vendors have opted for later this year, but one probably configuration is a 1+5+2 (X4+720+520), which is likely a configuration for a high-end smartphone. Compared to a last-generation 1+3+4 cluster (X3+715+510), Arm is claiming an uplift of 27% in compute performance within GeekBench 6 MT and a more considerable uplift of between 33% and 64% in the Speedometer 2.1 benchmark depending on software optimizations implemented.

Focusing more on the approach to 64-bit migration, last year Arm announced their first AArch64-only CPU core, the Cortex-A715. Consequently, last year saw the release of the first 64-bit only products, such as MediaTek's Dimensity 9200 SoC, as well as Google's Pixel 7 – which was 64-bit only as a platform choice rather than an architectural restriction.

That said, actual AArch64 adoption/use within the larger software ecosystem has been slower than expected, primarily due to the Chinese market being slow to make the switch from 32-bit to 64-bit. Google has actually been key with its application storage (Google Play) by requiring its developers to submit 64-bit apps as far back as 2019, while also allowing the use of 32-bit applications on devices without native 64-bit support. Other markets haven't been as quick in doing so, but Arm claims that it is 'nudging' companies such as OPPO, Vivi, and Xiaomi to adopt AArch64 faster, which is believed to have the desired effect.

With the initial Armv9 architecture, Arm made improvements to security through the use of its Memory Tagging Extension (MTE) (Armv8.5), which is a hardware-based implementation that uses Pointer Authentication (PA) extensions to help protect from memory vulnerabilities. Memory-based vulnerabilities have been a consistent threat to hardware-based security for many years, and it is something Arm is continually developing within its IP to help mitigate these types of attacks. For reference, Google's Chromium Project claimed that around 70% of high-severity bugs are from memory.

One of the related security features of the latest Armv9.2 architecture is the introduction of a new QARMA3 Pointer Authentication Code (PAC) algorithm. Arm claims the newer algorithm reduces the CPU overhead of PAC to less than 1%, even on their little cores, giving developers and handset vendors even less of a reason to not enable the security feature. Most of these improvements revolve around hardware integrity and security, with a combination of MTE and native benefits through the 64-bit instruction and architecture, all designed to make devices even more secure going into 2023 and beyond. This fits with Arm's ethos to encourage a full switch to 64-bit over a hybrid 64 and 32-bit marketplace.

Finally, looking at performance, Arm claims that their latest generation CPU and core complex architecture has made solid gains in power efficiency. At iso-performance, Cortex-X4 offers upwards of a 40% reduction in power consumption versus Cortex-X3, while Cortex-A720 and A520 save 20-22% over their respective predecessors. On the DSU-120 hub itself, Arm claims an 18% improvement in power efficiency.

Of course, most of these power savings are going to instead be invested in additional performance. But it goes to show what SoC and handset vendors can aim for in this generation if they focus singularly on power efficiency and battery life.

Arm Cortex X4: Fastest Arm Core Ever Built
Comments Locked

52 Comments

View All Comments

  • Kangal - Monday, May 29, 2023 - link

    I also forgot to mention, we've had leaks for more than a year about ARMv9 and their Second-Gen cores. They were promised with a sizeable performance improvement at a reduced power draw.

    Turns out the rumours were not correct. Well sort of. We assumed the advances came just from the architecture but that's not it. We're seeing a modest improvement in the architecture, and the benefits coming from a number of other factors. They're relying on a new DynamIQ setup, more cache, faster memory, all mixing together to have an overall notable improvement. Going 64bit-only in microcode will have unseen benefits too. And the elephant in the room is the jump to TSMC-3NM node shrink, which will likely have frequency increases.

    So comparing the QC 8g1 (Samsung 5nm) to the (TSMC-5nm) QC 8g1+ and QC 8g2 (+5nm-TSMC) and (TSMC-3NM) QC 8g3 will be a mixed bag.
  • Kangal - Monday, May 29, 2023 - link

    A TCS23 (X4+720+520) with 1+5+2 configuration, only yields a +27% performance uplift at the same power, compared to TCS22 (X3+715+510) in 1+3+4 cluster.

    Something is miscalculated there!!!

    They either mean:
    1) TCS23 vs TCS23, with only difference being different configuration
    2) TCS22 vs TCS22, with only difference being different configuration
    3) TCS22 vs TCS23, and they meant PLUS an extra +27% performance on top of the architectural improvements
    4) It's not a typo, and they really did mean you ONLY get +27% uplift total. Which doesn't make sense since they claimed the X4 uses -40% less energy than X3, whilst the A720 uses -20% less energy than A715, and the A520 uses -22% less energy than A530. Logically speaking if you just multiply the efficiency gains by the core quantities you get an impressive figure. Unless you divide that by the total cores, that gives you an average drop by -23% energy, but that's not the total. Unless the engineers or the marketers are utterly incompetent there at ARM, and they meant this -23% figure gets increased to -27% figure (+4% efficiency gain) just based on the cluster configuration difference. That's not a great improvement, it's negligible, and not substantial enough to require a new silicon stamp (which explains MediaTek).

    1x40% + 5x20% + 2x22% = 184% / 8 = 23%
  • Doug_S - Monday, May 29, 2023 - link

    There are so many factors changing like more L2 cache, supporting more L3 cache, faster memory, better processes and they don't tell you anything about what the differences are.

    If they said X3 with x cache, y DRAM on process z was compared to X3 with x cache, y DRAM on process z then you could assume the performance uplift was due to their architecture. But they are turning all those knobs so who knows what improvement comes from the core versus what is around the core and what node it is on.
  • Doug_S - Monday, May 29, 2023 - link

    Ugh I meant X3 compared to X4 of course.
  • Kangal - Tuesday, May 30, 2023 - link

    I got that, but the architecture has been rather since the Cortex-A78. Just like how there was a long period of time since the release of the Cortex-A57 compared to the Cortex-A72. That extra time let ARM make a lot of big architectural improvements. In fact, it was pretty lengthy that we got Custom Cores developed by the likes of Nvidia, Qualcomm, Samsung, all which were vastly superior to the Cortex-A57 and they matched the subsequent release of the Cortex-A72.

    My biggest concern is the mistakes in their slides.
    If you have a 1+3+4 TCS22 design, and you do nothing but change the A510 to A520, you should see an upgrade of 22% per core. So 22% x4 should see a +88% uptick in performance. Now compare that to the mere 27% upgrade they said if you upgraded all the core types (X4 / A720 / A520) and you went with a larger chipset with the 1+5+2 design. Something is clearly amiss.

    Another solution to the riddle is they are using the problematic silicon from Samsung-5nm. Making a comparison between the flawed QC 8g1, against a new chipset using the same node, but upgrading the Core-types and the Cluster-design. Even then it's a bad excuse, because that would mean its barely competing against the 6-month old QC 8g2 (on TSMC node), and we collectively just ignore the existence of the MediaTek Dimensity chipsets.

    I think we will have to wait to hear the announcement of next-gen chips for 2024, in the form of QC 8g3 and MTK D9400. Let's see their claimed battery life improvement, their performance improvement, and deduce the efficiency from there. Look at which silicon they're building upon (TSMC 4nm vs 5nm). And finally look at the in depth reviews from the likes of Anandtech/Andrei, Geekerwan, and Golden Reviewer.
  • iphonebestgamephone - Wednesday, May 31, 2023 - link

    Techtechpotato instead of anandtech/andrei
  • Findecanor - Monday, May 29, 2023 - link

    ARM MTE and PAC are two different things, and I find it really silly to see them touted for use together.
    MTE steals eight pointer bits that would have been used for PAC, and on some implementations the bits for PAC would then be as few as 3.

    You would better pick one or the other, depending on your protection scheme.
  • syxbit - Monday, May 29, 2023 - link

    So Arm are still years behind Apple? And possibly will be slower than Nuvia too?
    I guess this just helps QUalcomm, as smaller companies have no choice. Either use slow off the shelf parts, or pay QCOMM for their superior chip (assuming Nuvia is as good as claimed.).
  • DanNeely - Thursday, June 1, 2023 - link

    Blame some of the mainland China android forks. They've spent years trying to pretend 64 bit only was never going to happen and were nowhere near ready when last years x3 dropped support for 32 bit code.
  • Arnulf - Monday, May 29, 2023 - link

    "up for the 6/8-wide dispatch with of the X3"

    From ... width

Log in

Don't have an account? Sign up now