The High-Level Zen Overview

AMD is keen to stress that the Zen project had three main goals: core, cache and power. The power aspect of the design is one that was very aggressive – not in the sense of aiming for a mobile-first design, but efficiency at the higher performance levels was key in order to be competitive again. It is worth noting that AMD did not mention ‘die size’ in any of the three main goals, which is usually a requirement as well. Arguably you can make a massive core design to run at high performance and low latency, but it comes at the expense of die size which makes the cost of such a design from a product standpoint less economical (if AMD had to rely on 500mm2 die designs in consumer at 14nm, they would be priced way too high). Nevertheless, power was the main concern rather than pure performance or function, which have been typical AMD targets in the past. The shifting of the goal posts was part of the process to creating Zen.

This slide contains a number of features we will hit on later in this piece, but covers a number of main topics which come under those main three goals of core, cache and power.

For the core, having bigger and wider everything was to be expected, however maintaining a low latency can be difficult. Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch prediction means that higher throughput can be maintained longer and in the fastest order possible. Add in dual threads and the applicability of keeping the functional units occupied with full queues also improves multi-threaded performance.

For the caches, having a faster prefetch and better algorithms ensures the data is ready when each of the caches when a thread needs it. Aiming for faster caches was AMD’s target, and while they are not disclosing latencies or bandwidth at this time, we are being told that L1/L2 bandwidth is doubled with L3 up to 5x.

For the power, AMD has taken what it learned with Carrizo and moved it forward. This involves more aggressive monitoring of critical paths around the core, and better control of the frequency and power in various regions of the silicon. Zen will have more clock regions (it seems various parts of the back-end and front-end can be gated as needed) with features that help improve power efficiency, such as the micro-op cache, the Stack Engine (dedicated low power address manipulation unit) and Move elimination (low-power method for register adjustment - pointers to registers are adjusted rather than going through the high-power scheduler).

The Big Core Diagram

We saw this diagram last year, showing some of the bigger features AMD wants to promote:

The improved branch predictor allows for 2 branches per Branch Target Buffer (BTB), but in the event of tagged instructions will filter through the micro-op cache. On the other side, the decoder can dispatch 4 instructions per cycle however some of those instructions can be fused into the micro-op queue. Fused instructions still come out of the queue as two micro-ops, but take up less buffer space as a result.

As mentioned earlier, the INT and FP pipes and schedulers are separated, however the INT rename space is 168 registers wide, which feeds into 6x14 scheduling queues. The FP employs as 160 entry register file, and both the FP and INT sections feed into a 192-entry retire queue. The retire queue can operate at 8 instructions per cycle, moving up from 4/cycle in previous AMD microarchitectures.

The load/store units are improved, supporting a 72 out-of-order loads, similar to Skylake. We’ll discuss this a bit later. On the FP side there are four pipes (compared to three in previous designs) which support combined 128-bit FMAC instructions. These can be combined for one 256-bit AVX, but beyond that it has to be scheduled over multiple instructions.

The Ryzen Die Fetch and Decode
Comments Locked

574 Comments

View All Comments

  • Cooe - Sunday, February 28, 2021 - link

    Absolute nonsense. Game code is optimized specifically for the Intel Core pipeline & ESPECIALLY it's ring bus interconnect. There's no such thing as "optimizing for x86". Code is either written with the x86 ISA or its not...
  • FriendlyUser - Thursday, March 2, 2017 - link

    The 1700X with a premium motherboard is cheaper and faster than the 6850K. If you absolutely need the extra PCIe lanes or the 8 DIMM slots, then x99 is better, otherwise you are getting less perf/$.
  • mapesdhs - Thursday, March 2, 2017 - link

    Or a used X79. I'm still rather surprised how close my 3930K/4.8 results are to the tests results shown here (CB10/ST = 7935, CB10/MT = 42389 , CB11.5/MT = 13.80, CB R15 MT = 1241). People are selling used 3930Ks for as little as 80 UKP now, though finding a decent mbd is a bit more tricky.

    I have an ASUS R5E/6850K setup to test, alongside a used-parts ASYS P9X79-E WS/4960X which cost scarily less than the new X99 setup, it'll be interesting to see how these behave against the KL/BW-E/Ryzen numbers shown here.

    Ian.
  • Aerodrifting - Thursday, March 2, 2017 - link

    "$500 1800x is still too expensive. According to this even a 7700k @ $300 -$350 is still a good choice for gamers."
    Same thing can be said for every Intel extreme platform processors, $1000 5960X/6900K is still too expensive, $1600 6950X is too expensive, Because 7700K is better for gaming.
    Then you said "2011-v3 still offers a platform with more PCIe3 lanes and quad memory channel. ", Which directly contradict what you said earlier about gaming, How does more PCIe3 lanes and quad channel memory improve your FPS when video cards run fine with x8.
    Your are too idiotic to even run coherent argument.
  • lmcd - Thursday, March 2, 2017 - link

    What on earth are you talking about? PCIe3 lanes and quad channel memory are helpful for prosumer workloads. It's not contradictory at all?
  • mapesdhs - Thursday, March 2, 2017 - link

    Yup, quad GPU for After Effects RT3D, and fast RAM makes quite a difference.
  • Notmyusualid - Friday, March 3, 2017 - link

    @mapesdhs:

    Indeed.

    Also, I can actually 'feel' the difference going from dual to quad channel ram performance.

    I checked, and I hadn't correctly seated one of my four 16GB modules...

    Shutdown, reseat, reboot, and it 'felt' faster again.
  • Aerodrifting - Thursday, March 2, 2017 - link

    Learn to read a complete sentence please.
    "nos024" was complaining gaming performance, Then he pulled out extra PCIe3 lanes and quad channel memory to defend X99 platform even though they were also inferior to 7700K in gaming (just like Ryzen). That makes him sound like a completely moron, Because games don't care about those extra PCIe lane or quad channel memory.
  • Notmyusualid - Friday, March 3, 2017 - link

    X99 'inferior'?

    I just popped over the 3dmark11's results page, selected GPU as 1080, and I had to scroll down to 199th place (a 7700k clocked to a likely LN2 5.8GHz), to find a system that wasn't triple, or quad channel equipped.

    Here: http://www.3dmark.com/search#/?url=/proxycon/ajax/...

    So I guess those lanes don't help us multip-gpu people after all?

    Swallow.
  • Aerodrifting - Saturday, March 4, 2017 - link

    Because 3Dmark11 hall of fame ranking equals real life gaming performance.

    Are you a moron or just trolling? Everyone knows when it comes to gaming, A high frequency i7 (such as 7700K) beats everything else, Including 8 core Ryzen or 10 core i7 extreme 6950X.

Log in

Don't have an account? Sign up now