Integer Crunching Power

Each core has two integer executions units (EX0 and EX1) and two AGUs (Address Generation Units). For comparison, the K10 core inside Magny-Cours and Istanbul had three ports to a “Fully featured ALU + AGU” couple. AMD marketing cleverly drew four pipeline blocks inside the Bulldozer integer core, but those powerpoint blocks cannot hide the fact that each Bulldozer integer core has fewer execution resources.

In practice, the AG0 and AG1 are little more than assistants with limited capabilities to EX0 and EX1.The software optimization guide for AMD family 15h processors lists only a few instructions (page 248 in the January 2012 version) that can be processed by the AG0 and AG1 execution units and each time the remark "First op to AG0 | AG1, Second to EX0 | EX1" is made. The AG0 and AG1 execution units reduce the latency of the CALL and LEA instructions, but the maximum throughput of each integer core inside the Bulldozer module is only two integer instructions per clock cycle. It's only when a fused branch enters EX0 and another integer instruction can enter EX1 that we have a slightly higher throughput of three integer instructions.

So the Bulldozer integer core can execute one integer instruction less per cycle (2 vs 3). That doesn’t mean that the Bulldozer integer core is 1/3 slower, however. The integer core of Bulldozer is smaller but also more flexible. The per lane dedicated 8-entry schedulers are gone, and a much larger 40 entry scheduler replaced it. This means that Bulldozer should be better at extracting ILP (Instruction Level Parallelism) out of code that has low IPC (Instructions Per Clock).

In some integer intensive applications, the fact that the maximum throughput of integer instructions is somewhat lower might slow things down. That is the not very useful "it depends" answer, so let's clarify: what kind of applications are we talking about?

Setting Expectations: the Front End Reevaluating the Situation
Comments Locked

84 Comments

View All Comments

  • Aone - Monday, June 4, 2012 - link

    Bulldozer's conception was wrong from the scratch.
    I told it a few time, let's me explain it here again.

    I'm sure everyone of you do remember AMD's own words "one BD module has 80% of throughput of two independent cores".
    What does this mean in figures?
    Let's take the performance of one core as 1.0 point. Therefore two BD modules would have 3.2 points or in other words less than 10% than 3.0 (performance of three independent cores).
    Should I remind that with development of independent cores AMD wouldn't had wasted resources (engineering, transistors, money and time) on design and debugging the shared logic. The chip could have been much smaller due to the fact that the chip would have had only 1MB L2 and 2MB L3 per each core and no shared logic. And all of those released resources could have been allocated for development of a more advanced core.

    You see that packing two cores inside a one module was wrong even on the conceptional level. I'm very curious who was the main supporter and decision maker of this approach in AMD.

    AMD must through away BD conception and return to standard practice. The only question remains: Does AMD have long enough TTL to do it?

    BTW, I recommend to look through Spec results again. The comparison of 12c Opteron 62xx w/ 12c Opteron 61xx is of special interest. And let's not forget that Opteron 62xx submissions have higher freq, faster memory and as well as more advanced compiler version and extended instruction set.
  • TC2 - Monday, June 18, 2012 - link

    I'm agree in 100%!!!
    The BD uA is "unsuccessful" port from graphics uA. There is many and major drawbacks! Note for example one - to write an optimal software you must adopt an application at algorithmic level (in sense of thread specialization)! This is because the both BD-cores are not the same! Also they shares L1 IC, the number of elements is high, ... and many others uA weaknesses.
  • evolucion8 - Tuesday, June 17, 2014 - link

    Northwood was 20 stage pipelines and Prescott was 31, not 39...
  • tipoo - Wednesday, October 8, 2014 - link

    Where is the aftermath?

    "But what about the fourth show stopper? That is probably one of the most interesting ones because it seems to show up (in a lesser degree) in Sandy Bridge too. However, we're not quite ready with our final investigations into this area, so you'll have to wait a bit longer. To be continued...."

Log in

Don't have an account? Sign up now