The Architecture

We'll start, logically, at the front end of a Bulldozer module. The fetch and decode logic in each module is shared by both integer cores. The role this logic plays is to fetch the next instruction in the thread being executed, decode the x86 instruction into AMD's own internal format, and pass the decoded instruction onto the scheduling hardware for execution.

AMD widened the K8 front end with Bulldozer. Each module is now able to fetch and decode up to four x86 instructions from a single thread in parallel. Each of the four decoders are equally capable. Remembering that each Bulldozer module appears as two cores, the front end can only pick 4 instructions to fetch and decode from a single thread at a time. A single Bulldozer module can switch between threads as often as every clock.

Decode hardware isn't very expensive on its own, but duplicating it four times across multiple cores quickly adds up. Although decode width has increased for a single core, multi-core Bulldozer configurations can actually be at a disadvantage compared to previous AMD architectures. Let's look at the table below to understand why:

Front End Comparison
  AMD Phenom II AMD FX Intel Core i7
Instruction Decode Width 3-wide 4-wide 4-wide
Single Core Peak Decode Rate 3 instructions 4 instructions 4 instructions
Dual Core Peak Decode Rate 6 instructions 4 instructions 8 instructions
Quad Core Peak Decode Rate 12 instructions 8 instructions 16 instructions
Six/Eight Core Peak Decode Rate 18 instructions (6C) 16 instructions 24 instructions (6C)

For a single instruction thread, Bulldozer offers more front end bandwidth than its predecessor. The front end is wider and just as capable so this makes sense. But note what happens when we scale up core count.

Since fetch and decode hardware is shared per module, and AMD counts each module as two cores, given an equivalent number of cores the old Phenom II actually offers a higher peak instruction fetch/decode rate than the FX. The theory is obviously that the situations where you're fetch/decode bound are infrequent enough to justify the sharing of hardware. AMD is correct for the most part. Many instructions can take multiple cycles to decode, and by switching between threads each cycle the pipelined front end hardware can be more efficiently utilized. It's only in unusually bursty situations where the front end can become a limit.

Compared to Intel's Core architecture however, AMD is at a disadvantage here. In the high-end offerings where Intel enables Hyper Threading, AMD has zero advantage as Intel can weave in instructions from two threads every clock. It's compared to the non-HT enabled Core CPUs that the advantage isn't so clear. Intel maintains a higher instantaneous decode bandwidth per clock, however overall decoder utilization could go down as a result of only being able to fill each fetch queue from a single thread.

After the decoders AMD enables certain operations to be fused together and treated as a single operation down the rest of the pipeline. This is similar to what Intel calls micro-ops fusion, a technology first introduced in its Banias CPU in 2003. Compare + branch, test + branch and some other operations can be fused together after decode in Bulldozer—effectively widening the execution back end of the CPU. This wasn't previously possible in Phenom II and obviously helps increase IPC.

A Decoupled Branch Predictor

AMD didn't disclose too much about the configuration of the branch predictor hardware in Bulldozer, but it is quick to point out one significant improvement: the branch predictor is now significantly decoupled from the processor's front end.

The role of the branch predictor is to intercept branch instructions and predict their target address, rather than allowing for tons of cycles to go by until the branch target is known for sure. Branches are predicted based on historical data. The more data you have, and the better your branch predictors are tuned to your workload, the more accurate your predictions can be. Accurate branch prediction is particularly important in architectures with deep pipelines as a mispredict causes more instructions to be flushed out of the pipe. Bulldozer introduces a significantly deeper pipeline than its predecessor (more on this later), and thus branch prediction improvements are necessary.

In both Phenom II and Bulldozer, branches are predicted in the front end of the pipe alongside the fetch hardware. In Phenom II however, any stall in the fetch pipeline (e.g. fetching an instruction that wasn't in cache) would stop the whole pipeline including future branch predictions. Bulldozer decouples the branch prediction hardware from the fetch pipeline by way of a prediction queue. If there's a stall in the fetch pipeline, Bulldozer's branch prediction hardware is allowed to run ahead and continue making future predictions until the prediction queue is full.

We'll get to the effectiveness of this approach shortly.

Scheduling and Execution Improvements

As with Sandy Bridge, AMD migrated to a physical register file architecture with Bulldozer. Data is now only stored in one location (the physical register file) and is tracked via pointers back to the PRF as operations make their way through the execution engine. This is a move to save power as copying data around a chip is hardly power efficient.

The buffers and queues that feed into the execution engines of the chip are all larger on Bulldozer than they were on Phenom II. Larger data structures allows for better instruction level parallelism when trying to execute operations out of order. In other words, the issue hardware in Bulldozer is beefier than its predecessor.

Unfortunately where AMD took one step forward in issue hardware, it does a bit of a shuffle when it comes to execution resources themselves. Let's start with the positive: Bulldozer's integer execution cores.

Integer Execution

Each Bulldozer module features two fully independent integer cores. Each core has its own integer scheduler, register file and 16KB L1 data cache. The integer schedulers are both larger than their counterparts in the Phenom II.

The biggest change here is each integer core now has two ports instead of three. A single integer core features two AGU/ALU ports, compared to three in the previous design. AMD claims the third ALU/AGU pair went mostly unused in Phenom II, and as a result it's been removed from Bulldozer.

With larger structures feeding into the integer cores, AMD should be able to have an easier time of making use of the integer units than in previous designs. AMD could, in theory, execute more integer operations per core in Phenom II however AMD claims the architecture was typically bound elsewhere.

The Shared FP Core

A single Bulldozer module has a single shared FP core for use by up to two threads. If there's only a single FP thread available, it is given full access to the FP execution hardware, otherwise the resources are shared between the two threads.

Compared to a quad-core Phenom II, AMD's eight-core (quad-module) FX sees no drop in floating point execution resources. AMD's architecture has always had independent scheduling for integer and floating point instructions, and we see the same number of execution ports between Phenom II cores and FX modules. Just as is the case with the integer cores, the shared FP core in a Bulldozer module has larger scheduling hardware in front of it than the FPU in Phenom II.

The problem is AMD had to increase the functionality of its FPU with the move to Bulldozer. The Phenom II architecture lacks SSE4 and AVX support, both of which were added in Bulldozer. Furthermore, AMD chose Bulldozer as the architecture to include support for fused multiply-add instructions (FMA). Enabling FMA support also increases the relative die area of the FPU. So while the throughput of Bulldozer's FPU hasn't increased over K8, its capabilities have. Unfortunately this means that peak FP throughput running x87/SSE2/3 workloads remains unchanged compared to the previous generation. Bulldozer will only be faster if newer SSE, AVX or FMA instructions are used, or if its clock speed is significantly higher than Phenom II.

Looking at our Cinebench 11.5 multithreaded workload we see the perfect example of this performance shuffle:

Cinebench 11.5—Multi-Threaded

Despite a 9% higher base clock speed (more if you include turbo core), a 3.6GHz 8-core Bulldozer is only able to outperform a 3.3GHz 6-core Phenom II by less than 2%. Heavily threaded floating point workloads may not see huge gains on Bulldozer compared to their 6-core predecessors.

There's another issue. Bulldozer, at least at launch, won't have to simply outperform its quad-core predecessor. It will need to do better than a six-core Phenom II. In this comparison unfortunately, the Phenom II has the definite throughput advantage. The Phenom II X6 can execute 50% more SSE2/3 and x87 FP instructions than a Bulldozer based FX.

Since the release of the Phenom II X6, AMD's major advantage has been in heavily threaded workloads—particularly floating point workloads thanks to the sheer number of resources available per chip. Bulldozer actually takes a step back in this regard and as a result, you will see some of those same workloads perform worse, if not the same as the outgoing Phenom II X6.

Compared to Sandy Bridge, Bulldozer only has two advantages in FP performance: FMA support and higher 128-bit AVX throughput. There's very little code available today that uses AMD's FMA instruction, while the 128-bit AVX advantage is tangible.

Cache Hierarchy and Memory Subsystem

Each integer core features its own dedicated L1 data cache. The shared FP core sends loads/stores through either of the integer cores, similar to how it works in Phenom II although there are two integer cores to deal with now instead of just one. Bulldozer enables fully out-of-order loads and stores, an improvement over Phenom II putting it on parity with current Intel architectures. The L1 instruction cache is shared by the entire bulldozer module, as is the L2 cache.

The instruction cache is a large 64KB 2-way set associative cache, similar in size to the Phenom II's L1 cache but obviously shared by more "cores". A four-core Phenom II would have 256KB of total L1 I-Cache, while a four core Bulldozer will have half of that. The L1 data caches are also significantly smaller than Bulldozer's predecessor. While Phenom II offered a 64KB L1 D-Cache per core, Bulldozer only offers 16KB per integer core.

The L2 cache is much larger than what we saw in multi-core Phenom II designs however. Each Bulldozer module has a private 2MB L2 cache.

There's a single 8MB L3 cache that's shared among all Bulldozer modules on a chip. In its first incarnation, AMD has no plans to offer a desktop part without an L3 cache. However AMD indicated that the L3 cache was only really useful in server workloads and we might expect future Bulldozer derivatives (ahem, Trinity?) to forgo the L3 cache entirely as a result.

Cache accesses require more clocks in Bulldozer, due to a combination of size and AMD's desire to make Bulldozer a very high clock speed part...

Introduction The Pursuit of Clock Speed
Comments Locked

430 Comments

View All Comments

  • stephenbrooks - Thursday, October 13, 2011 - link

    Not trying to argue with you about the accuracy issue - if "FMA" is defined in a certain way, that's how it's defined and an instruction that rounds differently is a different instruction.

    However, imagine AMD could implement a "Legacy FMA" (LFMA) instruction in their FPU - which would round as if the MUL came first. You could then fuse MUL, ADD pairs into LFMA instructions without producing bugs. Not sure whether the two types of FMA could be done on the same hardware (they are basically different rounding modes) without a large overhead though.

    I don't really understand why there's a big demand for not rounding after the MUL because normally these instructions show up in code like
    for (n=1000;n>0;n--) total+=a[n]*b[n];
    ...and the potential rounding inaccuracy comes in the add stage: there are often lots of adds in sequence, but not normally lots of MULs, and adds suffer more often from the problem of accumulating many small values. Anyway, I know in my code there are lots of instances of doing the "multiply add" operation, and it would be nice to have some sort of CPU acceleration for this.
  • TheDude69 - Wednesday, October 12, 2011 - link

    The Party is over! AMD is no more! They have successfully designed themselves out of the desktop CPU business. I applied for their CEO position and they went with a Moron! You can all kiss low, desktop CPU prices goodbye! Congratulations INTEL you now have a CPU Monopoly!! We can only hope that Nvidia will come through with a Super fast Tegra that will outperform Intel in the netbook arena.

    I don't think AMD can pull a rabbit out of it's hat now. Guess I'll cave in and buy an i7.
    I feel like I'm going to puke........................
  • policeman0077 - Wednesday, October 12, 2011 - link

    I am a newbie, have quite a lot of questions after reading the review of bulldozer.

    1. what is heavily thread tasks?does matlab count as heavily thread task?I heard matlab use a lot of FP resource? If so,how can bulldozer beat i5 with only 4 lower efficiency FPs ? Does browsing a lot of website simultaneously count?

    2. if single core cpu A and B have same frequency but different efficiency and work on same task without full load. They will accomplish the task in same time?

    3. so if single core/thread performance is very important, the situation I aforementioned(if is true) totally doesn't show the benefit of a high efficiency core? (didn't consider power consumption.)

    4. Does many application will let a core fully loaded and they won't split the task to another core? What kind of application suit this kind of situation? Any example?

    5. in the case of 2, if another application request cpu's resource, will the core with high efficiency get quicker response?

    6. in the case of 2, consider multi core cpus A and B. if one core of these two cpus are nearly full loaded, at the same time, another application request the source of cpu. And the operate system decide to let this application work on another idle core. Will higher efficiency core response fast?
  • TheDude69 - Wednesday, October 12, 2011 - link

    Don't bother....I have been following this saga since AMD's first CPU and, until today, was an AMD Fan boy. Buy an i7 now before Intel triples the price for this CPU!
  • GustoGuy - Wednesday, October 12, 2011 - link

    I am really surprised that AMD didn't at least match the i7 in a majority of the bench marks, and what is even more disturbing is that it sometimes performs worse than a Phenom II X4 on some bench marks. AMD could have tweak this processor with all the time it took and had a stationary target with the i7 so I am perplexed at why they were not able to get it to benchmark at least as well as an entry level i7. Hopefully it will be like the first generation of the Phenom x4 where AMD was able to add an l3 cache and tweak the overclocking abilities so they could rease a product that was at least competitve with the Core Duo processors. I like AMD however they have to be copetitive with Intel and can not afford to give up in some cases a 50% decificiency when compared to the i7.
  • Belard - Wednesday, October 12, 2011 - link

    There isn't much AMD can do with this. They are "planning" on having a TICK TICK TICK 10~15% performance increase with yields, higher clocks and tweaks, which what they and intel normally do.

    True, we CANNOT expect AMD to compete directly with Intel. They simply don't have the resources. Not the money, not the talent, not the manufacturing abilities. Perhaps, if they were not HELD DOWN by intel during the AMD32~64 days, they'd have made the much-needed profits to afford a much bigger R&D department. There was a point at the end of the AMD64/X2 dominance in which AMD couldn't make enough CPUs.

    If the 8150 was marketed as it is... a quad core CPU and was across the board, no more than 15% slower than the 2600 at a price of $220 (The 2600 sells at $300) then it would be considered a GOOD buy. But its worse than that in performance and price.

    It takes years to develop a new CHIP. It would take 1-2 years to fix the problems with bulldog, if they could be fixed. But look at it this way, how did intel fix their P4/Netburst problem? Oh yeah, they developed a whole new design!!

    BD is a s flawed as the P4. Its very difficult to FIX a HUGE CPU.. and SB is about half as complex and half the size of BD! So what... AMD is going to add even more junk to the design?

    Hence, it costs AMD about twice as much to make such a CPU compared to intel. So do the math. Intel makes more profit per CPU. For AMD to compete, they would need to reduce their price by 25~30% - which means almost NO profit.

    AMD is screwed. They'll really need to work with Llano a lot more... and look at burying Bulldozer with something else.

    If Piledriver does somehow kick butt (there are no indications that it will) - too bad, a large chunk of AMD users would have already moved on to intel. And when Piledriver does finally hit the market, intel will have already released an ever faster CPU.

    Did I say AMD is screwed?
  • Belard - Wednesday, October 12, 2011 - link

    Seriously?

    While I won't call myself an AMD fanboy - as I own both intel and AMD systems... I've been drooling for a Sandy Bridge like AMD part. I buy and sell AMD systems for years for desktop users. In general, I prefer AMD chipsets over intels, I like your GPUs, etc... With the release of BD (Bulldozer) FX chips... the WAIT IS OVER!!!

    My next system will be an Intel... my next customer builds will be intels with 2300~2600k CPUs.

    I *CANNOT* sell my customers a sub-standard part, which is what BD is. Why the hell would I have them spend $250 for a CPU that can't constantly compete with a $150 or 2 year old CPUs?

    I think we know why Rick Bergman left AMD, I don't see him signing off on such a crappy CPU. Seriously, why bother? Llano OLDER Fusion design is more attractive than this insulting FX garbage.

    What AMD has done with the release of these BD/FX chips is created more sales for intel, nothing more. Only a fool would buy an FX 8150... just like the fools who spent $1000 on the Intel EE CPUs (okay, not quite that dumb since these AMDs are 1/4 such prices) These 8core CPUs are actually 4 core, 6 = 3 and 4 is a dual core. An enhanced version of Hyper-threading by Intel 10 years ago.

    There is a SEVERE problem when your "8 core" CPU can't surpass intel's $150 dual core CPUs. Why AMD, why did you take a page out of intel and Nvidia and do the SAME stupid thing? This *IS* your Netburst and FERMI all wrapped in one. A BIG, HOT, EXPENSIVE and SLooooow product that doesn't impress anyone, other than the stupidity of the design. You think WE should wait 2-3 years for you to ramp up speed to 5-6Ghz to say you're competitive with TODAYS intel CPUs? I don't think so.

    After an hour or so of reading this review, here is what happened. 5 sales for desktop builds have just gone to Intel i-whatever-it-is 2500 & 2600s. You make me and others LOOK LIKE FOOLS waiting for Bulldozer or Bulldog to come out and kick some intel butt. You didn't. No, we were NOT expecting you to surpass Sandbridge (SB)... but if your "$250 8150 Best CPU" was at least up against the 2500~2600 in performance, it would be acceptable. But on Newegg - this $280 CPU is slower in most benchmarks to the i5-2400 which is $180. The 8150's power usage and heat is through the roof from the faster i5-2400 which is $100 cheaper and faster in games and most productivity.
    No gamer in their right mind would spend nearly $300 for a CPU that is about 25~40% slower than the similar or cheaper priced Intels. Big deal if they are unlocked... so are the K chips, which would only pull ahead further.
    (We could use a review showing a 5Ghz 8150 vs a 5Ghz 2600K - but I would expect the AMD deficit to remain)

    The heatsinks on SB CPUs are tiny compared to AMD... that means less noise, less heat.

    If a client needs a custom budget computer, I'd go with a $100~130 AMD CPU... that is it. If AMD wants to compete with the CURRENT Sandy Bridge, the 8150 will need to be a sub $200 part (Hey, isn't intel about to drop their prices??) Their BS "4 core" will need to be $100... but we'll need to see how it performs in the real world... to see if its worth that much money.

    This article has over 250 posts in less than 24hrs.... and its the voices of very unhappy AMD users.

    I still can't believe AMD went the P4 route. They spent years trying to CHEAT performance and this is the result? Luckily there is lots of demands for cheap CPUs and ATI GPUs which should keep AMD alive.
  • descendency - Thursday, October 13, 2011 - link

    The thing that bothers me most is the Dirt 3 performance.

    According to an AMD rep at the AMD launch press-conference, games like Dirt 3 would be able to utilize the Bulldozer's "8 cores" to deliver awesome performance. The truth is that it does worse than the 1100T (the one I already own).
  • wolfman3k5 - Thursday, October 13, 2011 - link

    It's finally here! I have been waiting for one of these. It's another Hitler Video, this time it's about Bulldozer. Funny as hell...

    The video can be found here: http://www.youtube.com/watch?v=SArxcnpXStE
  • Artas1984 - Thursday, October 13, 2011 - link

    Hey thanks for notice! I was expecting this already!

Log in

Don't have an account? Sign up now