The Bulldozer Review: AMD FX-8150 Testedby Anand Lal Shimpi on October 12, 2011 1:27 AM EST
We'll start, logically, at the front end of a Bulldozer module. The fetch and decode logic in each module is shared by both integer cores. The role this logic plays is to fetch the next instruction in the thread being executed, decode the x86 instruction into AMD's own internal format, and pass the decoded instruction onto the scheduling hardware for execution.
AMD widened the K8 front end with Bulldozer. Each module is now able to fetch and decode up to four x86 instructions from a single thread in parallel. Each of the four decoders are equally capable. Remembering that each Bulldozer module appears as two cores, the front end can only pick 4 instructions to fetch and decode from a single thread at a time. A single Bulldozer module can switch between threads as often as every clock.
Decode hardware isn't very expensive on its own, but duplicating it four times across multiple cores quickly adds up. Although decode width has increased for a single core, multi-core Bulldozer configurations can actually be at a disadvantage compared to previous AMD architectures. Let's look at the table below to understand why:
|Front End Comparison|
|AMD Phenom II||AMD FX||Intel Core i7|
|Instruction Decode Width||3-wide||4-wide||4-wide|
|Single Core Peak Decode Rate||3 instructions||4 instructions||4 instructions|
|Dual Core Peak Decode Rate||6 instructions||4 instructions||8 instructions|
|Quad Core Peak Decode Rate||12 instructions||8 instructions||16 instructions|
|Six/Eight Core Peak Decode Rate||18 instructions (6C)||16 instructions||24 instructions (6C)|
For a single instruction thread, Bulldozer offers more front end bandwidth than its predecessor. The front end is wider and just as capable so this makes sense. But note what happens when we scale up core count.
Since fetch and decode hardware is shared per module, and AMD counts each module as two cores, given an equivalent number of cores the old Phenom II actually offers a higher peak instruction fetch/decode rate than the FX. The theory is obviously that the situations where you're fetch/decode bound are infrequent enough to justify the sharing of hardware. AMD is correct for the most part. Many instructions can take multiple cycles to decode, and by switching between threads each cycle the pipelined front end hardware can be more efficiently utilized. It's only in unusually bursty situations where the front end can become a limit.
Compared to Intel's Core architecture however, AMD is at a disadvantage here. In the high-end offerings where Intel enables Hyper Threading, AMD has zero advantage as Intel can weave in instructions from two threads every clock. It's compared to the non-HT enabled Core CPUs that the advantage isn't so clear. Intel maintains a higher instantaneous decode bandwidth per clock, however overall decoder utilization could go down as a result of only being able to fill each fetch queue from a single thread.
After the decoders AMD enables certain operations to be fused together and treated as a single operation down the rest of the pipeline. This is similar to what Intel calls micro-ops fusion, a technology first introduced in its Banias CPU in 2003. Compare + branch, test + branch and some other operations can be fused together after decode in Bulldozer—effectively widening the execution back end of the CPU. This wasn't previously possible in Phenom II and obviously helps increase IPC.
A Decoupled Branch Predictor
AMD didn't disclose too much about the configuration of the branch predictor hardware in Bulldozer, but it is quick to point out one significant improvement: the branch predictor is now significantly decoupled from the processor's front end.
The role of the branch predictor is to intercept branch instructions and predict their target address, rather than allowing for tons of cycles to go by until the branch target is known for sure. Branches are predicted based on historical data. The more data you have, and the better your branch predictors are tuned to your workload, the more accurate your predictions can be. Accurate branch prediction is particularly important in architectures with deep pipelines as a mispredict causes more instructions to be flushed out of the pipe. Bulldozer introduces a significantly deeper pipeline than its predecessor (more on this later), and thus branch prediction improvements are necessary.
In both Phenom II and Bulldozer, branches are predicted in the front end of the pipe alongside the fetch hardware. In Phenom II however, any stall in the fetch pipeline (e.g. fetching an instruction that wasn't in cache) would stop the whole pipeline including future branch predictions. Bulldozer decouples the branch prediction hardware from the fetch pipeline by way of a prediction queue. If there's a stall in the fetch pipeline, Bulldozer's branch prediction hardware is allowed to run ahead and continue making future predictions until the prediction queue is full.
We'll get to the effectiveness of this approach shortly.
Scheduling and Execution Improvements
As with Sandy Bridge, AMD migrated to a physical register file architecture with Bulldozer. Data is now only stored in one location (the physical register file) and is tracked via pointers back to the PRF as operations make their way through the execution engine. This is a move to save power as copying data around a chip is hardly power efficient.
The buffers and queues that feed into the execution engines of the chip are all larger on Bulldozer than they were on Phenom II. Larger data structures allows for better instruction level parallelism when trying to execute operations out of order. In other words, the issue hardware in Bulldozer is beefier than its predecessor.
Unfortunately where AMD took one step forward in issue hardware, it does a bit of a shuffle when it comes to execution resources themselves. Let's start with the positive: Bulldozer's integer execution cores.
Each Bulldozer module features two fully independent integer cores. Each core has its own integer scheduler, register file and 16KB L1 data cache. The integer schedulers are both larger than their counterparts in the Phenom II.
The biggest change here is each integer core now has two ports instead of three. A single integer core features two AGU/ALU ports, compared to three in the previous design. AMD claims the third ALU/AGU pair went mostly unused in Phenom II, and as a result it's been removed from Bulldozer.
With larger structures feeding into the integer cores, AMD should be able to have an easier time of making use of the integer units than in previous designs. AMD could, in theory, execute more integer operations per core in Phenom II however AMD claims the architecture was typically bound elsewhere.
The Shared FP Core
A single Bulldozer module has a single shared FP core for use by up to two threads. If there's only a single FP thread available, it is given full access to the FP execution hardware, otherwise the resources are shared between the two threads.
Compared to a quad-core Phenom II, AMD's eight-core (quad-module) FX sees no drop in floating point execution resources. AMD's architecture has always had independent scheduling for integer and floating point instructions, and we see the same number of execution ports between Phenom II cores and FX modules. Just as is the case with the integer cores, the shared FP core in a Bulldozer module has larger scheduling hardware in front of it than the FPU in Phenom II.
The problem is AMD had to increase the functionality of its FPU with the move to Bulldozer. The Phenom II architecture lacks SSE4 and AVX support, both of which were added in Bulldozer. Furthermore, AMD chose Bulldozer as the architecture to include support for fused multiply-add instructions (FMA). Enabling FMA support also increases the relative die area of the FPU. So while the throughput of Bulldozer's FPU hasn't increased over K8, its capabilities have. Unfortunately this means that peak FP throughput running x87/SSE2/3 workloads remains unchanged compared to the previous generation. Bulldozer will only be faster if newer SSE, AVX or FMA instructions are used, or if its clock speed is significantly higher than Phenom II.
Looking at our Cinebench 11.5 multithreaded workload we see the perfect example of this performance shuffle:
Despite a 9% higher base clock speed (more if you include turbo core), a 3.6GHz 8-core Bulldozer is only able to outperform a 3.3GHz 6-core Phenom II by less than 2%. Heavily threaded floating point workloads may not see huge gains on Bulldozer compared to their 6-core predecessors.
There's another issue. Bulldozer, at least at launch, won't have to simply outperform its quad-core predecessor. It will need to do better than a six-core Phenom II. In this comparison unfortunately, the Phenom II has the definite throughput advantage. The Phenom II X6 can execute 50% more SSE2/3 and x87 FP instructions than a Bulldozer based FX.
Since the release of the Phenom II X6, AMD's major advantage has been in heavily threaded workloads—particularly floating point workloads thanks to the sheer number of resources available per chip. Bulldozer actually takes a step back in this regard and as a result, you will see some of those same workloads perform worse, if not the same as the outgoing Phenom II X6.
Compared to Sandy Bridge, Bulldozer only has two advantages in FP performance: FMA support and higher 128-bit AVX throughput. There's very little code available today that uses AMD's FMA instruction, while the 128-bit AVX advantage is tangible.
Cache Hierarchy and Memory Subsystem
Each integer core features its own dedicated L1 data cache. The shared FP core sends loads/stores through either of the integer cores, similar to how it works in Phenom II although there are two integer cores to deal with now instead of just one. Bulldozer enables fully out-of-order loads and stores, an improvement over Phenom II putting it on parity with current Intel architectures. The L1 instruction cache is shared by the entire bulldozer module, as is the L2 cache.
The instruction cache is a large 64KB 2-way set associative cache, similar in size to the Phenom II's L1 cache but obviously shared by more "cores". A four-core Phenom II would have 256KB of total L1 I-Cache, while a four core Bulldozer will have half of that. The L1 data caches are also significantly smaller than Bulldozer's predecessor. While Phenom II offered a 64KB L1 D-Cache per core, Bulldozer only offers 16KB per integer core.
The L2 cache is much larger than what we saw in multi-core Phenom II designs however. Each Bulldozer module has a private 2MB L2 cache.
There's a single 8MB L3 cache that's shared among all Bulldozer modules on a chip. In its first incarnation, AMD has no plans to offer a desktop part without an L3 cache. However AMD indicated that the L3 cache was only really useful in server workloads and we might expect future Bulldozer derivatives (ahem, Trinity?) to forgo the L3 cache entirely as a result.
Cache accesses require more clocks in Bulldozer, due to a combination of size and AMD's desire to make Bulldozer a very high clock speed part...
Post Your CommentPlease log in or sign up to comment.
View All Comments
ThaHeretic - Saturday, October 15, 2011 - linkHere's something for a compile test: build the Linux kernel. Something people actually care about.
Loki726 - Monday, October 31, 2011 - linkThe linux kernel is more or less straight C with a little assembly; it is much easier on a compiler frontend and more likely to stress the backend optimizers and code generators.
Chromium is much more representative of a modern C++ codebase. At least, it is more relevant to me.
nyran125 - Saturday, October 15, 2011 - linkWhats the point in having 8 cores, if its not even as fast as an intel 4 core and you get better performance overall with intel.. Heres the BIG reality, the high end 8 core is not that much cheaper than a 2600K. Liek $20-60 MAX> Youd be crazy to buy an 8 core for the same price as an intel 2600K...
Fiontar - Saturday, October 15, 2011 - linkWell, these numbers are pretty dismal all around. Maybe as the architecture and the process mature, this design will start to shine, but for the first generation, the results are very disappointing.
As someone who is running a Phenom II X6 at a non-turbo core 4.0 Ghz, air cooled, I just don't see why I would want to upgrade. If I got lucky and got a BD overclock to 4.6 Ghz, I might get a single digit % increase in performance over my Phenom II X6, which is not worth the cost or effort.
I guess on the plus side, my Phenom II was a good upgrade investment. Unless I'm tempted to upgrade to an Intel set up in the near future, I think I can expect to get another year or two from my Phenom II before I start to see upgrade options that make sense. (I usually wait to upgrade my CPU until I can expect about a 40% increase in performance over my current system at a reasonable price).
I hope AMD is able to remain competitive with NVidia in the GPU space, because they just aren't making it in the CPU space.
BTW, if the BD can reliably be overclocked to to 4.5Ghz+, why are they only selling them at 3.3 Ghz? I'm guessing because the added power requirements then make them look bad on power consumption and performance per watt, which seems to be trumping pure performance as a goal for their CPU releases.
Fiontar - Saturday, October 15, 2011 - linkA big thumbs down to Anand for not posting any of the over-clock benchmarks. He ran them, why not include them in the review?
With the BD running at an air cooled 4.5 Ghz, or a water cooled 5.0 Ghz, both a significant boost over the default clock speed, the OC benchmarks are more important to a lot of enthusiasts than the base numbers. In the article you say you ran the benchmarks on the OC part, why didn't you include them in your charts? Or at least some details in the section of the article on the Over-clock? You tell us how high you managed to over-clock the BD and under what conditions, but you gave us zero input on the payoff!
Oscarcharliezulu - Saturday, October 15, 2011 - link...was going to upgrade my old amd3 system to a BD, just a dev box, but I think a phenom x6 or 955 will be just fine. Bit sad too.
nhenk--1 - Sunday, October 16, 2011 - linkI think Anand hit the nail on the head mentioning that clock frequency is the major limitation of this chip. AMD even stated that they were targeting a 30% frequency boost. A 30% frequency increase over a 3.2 GHz Phenom II (AM3 launch frequency i think) would be 4.2 GHz, 17% faster than the 3.6 GHz 8150.
If AMD really did make this chip to scale linearly to frequency increases, and you add 17% performance to any of the benchmarks, BD would roughly match the i7. This was probably the initial intention at AMD. Instead the gigantic die, and limitations of 32nm geometries shot heat and power through the roof, and that extra 17% is simply out of reach.
I am an AMD fan, but at this point we have to accept that we (consumers) are not a priority. AMD has been bleeding share in the server space where margins are high, and where this chip will probably do quite well. We bashed Barcelona at release too (I was still dumb enough to buy one), but it was a relative success in the server market.
AMD needs to secure its spot in the server space if it wants to survive long term. 5 years from now we will all be connecting to iCloud with our ARM powered Macbook Vapor thin client laptops, and a server will do all of the processing for us. I will probably shed a tear when that happens, I like building PCs. Maybe I'll start building my own dedicated servers.
The review looked fair to me, seems like Anand is trying very hard to be objective.
neotiger - Monday, October 17, 2011 - link"server space where margins are high, and where this chip will probably do quite well."
I don't see how Bulldozer could possibly do well in the server space. Did you see the numbers on power consumption? Yikes.
For servers power consumption is far more important than it is in the consumer space. And BD draws about TWICE as much power as Sandy Bridge does while performs worse.
BD is going to fail worse in the server space than it will in the consumer space.
silverblue - Monday, October 17, 2011 - linkI'm not sure that I agree.
For a start, you're far more likely to see heavily threaded workloads on servers than in the consumer space. Bulldozer does far better here than with lightly threaded workloads and even the 8150 often exceeds the i7-2600K under such conditions, so the potential is there for it to be a monster in the server space. Secondly, if Interlagos noticably improves performance over Magny Cours then coupled with the fact that you only need the Interlagos CPU to pop into your G34 system means this should be an upgrade. Finally, power consumption is only really an issue with Bulldozer when you're overclocking. Sure, Zambezi is a hungrier chip, but remember that it's got a hell of a lot more cache and execution hardware under the bonnet. Under the right circumstances, it should crush Thuban, though admittedly we expected more than just "under the right circumstances".
I know very little about servers (obviously), however I am looking forward to Johan's review; it'd be good to see this thing perform to its strengths.
neotiger - Monday, October 17, 2011 - linkFirst, in the server space BD isn't competing with i7-2600K. You have to remember that all the current Sandy Bridge i7 waste a big chunk of silicon real estate on GPU, which is useless in servers. In 3 weeks Intel is releasing the 6 core version of SB, essentially take the transistors that have been used for GPU and turn them into 2 extra cores.
Even in highly threaded workloads 8150 performs more or less the same level as i7-2600K. In 3 weeks SB will increase threaded performance by 50% (going from 4 cores to 6). Once again the performance gap between SB and BD will be huge, in both single-threaded and multi-threaded workloads.
Second, BD draws much higher power than SB even in stock frequency. This is born out by the benchmark data in the article.