Cache Improvements

The shared L1 instruction cache grew in size with Steamroller, although AMD isn’t telling us by how much. Bulldozer featured a 2-way 64KB L1 instruction cache, with each “core” using one of the ways. This approach gave Bulldozer less cache per core than previous designs, so the increase here makes a lot of sense. AMD claims the larger L1 can reduce i-cache misses by up to 30%. There’s no word on any possible impact to L1 d-cache sizes.

Although AMD doesn’t like to call it a cache, Steamroller now features a decoded micro-op queue. As x86 instructions are decoded into micro-ops, the address and decoded op are both stored in this queue. Should a fetch come in for an address that appears in the queue, Steamroller’s front end will power down the decode hardware and simply service the fetch request out of the micro-op queue. This is similar in nature to Sandy Bridge’s decoded uop cache, however it is likely smaller. AMD wasn’t willing to disclose how many micro-ops could fit in the queue, other than to say that it’s big enough to get a decent hit rate. 
The L1 to L2 interface has also been improved. Some queues have grown and logic is improved.
Finally on the caching front, Steamroller introduces a dynamically resizable L2 cache. Based on workload and hit rate in the cache, a Steamroller module can choose to resize its L2 cache (powering down the unused slices) in 1/4 intervals. AMD believes this is a huge power win for mobile client applications such as video decode (not so much for servers), where the CPU only has to wake up for short periods of time to run minor tasks that don’t have large L2 footprints. The L2 cache accounts for a large chunk of AMD’s core leakage, so shutting half or more of it down can definitely help with battery life. The resized cache is no faster (same access latency); it just consumes less power. 
Steamroller brings no significant reduction in L2/L3 cache latencies. According to AMD, they’ve isolated the reason for the unusually high L3 latency in the Bulldozer architecture, however fixing it isn’t a top priority. Given that most consumers (read: notebooks) will only see L3-less processors (e.g. Llano, Trinity), and many server workloads are less sensitive to latency, AMD’s stance makes sense. 

Looking Forward: High Density Libraries

This one falls into the reasons-we-bought-ATI column: future AMD CPU architectures will employ higher levels of design automation and new high density cell libraries, both heavily influenced by AMD’s GPU group. Automated place and route is already commonplace in AMD CPU designs, but AMD is going even further with this approach.
The methodology comes from AMD’s work in designing graphics cores, and we’ve already seen some of it used in AMD’s ‘cat cores (e.g. Bobcat). As an example, AMD demonstrated a 30% reduction in area and power consumption when these new automated procedures with high density libraries were applied to a 32nm Bulldozer FPU:

The power savings comes from not having to route clocks and signals as far, while the area savings are a result of the computer automated transistor placement/routing and higher density gate/logic libraries.
The tradeoff is peak frequency. These heavily automated designs won’t be able to clock as high as the older hand drawn designs. AMD believes the sacrifice is worth it however because in power constrained environments (e.g. a notebook) you won’t hit max frequency regardless, and you’ll instead see a 15 - 30% energy reduction per operation. AMD equates this with the power savings you’d get from a full process node improvement.
We won’t see these new libraries and automated designs in Steamroller, but rather its successor in 2014: Excavator.

Final Words

Steamroller seems like a good evolutionary improvement to AMD’s Bulldozer and Piledriver architectures. While Piledriver focused more on improving power efficiency, Steamroller should make a bigger impact on performance.
The architecture is still slated to debut in 2013 on GlobalFoundries' 28nm bulk process. The improvements look good on paper, but the real question remains whether or not Steamroller will be enough to go up against Haswell.
Front End & Execution Improvements
Comments Locked


View All Comments

  • thehat2k5 - Wednesday, August 29, 2012 - link

    at the very least we are suggesting Radeon 7870 or GTX570. How much are those where you come from? Up here, there is no way i can build you a computer for $469 that we will put our name on and certify it for BF3 at Ultra!
  • Origin64 - Thursday, August 30, 2012 - link

    You need at least 2GB of vram to run BF3 on ultra. And the flops to match it of course. Good luck getting that under 400 bucks.
  • Hardin4188 - Wednesday, August 29, 2012 - link

    Is it ok if I laugh at all seven of your employees?
  • thehat2k5 - Wednesday, August 29, 2012 - link

    it sure is ok, as long as you can link me the parts you are using to make this miracle machine;)
  • Novulux - Wednesday, August 29, 2012 - link

    I built a PC for my younger brother with $250 of parts from Microcenter, and gave him two HD 5770s bought on Ebay for ~$110 for both. Bought an HDD from Newegg for $70. He only plays at 1440x900 though.
  • Spunjji - Thursday, August 30, 2012 - link

    Yes, and presumably not at Ultra settings, unless he likes his textures being swapped in/out of RAM constantly..?
  • CeriseCogburn - Wednesday, August 29, 2012 - link

    Close your doors hat2k5 - you don't know what you're doing.
    Not surprised.
  • hapkiman - Sunday, September 2, 2012 - link

    Not trying to stir the pot, but I get around 60 FPS consistently on BF3 with ALL settings on Ultra. Everything, including having ambient occlusion turned on.

    I have a overclocked XFX Radeon HD 6950, which is not a $500 card.

    I have a i7 3770, and 16GB of 1600MHz RAM, and the game and all maps are loading from an Intel 520 160GB SSD.

    Believe it or not but its the truth. I just finished a game on Back to Karkand map, and I averaged 50-60 FPS, with spikes going well over 60.
  • AssBall - Wednesday, August 29, 2012 - link

    Cool story, Bro.
  • CeriseCogburn - Wednesday, August 29, 2012 - link

    You just can't make up the crap that amd fan boys do, since they are clueless.

    The i3 2100 STOMPS the fx4100 .

Log in

Don't have an account? Sign up now