News Headlines
- Wed, Jun 19
- The War Z becomes Infestation: Survivor Stories citing trademark conflicts, game otherwise unaffected
- Microsoft's One Mistake: Pressing reset on the Xbox One's aspirations of a digital future
- Surprise! Company of Heroes 2 beta extended through June 23
- Microsoft officially canceling Xbox One online and used game policies across the board
- Nintendo wins appeal over Wii Fit Balance Board patent litigation, probably not over
New Articles
Related Articles
As mentioned earlier, Bulldozer is a brand new architecture from AMD. It shares nothing with the Phenom II. So how was it designed? Starting from the beginning, reinventing the 4004 isn't quite how the engineers proceeded; decades of CPU design cannot be thrown away so easily after all. Instead, AMD started off with the general idea of what a modern processor looks like. Simply put, a processor core is composed of instruction fetch and decode stages, the floating-point and integer execution units, some cache, and a link to a northbridge which handles memory access and further I/O. Single-core processors have now become quite rare these days, with the exception of low-power platforms such as the entry-level AMD Fusion processors, the Intel Atom or the VIA Nano, so at least two cores are found in a chip.

This basic concept has been improved over time, but it may have attained its limitations. It needs some drastic changes to continue marching forward. AMD's focus here was to maximize the instructions per watt while offering more cores, which is the best way to increase the overall throughput of a server or cluster. The smaller a core is, the more can be put in a chip, obviously. In this field of engineering, there's a principle which says that the most common use case must be favored. This does not go hand in hand with the fact that floating-point operations account for only 20% of the CPU usage compared to 80% for integers according to AMD, and that their operations are much more complex, thus requiring lots of die space. In order to save some of it, AMD started off with the idea of sharing one FPU over two cores. However, many computing applications make intensive use of them, and newer sets of instructions have recently appeared to boost their performance, such as the 256-bit Advanced Vector eXtensions (AVX) featured on Intel Sandy Bridge processors. The FPU in Bulldozer does support them as well, but when they are not in use, each core in the module has access to half of the pipelines for 128-bit calculations.

The saved die space has been spent on aggressive features that benefit both cores, namely prefetching. The shared frontend prefetches instructions in a dynamic fashion, according to the destination addresses of branches stored in the two levels of the branch target buffer that are 512B and 5KB in size, respectively. For those unfamiliar with the branch prediction concept, what's basically stored in these buffers is the actual memory address of the instruction located at the branch destination. There is a rule of thumb which states that a processor spends 90% of its time in 10% of the code; when a branch is being taken, it is likely to be taken again very soon, so keeping its previous target at hand instead of waiting for its computation can save some precious cycles. The prediction pipeline is free to run as long as its queue (dedicated for each thread) is not full. By looking at the Relative Instructions Pointers (RIP), the instruction fetch pipeline can then predict future cache misses. As for the 64KB instruction cache, the two threads compete dynamically for it. There is a slight problem though; in some specific cases, there can be an excessive number of cache invalidations, forcing the instructions to be fetched again. There was a discussion back in July between some folks at AMD and some other guys, namely Mr. Linus Torvalds himself, about patching the Linux kernel to prevent this, which currently hasn't been done. Supposedly this would lead to a 3% sacrifice in performance, but rumors posit this to be higher on Windows. This bug doesn't affect the viability of the system like the Intel P67's premature SATA degradation or the AMD TLB bug did, though. Fixing it in the next core revision would of course be better than software workarounds, and allow for a measurable performance boost. Finally, another big difference with Phenom II is the addition of a fourth instruction decoder, which puts it on par with Sandy Bridge. All of these improvements should help maximize the use of the execution units.


Each ALU has its own thread scheduler. One can see the pipelines for division and multiplication, as well as address generation. The latter serves the fully out-of-order load and store unit, which can handle two 128-bit loads and one 128-bit store per cycle. The queue for each of these operation is 40 and and 24-entry long, respectively, and the data cache is 16KB in size. There's also some register renaming going on in there to avoid unnecessary data hazards, or dependencies. For example, if one instruction stores the value of a given register in memory, and the following instruction wants to use the same register for storing the result from the ALU, the result will be put in another register instead of waiting for the memory store to be completed.

The FPU, as explained above, is shared between two cores. To allow such a configuration, AMD has adopted a coprocessor arrangement. The unified FP scheduler manages both threads, and when the execution is completed, the parent core is advised. Two of the pipelines consist of Fused Multiply-Accumulate (FMAC) which, in the four operand form adopted by AMD, can be described as follows with the arrow representing a store operation: A ← B + C x D. The upcoming processors from Intel are also going to feature FMA pipes, however they are going to be in the three operand, or destructive form, like this: A ← A + B x C. Obviously, keeping A unmodified has its advantages. If it needs to be used for other operations, it will need to be copied over to other registers before doing an FMA3 operation, thus adding more instructions. To maintain compatibility, AMD will also support the three operand form in the next core, dubbed Piledriver. The other two pipelines in the coprocessor are actually integer pipelines, which can also work with 128-bit operands for the SSE instructions. They take care of the operations in the XOP instruction set as well, which along with FMA4, forms what was originally supposed to be SSE5, first proposed by AMD back in 2007. Once again the reason for this change is to have a better compatibility with Intel's instruction set. XOP contains integer vector operations such as multiply-accumulate, compare, shift, rotate, permute, and more. So there is a great opportunity for developers to get tremendous boosts in speed with these new SIMD instructions.

Then there is the 16-way unified L2 cache, 2MB in size. Since it is shared between two cores on eight, the core on which a given thread is scheduled might affect peformance; if two threads of a program are in the same module, they will share their L2, otherwise they have to rely on the slower L3. The Windows scheduler is obviously not aware of such a detail of the implementation, but supposedly the Windows 8 developer preview shows some benefits due to its better scheduler. What is important to note also is that unlike the L1 cache, the L2 is exclusive in regard to its higher sibling which results in a total of 16MB of data. Additionally, the 8-way L2 translation lookaside buffer, used to do the conversion between virtual and physical memory addresses, has 1024 entries and services both the instruction and data requests. Finally, there are data prefetchers which try to predict data use and bring it into cache ahead of when the processor executes the load.

The integrated northbridge has also been redesigned. After the synchronization between the four modules, the requests are sent to either the L3 cache or the memory and the rest of the system via the HT link. There are also two Advanced Programmable Interrupt Controllers (APIC).

There is also an Application Power Management (APM) module somewhere in there which measures the TDP headroom for the Turbo Core 2.0. If there is enough of it, all cores can get a 300MHz increase, significantly boosting the performance. This case happens when an application uses more than four cores, but doesn't load them up to 100%. The major difference with the previous version seen in the Thuban die is the addition of a second Turbo mode. If no more than half of the cores are active, the unused modules can go into C6 state and allow the others to level up another 300MHz, for a total of 600MHz on the FX-8150. On the FX-8120 model, this ramps up to 900MHz higher than stock! Again, there is a small hiccup with the current Windows scheduler; if the threads are not running on the right modules, this Turbo mode won't work. Hopefully a patch to the Windows 7 scheduler will soon arrive.

That C6 state implies power gating the whole unused module. First, after a predetermined period of inactivity, the L2 cache is flushed and the register's content is saved. Then, some FETs essentially isolate the module from the ground. To resume, they close back the loop and the execution context is restored from the saved space. There is also some clock gating going on in the modules, which is essentially bringing the frequency down to zero, but at a more granular level. Some parts of the northbridge can also be power gated if not used.

Article Index |
|
prev
1 2
I'm sort of disappointed, but credz to AMD, for the jump in power.
the 4100 kind of looks worth my time, how does it compare to a 955 be?
My guess would be that they are approximately equal since there is a 400MHz difference. The now very great Turbo Core should help it to pull ahead though, and if the application can use the newer AVX, FMA4 or XOP, the 4100 is going to have a big advantage.
I will see if Neoseeker can get its hands on one. Stay tuned!
http://www.guru3d.com/article/amd-fx-8150--8120-6100-and-4100-performance-review/1
I'd really like AMD to go after performance/die space as they do with the graphics card lineup. If they could optimize the x86 cores as well as they do their shader clusters per die space they would at least make a little more money off of their underperforming cpu's instead of trying to be competitive with a much smaller die sandybridge that beats it in performance. Of course I imagine it's at least a little harder to design an x86 core over tacking on more shaders, rops, texture units, and a new UVD every now and then.
The bulldozer CPU design is very intriguing and creative with all the new design ideas that it uses but at the end of the day it comes down to whether it can perform or not, and on this... well, you've seen the benchmarks. Seems we're back to the days of the original phenom x4's, decent enough for most things, uses more power, larger die size, not competitive in the high end.
Here's to hoping they get their asses in gear.
This week we launched the highly anticipated AMD FX series of desktop processors. Based on initial technical reviews, there are some in our community who feel the product performance did not meet their expectations of the AMD FX and the “Bulldozer” architecture. Over the past two days we’ve been listening to you and wanted to help you make sense of the new processors. As you begin to play with the AMD FX CPU processor, I foresee a few things will register:
In our design considerations, AMD focused on applications and environments that we believe our customers use – and which we expect them to use in the future. The architecture focuses on high-frequency and resource sharing to achieve optimal throughput and speed in next generation applications and high-resolution gaming.
Here’s some example scenarios where the AMD FX processor shines:
Playing the Latest Games
A perfect example is Battlefield 3. Take a look at how our test of AMD FX CPU compared to the Core i7 2600K and AMD Phenom™ II X6 1100T processors at full settings:
Map
Resolution
AMD FX-8150
Sandy Bridge i7 2600k
AMD Phenom™ II X6 1100T
MP_011
1650x1080x32 max settings
39.3
37.5
36.3
MP_011
1920x1200x32 max settings
33.2
31.8
30.6
MP_011
2560x1600x32 max settings
21.4
20.4
19.9
Benchmarking done with a single AMD Radeon™ HD 6970 graphics card
Creating in HD
Those users running time intensive tasks are going to want an AMD FX processor for applications like x264, HandBrake, Cinema4D where an eight-core processor will rip right along.
Building for the Future
This is a new architecture. Compilers have recently been updated, and programs have just started exploring the new instructions like XOP and FMA4 (two new instructions first supported by the AMD FX CPU) to speed up many applications, especially when compared to our older generation.
If you are running lightly threaded apps most of the time, then there are plenty of other solutions out there. But if you’re like me and use your desktop for high resolution gaming and want to tackle time intensive tasks with newer multi-threaded applications, the AMD FX processor won’t let you down.
We are a company committed to our customers and we’re constantly listening and working to improve our products. Please let us know what questions you have and we’ll do our best to respond.
Adam Kozak is a product marketing manager at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.
prev
1 2