Author: William Henning
Editor: Howard Ha
Publish Date: Monday, October 29th, 2007
Originally Published on Neoseeker (http://www.neoseeker.com)
Article Link: http://www.neoseeker.com/Articles/Hardware/Reviews/qx9650_penryn_review/
Copyright Neo Era Media, Inc. - please do not redistribute or use for commercial purposes.
Months of anticipation and speculation are about to end, as Intel prepares the launch of the first 45nm Core 2 CPUs known as "Penryn". This is the new kid on the block – and while you can’t buy one yet, we are here to tell you about Penryn, and how it performs. We spent several weeks with a Core 2 Extreme QX9650 quad core Yorkfield processor and we're pretty excited to finally report it all to you.
![]()
The QX9650 Penryn is built on Intel’s 45nm High-k metal gate silicon technology which features transistors with reduced current leakage, thus reducing power consumption and allowing for increased clock speeds. More than just a die shrink, Penryn also has a number of interesting enhancements to the extremely successful Core 2 micro architecture:
Our particular chip, the QX9650, is the second fastest Yorkfield being released in November, with its four cores running at 3.0GHz (1333Mhz FSB at 9X multiplier). The CPU-Z and BIOS shots below give you the details.
Penryn quad core processors are constructed as a multi-chip module consisting of two dual core 400+ million transistor Penryn dies with up to 6MB of L2 cache and come in a Socket 775 ball grid array package.
The 45nm architecture allows for roughly twice the transistor density in the same die area, with approximately 30% reduction in transistor switching power and better than 20% greater switching speed due to a five fold reduction in source drain leakage power and a greater than ten times reduction in transistor gate oxide leakage.
Changes to Architecture
Intel is pulling no punches with its Penryn core. Some of the changes they've implemented are targetted directly at server platforms, while others are generally useful alround.
L2 Cache changes
The L2 cache has been increased to 6MB, a 50% increase, with 24 way associativity. Quad core packages will therefore have a total of 12MB of L2 cache, further reducing the number of cache “misses” that will be encountered by most software.
The “split load” capability will reduce the current penalty for reading a data item where parts of the item are located in two different cache lines.
Radix-16 Divider
By doubling the number of quotient bits computed in each iteration of a division instruction from two bits of quotient to four, integer and floating point divide (and modulus) operations are doubled in speed.
![]()
Virtualization
The improved virtualization entry and exit times – averaging a 25% to 75% gain – will lower the overhead for virtual machines, and as virtualization is becoming common in server space, it is a welcome optimization. Neoseeker itself earlier this year had replaced 2 of its servers with new virtual servers running a total of 5 VM's, and we know other sites that have gone this route, so virtualisation is something all of you are using indirectly every day and is an especially exciting topic for new CPUs.
![]()
Store Forwarding & Improved OS synchronization support:
Store forwarding allows a read of a memory location to occur from the write pipe even if a mis-aligned write to main memory has not occurred yet; interrupts can be enabled and disabled faster and locked instructions can also execute faster – this should be a benefit in high I/O interrupt situations such as database servers.
Deep Power Down Technology
A new power management state significantly reduces power consumption during idle periods by adding a deeper “sleep” state that flushes caches, saves internal micro-architecture state and shuts off power to inactive cores and their L2 cache.
Dynamic Acceleration Technology
When one or more cores are inactive, the performance of an active core can be automatically boosted while still remaining within the power envelope of the chip – this basically sounds like automatically overclocking some active cores when other cores are inactive, so for example, on a quad core system, if two cores were idle during game play, the two active cores speed would automatically be boosted.
50+ New SSE4 instructions
According to Intel, these SSE4 additions can lead to dramatic performance gains, so let’s take a closer look at them:
PMULLD, PMULDQ – signed and unsigned multiplication for four packed 32 bit values
DPPS, DPPD – dot product instruction, used in matrix multiplication, 3D code
BLENDPDS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB PBLENDDW – conditional copying of fields in packed SSE registers
PMINSB, PMAXSB, PMINUW, PMAXUD, PMINUD, PMAXUD, PMINDS, PMAXSD – min and max operations for packed signed and unsigned bytes, words and dwords
ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD – rounding of packed single and double precision floating point data
INSERTPS, PINSRB, PINSRD, PINSRQ, EXTRACTPS, PEXTRB, PEXTRD, PEXTRW, PEXTRQ – data insertion/extraction between XMM registers and memory or cpu general purpose registers
PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ - conert from packed integer to zero or sign extended integer of a wider type
PTEST – packed test
PCMPEQQ, PCMPGTQ – compared packed qword’s
PACKUSDW – convert packed signed DWORDS to packed unsigned WORDS
PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM – advanced string comparison instructions
CRC32 – calculate a CRC polynomial
POPCNT – count number of bits set to 1
Ok, you can un-glaze your eyes now.
![]()
Counting every variation there is just a bit over fifty new instructions… but really there are only 14 totally unique instructions, with variations based on data type. Still, the new instructions will improve the quality of vector code, string comparisons, crc calculations and more, so they definitely will help – once compiler support for them arrives, and applications are re-compiled to take advantage of them.
Test Setup
We made the test systems as similar as we could, other than the memory - so it should give us some very interesting results! Since Yorkfield is a Socket 775 chip it can theoretically run on any Socket 775 board that has a BIOS supporting the CPU and a VRM that can handle the required voltages. Intel recommends P35 or X38 chipsets, but there are reports that P965, P975X and NVIDIA 680i chipsets may also work, depending on the manufacturer of the actual board. As always, check before you leap. For our tests we ran the QX9650 on an X38 based Asus P5E3 Deluxe board, one of the best DDR3 overclocking Core 2 boards on the market. Hardware used for testing the motherboards:
Benchmarks Used For now, here is a listing of the tests performed: Video drivers used were the NVIDIA ForceWare version 93.71 package.
http://download.intel.com/pressroom/kits/events/idffall_2007/BriefingSmith45nm.pdf
http://www.intel.com/technology/architecture-silicon/intel64/45nm-core2_whitepaper.pdf
http://download.intel.com/technology/architecture/new-instructions-paper.pdf
Business Winstone
The stock speed QX9650 tops the chart for Business Winstone; heck, even the underclocked QX9650 beats the X3210 and QX6700!

Content Creation
Again, the QX9650 takes the top spots...

Sandra Tests
PLEASE NOTE: The QX6700 results here are not comparable as they used an earlier version of Sandra; they are shown merely for interest.
The Sandra tests clearly show that the QX9650 Penryn architecture is significantly faster for floating point; surprisingly it is a hair slower for integer arithmetic, however the difference is so small that it is likely due to experimental error.

The memory banwidth of Penryn is significantly faster than previous quad cores as we can clearly see in the chart below.

WinRAR
Excellent results for the QX9650 here; however for WinRAR, single core, the older QX6700 with 833-4-4-4-12 DDR2 is almost as fast as the QX9650!

WinRAR MT
The QX9650 dominates here.

RightMark Read
The Penryn dominates the RightMark Read benchmark.

RightMark Write
And the QX9650 also takes the top spots in the RightMark Write benchmark.

RightMark Latency
For latency, DDR2 still wins.

RightMark Bandwidth
I am surprised; the X3210 with the smaller L2 cache showed a better bandwidth score!

LAME MP3
Looks like LAME likes the QX9650's larger caches and other improvements...


TMPGEnc
TMPGEnc benefits from Penryn as well.

CineBench
CineBench loves the QX9650.

POV-Ray
POV-Ray also loves the Penryn.

Call of Duty
Oh my. The QX9650 - Penryn - rocks for Call of Duty.

Commanche 4
And for Commanche.

Doom 3
Doom 3 also likes Penryn - the QX9650 took the top place on the chart.

Halo
Same story here.

Jedi Knight
It is becoming very clear that for CPU bound applications, the Penryn will consistently beat previous quad cores.
The QX9650 rocks.

Unreal Tournament 2004
Oh my.

Quake 4
Ok, by this time it should not be a surprise that the Penryn does very well for gaming.
World In Conflict
People have been asking for newer benchmarks - so I am adding World in Conflict, under XP Pro.
These results are the best frame rate from the buit in benchmark, with "800x600 very low quality" settings.
Of course, Penryn does very well.
Penryn improvements explored
I wanted to see exactly how much difference the architectural improvements and larger L2 cache would make for Penryn, so I compiled my results in a nice, easy table below. I took the Xeon X3210 and compared its performance in every benchmark I ran against the Yorkfield QX9650 with both processors running at the same clockspeeds with the exact same memory and motherboard configurations. When reading the below table please be aware that for some tests the lower score is better, so you will see "negative" % improvements. I highlighted the tests where Yorkfield/Penryn lost to its predecessor in red so you can differentiate when the %improvement should be read as "Yorkfield lost".
| Same platform, both processors running 333x8 1333-8-8-8-24 | |||
| 333x8 1333-8-8-8-24 | QX9650 | X3210 | % improvement |
| Business Winstone | 32.5 | 32.1 | 1.25% |
| Content Creation | 47.6 | 46.7 | 1.93% |
| Sandra - Int | 49183 | 49301 | -0.24% |
| Sandra - Float | 37854 | 33028 | 14.61% |
| Sandra - Int Bandwidth | 7247 | 6880 | 5.33% |
| Sandra - Float Bandwidth | 7226 | 6891 | 4.86% |
| WinRAR | 788 | 765 | 3.01% |
| WinRAR MT | 2136 | 1945 | 9.82% |
| RightMark Read | 8633.26 | 8631.04 | 0.03% |
| RightMark Write | 7108.88 | 6093.75 | 16.66% |
| RightMark Latency | 50.82 | 51.24 | -0.82% |
| RightMark Bandwidth | 5737.56 | 5833.91 | -1.65% |
| LAME MP3 | 399 | 434 | -8.06% |
| LAME MP3 MT | 177 | 190 | -6.84% |
| TMPGEnc | 491 | 550 | -10.73% |
| TMPGEnc MT | 217 | 221 | -1.81% |
| CineBench | 51.7 | 56.3 | -8.17% |
| CineBench MT | 15.9 | 17.5 | -9.14% |
| POV-RAY | 559.73 | 545.41 | 2.63% |
| POV-RAY MT | 2130.44 | 2111.41 | 0.90% |
| Call of Duty | 209.3 | 191.4 | 9.35% |
| Commanche 4 | 123.75 | 123.81 | -0.05% |
| Doom 3 | 252.3 | 235.3 | 7.22% |
| Halo | 255.74 | 246.86 | 3.60% |
| Jedi Knight | 143.3 | 135.9 | 5.45% |
| UT 2004 | 177.77 | 157.33 | 12.99% |
I think the above table shows a pretty astonishing trend. Penryn is FASTER in 23 out of 26 tests against its predecessor, in the same motherboard, with the same memory, with the same settings. In 8 of the tests Yorkfield bested by 8% or more (sometimes up to 16% improvement). Those Penryn architectural changes we described are certainly doing something!
Now that we've seen how Penryn does in stock performance, let's start looking at my favourite part of any CPU review: the extreme overclocking analysis :-).
Overclocking the Yorkfield/QX9650
This is a very exciting thing for me. I've been waiting for Penryn and 45nm Core 2's for a very long time. Why? Because every die shrink is another opportunity to see how much overhead and overclocking bliss one can achieve. I must say that overclocking the QX9650 was a pleasure. Here are some CPU-Z screen captures so you can see how far I got even before you look at the overclocked results :-)
3.6GHz - piece of cake, default voltages.
4.25GHz - had to work for it:
But once I got it stable, it was really sweet!
4.5GHz - booted into Windows, but extremely unstable. With better cooling it should be achievable!
Overclocked QX9650 Business Winstone
The chart says it all better than I can. The QX9650 overclocks like crazy, and is very very fast for even single threaded business apps.

Overclocked QX9650 Content Creation
The Penryn loves to do multi-media!

Overclocked QX9650 Sandra Tests
JUST LOOK AT THESE SANDRA RESULTS!
Please note, the QX6700 results are not comparable as they were from an older Sandra.

Ok, DDR3 can give very very good memory bandwidth.

Overclocked QX9650 WinRAR
WinRAR loves Penryn too. QX9640 is great for overclockers.

Overclocked QX9650 WinRAR MT
WinRAR MT REALLY loves Penryn and fast DDR3.

Overclocked QX9650 RightMark Read
The Penryn gives us amazing read bandwidth. Over 11GB/sec!

Overclocked QX9650 RightMark Write
Penryn also gives us 9.5GB/sec writing!

Overclocked QX9650 RightMark Latency
No, you are not seeing things. When you crank the MegaHertz high enough, use an X38 chipset based board, and overclock a Penryn, you can get very low latencies (for an Intel system).

Overclocked QX9650 RightMark Bandwidth
The overclocked bandwidth is excellent.

Overclocked QX9650 Lame MP3
The RIAA is going to hate Penryn. It encodes really fast.


Overclocked QX9650 TMPGEnc
No worries, the MPAA will hate Penryn too. It crunches through MPEG encoding of movies.

Overclocked QX9650 Rendering Tests
CineBench
Pixar and Disney will love Penryn. It ray traces like there is no tomorrow.

POV-Ray
Ditto with POV.

Overclocked QX9650 Call of Duty
Oh my. Penryn overclocked just chews through games that are not GPU limited.

Overclocked QX9650 Commanche 4
Yep, on top of the chart again.

Overclocked QX9650 Doom3
That is an amazing Doom 3 result.
Penryn rocks.
Granted, its low res, low detail.
But come on... 309.2fps???
Fragged.

Overclocked QX9650 Halo
I'm scratching my head as to why the QX6700 is on top - perhaps an optimized driver?. Penryn still rocks.

Overclocked QX9650 Jedi Knight
Oh boy. The Penryn numbers speak for themselves.

Overclocked QX9650 Unreal Tournament 2004
Ok. Why are we not surprised that Penryn tops the chart here?

Overclocked QX9650 Quake 4
What can I say? It's fast.
Overclocked X9650 World In Conflict
Here are the overclocked World in Conflict results under XP Pro.
These results are the best frame rate from the built in benchmark, with "800x600 very low quality" settings.
Penryn does very well indeed.
Power Consumption
Since everyone is interested in saving power, I thought it would be useful to chart the power consumption of the QX9650 vs. the X3210. The chart below shows total system draw as measured by our "Kill-o-Watt" meter.
The Penryn based system uses up to 21% less power than the Xeon based system at the exact same clockspeed whether under idle or load - an excellent improvement!
Overclocking Final Thoughts
What can I say? The Yorkfield Core 2 Extreme QX9650 quad core processor is simply an amazing overclocker.
I had to go across the lab to collect my socks – because the Penryn definitely blew them off by being the best Core 2 derived overclocking chip I’ve had the pleasure to use to date -- and that’s with four cores in a multi-chip module!
Frankly, I am certain that I have not reached the ceiling of possible performance with this chip; the temperature was still well under control at 4.25GHz, however given that this is the only Penryn chip we have, I chickened out and did not push Vcore beyond 1.575V during my tests.
I have absolutely no doubt that this chip can go further, and I may try some phase change cooling on it – there is a LOT of headroom on this chip folks!
Why am I so sure?
Because I was able to go to the Windows desktop, and run some of the tests at 4.5GHz, it just was not stable enough for my tastes with a 1.575Vcore at 4.5GHz.
It ran at 3.6GHz at stock Vcore.
And I thought that the E6750 and the X3210 were good overclockers.
Wow.
Conclusion
I think Intel has another smashing winner on its hands. To sum up our findings, let's consider the following few points. The architectural improvements to the Penryn core yielded a significant increase in performance clock for clock against previous Core 2's, sometimes as high as 16% at the same clockspeeds. Our Penryn powered platform also showed an impressive 21% lower power consumption under idle and load at the same clockspeed as previous quad core processors. And finally, the QX9650 was an amazing overclocker. And this is where it will really win points with enthusiasts.
I still can’t quite believe how nice and cool it ran at a blistering 4.25GHz – when it is rated for 3.0GHz.
There is absolutely no question in my mind that Penryn will be a hit - especially once the lower cost non-extreme parts arrive. In case you were wondering, that's why I also included 333x8 results, as one of the expected clock speeds for less expensive Penryn's will be 333x8. Besides, it allowed me to do more comparisons against earlier Core 2 Quads. Intel could have gone ahead with a simple die shrink, but instead they decided to pack architectural improvements that made a significant real world difference and introduce new SSE instructions. With improved virtualisation and lower power consumption to round out the chip I'm fairly certain Intel will make even more significant headway in server marketshare as well. This is a superb chip and I'm very much looking forward to the next generation of Core 2's with the Penryn core.
Please do not redistribute or use this article in whole, or in part, for commercial purposes.