Core i7 Architecture - Continued
Nehalem also greatly improves on the memory architecture - massively increasing the bandwidth available to the cores not only from the caches but also from the external main memory. Careful attention was paid to reducing latency in all the levels of the memory hierarchy - L1, L2 and L3 caches as well as the new on-chip triple channel memory controller.
Intel considers the Nehalem to be divisible into two areas - the "Core" area, consisting of a number (currently four) of processor cores with individual L1 and L2 caches, and the "Uncore" area, consisting of a shared L3 cache, an Integrated Memory Controller (currently with three channels), a number of Quick Path Interconnects, and a Power&Clock section.

For servers and desktops in 2008-2009, Intel intends to differentiate its offerings by varying the:
- number of cores
- number of memory channels
- number of QPI links
- size of caches
- type of memory supported
- power management
- integrated graphics
Intel is increasing cache performance by moving to per-core low latency L1 and L2 caches that share a unique shared L3 caches. The L3 cache is inclusive, that is, anything present in a core's L1 or L2 cache must be present in the L3 cache as well.
This presents some advantages, as Intel has added a "present in core's L2 cache" bit to each cache line in the L3, significantly reducing cache snooping and cache coherency traffic, as if the data being requested is not available in the L3 cache, it is guaranteed not to be in the L1/L2 caches of the other cores. If the L3 cache line is present in another core, the other core must be snooped to see if it has modified the cache line. The relatively small sizes of the L1 and L2 caches allow them to be built with very low latency - and help minimize cache coherency checks - and also allow for reducing the size of the L3 cache to as little as 1MB in a four core processor.
Nehalem increases the scalability of multi-processor systems significantly by having the memory controller on the processor; thus adding a socket also adds another three memory channels that may be populated. Adding processors, with associated memory channels, will increase the memory bandwidth available to servers, thus greatly improving scalability.
Currently Nehalem officially supports up to DDR3-1333, but as you will see, we were able to exceed that in our tests. Having the memory controller - with a potential peak 32GB/sec bandwidth - on the processor allows for hitherto unseen (on Intel platforms) low memory latencies. Nehalem wil also support RDIMM and UDIMM memories.
Using the Quick Path Interconnect, Intel adds NUMA capability (Non-Uniform Memory Access) needed to access the memory attached to other processors in the system. The memory local to any processor socket will always be faster to access than memory attached to another processor, however QPI will make non-local memory available at data rates comparable to, and in some cases faster, than current Intel FSB designs.
The combination of the triple channel memory controller and QPI is likely to erode the current advantage AMD enjoys in multi-socket servers; thus allowing Intel inroads in the only market where it currently arguably runs second place to AMD.
The improved virtualization support not only reduces the time cost of entering/leaving a virtual machine, it also reduces the number of virtual transitions by implementing extended virtual page tables to translate guest to host physical addresses, removing the #1 cause of having to leave the virtual machine and allowing virtual guests full control over their own page tables. A virtual processor ID also helps reduce the frequency of TLB entry invalidations.
Intel is also updating its optimizing compilers, and so is Microsoft - the new 2008 Visual studio will support SSE4.2 fully.