A LOOK AT INTEL S NEW NEHALEM ARCHITECTURE: THE BLOOMFIELD AND LYNNFIELD FAMILIES AND THE NEW TURBO BOOST TECHNOLOGY



Similar documents
Overview. CPU Manufacturers. Current Intel and AMD Offerings

Next Generation Intel Microarchitecture Nehalem Paul G. Howard, Ph.D. Chief Scientist, Microway, Inc. Copyright 2009 by Microway, Inc.

Intel Xeon Processor E5-2600

Generations of the computer. processors.

The Bus (PCI and PCI-Express)

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Innovativste XEON Prozessortechnik für Cisco UCS

Accelerating Innovation in the Desktop. Rob Crooke VP and General Manager Business Client Group

How System Settings Impact PCIe SSD Performance

Intel Xeon Processor 5500 Series. An Intelligent Approach to IT Challenges

Making the Move to Quad-Core and Beyond

A Quantum Leap in Enterprise Computing

DDR3 memory technology

Enabling Technologies for Distributed and Cloud Computing

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Personal Systems Reference Intel PC Processors

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Intel Core i3-2310m Processor (3M Cache, 2.10 GHz)

An examination of the dual-core capability of the new HP xw4300 Workstation

Choosing a Computer for Running SLX, P3D, and P5

The Motherboard Chapter #5

CPU Benchmarks Over 600,000 CPUs Benchmarked

Intel Xeon Processor 5600 Series

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Multi-core and Linux* Kernel

Small Business Upgrades to Reliable, High-Performance Intel Xeon Processor-based Workstations to Satisfy Complex 3D Animation Needs

Desktop Processor Roadmap. Solution Provider Accounts

Chipsets and Memory Architectures

Enabling Technologies for Distributed Computing

Intel Core i Processor (3M Cache, 3.30 GHz)

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Server Bandwidth Scenarios Signposts for 40G/100G Server Connections

PassMark - CPU Mark Multiple CPU Systems - Updated 17th of July 2012

Low Power AMD Athlon 64 and AMD Opteron Processors

Server: Performance Benchmark. Memory channels, frequency and performance

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Advanced Micro Devices - AMD

Intel Xeon Processor 3400 Series-based Platforms

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

How PCI Express Works (by Tracy V. Wilson)

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Virtualization with the Intel Xeon Processor 5500 Series: A Proof of Concept

LOOKING FOR AN AMAZING PROCESSOR. Product Brief 6th Gen Intel Core Processors for Desktops: S-series

Intel X58 Express Chipset

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

Quad-Core Intel Xeon Processor

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Memory Configuration for Intel Xeon 5500 Series Branded Servers & Workstations

EVALUATING NEW ARCHITECTURAL FEATURES OF THE INTEL(R) XEON(R) 7500 PROCESSOR FOR HPC WORKLOADS

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

12 rdzeni w procesorze - nowa generacja platform serwerowych AMD Opteron Maj Krzysztof Łuka krzysztof.luka@amd.com

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Discovering Computers Living in a Digital World

HP PCIe IO Accelerator For Proliant Rackmount Servers And BladeSystems

A Powerful solution for next generation Pcs

SUN FIRE X4170, X4270, AND X4275 SERVER ARCHITECTURE Optimizing Performance, Density, and Expandability to Maximize Datacenter Value

Random Access Memory (RAM) Types of RAM. RAM Random Access Memory Jamie Tees SDRAM. Micro-DIMM SO-DIMM

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Table Of Contents. Page 2 of 26. *Other brands and names may be claimed as property of others.

The Foundation for Better Business Intelligence

LABS. Boston Solutions. Future Intel Xeon processor E v3 product families. September Powered by

Legal Notices and Important Information

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

PCI Express and Storage. Ron Emerick, Sun Microsystems

MB ASUS P5G41T-M LX2/GB/LPT Codice Produttore: BRV_

Putting it all together: Intel Nehalem.

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

W o r k s h e e t : P r o c e s s o r s

on an system with an infinite number of processors. Calculate the speedup of

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

Intel Desktop Board DQ45CB

Comparing Multi-Core Processors for Server Virtualization

Configuring Memory on the HP Business Desktop dx5150

PCI vs. PCI Express vs. AGP

Home Software Hardware Benchmarks Services Store Support Forums About Us

Fujitsu PRIMERGY BX924 S2 Dual-Socket server

Technical Product Specifications Dell Dimension 2400 Created by: Scott Puckett

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

Central Processing Unit

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

SeaMicro SM Server

Design Cycle for Microprocessors

Intel Desktop Board DG45FC

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

Primergy Blade Server

Developments in Internet Infrastructure

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Computer Systems Structure Input/Output

INTEL HIGH-PERFORMANCE CONSUMER DESKTOP MICROPROCESSOR TIMELINE

IBM System x family brochure

Lecture 1: the anatomy of a supercomputer

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

Parallel Programming Survey

Intel Itanium Architecture

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Transcription:

A LOOK AT INTEL S NEW NEHALEM ARCHITECTURE: THE BLOOMFIELD AND LYNNFIELD FAMILIES AND THE NEW TURBO BOOST TECHNOLOGY Asist.univ. Dragos-Paul Pop froind@gmail.com Romanian-American University Faculty of Computer Science for Business Management Keywords: Intel, Nehalem, Bloomfield, Lynnfield, CPU, Turbo Boost, HyperThreading, PCIe, transistor, QPI, core, thread, overclock 1. Introduction Nehalem is the codename for the Intel processor microarchitecture successor to the Core microarchitecture. The first processor released with the Nehalem architecture is the desktop Core i7, which was released in November 2008. Nehalem processors use the same 45 nm manufacturing methods as Penryn. A working system with two Nehalem processors was shown at Intel Developer Forum Fall 2007, and a large number of Nehalem systems were shown at Computex in June 2008. The microarchitecture is named after the Nehalem Native American nation in Oregon. At that stage it was supposed to be the latest evolution of the NetBurst microarchitecture. Since the abandonment of NetBurst, the codename has been recycled and refers to a completely different project, although Nehalem still has some things in common with NetBurst. Nehalem-based microprocessors utilize higher clock speeds and are more energy-efficient than Penryn microprocessors. Hyper-Threading is reintroduced along with an L3 Cache missing from most Core-based microprocessors. Nehalem is the Tock in Intel s Tick Tock model, the Tick being the Penryn. "Tick-Tock" is a model adopted by Intel since 2007 to follow every microarchitectural change with shrinking of the process technology. Every "tick" is a shrinking of process technology of the previous microarchitecture and every "tock" is a new microarchitecture. Every year, there is expected to be one tick or tock. 2. Characteristics The main characteristics of the Nehalem microarchitecture are: 45 nm manufacturing process 4 cores HyperThreading (some models) Intel Turbo Boost Technology A new point-to-point processor interconnect, the Intel QuickPath Interconnect, in high-end models, replacing the legacy front side bus Integrated memory controller supporting two or three memory channels 731 (Bloomfield) and 774 (Lynnfield) million transistors Integration of PCI Express and Direct Media Interface into the processor in mid-range models, replacing the northbridge 3. Families The two processor families based on the Nehalem microarchitecture are Bloomfield (released in November 2008, Core i7 9xx series) and Lynnfield (September 2009, Core i7 8xx series and Core i5 750). The processor form the two families are quite different from each other, but the Lynnfield family can be seen as an evolution of Bloomfield. The main differences between the two are: socket (LGA 1366 for Bloomfield and LGA 1156 for Lynnfield), memory controller (3 channels for Bloomfield and 2

channels for Lynnfield, although the latter supports higher memory speeds than the first), PCI Express channels (36 for Bloomfield and 16 for Lynnfield). Intel's Bloomfield Platform (X58 + LGA-1366): Intel's Lynnfield Platform (P55 + LGA-1156): Processor Clock Speed Cores/ Threads Maxi Single Core Turbo Freq TDP Price Intel Core i7-975 Extreme Intel Core i7 965 Extreme 3.33GHz 4 / 8 3.60GHz 130W $999 3.20GHz 4 / 8 3.46GHz 130W $999 Intel Core i7 940 2.93GHz 4 / 8 3.20GHz 130W $562

Intel Core i7 920 2.66GHz 4 / 8 2.93GHz 130W $284 Intel Core i7 870 2.93GHz 4 / 8 3.60GHz 95W $562 Intel Core i7 860 2.80GHz 4 / 8 3.46GHz 95W $284 Intel Core i5 750 2.66GHz 4 / 4 3.20GHz 95W $196 4. Technologies HyperThreading Intel decided to reintroduce the HyperThreading technology, which means that every core can run two threads at the same time. This translates into 8 virtual processors seen by the operating system. HyperThreading is Intel s proprietary SMT (simultaneous multithreading) technology and it has not been present on an Intel processor since the Pentium 4s in 2006. QuickPath Interconnect The QuickPath Interconnect (QuickPath, QPI) is a point-to-point processor interconnect developed by Intel to compete with HyperTransport (used by AMD). It replaces the Front Side Bus (FSB) for Desktop, Xeon, and Itanium platforms. The QPI is an element of a system architecture that Intel calls the QuickPath architecture that implements what Intel calls QuickPath technology. In its simplest form on a single-processor motherboard, a single QPI is used to connect the processor to the IO Hub. In more complex instances of the architecture, separate QPI link pairs connect one or more processors and one or more IO hubs or routing hubs in a network on the motherboard, allowing all of the components to access other components via the network. As with HyperTransport, the QuickPath Architecture assumes that the processors will have integrated memory controllers, and enables a non-uniform memory architecture (NUMA). Each QPI comprises two 20-lane point-to-point data links, one in each direction (full duplex), with a separate clock pair in each direction, for a total of 42 signals. Each signal is a differential pair, so the total number of pins is 84. The 20 data lanes are divided onto four "quadrants" of 5 lanes each. The basic unit of transfer is the 80-bit "flit," which is transferred in two clock cycles (four 20 bit transfers, two per clock.) The 80-bit "flit" has 8 bits for error detection, 8 bits for "link-layer header," and 64 bits for "data." QPI bandwidths are advertised by computing the transfer of 64 bits (8 bytes) of data every two clock cycles in each direction. Although the initial implementations use single four-quadrant links, the QPI specification permits other implementations. Each quadrant can be used independently. On high-reliability servers, a QPI link can operate in a degraded mode. If one or more of the 20+1 signals fails, the interface will operate using 10+1 or even 5+1 remaining signals, even reassigning the clock to a data signal if the clock fails. The initial Nehalem implementation uses a full four-quadrant interface to achieve 25.6 GB/s, which provides exactly double the theoretical bandwidth of Intel's 1600 MHz FSB used in the X48 chipset.

Turbo Boost In a nutshell, Turbo Boost is overclocking. Actually it is real-time self overclocking done by the processor itself. It is yet another very smart technology that Intel developed to squeeze as much performance from the processor as possible. In order to see how it works we have to understand TDP. The thermal design power (TDP), sometimes called thermal design point, represents the maximum amount of power the cooling system in a computer is required to dissipate. For example, a laptop's CPU cooling system may be designed for a 20 watt TDP, which means that it can dissipate up to 20 watts of heat without exceeding the maximum junction temperature for the computer chip. It can do this using an active cooling method such as a fan or any of the three passive cooling methods, convection, thermal radiation or conduction. Typically, a combination of methods are used. The TDP is typically not the most power the chip could ever draw, such as by a power virus, but rather the maximum power that it would draw when running real applications. This ensures the computer will be able to handle essentially all applications without exceeding its thermal envelope, or requiring a cooling system for the maximum theoretical power, which would cost more and achieve no benefit. So we see that there is so much power a CPU can utilize before it becomes inefficient. The limit is somewhere around 140 W. The question is how to make a CPU run faster if we can t use more power to increase frequency? The obvious solution was to make multi-core CPUs. Here s an illustration of the concept: Single Core Dual Core Quad Core Hex Core TDP Tradeoff If all cores are used then a lower clocked multi-core CPU is faster than a higher clocked single core CPU. But what happens when not all cores are used? And this happens a lot. Well, when let s say just one core is used we can clearly see that the multi core CPU is slower then single core one because it has lower clock speed. This creates a dilemma: do you buy a single core, high clock speed CPU or a multi core lower clock speed one? The dilemma would disappear if the multi core CPU could use the extra power gained from the cores that are not running to overclock itself, making itself faster. That would be a smart CPU and that smart CPU is a Nehalem CPU. But still, there is a problem: even cores that are not in use (idle) do use some amount of power. This limits the degree of self-overclock a CPU can achieve, not to mention that the extra used power is waste and, for mobile devices such as laptops, it s a big concern. The solution to this is to turn off cores when they are not in use and turn them on when needed. Sounds simple, but it s actually not. This is because transistors are used to turn off the cores. Transistors act like light switches, but because their size gets smaller and smaller (45 nm to date) we have power leakage between them even when they are supposed to be off. So, a core that is supposed to be turned off still draws power. Intel s answer to this was the

Power Gate Transistor (PGT). The PGT is a special kind of transistor that has almost zero leakage. This technology helps the Nehalem CPUs use the TDP gained from idle cores to power running cores, effectively overclocking them. This is what Turbo Boost is all about. Turbo Boost was first implemented in Bloomfield, with some results (about 5% gain in performance, depending on the application running) and then it evolved with Lynnfield to about 17% increase in performance. Max Speed Stock 4 Cores Active 3 Cores Active 2 Cores Active 1 Core Active Intel Core i7 870 2.93GHz 3.20GHz 3.20GHz 3.46GHz 3.60GHz Intel Core i7 860 2.80GHz 2.93GHz 2.93GHz 3.33GHz 3.46GHz Intel Core i5 750 2.66GHz 2.80GHz 2.80GHz 3.20GHz 3.20GHz If Intel had Turbo mode back when dual-cores first started shipping we would've never had the whole single vs. dual core debate. If you're running a single thread, this 774M transistor beast will turn off three of its cores and run its single active core at up to 3.6GHz. That's faster than the fastest Core 2 Duo on the market today. The ultimate goal is to always deliver the best performance regardless of how threaded (or not) the workload is. Buying more cores shouldn't get you lower clock speeds, just more flexibility. The top end Lynnfield is like buying a 3.46GHz dual-core processor that can also run well threaded code at 2.93GHz. Take this one step further and imagine what happens when you have a CPU/GPU on the same package or better yet, on the same die. Need more GPU power? Underclock the CPU cores, need more CPU power? Turn off half the GPU cores. It's always available, real-time-configurable processing power. That's the goal and Lynnfield is the first real step in that direction. 5. Conclusion As awesome as it is, Turbo doesn't work 100% of the time, its usefulness varies on a number of factors including the instruction mix of active threads and processor cooling. The actual instructions being executed by each core will determine the amount of current drawn and total TDP of the processor. For example, video encoding uses a lot of SSE instructions which in turn keep the SSE units busy on the chip; the front end remains idle and is clock gated, so power is saved there. The resulting power savings are translated into higher clock frequency. Intel tells us that video encoding should see the maximum improvement of two bins with all four cores active. Floating point code stresses both the front end and back end of the pipe, here we should expect to see only a 133MHz increase from turbo mode if any at all. In short, you can't simply look at whether an app uses one, two or more threads. It's what the app does that matters. There's also the issue of background threads running in the OS. Although your foreground app may only use a single thread, there are usually dozens (if not hundreds) of active threads on your system at any time. Just a few of those being scheduled on sleeping cores will wake them up and limit your max turbo frequency (Windows 7 is allegedly better at not doing this). You can't really control the instruction mix of the apps you run or how well they're threaded, but this last point you can control: cooling. The sort-of trump all feature that you have to respect is Intel's thermal

throttling. If the CPU ever gets too hot, it will automatically reduce its clock speed in order to avoid damaging the processor; this includes a clock speed increase due to turbo mode. 6. Bibliography: http://www.anandtech.com http://www.intel.com http://www.wikipedia.org http://www.tomshardware.com