Why Intel is designing multi-core processors. Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning July, 2006

Why Intel is designing multi-core processors Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning

Moore s Law at Intel 970-2005 0 Feature Size (um).e+09 Transistor Count 0.79x.E+07 2.0x 0. 0.70x.E+05 0.0 970 980 990 2000 200 2020 00000 Frequency (MHz) 0000.E+03 970 980 990 2000 200 2020 000 CPU Power (W) 2 000 00 0 970 980 990 2000 200 2020 2.0x.5x.4x 00 0 990 995 2000 2005 200 205 Power trend not sustainable

Pentium 4 Processor 386 Processor 3 May 986 @6 MHz core 275,000.5µ transistors ~.2 SPECint2000 7 Years 200x 200x/x 000x August 27, 2003 @3.2 GHz core 55 Million 0.3µ transistors 249 SPECint2000

Frequency: 200x 0000 000 Clock Frequency 00 0 Performance: 000x 0000 Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date 000 SPECint2000 00 0 4 Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date

SPECint2000/MHz (normalized) 7 PIII Normalized SPECint2000/MHz 6 5 4 3 2 386 P5 486 486DX2 PPro/II P55-MMX P4 0 Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date 5

5 Increase (X) 4 3 2 Area X Perf X Performance scales with area**.5 0 Pipelined S-Scalar OOO- Spec Deep Pipe 3 2 Power X Mips/W (%) Power efficiency has dropped Increase (X) 0 - Pipelined S-Scalar OOO- Spec Deep Pipe 6

Pushing Frequency 0 8 6 4 2 0 Performance Diminishing Return 2 3 4 5 6 7 8 9 0 Relative Frequency (Pipelining) Pipeline & Performance Maximized frequency by Deeper pipelines Pushed process to the limit Dynamic power increases with frequency Leakage power increases with reducing Vt 0 8 6 4 2 0 Sub-threshold Leakage increases exponentially 2 3 4 5 6 7 8 9 0 Relative Frequency Process Technology 7 Diminishing return on performance. Increase in power

Moore s Law at Intel 970-2005 0 Feature Size (um).e+09 Transistor Count 0.79x.E+07 2.0x 0. 0.70x.E+05 8 0.0 970 980 990 2000 200 2020 00000 0000 000 00 0 Frequency (MHz) 2.0x.5x 970 980 990 2000 200 2020.E+03 970 980 990 2000 200 2020 000 CPU Power (W).4x 00 0 990 995 2000 2005 200 205 Power trend not sustainable

Reducing power with voltage scaling Power = Capacitance * Voltage**2 * Frequency Frequency ~ Voltage in region of interest Power ~ Voltage ** 3 0% reduction of voltage yields 0% reduction in frequency 30% reduction in power Less than 0% reduction in performance Rule of Thumb Voltage Frequency Power Performance % % 3% 0.66% 9

Dual core with voltage scaling A 5% Reduction In Voltage Yields RULE OF THUMB Frequency Reduction Power Reduction Performance Reduction 5% 45% 0% SINGLE CORE DUAL CORE Area = Voltage = Freq = Power = Perf = Area = 2 Voltage = 0.85 Freq = 0.85 Power = Perf = ~.8 0

Multiple cores deliver more performance per watt Cache Big core Power Power = ¼ 4 Performance = /2 3 2 Performance 2 Small core C C3 Cache C2 C4 4 3 2 4 3 2 Many core is more power efficient Power ~ area Single thread performance ~ area**.5

Memory Gap Growing Performance Gap Peak Instructions Per DRAM Access 700 600 500 LOGIC GAP 400 300 200 992 994 996 998 2000 2002 MEMORY 00 0 Pentium 66MHz Pentium-Pro 200MHz PentiumIII 00MHz Pentium4 2 GHz Reduce DRAM access with large caches Extra benefit: power savings. Cache is lower power than logic 2 Tolerate memory latency with multiple threads Multiple cores Hyper-threading

Multi-threading tolerates memory latency Serial Execution A i Idle A i+ B i Idle B i+ Multi-threaded Execution A i Idle A i+ B i B i+ Execute thread B while thread A waits for memory 3 Multi-core has a similar effect

Multi-core tolerates memory latency Serial Execution A i A i+ B i Idle Idle B i+ Multi-core Execution A i Idle A i+ B i Idle B i+ Execute thread A and B simultaneously 4

Core Area (with L caches) Trend 350.0 Die area/core shrinking after peaking with PPro 300.0 Pentium Processor PPro Core Area - mm^2 250.0 200.0 50.0 00.0 386 486 Pentium II processor Pentium 4 processor McKinley Prescott 5 50.0 Banias 4004 Yonah Merom 0.0 970 975 980 985 990 995 2000 2005 Introduction Date Why shrinking? Diminishing returns on performance. Interconnect. Caches. Complexity.

Reliability in the long term Relative 50 00 50 0 ~8% degradation/bit/generation 80 30 90 65 45 Soft Error FIT/Chip (Logic & Mem) 32 22 6 Relative 00 50 0 Wider 00 20 40 60 80 200 Vt(mV) Extreme device variations Ion Relative 0 2 3 4 5 6 7 8 9 0 Time dependent device degradation 6 Time In future process generations, soft and hard errors will be more common.

Redundant multi-threading: an architecture for fault detection and recovery Sphere of Replication Leading Thread Trailing Thread Input Replication Output Comparison Memory System Two copies of each architecturally visible thread Compare results: signal fault if different Multi-core enables many possible designs for redundant 7 threading.

Moore s Law at Intel 970-2005 0 Feature Size (um).e+09 Transistor Count 0.79x.E+07 2.0x 0. 0.70x.E+05 0.0 970 980 990 2000 200 2020 00000 0000 000 00 0 Frequency (MHz) 2.0x.5x.E+03 970 980 990 2000 200 2020 000 CPU Power (W) 00.4x 8 0 Multi-core 970 980 990 addresses 2000 200 2020 power, performance, 990 995 2000 2005memory, 200 205 complexity, reliability

Moore s Law will provide transistors Intel process technology capabilities High Volume Manufacturing 2004 2006 2008 200 202 204 206 208 Feature Size 90nm 65nm 45nm 32nm 22nm 6nm nm 8nm Integration Capacity (Billions of Transistors) 2 4 8 6 32 64 28 256 50nm Transistor for 90nm Process Source: Intel Influenza Virus Source: CDC 9 Use transistors for multiple cores, caches, and new features.

Multi-Core Processors CLOVERTOWN WOODCREST Quad Core PENTIUM M Dual Core Single Core 20

Future Architecture: More Cores Open issues Cores How many? What size? Homogeneous? Heterogeneous? Power delivery and management High bandwidth memory Reconfigurable cache Scalable fabric On-die Interconnect Topology Bandwidth Core Core Core Core Core Core Core Core Core Core Core Core Core Cache hierarchy Number of levels Sharing Inclusion Core Core Fixed-function units Scalability 2

The Importance of Threading Do Nothing: Benefits Still Visible Operating systems ready for multi-processing Background tasks benefit from more compute resources Virtual machines Parallelize: Unlock the Potential Native threads Threaded libraries Compiler generated threads 22

Performance Scaling Performance Amdahl s Law: Parallel Speedup = /(Serial% + (-Serial%)/N) 0 9 8 7 6 5 4 3 2 0 Serial% = 20% N = 6, N /2 = 3 6 Cores, Perf = 3 0 0 20 30 Number of Cores Serial% = 6.7% N = 6, N /2 = 8 6 Cores, Perf = 8 23 Parallel software key to Multi-core success

How does Multicore Change Parallel Programming? SMP P cache CMP C cache P2 P3 P4 cache cache cache Memory C2 C3 C4 cache cache cache Memory No change in fundamental programming model Synchronization and communication costs greatly reduced Makes it practical to parallelize more programs Resources now shared Caches Memory interface Optimization choices may be different 24

Threading for Multi-Core Architectural Analysis Intel VTune Analyzers Introducing Threads Intel C++ Compiler Debugging Intel Thread Checker Performance Tuning Intel Thread Profiler 25 Intel has a full set of tools for parallel programming

The Era Of Tera Terabytes of data. Teraflops of performance. Scientific Simulation Courtesy Tsinghua University HPC Center Interactive learning Text Mining Improved Medical care When personal computing finally becomes personal X Y Machine vision Immersive 3D entertainment Source: Steven K. Feiner, Columbia University. Tele- present meetings Virtual realities Courtesy of the Electronic Visualization Computationally intensive applications of the future will 26 Laboratory, Univ. Illinois at Chicago. Tele-present meetings be highly parallel

2 st Century Computer Architecture Slide from Patterson 2006 Old CW: Since cannot know future programs, find set of old programs to evaluate designs of computers for the future E.g., SPEC2006 What about parallel codes? Few available, tied to old models, languages, architectures, New approach: Design computers of future for numerical methods important in future Claim: key methods for next decade are 7 dwarves (+ a few), so design for them! Representative codes may vary over time, but these numerical methods will be important for > 0 years 27 Patterson, UC Berkeley, also predicts importance of parallel applications.

Phillip Colella s Seven dwarfs High-end simulation in the physical sciences = 7 numerical methods: Slide from Patterson 2006 tructured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) nstructured Grids ast Fourier Transform ense Linear Algebra parse Linear Algebra articles onte Carlo If add 4 for embedded, covers all 4 EEMBC benchmarks 8. Search/Sort 9. Filter 0. Combinational logic. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Slide from Defining Software Requirements for Scientific Computing, Phillip Colella, 2004 Well-defined targets from algorithmic, software, and architecture standpoint Note scientific computing, media processing, machine 28 learning, statistical computing share many algorithms

Workload Acceleration Recognition 64 56 48 RMS Workload Speedup (Simulated) 48x - 57x Ray Tracing (Avg.) Mod. Levelset FB_Est Body Tracker Gauss-Seidel Mining Speedup 40 32 24 9x - 33x Sparse Matrix (Avg.) Dense Matrix (Avg.) Kmeans SVD SVM_classification 6 Cholesky (watson) Synthesis 8 0 0 4 8 2 6 20 24 28 32 36 40 44 48 52 56 60 64 # of Cores Cholesky (mod2) Forward Solver (pds-0) Forward Solver (world) Backward Solver (pds- 0) Backward Solver (watson) 29 Group I Scale well with increasing core count Group II Worst Examples: Ray tracing, body tracking, physical simulation Worst-case scaling examples, yet still scale Examples: Forward & backward solvers Source: Intel Corporation

Tera-Leap to Parallelism: ENERGY-EFFICIENT PERFORMANCE Energy Efficient Performance Dual Core Hyper-Threading Instruction level parallelism TIME Tera-Scale Computing Quad-Core Science fiction becomes A REALITY More performance Using less energy Single-core chips 30

Summary Technology is driving Intel to build multi-core processors. Power Performance Memory latency Complexity Reliability Parallel programming is a central issue. Parallel applications will become mainstream 3