Why Intel is designing multi-core processors. Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning July, 2006

Size: px

Start display at page:

Download "Why Intel is designing multi-core processors. Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning July, 2006"

Lydia Douglas
7 years ago
Views:

1 Why Intel is designing multi-core processors Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning

2 Moore s Law at Intel Feature Size (um).e+09 Transistor Count 0.79x.E x x.E Frequency (MHz) 0000.E CPU Power (W) x.5x.4x Power trend not sustainable

3 Pentium 4 Processor 386 Processor 3 May MHz core 275,000.5µ transistors ~.2 SPECint Years 200x 200x/x 000x August 27, GHz core 55 Million 0.3µ transistors 249 SPECint2000

4 Frequency: 200x Clock Frequency 00 0 Performance: 000x 0000 Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date 000 SPECint Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date

5 SPECint2000/MHz (normalized) 7 PIII Normalized SPECint2000/MHz P DX2 PPro/II P55-MMX P4 0 Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date 5

6 5 Increase (X) Area X Perf X Performance scales with area**.5 0 Pipelined S-Scalar OOO- Spec Deep Pipe 3 2 Power X Mips/W (%) Power efficiency has dropped Increase (X) 0 - Pipelined S-Scalar OOO- Spec Deep Pipe 6

7 Pushing Frequency Performance Diminishing Return Relative Frequency (Pipelining) Pipeline & Performance Maximized frequency by Deeper pipelines Pushed process to the limit Dynamic power increases with frequency Leakage power increases with reducing Vt Sub-threshold Leakage increases exponentially Relative Frequency Process Technology 7 Diminishing return on performance. Increase in power

8 Moore s Law at Intel Feature Size (um).e+09 Transistor Count 0.79x.E x x.E Frequency (MHz) 2.0x.5x E CPU Power (W).4x Power trend not sustainable

9 Reducing power with voltage scaling Power = Capacitance * Voltage**2 * Frequency Frequency ~ Voltage in region of interest Power ~ Voltage ** 3 0% reduction of voltage yields 0% reduction in frequency 30% reduction in power Less than 0% reduction in performance Rule of Thumb Voltage Frequency Power Performance % % 3% 0.66% 9

10 Dual core with voltage scaling A 5% Reduction In Voltage Yields RULE OF THUMB Frequency Reduction Power Reduction Performance Reduction 5% 45% 0% SINGLE CORE DUAL CORE Area = Voltage = Freq = Power = Perf = Area = 2 Voltage = 0.85 Freq = 0.85 Power = Perf = ~.8 0

11 Multiple cores deliver more performance per watt Cache Big core Power Power = ¼ 4 Performance = /2 3 2 Performance 2 Small core C C3 Cache C2 C Many core is more power efficient Power ~ area Single thread performance ~ area**.5

12 Memory Gap Growing Performance Gap Peak Instructions Per DRAM Access LOGIC GAP MEMORY 00 0 Pentium 66MHz Pentium-Pro 200MHz PentiumIII 00MHz Pentium4 2 GHz Reduce DRAM access with large caches Extra benefit: power savings. Cache is lower power than logic 2 Tolerate memory latency with multiple threads Multiple cores Hyper-threading

13 Multi-threading tolerates memory latency Serial Execution A i Idle A i+ B i Idle B i+ Multi-threaded Execution A i Idle A i+ B i B i+ Execute thread B while thread A waits for memory 3 Multi-core has a similar effect

14 Multi-core tolerates memory latency Serial Execution A i A i+ B i Idle Idle B i+ Multi-core Execution A i Idle A i+ B i Idle B i+ Execute thread A and B simultaneously 4

15 Core Area (with L caches) Trend Die area/core shrinking after peaking with PPro Pentium Processor PPro Core Area - mm^ Pentium II processor Pentium 4 processor McKinley Prescott Banias 4004 Yonah Merom Introduction Date Why shrinking? Diminishing returns on performance. Interconnect. Caches. Complexity.

16 Reliability in the long term Relative ~8% degradation/bit/generation Soft Error FIT/Chip (Logic & Mem) Relative Wider Vt(mV) Extreme device variations Ion Relative Time dependent device degradation 6 Time In future process generations, soft and hard errors will be more common.

17 Redundant multi-threading: an architecture for fault detection and recovery Sphere of Replication Leading Thread Trailing Thread Input Replication Output Comparison Memory System Two copies of each architecturally visible thread Compare results: signal fault if different Multi-core enables many possible designs for redundant 7 threading.

18 Moore s Law at Intel Feature Size (um).e+09 Transistor Count 0.79x.E x x.E Frequency (MHz) 2.0x.5x.E CPU Power (W) 00.4x 8 0 Multi-core addresses power, performance, memory, complexity, reliability

19 Moore s Law will provide transistors Intel process technology capabilities High Volume Manufacturing Feature Size 90nm 65nm 45nm 32nm 22nm 6nm nm 8nm Integration Capacity (Billions of Transistors) nm Transistor for 90nm Process Source: Intel Influenza Virus Source: CDC 9 Use transistors for multiple cores, caches, and new features.

20 Multi-Core Processors CLOVERTOWN WOODCREST Quad Core PENTIUM M Dual Core Single Core 20

21 Future Architecture: More Cores Open issues Cores How many? What size? Homogeneous? Heterogeneous? Power delivery and management High bandwidth memory Reconfigurable cache Scalable fabric On-die Interconnect Topology Bandwidth Core Core Core Core Core Core Core Core Core Core Core Core Core Cache hierarchy Number of levels Sharing Inclusion Core Core Fixed-function units Scalability 2

22 The Importance of Threading Do Nothing: Benefits Still Visible Operating systems ready for multi-processing Background tasks benefit from more compute resources Virtual machines Parallelize: Unlock the Potential Native threads Threaded libraries Compiler generated threads 22

23 Performance Scaling Performance Amdahl s Law: Parallel Speedup = /(Serial% + (-Serial%)/N) Serial% = 20% N = 6, N /2 = 3 6 Cores, Perf = Number of Cores Serial% = 6.7% N = 6, N /2 = 8 6 Cores, Perf = 8 23 Parallel software key to Multi-core success

24 How does Multicore Change Parallel Programming? SMP P cache CMP C cache P2 P3 P4 cache cache cache Memory C2 C3 C4 cache cache cache Memory No change in fundamental programming model Synchronization and communication costs greatly reduced Makes it practical to parallelize more programs Resources now shared Caches Memory interface Optimization choices may be different 24

25 Threading for Multi-Core Architectural Analysis Intel VTune Analyzers Introducing Threads Intel C++ Compiler Debugging Intel Thread Checker Performance Tuning Intel Thread Profiler 25 Intel has a full set of tools for parallel programming

26 The Era Of Tera Terabytes of data. Teraflops of performance. Scientific Simulation Courtesy Tsinghua University HPC Center Interactive learning Text Mining Improved Medical care When personal computing finally becomes personal X Y Machine vision Immersive 3D entertainment Source: Steven K. Feiner, Columbia University. Tele- present meetings Virtual realities Courtesy of the Electronic Visualization Computationally intensive applications of the future will 26 Laboratory, Univ. Illinois at Chicago. Tele-present meetings be highly parallel

27 2 st Century Computer Architecture Slide from Patterson 2006 Old CW: Since cannot know future programs, find set of old programs to evaluate designs of computers for the future E.g., SPEC2006 What about parallel codes? Few available, tied to old models, languages, architectures, New approach: Design computers of future for numerical methods important in future Claim: key methods for next decade are 7 dwarves (+ a few), so design for them! Representative codes may vary over time, but these numerical methods will be important for > 0 years 27 Patterson, UC Berkeley, also predicts importance of parallel applications.

28 Phillip Colella s Seven dwarfs High-end simulation in the physical sciences = 7 numerical methods: Slide from Patterson 2006 tructured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) nstructured Grids ast Fourier Transform ense Linear Algebra parse Linear Algebra articles onte Carlo If add 4 for embedded, covers all 4 EEMBC benchmarks 8. Search/Sort 9. Filter 0. Combinational logic. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Slide from Defining Software Requirements for Scientific Computing, Phillip Colella, 2004 Well-defined targets from algorithmic, software, and architecture standpoint Note scientific computing, media processing, machine 28 learning, statistical computing share many algorithms

29 Workload Acceleration Recognition RMS Workload Speedup (Simulated) 48x - 57x Ray Tracing (Avg.) Mod. Levelset FB_Est Body Tracker Gauss-Seidel Mining Speedup x - 33x Sparse Matrix (Avg.) Dense Matrix (Avg.) Kmeans SVD SVM_classification 6 Cholesky (watson) Synthesis # of Cores Cholesky (mod2) Forward Solver (pds-0) Forward Solver (world) Backward Solver (pds- 0) Backward Solver (watson) 29 Group I Scale well with increasing core count Group II Worst Examples: Ray tracing, body tracking, physical simulation Worst-case scaling examples, yet still scale Examples: Forward & backward solvers Source: Intel Corporation

30 Tera-Leap to Parallelism: ENERGY-EFFICIENT PERFORMANCE Energy Efficient Performance Dual Core Hyper-Threading Instruction level parallelism TIME Tera-Scale Computing Quad-Core Science fiction becomes A REALITY More performance Using less energy Single-core chips 30

31 Summary Technology is driving Intel to build multi-core processors. Power Performance Memory latency Complexity Reliability Parallel programming is a central issue. Parallel applications will become mainstream 3

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide