Why Intel is designing multi-core processors. Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning July, 2006

Similar documents
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to Multi-Core

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Parallel Programming Survey

Digital Design for Low Power Systems

Making the Move to Quad-Core and Beyond

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

High Performance Computing in the Multi-core Area

Introduction to Microprocessors

Efficient Big Data Analytics Computing: A Research Challenge

Next Generation GPU Architecture Code-named Fermi

First 40 Giga-bits per second Silicon Laser Modulator. Dr. Mario Paniccia Intel Fellow Director, Photonics Technology Lab

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

GPUs for Scientific Computing

Design Cycle for Microprocessors

Multi-core and Linux* Kernel

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Parallel Algorithm Engineering

Chapter 2 Parallel Computer Architecture

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Multicore Programming with LabVIEW Technical Resource Guide

Power-Aware High-Performance Scientific Computing

HPC with Multicore and GPUs

An examination of the dual-core capability of the new HP xw4300 Workstation

Multi-Threading Performance on Commodity Multi-Core Processors

1. Memory technology & Hierarchy

Quad-Core Intel Xeon Processor

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Multicore Parallel Computing with OpenMP

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

Low Power AMD Athlon 64 and AMD Opteron Processors

System Models for Distributed and Cloud Computing

Parallelism and Cloud Computing

SPARC64 VIIIfx: CPU for the K computer

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Enabling Technologies for Distributed and Cloud Computing

OC By Arsene Fansi T. POLIMI

CHAPTER 1 INTRODUCTION

Computing for Data-Intensive Applications:

FPGA-based Multithreading for In-Memory Hash Joins

Control 2004, University of Bath, UK, September 2004

High Performance Computing for Operation Research

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Innovativste XEON Prozessortechnik für Cisco UCS

Driving force. What future software needs. Potential research topics

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Experiences With Mobile Processors for Energy Efficient HPC

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Managing Data Center Power and Cooling

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

GPU Architecture. Michael Doggett ATI

Introduction to Cloud Computing

How To Build A Cloud Computer

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

FPGA-based MapReduce Framework for Machine Learning

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Intel Itanium Architecture

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

on an system with an infinite number of processors. Calculate the speedup of

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Understanding the Benefits of IBM SPSS Statistics Server

OpenSoC Fabric: On-Chip Network Generator

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps

Data Centric Systems (DCS)

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

CSEE W4824 Computer Architecture Fall 2012

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Program Optimization for Multi-core Architectures

Desktop Processor Roadmap. Solution Provider Accounts

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

CSE 6040 Computing for Data Analytics: Methods and Tools

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Symmetric Multiprocessing

Managing and Using Millions of Threads

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Course Development of Programming for General-Purpose Multicore Processors

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Lecture 1: the anatomy of a supercomputer

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

64-Bit versus 32-Bit CPUs in Scientific Computing

Multi-core processors An overview

Generations of the computer. processors.

Scientific Computing Programming with Parallel Objects

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V

Intel Pentium 4 Processor on 90nm Technology

Table Of Contents. Page 2 of 26. *Other brands and names may be claimed as property of others.

Transcription:

Why Intel is designing multi-core processors Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning

Moore s Law at Intel 970-2005 0 Feature Size (um).e+09 Transistor Count 0.79x.E+07 2.0x 0. 0.70x.E+05 0.0 970 980 990 2000 200 2020 00000 Frequency (MHz) 0000.E+03 970 980 990 2000 200 2020 000 CPU Power (W) 2 000 00 0 970 980 990 2000 200 2020 2.0x.5x.4x 00 0 990 995 2000 2005 200 205 Power trend not sustainable

Pentium 4 Processor 386 Processor 3 May 986 @6 MHz core 275,000.5µ transistors ~.2 SPECint2000 7 Years 200x 200x/x 000x August 27, 2003 @3.2 GHz core 55 Million 0.3µ transistors 249 SPECint2000

Frequency: 200x 0000 000 Clock Frequency 00 0 Performance: 000x 0000 Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date 000 SPECint2000 00 0 4 Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date

SPECint2000/MHz (normalized) 7 PIII Normalized SPECint2000/MHz 6 5 4 3 2 386 P5 486 486DX2 PPro/II P55-MMX P4 0 Jan-85 Jan-87 Jan-89 Jan-9 Jan-93 Jan-95 Jan-97 Jan-99 Jan-0 Jan-03 Jan-05 Introduction Date 5

5 Increase (X) 4 3 2 Area X Perf X Performance scales with area**.5 0 Pipelined S-Scalar OOO- Spec Deep Pipe 3 2 Power X Mips/W (%) Power efficiency has dropped Increase (X) 0 - Pipelined S-Scalar OOO- Spec Deep Pipe 6

Pushing Frequency 0 8 6 4 2 0 Performance Diminishing Return 2 3 4 5 6 7 8 9 0 Relative Frequency (Pipelining) Pipeline & Performance Maximized frequency by Deeper pipelines Pushed process to the limit Dynamic power increases with frequency Leakage power increases with reducing Vt 0 8 6 4 2 0 Sub-threshold Leakage increases exponentially 2 3 4 5 6 7 8 9 0 Relative Frequency Process Technology 7 Diminishing return on performance. Increase in power

Moore s Law at Intel 970-2005 0 Feature Size (um).e+09 Transistor Count 0.79x.E+07 2.0x 0. 0.70x.E+05 8 0.0 970 980 990 2000 200 2020 00000 0000 000 00 0 Frequency (MHz) 2.0x.5x 970 980 990 2000 200 2020.E+03 970 980 990 2000 200 2020 000 CPU Power (W).4x 00 0 990 995 2000 2005 200 205 Power trend not sustainable

Reducing power with voltage scaling Power = Capacitance * Voltage**2 * Frequency Frequency ~ Voltage in region of interest Power ~ Voltage ** 3 0% reduction of voltage yields 0% reduction in frequency 30% reduction in power Less than 0% reduction in performance Rule of Thumb Voltage Frequency Power Performance % % 3% 0.66% 9

Dual core with voltage scaling A 5% Reduction In Voltage Yields RULE OF THUMB Frequency Reduction Power Reduction Performance Reduction 5% 45% 0% SINGLE CORE DUAL CORE Area = Voltage = Freq = Power = Perf = Area = 2 Voltage = 0.85 Freq = 0.85 Power = Perf = ~.8 0

Multiple cores deliver more performance per watt Cache Big core Power Power = ¼ 4 Performance = /2 3 2 Performance 2 Small core C C3 Cache C2 C4 4 3 2 4 3 2 Many core is more power efficient Power ~ area Single thread performance ~ area**.5

Memory Gap Growing Performance Gap Peak Instructions Per DRAM Access 700 600 500 LOGIC GAP 400 300 200 992 994 996 998 2000 2002 MEMORY 00 0 Pentium 66MHz Pentium-Pro 200MHz PentiumIII 00MHz Pentium4 2 GHz Reduce DRAM access with large caches Extra benefit: power savings. Cache is lower power than logic 2 Tolerate memory latency with multiple threads Multiple cores Hyper-threading

Multi-threading tolerates memory latency Serial Execution A i Idle A i+ B i Idle B i+ Multi-threaded Execution A i Idle A i+ B i B i+ Execute thread B while thread A waits for memory 3 Multi-core has a similar effect

Multi-core tolerates memory latency Serial Execution A i A i+ B i Idle Idle B i+ Multi-core Execution A i Idle A i+ B i Idle B i+ Execute thread A and B simultaneously 4

Core Area (with L caches) Trend 350.0 Die area/core shrinking after peaking with PPro 300.0 Pentium Processor PPro Core Area - mm^2 250.0 200.0 50.0 00.0 386 486 Pentium II processor Pentium 4 processor McKinley Prescott 5 50.0 Banias 4004 Yonah Merom 0.0 970 975 980 985 990 995 2000 2005 Introduction Date Why shrinking? Diminishing returns on performance. Interconnect. Caches. Complexity.

Reliability in the long term Relative 50 00 50 0 ~8% degradation/bit/generation 80 30 90 65 45 Soft Error FIT/Chip (Logic & Mem) 32 22 6 Relative 00 50 0 Wider 00 20 40 60 80 200 Vt(mV) Extreme device variations Ion Relative 0 2 3 4 5 6 7 8 9 0 Time dependent device degradation 6 Time In future process generations, soft and hard errors will be more common.

Redundant multi-threading: an architecture for fault detection and recovery Sphere of Replication Leading Thread Trailing Thread Input Replication Output Comparison Memory System Two copies of each architecturally visible thread Compare results: signal fault if different Multi-core enables many possible designs for redundant 7 threading.

Moore s Law at Intel 970-2005 0 Feature Size (um).e+09 Transistor Count 0.79x.E+07 2.0x 0. 0.70x.E+05 0.0 970 980 990 2000 200 2020 00000 0000 000 00 0 Frequency (MHz) 2.0x.5x.E+03 970 980 990 2000 200 2020 000 CPU Power (W) 00.4x 8 0 Multi-core 970 980 990 addresses 2000 200 2020 power, performance, 990 995 2000 2005memory, 200 205 complexity, reliability

Moore s Law will provide transistors Intel process technology capabilities High Volume Manufacturing 2004 2006 2008 200 202 204 206 208 Feature Size 90nm 65nm 45nm 32nm 22nm 6nm nm 8nm Integration Capacity (Billions of Transistors) 2 4 8 6 32 64 28 256 50nm Transistor for 90nm Process Source: Intel Influenza Virus Source: CDC 9 Use transistors for multiple cores, caches, and new features.

Multi-Core Processors CLOVERTOWN WOODCREST Quad Core PENTIUM M Dual Core Single Core 20

Future Architecture: More Cores Open issues Cores How many? What size? Homogeneous? Heterogeneous? Power delivery and management High bandwidth memory Reconfigurable cache Scalable fabric On-die Interconnect Topology Bandwidth Core Core Core Core Core Core Core Core Core Core Core Core Core Cache hierarchy Number of levels Sharing Inclusion Core Core Fixed-function units Scalability 2

The Importance of Threading Do Nothing: Benefits Still Visible Operating systems ready for multi-processing Background tasks benefit from more compute resources Virtual machines Parallelize: Unlock the Potential Native threads Threaded libraries Compiler generated threads 22

Performance Scaling Performance Amdahl s Law: Parallel Speedup = /(Serial% + (-Serial%)/N) 0 9 8 7 6 5 4 3 2 0 Serial% = 20% N = 6, N /2 = 3 6 Cores, Perf = 3 0 0 20 30 Number of Cores Serial% = 6.7% N = 6, N /2 = 8 6 Cores, Perf = 8 23 Parallel software key to Multi-core success

How does Multicore Change Parallel Programming? SMP P cache CMP C cache P2 P3 P4 cache cache cache Memory C2 C3 C4 cache cache cache Memory No change in fundamental programming model Synchronization and communication costs greatly reduced Makes it practical to parallelize more programs Resources now shared Caches Memory interface Optimization choices may be different 24

Threading for Multi-Core Architectural Analysis Intel VTune Analyzers Introducing Threads Intel C++ Compiler Debugging Intel Thread Checker Performance Tuning Intel Thread Profiler 25 Intel has a full set of tools for parallel programming

The Era Of Tera Terabytes of data. Teraflops of performance. Scientific Simulation Courtesy Tsinghua University HPC Center Interactive learning Text Mining Improved Medical care When personal computing finally becomes personal X Y Machine vision Immersive 3D entertainment Source: Steven K. Feiner, Columbia University. Tele- present meetings Virtual realities Courtesy of the Electronic Visualization Computationally intensive applications of the future will 26 Laboratory, Univ. Illinois at Chicago. Tele-present meetings be highly parallel

2 st Century Computer Architecture Slide from Patterson 2006 Old CW: Since cannot know future programs, find set of old programs to evaluate designs of computers for the future E.g., SPEC2006 What about parallel codes? Few available, tied to old models, languages, architectures, New approach: Design computers of future for numerical methods important in future Claim: key methods for next decade are 7 dwarves (+ a few), so design for them! Representative codes may vary over time, but these numerical methods will be important for > 0 years 27 Patterson, UC Berkeley, also predicts importance of parallel applications.

Phillip Colella s Seven dwarfs High-end simulation in the physical sciences = 7 numerical methods: Slide from Patterson 2006 tructured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) nstructured Grids ast Fourier Transform ense Linear Algebra parse Linear Algebra articles onte Carlo If add 4 for embedded, covers all 4 EEMBC benchmarks 8. Search/Sort 9. Filter 0. Combinational logic. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Slide from Defining Software Requirements for Scientific Computing, Phillip Colella, 2004 Well-defined targets from algorithmic, software, and architecture standpoint Note scientific computing, media processing, machine 28 learning, statistical computing share many algorithms

Workload Acceleration Recognition 64 56 48 RMS Workload Speedup (Simulated) 48x - 57x Ray Tracing (Avg.) Mod. Levelset FB_Est Body Tracker Gauss-Seidel Mining Speedup 40 32 24 9x - 33x Sparse Matrix (Avg.) Dense Matrix (Avg.) Kmeans SVD SVM_classification 6 Cholesky (watson) Synthesis 8 0 0 4 8 2 6 20 24 28 32 36 40 44 48 52 56 60 64 # of Cores Cholesky (mod2) Forward Solver (pds-0) Forward Solver (world) Backward Solver (pds- 0) Backward Solver (watson) 29 Group I Scale well with increasing core count Group II Worst Examples: Ray tracing, body tracking, physical simulation Worst-case scaling examples, yet still scale Examples: Forward & backward solvers Source: Intel Corporation

Tera-Leap to Parallelism: ENERGY-EFFICIENT PERFORMANCE Energy Efficient Performance Dual Core Hyper-Threading Instruction level parallelism TIME Tera-Scale Computing Quad-Core Science fiction becomes A REALITY More performance Using less energy Single-core chips 30

Summary Technology is driving Intel to build multi-core processors. Power Performance Memory latency Complexity Reliability Parallel programming is a central issue. Parallel applications will become mainstream 3