Introduction to Multi-Core

Similar documents

Enabling Technologies for Distributed and Cloud Computing

Introduction to Cloud Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Enabling Technologies for Distributed Computing

Digital Design for Low Power Systems

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Low Power AMD Athlon 64 and AMD Opteron Processors

Intel Virtualization Technology

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Parallel Programming Survey

Intel Itanium Architecture

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Multi-core and Linux* Kernel

Parallel Algorithm Engineering

Desktop Processor Roadmap. Solution Provider Accounts

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

evm Virtualization Platform for Windows

Overview. CPU Manufacturers. Current Intel and AMD Offerings

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

Intel Pentium 4 Processor on 90nm Technology

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Scaling in a Hypervisor Environment

Putting it all together: Intel Nehalem.

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

A Quantum Leap in Enterprise Computing

OC By Arsene Fansi T. POLIMI

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Multi-Core Programming

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Design Cycle for Microprocessors

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Introduction to Microprocessors

Distribution One Server Requirements

Full and Para Virtualization

Servervirualisierung mit Citrix XenServer

IOS110. Virtualization 5/27/2014 1

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

OPENSPARC T1 OVERVIEW

Embedded Systems: map to FPGA, GPU, CPU?

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

An Implementation Of Multiprocessor Linux

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Chapter 2 Parallel Computer Architecture

Thread level parallelism

Xeon+FPGA Platform for the Data Center

Intel Core i3-2310m Processor (3M Cache, 2.10 GHz)

SOC architecture and design

Virtualization. Clothing the Wolf in Wool. Wednesday, April 17, 13

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Discovering Computers Living in a Digital World

Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

High Performance Computing in the Multi-core Area

BC43: Virtualization and the Green Factor. Ed Harnish

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

Making the Move to Quad-Core and Beyond

How To Understand The Design Of A Microprocessor

Multithreading Lin Gao cs9244 report, 2006

Hitachi Virtage Embedded Virtualization Hitachi BladeSymphony 10U

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

The Motherboard Chapter #5

Generations of the computer. processors.

Distributed and Cloud Computing

Quad-Core Intel Xeon Processor

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Vocera Voice 4.3 and 4.4 Server Sizing Matrix

The Art of Virtualization with Free Software

Intel Core i Processor (3M Cache, 3.30 GHz)

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Performance evaluation

Intel architecture. Platform Basics. White Paper Todd Langley Systems Engineer/ Architect Intel Corporation. September 2010

AMD Opteron Quad-Core

The Central Processing Unit:

Intel Xeon Processor E5-2600

A Comparison of VMware and {Virtual Server}

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

What is a System on a Chip?

PC Solutions That Mean Business

Thread Level Parallelism II: Multithreading

Symmetric Multiprocessing

Next Generation Intel Microarchitecture Nehalem Paul G. Howard, Ph.D. Chief Scientist, Microway, Inc. Copyright 2009 by Microway, Inc.

Comparing Multi-Core Processors for Server Virtualization

The Transition to PCI Express* for Client SSDs

Multicore Programming with LabVIEW Technical Resource Guide

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Transcription:

Introduction to Multi-Core Baskaran Ganesan Baskaran.ganesan@intel.com Sr. Design Engineer Digital Enterprise Group, Intel Corporation Foundation for Advancement of Education and Research (FAER) 1

Topics 1.CPU (semiconductor) HISTORY (SESSION-1) a. Moore s Law b. Transistor scaling c. Scaling limitations & impact d. What then? - Dual core e. The new era - ARCHITECTURE (SESSION-2) a. Core Architecture - Core basics, Platform architecture, Core architecture b. Multi-core architecture c. Multi-core challenges d. Closing notes Foundation for Advancement of Education and Research (FAER) 2

Moore s Law Foundation for Advancement of Education and Research (FAER) 3

Moore s law at work Transistor Size Transistor Count CPU Arch technology Manufacturing technology Compute Power SW/IT eco-system Volume Market CPU Cost Foundation for Advancement of Education and Research (FAER) 4

Historical Driving Forces Shrinking Geometry Increased Frequency 10 100000 Feature Size (um) 1 0.1 Frequency (MHz) 10000 1000 100 0.01 1970 1980 1990 2000 2010 2020 10 1 1970 1980 1990 2000 2010 2020 1971 4004 Processor 2300 Transistors 1978 8008 Processor IBM PC 1986 1993 i386 Processor Pentium Processor 32-bit 3.1M transistors 2005 Montecito 1.7B Transistors Foundation for Advancement of Education and Research (FAER) 5

Scale Factors (loosely defined) Voltage scale-factor: Rate at which the transistor voltage decreases with respect to a change in transistor dimensions Frequency scale-factor: Rate at which the transistor frequency increases with respect to a change in transistor dimensions Cost scale-factor: Rate at which the per-transistor cost decreases with respect to a change in transistor dimensions Count scale-factor: Rate at which the transistor count increases with respect to a change in transistor dimensions Foundation for Advancement of Education and Research (FAER) 6

Scaling: More data Foundation for Advancement of Education and Research (FAER) 7

The Act of Balancing Delivered Performance = Instructions Per Cycle (IPC) * Frequency Goal is higher performance and lower power Power α C dynamic * V * V * Frequency Foundation for Advancement of Education and Research (FAER) 8

Scaling at its best Pentium 4 Processor 386 Processor May 1986 @16 MHz core 275,000 1.5µ transistors ~1.2 SPECint2000 17 Years 200x 200x/11x 1000x August 27, 2003 @3.2 GHz core 55 Million 0.13µ transistors 1249 SPECint2000 Foundation for Advancement of Education and Research (FAER) 9

Architectural Innovations Serial, sequential execution Overlapped execution (pipelining) Multi-stage, deep pipelining Control-speculative execution Data-speculative execution Super-scalar execution Out-of-order execution Vector computing Addressing extensions Application specific instructions Multi-level on-chip caching Memory disambiguation Register renaming Score-boarding Hardware data prefetching Many decades of computer architecture focused on Instruction-Level Parallelism (ILP) enhancement Foundation for Advancement of Education and Research (FAER) 10

The Challenges Power Limitations Diminishing Voltage Scaling 10 Supply 1 Voltage (V) 0.7um 0.5um 0.35um ~30% 0.25um 0.18um 0.13um 90nm 65nm 45nm slowing 30nm 0.1 1990 1993 1997 2001 2005 2009 Power = Capacitance x Voltage 2 x Frequency also Power ~ Voltage 3 Foundation for Advancement of Education and Research (FAER) 11

Heat Dissipation Projected 10,000 Sun s s Surface 1,000 Rocket Nozzle Power Density (W/cm2) 100 10 1 4004 8008 8080 8085 8086 286 Nuclear Reactor 386 Hot Plate 486 Pentium processors 70 80 90 00 10 Foundation for Advancement of Education and Research (FAER) 12

What then? Performance Power 1.00x Max Frequency Foundation for Advancement of Education and Research (FAER) 13

Over-clocking 1.73x Performance Power 1.13x 1.00x Over-clocked (+20%) Max Frequency Foundation for Advancement of Education and Research (FAER) 14

Under-clocking 1.73x Performance Power 1.13x 1.00x 0.87x 0.51x Over-clocked (+20%) Max Frequency Under-clocked (-20%) Foundation for Advancement of Education and Research (FAER) 15

Multi-Core Energy-Efficient Performance 1.73x Dual-Core Performance Power 1.73x 1.13x 1.00x 1.02x Over-clocked (+20%) Max Frequency Dual-core (-20%) Relative single-core frequency and Vcc Foundation for Advancement of Education and Research (FAER) 16

Dual core with voltage scaling A 15% Reduction In Voltage Yields RULE OF THUMB Frequency Reduction 15% Power Reduction 45% Performance Reduction 10% SINGLE CORE DUAL CORE Area = 1 Voltage = 1 Freq = 1 Power = 1 Perf = 1 Area = 2 Voltage = 0.85 Freq = 0.85 Power = 1 Perf = ~1.8 Foundation for Advancement of Education and Research (FAER) 17

Intel: Dual & Quad Cores Foundation for Advancement of Education and Research (FAER) 18

A New Era THE OLD Performance Equals Frequency THE NEW Performance Equals IPC Multi-Core Power Efficiency Microarchitecture Advancements Unconstrained Power Voltage Scaling Foundation for Advancement of Education and Research (FAER) 19

Trade-off equations - Power is costly; Transistors, relatively cheap - Frequency alone is not important; Efficiency IS - Performance-per-watt is critical; per-core performance is not quite - Computation is relatively easy; Memory accesses are NOT Foundation for Advancement of Education and Research (FAER) 20

Q & A Foundation for Advancement of Education and Research (FAER) 21

Topics 1. CPU (semiconductor) HISTORY (SESSION-1) a. Moore s Law b. Transistor scaling c. Scaling limitations & impact d. What then? - Dual core e. The new era - ARCHITECTURE (SESSION-2) a. Core Architecture - Core basics, Platform architecture, Core architecture b. Multi-core architecture c. Multi-core challenges d. Closing notes Foundation for Advancement of Education and Research (FAER) 22

Typical PC Architecture Foundation for Advancement of Education and Research (FAER) 23

Processor Resources - Caches: L0, L1, L2 etc (Different levels of caches) - General Purpose Registers (For SW programming) - Segment Registers & TLB (for memory management) - FP registers, XMM registers - System Flags - Control and Data registers, Debug registers, MSRs - Many more Foundation for Advancement of Education and Research (FAER) 24

CMP/SMP/HT CMP: Chip Multi Processing, refers to multiple physical core engines that have unique resources Unique: L0/L1 Cache, TLBs, Instruction Pointer, GP Regs Shared: L2 Cache SMP: Refers to multiple threads that share all resources (time muxed) Shared: L0/L1/L2 Caches, TLBs Unique: Instruction Pointer, GP Regs Hyper Threading: Refers to multiple threads that share more resources (L0/L1 Cache for example); May/May not be part of a CMP core SW Threading: Application (SW) level threading of processes on one/more physical core engines Foundation for Advancement of Education and Research (FAER) 25

Core Architecture (Prescott) Foundation for Advancement of Education and Research (FAER) 26

Core Architecture (Xeon Dual Core) Foundation for Advancement of Education and Research (FAER) 27

Multi-core platform (Freescale: embedded) Foundation for Advancement of Education and Research (FAER) 28

Multi-Core platform (RMI-XLR: embedded) Foundation for Advancement of Education and Research (FAER) 29

Tilera 64 core CPU Foundation for Advancement of Education and Research (FAER) 30

Tilera Platform Foundation for Advancement of Education and Research (FAER) 31

Tera-scale Computing Performance IPS = Instruction per second TIPS GIPS MIPS KIPS 3D & Video Mult- Media Text Kilobytes RMS Single Core Multi-core Megabytes Entertainment Tera-scale Gigabytes Dataset Size Learning & Travel RMS Applications Recognition Terabytes Mining Synthesis Personal Media Creation and Management Health Foundation for Advancement of Education and Research (FAER) 32

Intel Polaris (80-core) Foundation for Advancement of Education and Research (FAER) 33

Foundation for Advancement of Education and Research (FAER) 34

Multi-Core: what next? Foundation for Advancement of Education and Research (FAER) 35

Connecting multiple cores Foundation for Advancement of Education and Research (FAER) 36

Platform Architecture (multi-core) External I/F Foundation for Advancement of Education and Research (FAER) 37

Multi-core: Architectural Challenges - Instruction-level parallelism v/s Thread-level parallelism tradeoffs and balance - Shared resource management (functional units, caches, tlb, btb) - Multi-threading v/s Multi-core tradeoffs - On and Off-chip bandwidth requirements - Latencies (execution, cache, and memory) reduction - Memory Coherence/Consistency (for high speed on-die cache hierarchies) - Multiple domains (and crossing) in clocking, voltage, reset,... - Partitioning resources (between threads/cores) - Fault tolerance (at device, storage, execution, core level) (aka reliability) - On-die interconnect (optimized along latency, bw, modularity, power,...) - Integration (of system components, and/or fixed function devices) Foundation for Advancement of Education and Research (FAER) 38

Multi-core: Design Challenges Design Complexity, Productivity Tools / Methods Advance But at slower rate than Moore s Law Replicating cores improves productivity Visibility for Test & Debug Pin Bandwidth/Transistor continues to decline Shrinking dimensions, increasing speeds, Increased test time adding to cost Power Power Delivery di/dt of Amps/nano-second Thermals: Overall power and thermal density Foundation for Advancement of Education and Research (FAER) 39

Multi-core: Eco-system challenges Underlying Software assumptions on resource sharing Lack of standard mechanisms to share resource sharing info between hw and OS Lack of Resource sharing aware SW Compilers, Schedulers, Configuration/Management (Power!) etc Legacy SW architectural requirements left on Multi-Core CPUs Compatibility requirements Many more unknowns (to CPU Design world) Foundation for Advancement of Education and Research (FAER) 40

Multi-core: Software Challanges - Scalability of O/S Data Structures and Policies - Synchronization and locking, Scheduling, Process management, Data structure sizing and management limitations, Threading granularity and primitives - Memory Hierarchy Awareness - Impact of coherency policy, Efficiency of Data-sharing and Process migration effects, SW visibility to High speed on-die interconnect, SW control of Cache hierarchy, NUCA Awareness - High Bandwidth I/O Support - Light weight Interrupts, Data movement and transformation engines, I/O Affinity Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, not ready for 100s of CPUs / chip Foundation for Advancement of Education and Research (FAER) 41

More than the cores Foundation for Advancement of Education and Research (FAER) 42

Closing notes Single and Multi-core architectures presented Multi-Core CPU is the next generation CPU Architecture 2Core and Intel Quad-Core designs plenty on market already Many More are on their way Several old paradigms ineffective; Several new problems to be addressed Chip Level Multiprocessing and large caches can exploit Moore s Law Thread/Core count in future microprocessor systems to increase Eco-system immature/non-existent Numerous domains in arch/design awaiting research & innovation and here is where you come in!!! Multi-Core Architecture and Design ready for research, development and innovation! Foundation for Advancement of Education and Research (FAER) 43

Acknowledgements Gautam Doshi [Principal Engineer, Digital Enterprise Group] Ajay Bhatt [Intel Fellow, Digital Enterprise Group] Dileep Bhandarkar [Architect, Digital Enterprise Group] Sunit Tyagi [Sr. Principal Engineer, Digital Enterprise Group] and countless foil-wares Foundation for Advancement of Education and Research (FAER) 44

Resources Intel Tech/Research: http://www.intel.com/technology/index.htm Energy Efficient Performance: http://www.intel.com/technology/eep/index.htm Intel Core Microarchitecture: http://www.intel.com/technology/architecture/coremicro/ Dual-core processor: http://www.intel.com/technology/computing/dual-core/index.htm Multi/Many Core: http://www.intel.com/multi-core/index.htm Intel Platforms: http://www.intel.com/platforms/index.htm Threading: http://www3.intel.com/cd/ids/developer/asmo-na/eng/dc/threading/index.htm Foundation for Advancement of Education and Research (FAER) 45

Q & A Foundation for Advancement of Education and Research (FAER) 46

Backup: Core uarch Foundation for Advancement of Education and Research (FAER) 47

Intel Core Microarchitecture Low Power High Performance Scalable Intel Wide Dynamic Execution Intel Intelligent Power Capability Intel Advanced Smart Cache Server Optimized Desktop Optimized (Xeon) Woodcrest (Core2 Duo) Conroe 65nm Intel Smart Memory Access Intel Advanced Digital Media Boost Mobile Optimized (Core2 Duo) Merom *Graphics not Intel representative Higher Education of actual die Program photo or relative & size Foundation for Advancement of Education and Research (FAER) 48

Intel Intelligent Power Capability Process Coarse Grained Ultra Fine Grained Transistor 65nm Strained Silicon Low-K K Dielectric More Metal Layers Aggressive Clock Gating Enhanced Speed-Step Step Low VCC Arrays Blocks Controlled Via Sleep Transistors Low Leakage Transistors Sleep Transistors Energy ADVANTAGE Mobile-Level Power Management Energy Efficient Performance *Graphics not Intel representative Higher Education of actual die Program photo or relative & size Foundation for Advancement of Education and Research (FAER) 49

Intel Wide Dynamic Execution EACH CORE CORE 1 CORE 2 EFFICIENT 14 STAGE PIPELINE DEEPER BUFFERS 4 WIDE - DECODE TO EXECUTE 4 WIDE - MICRO-OP OP EXECUTE MICRO and MACRO FUSION ENHANCED ALUs INSTRUCTION FETCH AND PRE-DECODE INSTRUCTION QUEUE DECODE RENAME / ALLOC RETIREMENT UNIT (REORDER BUFFER) SCHEDULERS EXECUTE INSTRUCTION FETCH AND PRE-DECODE INSTRUCTION QUEUE DECODE RENAME / ALLOC RETIREMENT UNIT (REORDER BUFFER) SCHEDULERS EXECUTE Perf Energy 33% Wider Execution over Previous Gen ADVANTAGE Comprehensive Advancements Reach Enabled To Teach In Each Core Foundation for Advancement of Education and Research (FAER) 50

Intel Wide Dynamic Execution Micro and Macro Fusion Micro Fusion Macro Fusion MACRO FUSION EXAMPLE CMP+JMP IN 1 CLOCK WITH MACRO FUSION WITHOUT MACRO FUSION INSTRUCTION 3 INSTRUCTION 3 ucode ROM DECODE INSTRUCTION 2 INSTRUCTION 1 DECODE INSTRUCTION 2 INSTRUCTION 1 DECODE COMBINED INST 2 & 3 INTERNAL INST 3 INTERNAL INST 1 INTERNAL INST 2 EXECUTE EXECUTE COMPLETED INST 3 INTERNAL INST 1 EXECUTE COMPLETED INST 2 COMPLETED INST 3 COMPLETED INST 1 COMPLETED INST 2 COMPLETED INST 1 Perf Energy Instruction Load Reduced ~ 15% ADVANTAGE ** Micro-Ops Reduced ~ 10% ** *Graphics not representative Reach To of actual Teach die photo or relative size Intel Higher ** Workload Education dependant Program & Foundation for Advancement of Education and Research (FAER) 51

Intel Advanced Smart Cache Dynamic L2 Cache Usage Core Microarchitecture Shared L2 Decreased Traffic Increased Traffic Independent L2 Dynamically, Bi-Directionally Available x Not Shareable L1 CACHE L1 CACHE L1 CACHE L1 CACHE CORE 1 CORE 2 CORE 1 CORE 2 Perf Energy Higher Cache Hit Rate ADVANTAGE Reduced BUS Traffic Lower Latency to Data *Graphics not representative of actual die photo or relative size Foundation for Advancement of Education and Research (FAER) 52

Intel Smart Memory Access Hardware-based Memory Disambiguation Core Microarchitecture Other INST 2 LOAD [Y] INST 1 STORE [X] IN ORDER INST 2 LOAD [Y] INST 1 STORE [X] DECODE/SCHEDULE DECODE/SCHEDULE INST 2 LOAD [Y] INST 2 LOAD [Y] HARDWARE Mem. Dis. Predictor Inst. 2 Load Can Occur Before Inst. 1 Store INST 1 STORE [X] INST 2 LOAD [Y] EXECUTE INST 1 STORE [X] OUT OF ORDER INST 1 STORE [X] INST 2 LOAD [Y] STALL EXECUTE INST 1 STORE [X] Inst. 2 Must Wait For Inst. 1 Store To Complete Perf Energy Higher Utilization of Pipeline ADVANTAGE Masks latency to data access Higher Performance Foundation for Advancement of Education and Research (FAER) 53

Intel Advanced Digital Media Boost Single Cycle SSE In Each Core Fusion Support Single Cycle SSE SOURCE SSE/2/3 OP 127 X4 SSE Operation (SSE/SSE2/SSE3) X3 X2 X1 0 DECODE DECODE DEST Y4 Y3 Y2 Y1 Core µarch CLOCK CYCLE 1 X4opY4 X3opY3 X2opY2 X1opY1 EXECUTE EXECUTE Previous CLOCK CYCLE 1 X2opY2 X1opY1 CLOCK CYCLE 2 X4opY4 X3opY3 Perf Energy Increased Performance ADVANTAGE 128 bit Single Cycle in each core Improved Energy Efficiency *Graphics not representative of actual die photo or relative size Foundation for Advancement of Education and Research (FAER) 54

Backup: Next Gen Technologies Foundation for Advancement of Education and Research (FAER) 55

Traditional Operating Systems (Time-mux) Foundation for Advancement of Education and Research (FAER) 56

What is Virtualization? App App... App VM 0 App App... App VM 1 App App... App Operating System Physical Host Hardware GFX A new layer of software... Guest OS 0... Guest OS 1 Processors Memory Graphics VM Monitor (VMM) Physical Host Hardware Network Storage Keyboard / Mouse Without VMs: Single OS owns all hardware resources With VMs: Multiple OSes share hardware resources Virtualization enables multiple operating systems to run Reach on To the Teach same platform Foundation for Advancement of Education and Research (FAER) 57

Types of Virtualization Hosted VMM launched from within an OS, e.g., VMplayer, WSX, GSX, Virtual PC, Virtual Server Cheap but lower performance Hypervisor: A bootable layer on Bios Thick: embeds all the drivers, e.g., ESX Thin: has a service VM, e.g., Xen derivates Virtual Appliances: dedicated Virtual machines, e.g., MojoPC Foundation for Advancement of Education and Research (FAER) 58

Intel Virtualization Technology (VT) App.. App App App 1 st VT base SW Solutions OS OS OS OS Virtual Machine Monitor Processors with Intel Virtualization Technology Intel VT First to market with native virtualization support Broadest HW and SW ecosystem support and others Core TM Microarchitecture based systems Significant increase in performance and improved VT performance overall segments Mobile - Intel Core 2 Duo Mobile Processor for Intel Centrino Duo Mobile Technology Desktop - Intel Core 2 Duo Desktop Processor E6000 sequence - Server Dual and Quad Core Intel Xeon Processor 5000 series Get More Done On Every Server Get More Capabilities On Client Foundation for Advancement of Education and Research (FAER) 59

Trusted Execution Technology Foundation for Advancement of Education and Research (FAER) 60

LT Hardware Ingredients LT = CPU + Chipset + TPM + Protected I/O = LT-specific enhancement CPU Extensions Enables domain separation Sets policy for protected memory Protected Graphics Trusted channel between graphics and trusted SW Integrated or third party discrete graphics Protected Keyboard & Mouse Trusted channel between keyboard/mouse and trusted SW Intel CPU Intel (G)MCH ICH USB LPC Protected Memory Mgmt Enforces access policy to protected memory TPM RAM Trusted Platform Module v1.2 Protects keys, digital certificates & attestation credentials Provides platform authentication Foundation for Advancement of Education and Research (FAER) 61

Backup: Misc Foundation for Advancement of Education and Research (FAER) 62

Moore s Law Moving Forward ---------------------ACTUAL--------------------------- --FORECAST- Production 1995 1997 1999 2001 2003 2005 2007 2009 2011 Generation 0.35 0.25 0.18 130nm Gate Length 0.35 0.20 0.13 <70nm <50nm nm 90nm 65nm 45nm 35nm 22nm nm <35nm <35nm <35nm <22 Wafer Size (mm( mm) 200 200 200 300 300 300 300 300? 300? 22nm Integration Capacity <100M 100M 200M 500M 1B >1B >2B >4B >8B Another decade is probably straight-forward There is certainly no end to creativity. - Gordon Moore, speaking of extending Moore s s Law at ISSCC, Feb 2003 Foundation for Advancement of Education and Research (FAER) 63

Multi-Core Power Efficiency Cache Big core Power 4 3 2 1 Performance 2 1 Small core Power = ¼ Performance = 1/2 1 1 C1 C3 Cache C2 C4 4 3 2 1 4 3 2 1 Many core is more power efficient Power ~ area Single thread performance ~ area**.5 Foundation for Advancement of Education and Research (FAER) 64

Multi-Core and Memory Gap Growing Performance Gap Peak Instructions Per DRAM Access 700 600 500 LOGIC GAP 400 300 200 1992 1994 1996 1998 2000 2002 MEMORY 100 0 Pentium 66MHz Pentium-Pro 200MHz PentiumIII 1100MHz Pentium4 2 GHz Reduce DRAM access with large caches Extra benefit: power savings. Cache is lower power than logic Tolerate memory latency with multiple threads Multiple cores Hyper-threading Foundation for Advancement of Education and Research (FAER) 65

Multi-threading tolerates memory latency Serial Execution A i Idle A i+1 B i Idle B i+1 Multi-threaded Execution A i Idle A i+1 B i B i+1 Execute thread B while thread A waits for memory Multi-core has a similar effect Foundation for Advancement of Education and Research (FAER) 66

Multi-core tolerates memory latency Serial Execution A i Idle A i+1 B i Idle B i+1 Multi-core Execution A i Idle A i+1 B i Idle B i+1 Execute thread A and B simultaneously Foundation for Advancement of Education and Research (FAER) 67

How does Multicore Change Parallel Programming? SMP P1 P2 P3 P4 No change in fundamental programming model cache CMP C1 cache cache cache cache Memory C2 C3 C4 cache cache cache Memory Synchronization and communication costs greatly reduced Makes it practical to parallelize more programs Resources now shared Caches Memory interface Optimization choices may be different Foundation for Advancement of Education and Research (FAER) 68

Art of the Possible Billion transistors realized in 65nm Si process Multi-Billion transistors possible in future Si process Large die sizes can be built 400 to 600 square millimeters What can fit on a single die? For 65nm (rough est) 30 mm 2 per proc. 15 mm 2 per MB Die size (core + cache only) in mm 2 16 MB cache 2 cores 300 4 cores 360 8 cores 480 32 MB cache 540 600 720 Foundation for Advancement of Education and Research (FAER) 69

Quad Cores here a quarter ago already! Foundation for Advancement of Education and Research (FAER) 70

Multi-Core Foundation for Advancement of Education and Research (FAER) 71