High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Similar documents

Design Cycle for Microprocessors

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

VLIW Processors. VLIW Processors

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Low Power AMD Athlon 64 and AMD Opteron Processors

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

OC By Arsene Fansi T. POLIMI

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

CISC, RISC, and DSP Microprocessors

Introduction to Cloud Computing

Putting it all together: Intel Nehalem.

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Instruction Set Design

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Computer Organization and Components

Introduction to Microprocessors

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Computer Architecture TDTS10

IA-64 Application Developer s Architecture Guide

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Pipelining Review and Its Limitations

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Introducción. Diseño de sistemas digitales.1

Energy-Efficient, High-Performance Heterogeneous Core Design

SOC architecture and design

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

İSTANBUL AYDIN UNIVERSITY

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

WAR: Write After Read

Parallel Programming Survey

The Motherboard Chapter #5

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Parallel Algorithm Engineering

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

How To Understand The Design Of A Microprocessor

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Generations of the computer. processors.

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

ADVANCED COMPUTER ARCHITECTURE

Multithreading Lin Gao cs9244 report, 2006

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

Chapter 2 Logic Gates and Introduction to Computer Architecture

Week 1 out-of-class notes, discussions and sample problems

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Thread level parallelism

Computer Architecture

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Computer Architectures

CS352H: Computer Systems Architecture

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Mobile Processors: Future Trends

Chapter 1 Computer System Overview

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Performance evaluation

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

1. Memory technology & Hierarchy

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Instruction Set Architecture (ISA)

The Central Processing Unit:

Architectures and Platforms

Parallelism and Cloud Computing

Chapter 2 Parallel Computer Architecture

PROBLEMS #20,R0,R1 #$3A,R2,R4

Next Generation GPU Architecture Code-named Fermi

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

High Performance Computing in the Multi-core Area

SPARC64 VIIIfx: CPU for the K computer

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

CPU Organization and Assembly Language

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

picojava TM : A Hardware Implementation of the Java Virtual Machine

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Intel Pentium 4 Processor on 90nm Technology

路論 Chapter 15 System-Level Physical Design

on an system with an infinite number of processors. Calculate the speedup of

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

The Microarchitecture of Superscalar Processors

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

~ Greetings from WSU CAPPLab ~

Transcription:

High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1

2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004) 1979: 30000 transistors (Intel 8086) 1989: 1 M transistors (Intel 80486) 1999: 130 M transistors (HP PA-8500) 2005: 1,7 billion transistors (Intel Itanium Montecito) Processor performance doubles every 18 months 1989: Intel 80486 16 Mhz (< 1inst/cycle) 1993 : Intel Pentium 66 Mhz x 2 inst/cycle 1995: Intel PentiumPro 150 Mhz x 3 inst/cycle 06/2000: Intel Pentium III 1Ghz x 3 inst/cycle 09/2002: Intel Pentium 4 2.8 Ghz x 3 inst/cycle 09/2005: Intel Pentium 4, dual core 3.2 Ghz x 3 inst/cycle x 2 processors

3 Not just the IC technology VLSI brings the transistors, the frequency,.. Microarchitecture, code generation optimization bring the effective performance

4 The hardware/software interface software High level language Compiler/code generation Instruction Set Architecture (ISA) hardware micro-architecture transistor

5 Instruction Set Architecture (ISA) Hardware/software interface: The compiler translates programs in instructions The hardware executes instructions Examples Intel x86 (1979): still your PC ISA MIPS, SPARC (mid 80 s) Alpha, PowerPC ( 90 s) ISAs evolve by successive add-ons: 16 bits to 32 bits, new multimedia instructions, etc Introduction of a new ISA requires good reasons: New application domains, new constraints No legacy code

6 Microarchitecture macroscopic vision of the hardware organization Nor at the transistor level neither at the gate level But understanding the processor organization at the functional unit level

7 What is microarchitecture about? Memory access time is 100 ns Program semantic is sequential But modern processors can execute 4 instructions every 0.25 ns. How can we achieve that?

high performance processors everywhere 8 General purpose processors (i.e. no special target application domain): Servers, desktop, laptop, PDAs Embedded processors: Set top boxes, cell phones, automotive,.., Special purpose processor or derived from a general purpose processor

9 Performance needs Performance: Reduce the response time: Scientific applications: treats larger problems Data base Signal processing multimedia Historically over the last 50 years: Improving performance for today applications has fostered new even more demanding applications

10 How to improve performance? Use a better algorithm language compiler Instruction set (ISA) micro-architecture transistor Optimise code Improve the ISA More efficient microarchitecture New technology

11 How are used transistors: evolution In the 70 s: enriching the ISA Increasing functionalities to decrease instruction number In the 80 s: caches and registers Decreasing external accesses ISAs from 8 to 16 to 32 bits In the 90 s: instruction parallelism More instructions, lot of control, lot of speculation More caches In the 2000 s: More and more: Thread parallelism, core parallelism

12 A few technological facts (2005) Frequency : 1 3.8 Ghz An ALU operation: 1 cycle A floating point operation : 3 cycles Read/write of a registre: 2 3 cycles Often a critical path... Read/write of the cache L1: 1 3 cycles Depends on many implementation choices

A few technological parameters (2005) 13 Integration technology: 90 nm 65 nm 20 30 millions of transistor logic Cache/predictors: up one billion transistors 20 75 Watts 75 watts: a limit for cooling at reasonable hardware cost 20 watts: a limit for reasonable laptop power consumption 400 800 pins 939 pins on the Dual core Athlon

14 The architect challenge 400 mm2 of silicon 2/3 technology generations ahead What will you use for performance? Pipelining Instruction Level Parallelism Speculative execution Memory hierarchy Thread parallelism

Up to now, what was microarchitecture about? 15 Memory access time is 100 ns Program semantic is sequential Instruction life (fetch, decode,..,execute,..,memory access,..) is 10-20 ns How can we use the transistors to achieve the highest performance as possible? So far, up to 4 instructions every 0.3 ns

The architect tool box for uniprocessor performance 16 Pipelining Instruction Level Parallelism Speculative execution Memory hierarchy

Pipelining 17

18 Pipelining Just slice the instruction life in equal stages and launch concurrent execution: time I0 I1 I2 IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT

19 Principle The execution of an instruction is naturally decomposed in successive logical phases Instructions can be issued sequentially, but without waiting for the completion of the one.

20 Some pipeline examples MIPS R3000 : MIPS R4000 : Very deep pipeline to achieve high frequency Pentium 4: 20 stages minimum Pentium 4 extreme edition: 31 stages minimum

21 pipelining: the limits Current: 1 cycle = 12-15 gate delays: Approximately a 64-bit addition delay Coming soon?: 6-8 gate delays On Pentium 4: ALU is sequenced at double frequency a 16 bit add delay

22 Caution to long execution Integer : multiplication : 5-10 cycles division : 20-50 cycles Floating point: Addition : 2-5 cycles Multiplication: 2-6 cycles Division: 10-50 cycles

23 Dealing with long instructions Use a specific to execute floating point operations: E.g. a 3 stage execution pipeline Stay longer in a single stage: Integer multiply and divide

24 sequential semantic issue on a pipeline There exists situation where sequencing an instruction every cycle would not allow correct execution: Structural hazards: distinct instructions are competing for a single hardware resource Data hazards: J follows I and instruction J is accessing an operand that has not been acccessed so far by instruction I Control hazard: I is a branch, but its target and direction are not known before a few cycles in the pipeline

25 Enforcing the sequential semantic Hardware management: First detect the hazard, then avoid its effect by delaying the instruction waiting for the hazard resolution time I0 I1 I2 IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT

26 Read After Write Memory load delay:

27 And how code reordering may help a= b+c ; d= e+f

28 Control hazard The current instruction is a branch ( conditional or not) Which instruction is next? Number of cycles lost on branchs may be a major issue

29 Control hazards 15 30% instructions are branchs Targets and direction are known very late in the pipeline. Not before: Cycle 7 on DEC 21264 Cycle 11 on Intel Pentium III Cycle 18 on Intel Pentium 4 X inst. are issued per cycles! Just cannot afford to lose these cycles!

Branch prediction / next instruction prediction 30 PC prediction correction on mispredictio n IF DE EX MEM WB Prediction of the address of the next instruction (block)

Dynamic branch prediction: 31 just repeat the past Keep an history on what happens in the past, and guess that the same behavior will occur next time: essentially assumes that the behavior of the application tends to be repetitive Implementation: hardware storage tables read at the same time as the instruction cache What must be predicted: Is there a branch? Which is its type? Target of PC relative branch Direction of the conditional branch Target of the indirect branch Target of the procedure return

32 Predicting the direction of a branch It is more important to correctly predict the direction than to correctly predict the target of a conditional branch PC relative address: known/computed at execution at decode time Effective direction computed at execution time

33 Prediction as the last time direction prediction 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 mipredict mispredict mispredict mispredict for (i=0;i<1000;i++) { for (j=0;j<n;j++) { loop body } } 2 mispredictions on the first and the last iterations

34 Exploiting more past: inter correlations B1: if cond1 and cond2 B2: if cond1 cond1 cond2 T T T N N T N N cond1 AND cond2 T N N N Using information on B1 to predict B2 If cond1 AND cond2 true (p=1/4), predict cond1 true Si cond1 AND cond2 false (p= 3/4), predict cond1 false 100 % correct 66% correct

35 Exploiting the past: auto-correlation When the last 3 iterations are taken then predict not taken, otherwise predict taken 100 % correct 1 1 1 0 1 1 1 0 1 1 1 0 for (i=0; i<100; i++) for (j=0;j<4;j++) loop body

36 General principle of branch prediction Information on the branch Read tables F PC, global history, local history prediction

352 Kbits, cancelled 2001 Max hist length > 21, 35 Alpha EV8 predictor: (derived from) (2Bcgskew) 37 e-gskew

Current state-of-the-art 256 Kbits TAGE: Geometric history length (dec 2006) 38 pc pc h[0:l1] pc h[0:l2] pc h[0:l3] hash hash hash hash hash hash ctr tag u ctr tag u ctr tag u =? =? =? 1 Tagless base predictor 1 1 1 1 1 1 1 1 3.314 misp/ki prediction

ILP: Instruction level parallelism 39

Executing instructions in parallel : supercalar and VLIW processors 40 Till 1991, pipelining to achieve achieving 1 inst per cycle was the goal : Pipelining reached limits: Multiplying stages do not lead to higher performance Silicon area was available: Parallelism is the natural way ILP: executing several instructions per cycle Different approachs depending on who is in charge of the control : The compiler/software: VLIW (Very Long Instruction Word) The hardware: superscalar

Instruction Level Parallelism: what is ILP? 41 A= B+C; D=E+F; 8 instructions 1. Ld @ C, R1 (A) 2. Ld @ B, R2 (A) 3. R3 R1+ R2 (B) 4. St @ A, R3 (C ) 5. Ld @ E, R4 (A) 6. Ld @ F, R5 (A) 7. R6 R4+ R5 (B) 8. St @ A, R6 (C ) (A),(B), (C) three groups of independent instructions Each group can be executed in //

42 VLIW: Very Long Instruction Word Each instruction controls explicitly the whole processor The compiler/code scheduler is in charge of all functional units: Manages all hazards: resource: decides if there are two competing candidates for a resource data: ensures that data dependencies will be respected control:?!?

43 VLIW architecture Control unit UF Register bank UF UF UF Memory interface igtr r6 r5 -> r127, uimm 2 -> r126, iadd r0 r6 -> r36, ld32d (16) r4 -> r33, nop;

44 VLIW (Very Long Instruction Word) The Control unit issues a single long instruction word per cycle Each long instruction lanches simultinaeously sevral independant instructions: The compiler garantees that: the subinstructions are independent. The instruction is independent of all in flight instructions There is no hardware to enforce depencies

VLIW architecture: often used for embedded applications 45 Binary compatibility is a nightmare: Necessitates the use of the same pipeline structure Very effective on regular codes with loops and very few control, but poor performance on general purpose application (too many branchs) Cost-effective hardware implementation: No control: Less silicon area Reduced power consumption Reduced design and test delays

46 Superscalar processors The hardware is in charge of the control: The semantic is sequential, the hardware enforces this semantic The hardware enforces dependencies: Binary compatibility with previous generation processors: All general purpose processors till 1993 are superscalar

47 Superscalar: what are the problems? Is there instruction parallelism? On general-purpose application: 2-8 instructions per cycle On some applications: could be 1000 s, How to recognize parallelism? Enforcing data dependencies Issuing in parallel: Fetching instructions in // Decoding in parallel Reading operands iin parallel Predicting branchs very far ahead

48 In-order execution IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT

49 out-of-order execution To optimize resource usage: Executes as soon as operands are valid IF DC wait EX M wait CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT

50 Out of order execution Instructions are executed out of order: If inst A is blocked due to the absence of its operands but inst B has its operands avalaible then B can be executed!! Generates a lot of hardware complexity!!

speculative execution on OOO processors 51 10-15 % branches: On Pentium 4: direction and target known at cycle 31!! Predict and execute speculatively: Validate at execution time State-of-the-art predictors: 2-3 misprediction per 1000 instructions Also predict: Memory (in)dependency (limited) data value

Out of order execution: 52 Just be able to «undo» branch misprediction Memory dependency misprediction Interruption, exception Validate (commit) instructions in order Do not do anything definitely out of order

The memory hierarchy 53

54 Memory components Most transistors in a computer system are memory transistors: Main memory: Usually DRAM 1 Gbyte is standard in PCs (2005) Long access time 150 ns = 500 cycles = 2000 instructions On chip single ported memory: Caches, predictors,.. On chip multiported memory: Register files, L1 cache,..

55 Memory hierarchy Memory is : either huge, but slow or small, but fast The smallest, the fastest Memory hierarchy goal: Provide the illusion that the whole memory is fast Principle: exploit the temporal and spatial locality properties of most applications

56 Locality property On most applications, the following property applies: Temporal locality : A data/instruction word that has just been accessed is likely to be reaccessed in the near future Spatial locality: The data/instruction words that are located close (in the address space) to a data/instruction word that has just been accessed is likely to be reaccessed in the near future.

57 A few examples of locality Temporal locality: Loop index, loop invariants,.. Instructions: loops,.. 90%/10% rule of thumb: a program spends 90 % of its excution time on 10 % of the static code ( often much more on much less ) Spatial locality: Arrays of data, data structure Instructions: the next instruction after a non-branch inst is always executed

58 Cache memory A cache is small memory which content is an image of a subset of the main memory. A reference to memory is 1.) presented to the cache 2.) on a miss, the request is presented to next level in the memory hierarchy (2nd level cache or main memory)

59 Cache Tag Identifies the memory block Cache line memory Load &A If the address of the block sits in the tag array then the block is present in the cache A

Memory hierarchy behavior 60 may dictate performance Example : 4 instructions/cycle, 1 data memory acces per cycle 10 cycle penalty for accessing 2nd level cache 300 cycles round trip to memory 2% miss on instructions, 4% miss on data, 1 reference out 4 missing on L2 To execute 400 instructions : 1320 cycles!!

61 Block size Long blocks: Exploits the spatial locality Loads useless words when spatial locality is poor. Short blocks: Misses on conntiguous blocks Experimentally: 16-64 bytes for small L1 caches 8-32 Kbytes 64-128 bytes for large caches 256K-4Mbytes

62 Cache hierarchy Cache hierarchy becomes a standard: L1: small (<= 64Kbytes), short access time (1-3 cycles) Inst and data caches L2: longer access time (7-15 cycles), 512K-2Mbytes Unified Coming L3: 2M-8Mbytes (20-30 cycles) Unified, shared on multiprocessor

Cache misses do not stop a processor (completely) 63 On a L1 cache miss: The request is sent to the L2 cache, but sequencing and execution continues On a L2 hit, latency is simply a few cycles On a L2 miss, latency is hundred of cycles: Execution stops after a while Out-of-order execution allows to initiate several L2 cache misses (serviced in a pipeline mode) at the same time: Latency is partially hiden

64 Prefetching To avoid misses, one can try to anticipate misses and load the (future) missing blocks in the cache in advance: Many techniques: Sequential prefetching: prefetch the sequential blocks Stride prefetching: recognize a stride pattern and prefetch the blocks in that pattern Hardware and software methods are available: Many complex issues: latency, pollution,..

Execution time of a short instruction sequence is a complex function! 65 Branch Predictor Correct mispredict ITLB I-cache hit miss hit miss Execution core DTLB D-cache hit miss hit miss hit miss L2 Cache

66 Code generation issues First, avoid data misses: 300 cycles mispenalties Data layout Loop reorganization Loop blocking Instruction generation: minimize instruction count: e.g. common subexpression elimination schedule instructions to expose ILP Avoid hard-to-predict branches

On chip thread level pararallelism 67

68 One billion transistors now!! Ultimate 16 32 way superscalar uniprocessor seems unreachable: Just not enough ILP More than quadratic complexity on a few key (power hungry) components (register file, bypass network, issue logic) To avoid temperature hot spots: Intra CPU very long communications would be needed On chip thread parallelism appears as the only viable solution Shared memory processor i.e. chip multiprocessor Simultaneous multithreading Heterogeneous multiprocessing Vector processing

69 The Chip Multiprocessor Put a shared memory multiprocessor on a single die: Duplicate the processor, its L1 cache, may be L2, Keep the caches coherent Share the last level of the memory hierarchy (may be) Share the external interface (to memory and system)

General purpose Chip MultiProcessor (CMP): why it did not (really) appear before 2003 70 Till 2003 better (economic) usage for transistors: Single process performance is the most important More complex superscalar implementation More cache space: Bring the L2 cache on-chip Enlarge the L2 cache Include a L3 cache (now) Diminishing return!! Now: CMP is the only option!!

Simultaneous Multithreading (SMT): parallel processing on a single processor 71 functional units are underused on superscalar processors SMT: Sharing the functional units on a superscalar processor between several process Advantages: Single process can use all the resources units dynamic sharing of all structures on parallel/multiprocess workloads

72 Ti me (Processor cycl e) Superscalar SMT Unuti l i zed Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Issue slots

The programmer view of a CMP/SMT! 73

74 Why CMP/SMT is the new frontier? (Most) applications were sequential: Hardware WILL be parallel Tens, hundreds of SMT cores in your PC, PDA in 10 years from now (might be Option 1: Applications will have to be adapted to parallelism Option 2: // hardware will have to run efficiently sequential applications Option 3: invent new tradeoffs There is no current standard

Embedded processing and on-chip parallelism (1) 75 ILP has been exploited for many years: DSPs: a multiply-add, 1 or 2 loads, loop control in a single (long) cycle Caches were implemented on embedded processors +10 years VLIW on embedded processor was introduced in1997-98 In-order supercsalar processor were developped for embedded market 15 years ago (Intel i960)

Embedded processing and on-chip parallelism (2): thread parallelism 76 Heterogeneous multicores is the trend: Cell processor IBM (2005) One PowerPC (master) 8 special purpose processors (slaves) Philips Nexperia: A RISC MIPS microprocessor + a VLIW Trimedia processor ST Nomadik: An ARM + x VLIW

What about a vector microprocessor for scientific computing? 77 Vector parallelism is well understood! A niche segment Caches are not vectors friendly Not so small: 2B$! never heard about:l2 cache, prefetching, blocking? GPGPU!! GPU boards have become Vector processing units

78 Structure of future multicores? L3 cache μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $

$ μp L2 $ $ μp $ μp L2 $ $ μp 79 Hierarchical organization? μp $ μp $ μp $ μp $ L2 $ L2 $ L3 $

80 An example of sharing IL1 $ Inst. fetch μp FP μp DL1 $ Hardware accelerator IL1 $ Inst. fetch μp FP μp DL1 $ L2 cache

81 A possible basic brick I $ μp FP μp D $ I $ μp FP μp D $ L2 cache I $ μp FP μp D $ I $ μp FP μp D $

82 I $ I $ I $ I $ μp FP μp μp FP μp μp FP μp μp FP μp D $ D $ D $ D $ μp I $ FP μp L2 cache I $ μp FP μp μp I $ FP μp L2 cache I $ μp FP μp Memory interface D $ D $ D $ D $ L3 cache network interface I $ μp FP μp D $ I $ μp FP μp D $ I $ μp FP μp D $ I $ μp FP μp D $ System interface L2 cache L2 cache I $ I $ I $ I $ μp FP μp μp FP μp μp FP μp μp FP μp D $ D $ D $ D $

Only limited available thread parallelism? 83 Focus on uniprocessor architecture: Find the correct tradeoff between complexity and performance Power and temperature issues Vector extensions? Contiguous vectors ( a la SSE)? Strided vectors in L2 caches ( Tarantula-like)

84 Another possible basic brick Ultimate Out-of-order Superscalar I $ μp FP μp D $ L2 cache I $ μp FP μp D $

I $ D $ Ult. O-O-O L2 $ I $ D $ I $ D $ Ult. O-O-O L2 $ I $ D $ 85 Ult. O-O-O Ult. O-O-O I $ L2 $ I $ I $ L2 $ I $ Memory interface D $ D $ L3 cache D $ D $ network interface System interface

Some undeveloped issues : the power consumption issue 86 Power consumption: Need to limit power consumption Labtop: battery life Desktop: above 75 W, hard to extract at low cost Embedded: battery life, environment Revisit the old performance concept with: maximum performance in a fixed power budget Power aware architecture design Need to limit frequency

Some undeveloped issues : the performance predictabilty issue 87 Modern microprocessors have unpredictable/unstable performance The average user wants stable performance: Cannot tolerate variations of performance by orders of magnitude when varying simple parameters Real time systems: Want to guarantee response time.

Some undeveloped issues : the temperature issue 88 Temperature is not uniform on the chip (hotspots) Rising the temperature above a threshold has devastating effects: Defectuous behavior: transcient or definitive Component aging Solutions: gating (stops!!) or clock scaling, or task migration