High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
|
|
|
- Rosa James
- 10 years ago
- Views:
Transcription
1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1
2 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004) 1979: transistors (Intel 8086) 1989: 1 M transistors (Intel 80486) 1999: 130 M transistors (HP PA-8500) 2005: 1,7 billion transistors (Intel Itanium Montecito) Processor performance doubles every 18 months 1989: Intel Mhz (< 1inst/cycle) 1993 : Intel Pentium 66 Mhz x 2 inst/cycle 1995: Intel PentiumPro 150 Mhz x 3 inst/cycle 06/2000: Intel Pentium III 1Ghz x 3 inst/cycle 09/2002: Intel Pentium Ghz x 3 inst/cycle 09/2005: Intel Pentium 4, dual core 3.2 Ghz x 3 inst/cycle x 2 processors
3 3 Not just the IC technology VLSI brings the transistors, the frequency,.. Microarchitecture, code generation optimization bring the effective performance
4 4 The hardware/software interface software High level language Compiler/code generation Instruction Set Architecture (ISA) hardware micro-architecture transistor
5 5 Instruction Set Architecture (ISA) Hardware/software interface: The compiler translates programs in instructions The hardware executes instructions Examples Intel x86 (1979): still your PC ISA MIPS, SPARC (mid 80 s) Alpha, PowerPC ( 90 s) ISAs evolve by successive add-ons: 16 bits to 32 bits, new multimedia instructions, etc Introduction of a new ISA requires good reasons: New application domains, new constraints No legacy code
6 6 Microarchitecture macroscopic vision of the hardware organization Nor at the transistor level neither at the gate level But understanding the processor organization at the functional unit level
7 7 What is microarchitecture about? Memory access time is 100 ns Program semantic is sequential But modern processors can execute 4 instructions every 0.25 ns. How can we achieve that?
8 high performance processors everywhere 8 General purpose processors (i.e. no special target application domain): Servers, desktop, laptop, PDAs Embedded processors: Set top boxes, cell phones, automotive,.., Special purpose processor or derived from a general purpose processor
9 9 Performance needs Performance: Reduce the response time: Scientific applications: treats larger problems Data base Signal processing multimedia Historically over the last 50 years: Improving performance for today applications has fostered new even more demanding applications
10 10 How to improve performance? Use a better algorithm language compiler Instruction set (ISA) micro-architecture transistor Optimise code Improve the ISA More efficient microarchitecture New technology
11 11 How are used transistors: evolution In the 70 s: enriching the ISA Increasing functionalities to decrease instruction number In the 80 s: caches and registers Decreasing external accesses ISAs from 8 to 16 to 32 bits In the 90 s: instruction parallelism More instructions, lot of control, lot of speculation More caches In the 2000 s: More and more: Thread parallelism, core parallelism
12 12 A few technological facts (2005) Frequency : Ghz An ALU operation: 1 cycle A floating point operation : 3 cycles Read/write of a registre: 2 3 cycles Often a critical path... Read/write of the cache L1: 1 3 cycles Depends on many implementation choices
13 A few technological parameters (2005) 13 Integration technology: 90 nm 65 nm millions of transistor logic Cache/predictors: up one billion transistors Watts 75 watts: a limit for cooling at reasonable hardware cost 20 watts: a limit for reasonable laptop power consumption pins 939 pins on the Dual core Athlon
14 14 The architect challenge 400 mm2 of silicon 2/3 technology generations ahead What will you use for performance? Pipelining Instruction Level Parallelism Speculative execution Memory hierarchy Thread parallelism
15 Up to now, what was microarchitecture about? 15 Memory access time is 100 ns Program semantic is sequential Instruction life (fetch, decode,..,execute,..,memory access,..) is ns How can we use the transistors to achieve the highest performance as possible? So far, up to 4 instructions every 0.3 ns
16 The architect tool box for uniprocessor performance 16 Pipelining Instruction Level Parallelism Speculative execution Memory hierarchy
17 Pipelining 17
18 18 Pipelining Just slice the instruction life in equal stages and launch concurrent execution: time I0 I1 I2 IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT
19 19 Principle The execution of an instruction is naturally decomposed in successive logical phases Instructions can be issued sequentially, but without waiting for the completion of the one.
20 20 Some pipeline examples MIPS R3000 : MIPS R4000 : Very deep pipeline to achieve high frequency Pentium 4: 20 stages minimum Pentium 4 extreme edition: 31 stages minimum
21 21 pipelining: the limits Current: 1 cycle = gate delays: Approximately a 64-bit addition delay Coming soon?: 6-8 gate delays On Pentium 4: ALU is sequenced at double frequency a 16 bit add delay
22 22 Caution to long execution Integer : multiplication : 5-10 cycles division : cycles Floating point: Addition : 2-5 cycles Multiplication: 2-6 cycles Division: cycles
23 23 Dealing with long instructions Use a specific to execute floating point operations: E.g. a 3 stage execution pipeline Stay longer in a single stage: Integer multiply and divide
24 24 sequential semantic issue on a pipeline There exists situation where sequencing an instruction every cycle would not allow correct execution: Structural hazards: distinct instructions are competing for a single hardware resource Data hazards: J follows I and instruction J is accessing an operand that has not been acccessed so far by instruction I Control hazard: I is a branch, but its target and direction are not known before a few cycles in the pipeline
25 25 Enforcing the sequential semantic Hardware management: First detect the hazard, then avoid its effect by delaying the instruction waiting for the hazard resolution time I0 I1 I2 IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT
26 26 Read After Write Memory load delay:
27 27 And how code reordering may help a= b+c ; d= e+f
28 28 Control hazard The current instruction is a branch ( conditional or not) Which instruction is next? Number of cycles lost on branchs may be a major issue
29 29 Control hazards 15 30% instructions are branchs Targets and direction are known very late in the pipeline. Not before: Cycle 7 on DEC Cycle 11 on Intel Pentium III Cycle 18 on Intel Pentium 4 X inst. are issued per cycles! Just cannot afford to lose these cycles!
30 Branch prediction / next instruction prediction 30 PC prediction correction on mispredictio n IF DE EX MEM WB Prediction of the address of the next instruction (block)
31 Dynamic branch prediction: 31 just repeat the past Keep an history on what happens in the past, and guess that the same behavior will occur next time: essentially assumes that the behavior of the application tends to be repetitive Implementation: hardware storage tables read at the same time as the instruction cache What must be predicted: Is there a branch? Which is its type? Target of PC relative branch Direction of the conditional branch Target of the indirect branch Target of the procedure return
32 32 Predicting the direction of a branch It is more important to correctly predict the direction than to correctly predict the target of a conditional branch PC relative address: known/computed at execution at decode time Effective direction computed at execution time
33 33 Prediction as the last time direction prediction mipredict mispredict mispredict mispredict for (i=0;i<1000;i++) { for (j=0;j<n;j++) { loop body } } 2 mispredictions on the first and the last iterations
34 34 Exploiting more past: inter correlations B1: if cond1 and cond2 B2: if cond1 cond1 cond2 T T T N N T N N cond1 AND cond2 T N N N Using information on B1 to predict B2 If cond1 AND cond2 true (p=1/4), predict cond1 true Si cond1 AND cond2 false (p= 3/4), predict cond1 false 100 % correct 66% correct
35 35 Exploiting the past: auto-correlation When the last 3 iterations are taken then predict not taken, otherwise predict taken 100 % correct for (i=0; i<100; i++) for (j=0;j<4;j++) loop body
36 36 General principle of branch prediction Information on the branch Read tables F PC, global history, local history prediction
37 352 Kbits, cancelled 2001 Max hist length > 21, 35 Alpha EV8 predictor: (derived from) (2Bcgskew) 37 e-gskew
38 Current state-of-the-art 256 Kbits TAGE: Geometric history length (dec 2006) 38 pc pc h[0:l1] pc h[0:l2] pc h[0:l3] hash hash hash hash hash hash ctr tag u ctr tag u ctr tag u =? =? =? 1 Tagless base predictor misp/ki prediction
39 ILP: Instruction level parallelism 39
40 Executing instructions in parallel : supercalar and VLIW processors 40 Till 1991, pipelining to achieve achieving 1 inst per cycle was the goal : Pipelining reached limits: Multiplying stages do not lead to higher performance Silicon area was available: Parallelism is the natural way ILP: executing several instructions per cycle Different approachs depending on who is in charge of the control : The compiler/software: VLIW (Very Long Instruction Word) The hardware: superscalar
41 Instruction Level Parallelism: what is ILP? 41 A= B+C; D=E+F; 8 instructions 1. C, R1 (A) 2. B, R2 (A) 3. R3 R1+ R2 (B) 4. A, R3 (C ) 5. E, R4 (A) 6. F, R5 (A) 7. R6 R4+ R5 (B) 8. A, R6 (C ) (A),(B), (C) three groups of independent instructions Each group can be executed in //
42 42 VLIW: Very Long Instruction Word Each instruction controls explicitly the whole processor The compiler/code scheduler is in charge of all functional units: Manages all hazards: resource: decides if there are two competing candidates for a resource data: ensures that data dependencies will be respected control:?!?
43 43 VLIW architecture Control unit UF Register bank UF UF UF Memory interface igtr r6 r5 -> r127, uimm 2 -> r126, iadd r0 r6 -> r36, ld32d (16) r4 -> r33, nop;
44 44 VLIW (Very Long Instruction Word) The Control unit issues a single long instruction word per cycle Each long instruction lanches simultinaeously sevral independant instructions: The compiler garantees that: the subinstructions are independent. The instruction is independent of all in flight instructions There is no hardware to enforce depencies
45 VLIW architecture: often used for embedded applications 45 Binary compatibility is a nightmare: Necessitates the use of the same pipeline structure Very effective on regular codes with loops and very few control, but poor performance on general purpose application (too many branchs) Cost-effective hardware implementation: No control: Less silicon area Reduced power consumption Reduced design and test delays
46 46 Superscalar processors The hardware is in charge of the control: The semantic is sequential, the hardware enforces this semantic The hardware enforces dependencies: Binary compatibility with previous generation processors: All general purpose processors till 1993 are superscalar
47 47 Superscalar: what are the problems? Is there instruction parallelism? On general-purpose application: 2-8 instructions per cycle On some applications: could be 1000 s, How to recognize parallelism? Enforcing data dependencies Issuing in parallel: Fetching instructions in // Decoding in parallel Reading operands iin parallel Predicting branchs very far ahead
48 48 In-order execution IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT
49 49 out-of-order execution To optimize resource usage: Executes as soon as operands are valid IF DC wait EX M wait CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT IF DC EX M CT
50 50 Out of order execution Instructions are executed out of order: If inst A is blocked due to the absence of its operands but inst B has its operands avalaible then B can be executed!! Generates a lot of hardware complexity!!
51 speculative execution on OOO processors % branches: On Pentium 4: direction and target known at cycle 31!! Predict and execute speculatively: Validate at execution time State-of-the-art predictors: 2-3 misprediction per 1000 instructions Also predict: Memory (in)dependency (limited) data value
52 Out of order execution: 52 Just be able to «undo» branch misprediction Memory dependency misprediction Interruption, exception Validate (commit) instructions in order Do not do anything definitely out of order
53 The memory hierarchy 53
54 54 Memory components Most transistors in a computer system are memory transistors: Main memory: Usually DRAM 1 Gbyte is standard in PCs (2005) Long access time 150 ns = 500 cycles = 2000 instructions On chip single ported memory: Caches, predictors,.. On chip multiported memory: Register files, L1 cache,..
55 55 Memory hierarchy Memory is : either huge, but slow or small, but fast The smallest, the fastest Memory hierarchy goal: Provide the illusion that the whole memory is fast Principle: exploit the temporal and spatial locality properties of most applications
56 56 Locality property On most applications, the following property applies: Temporal locality : A data/instruction word that has just been accessed is likely to be reaccessed in the near future Spatial locality: The data/instruction words that are located close (in the address space) to a data/instruction word that has just been accessed is likely to be reaccessed in the near future.
57 57 A few examples of locality Temporal locality: Loop index, loop invariants,.. Instructions: loops,.. 90%/10% rule of thumb: a program spends 90 % of its excution time on 10 % of the static code ( often much more on much less ) Spatial locality: Arrays of data, data structure Instructions: the next instruction after a non-branch inst is always executed
58 58 Cache memory A cache is small memory which content is an image of a subset of the main memory. A reference to memory is 1.) presented to the cache 2.) on a miss, the request is presented to next level in the memory hierarchy (2nd level cache or main memory)
59 59 Cache Tag Identifies the memory block Cache line memory Load &A If the address of the block sits in the tag array then the block is present in the cache A
60 Memory hierarchy behavior 60 may dictate performance Example : 4 instructions/cycle, 1 data memory acces per cycle 10 cycle penalty for accessing 2nd level cache 300 cycles round trip to memory 2% miss on instructions, 4% miss on data, 1 reference out 4 missing on L2 To execute 400 instructions : 1320 cycles!!
61 61 Block size Long blocks: Exploits the spatial locality Loads useless words when spatial locality is poor. Short blocks: Misses on conntiguous blocks Experimentally: bytes for small L1 caches 8-32 Kbytes bytes for large caches 256K-4Mbytes
62 62 Cache hierarchy Cache hierarchy becomes a standard: L1: small (<= 64Kbytes), short access time (1-3 cycles) Inst and data caches L2: longer access time (7-15 cycles), 512K-2Mbytes Unified Coming L3: 2M-8Mbytes (20-30 cycles) Unified, shared on multiprocessor
63 Cache misses do not stop a processor (completely) 63 On a L1 cache miss: The request is sent to the L2 cache, but sequencing and execution continues On a L2 hit, latency is simply a few cycles On a L2 miss, latency is hundred of cycles: Execution stops after a while Out-of-order execution allows to initiate several L2 cache misses (serviced in a pipeline mode) at the same time: Latency is partially hiden
64 64 Prefetching To avoid misses, one can try to anticipate misses and load the (future) missing blocks in the cache in advance: Many techniques: Sequential prefetching: prefetch the sequential blocks Stride prefetching: recognize a stride pattern and prefetch the blocks in that pattern Hardware and software methods are available: Many complex issues: latency, pollution,..
65 Execution time of a short instruction sequence is a complex function! 65 Branch Predictor Correct mispredict ITLB I-cache hit miss hit miss Execution core DTLB D-cache hit miss hit miss hit miss L2 Cache
66 66 Code generation issues First, avoid data misses: 300 cycles mispenalties Data layout Loop reorganization Loop blocking Instruction generation: minimize instruction count: e.g. common subexpression elimination schedule instructions to expose ILP Avoid hard-to-predict branches
67 On chip thread level pararallelism 67
68 68 One billion transistors now!! Ultimate way superscalar uniprocessor seems unreachable: Just not enough ILP More than quadratic complexity on a few key (power hungry) components (register file, bypass network, issue logic) To avoid temperature hot spots: Intra CPU very long communications would be needed On chip thread parallelism appears as the only viable solution Shared memory processor i.e. chip multiprocessor Simultaneous multithreading Heterogeneous multiprocessing Vector processing
69 69 The Chip Multiprocessor Put a shared memory multiprocessor on a single die: Duplicate the processor, its L1 cache, may be L2, Keep the caches coherent Share the last level of the memory hierarchy (may be) Share the external interface (to memory and system)
70 General purpose Chip MultiProcessor (CMP): why it did not (really) appear before Till 2003 better (economic) usage for transistors: Single process performance is the most important More complex superscalar implementation More cache space: Bring the L2 cache on-chip Enlarge the L2 cache Include a L3 cache (now) Diminishing return!! Now: CMP is the only option!!
71 Simultaneous Multithreading (SMT): parallel processing on a single processor 71 functional units are underused on superscalar processors SMT: Sharing the functional units on a superscalar processor between several process Advantages: Single process can use all the resources units dynamic sharing of all structures on parallel/multiprocess workloads
72 72 Ti me (Processor cycl e) Superscalar SMT Unuti l i zed Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Issue slots
73 The programmer view of a CMP/SMT! 73
74 74 Why CMP/SMT is the new frontier? (Most) applications were sequential: Hardware WILL be parallel Tens, hundreds of SMT cores in your PC, PDA in 10 years from now (might be Option 1: Applications will have to be adapted to parallelism Option 2: // hardware will have to run efficiently sequential applications Option 3: invent new tradeoffs There is no current standard
75 Embedded processing and on-chip parallelism (1) 75 ILP has been exploited for many years: DSPs: a multiply-add, 1 or 2 loads, loop control in a single (long) cycle Caches were implemented on embedded processors +10 years VLIW on embedded processor was introduced in In-order supercsalar processor were developped for embedded market 15 years ago (Intel i960)
76 Embedded processing and on-chip parallelism (2): thread parallelism 76 Heterogeneous multicores is the trend: Cell processor IBM (2005) One PowerPC (master) 8 special purpose processors (slaves) Philips Nexperia: A RISC MIPS microprocessor + a VLIW Trimedia processor ST Nomadik: An ARM + x VLIW
77 What about a vector microprocessor for scientific computing? 77 Vector parallelism is well understood! A niche segment Caches are not vectors friendly Not so small: 2B$! never heard about:l2 cache, prefetching, blocking? GPGPU!! GPU boards have become Vector processing units
78 78 Structure of future multicores? L3 cache μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $ μp $
79 $ μp L2 $ $ μp $ μp L2 $ $ μp 79 Hierarchical organization? μp $ μp $ μp $ μp $ L2 $ L2 $ L3 $
80 80 An example of sharing IL1 $ Inst. fetch μp FP μp DL1 $ Hardware accelerator IL1 $ Inst. fetch μp FP μp DL1 $ L2 cache
81 81 A possible basic brick I $ μp FP μp D $ I $ μp FP μp D $ L2 cache I $ μp FP μp D $ I $ μp FP μp D $
82 82 I $ I $ I $ I $ μp FP μp μp FP μp μp FP μp μp FP μp D $ D $ D $ D $ μp I $ FP μp L2 cache I $ μp FP μp μp I $ FP μp L2 cache I $ μp FP μp Memory interface D $ D $ D $ D $ L3 cache network interface I $ μp FP μp D $ I $ μp FP μp D $ I $ μp FP μp D $ I $ μp FP μp D $ System interface L2 cache L2 cache I $ I $ I $ I $ μp FP μp μp FP μp μp FP μp μp FP μp D $ D $ D $ D $
83 Only limited available thread parallelism? 83 Focus on uniprocessor architecture: Find the correct tradeoff between complexity and performance Power and temperature issues Vector extensions? Contiguous vectors ( a la SSE)? Strided vectors in L2 caches ( Tarantula-like)
84 84 Another possible basic brick Ultimate Out-of-order Superscalar I $ μp FP μp D $ L2 cache I $ μp FP μp D $
85 I $ D $ Ult. O-O-O L2 $ I $ D $ I $ D $ Ult. O-O-O L2 $ I $ D $ 85 Ult. O-O-O Ult. O-O-O I $ L2 $ I $ I $ L2 $ I $ Memory interface D $ D $ L3 cache D $ D $ network interface System interface
86 Some undeveloped issues : the power consumption issue 86 Power consumption: Need to limit power consumption Labtop: battery life Desktop: above 75 W, hard to extract at low cost Embedded: battery life, environment Revisit the old performance concept with: maximum performance in a fixed power budget Power aware architecture design Need to limit frequency
87 Some undeveloped issues : the performance predictabilty issue 87 Modern microprocessors have unpredictable/unstable performance The average user wants stable performance: Cannot tolerate variations of performance by orders of magnitude when varying simple parameters Real time systems: Want to guarantee response time.
88 Some undeveloped issues : the temperature issue 88 Temperature is not uniform on the chip (hotspots) Rising the temperature above a threshold has devastating effects: Defectuous behavior: transcient or definitive Component aging Solutions: gating (stops!!) or clock scaling, or task migration
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Design Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Low Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
CISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Instruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Computer Organization and Components
Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides
Introduction to Microprocessors
Introduction to Microprocessors Yuri Baida [email protected] [email protected] October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers
CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.
Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how
Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.
Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December
A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications
1 A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications Simon McIntosh-Smith Director of Architecture 2 Multi-Threaded Array Processing Architecture
Introducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
SOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013
18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
18-548/15-548 Associativity 9/16/98. 7 Associativity. 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998
7 Associativity 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998 Required Reading: Cragon pg. 166-174 Assignments By next class read about data management policies: Cragon 2.2.4-2.2.6,
İSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge
CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities
WAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
The Motherboard Chapter #5
The Motherboard Chapter #5 Amy Hissom Key Terms Advanced Transfer Cache (ATC) A type of L2 cache contained within the Pentium processor housing that is embedded on the same core processor die as the CPU
Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level
System: User s View System Components: High Level View Input Output 1 System: Motherboard Level 2 Components: Interconnection I/O MEMORY 3 4 Organization Registers ALU CU 5 6 1 Input/Output I/O MEMORY
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
How To Understand The Design Of A Microprocessor
Computer Architecture R. Poss 1 What is computer architecture? 2 Your ideas and expectations What is part of computer architecture, what is not? Who are computer architects, what is their job? What is
CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
Generations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture
ELE 356 Computer Engineering II Section 1 Foundations Class 6 Architecture History ENIAC Video 2 tj History Mechanical Devices Abacus 3 tj History Mechanical Devices The Antikythera Mechanism Oldest known
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
ADVANCED COMPUTER ARCHITECTURE
ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: [email protected] Web: www.unipv.it/mferretti, eecs.unipv.it 1 Course syllabus and motivations This course covers the
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected].
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected] Review Computers in mid 50 s Hardware was expensive
Chapter 2 Logic Gates and Introduction to Computer Architecture
Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
Introduction to RISC Processor. ni logic Pvt. Ltd., Pune
Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
Computer Architecture
Cache Memory Gábor Horváth 2016. április 27. Budapest associate professor BUTE Dept. Of Networked Systems and Services [email protected] It is the memory,... The memory is a serious bottleneck of Neumann
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor
Computer Architectures
Computer Architectures 2. Instruction Set Architectures 2015. február 12. Budapest Gábor Horváth associate professor BUTE Dept. of Networked Systems and Services [email protected] 2 Instruction set architectures
CS352H: Computer Systems Architecture
CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards [email protected] At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
Mobile Processors: Future Trends
Mobile Processors: Future Trends Mário André Pinto Ferreira de Araújo Departamento de Informática, Universidade do Minho 4710-057 Braga, Portugal [email protected] Abstract. Mobile devices, such as handhelds,
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:
55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................
Performance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
Pentium vs. Power PC Computer Architecture and PCI Bus Interface
Pentium vs. Power PC Computer Architecture and PCI Bus Interface CSE 3322 1 Pentium vs. Power PC Computer Architecture and PCI Bus Interface Nowadays, there are two major types of microprocessors in the
1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
The Central Processing Unit:
The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Chapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
PROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
CPU Organization and Assembly Language
COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:
COMPUTER HARDWARE. Input- Output and Communication Memory Systems
COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)
picojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends
This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):
Intel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
路 論 Chapter 15 System-Level Physical Design
Introduction to VLSI Circuits and Systems 路 論 Chapter 15 System-Level Physical Design Dept. of Electronic Engineering National Chin-Yi University of Technology Fall 2007 Outline Clocked Flip-flops CMOS
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
The Microarchitecture of Superscalar Processors
The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: [email protected]
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory
Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
