# Instruction Level Parallelism Part I - Introduction

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Course on: Advanced Computer Architectures Instruction Level Parallelism Part I - Introduction Prof. Cristina Silvano Politecnico di Milano 1

2 Outline of Part I Introduction to ILP Dependences and Hazards Dynamic scheduling vs Static scheduling Superscalar vs VLIW 2

3 Getting higher performance... In a pipelined machine, actual CPI is derived as: CPI pipeline = CPI ideal + Structural Stalls + Data Hazard Stalls + Control Stalls + Memory Stalls Reduction of any right-hand term reduces CPI pipeline to CPI ideal (increases Instructions Per Clock: IPC = 1 / CPI) Best case: the max throughput would be to complete 1 Instruction Per Clock: IPC ideal = 1; CPI ideal = 1-3-

4 Sequential vs. Pipelining Execution I1 I2 IF ID EX MEM WB IF ID EX MEM WB 10 ns 10 ns I1 IF ID EX MEM WB Time I2 2 ns IF ID EX MEM WB IPC ideal = 1; CPI ideal = 1 I3 2 ns IF ID EX MEM WB I4 2 ns IF ID EX MEM WB I5 2 ns IF ID EX MEM WB -4-

5 Summary of Pipelining Basics Hazards limit performance: Structural: Need more HW resources Data: Need forwarding, Compiler scheduling Control: Early evaluation, Branch Delay Slot, Static and Dynamic Branch Prediction Increasing length of pipe (superpipelining) increases impact of hazards Pipelining helps instruction throughput, not latency 5

6 Summary: Types of Data Hazards Consider executing a sequence of r k (r i ) op (r j ) Type of instructions Data-dependence r 3 (r 1 ) op (r 2 ) Read-after-Write r 5 (r 3 ) op (r 4 ) (RAW) hazard Anti-dependence r 3 (r 1 ) op (r 2 ) Write-after-Read r 1 (r 4 ) op (r 5 ) (WAR) hazard Output-dependence r 3 (r 1 ) op (r 2 ) Write-after-Write r 3 (r 6 ) op (r 7 ) (WAW) hazard

7 Complex Pipelining ALU Mem IF ID Issue WB Fadd GPR s FPR s Fmul Fdiv Long latency or partially pipelined floating-point units Multiple function and memory units Memory systems with variable access time Precise exception

8 Complex In-Order Pipeline PC Inst. Mem D Decode GPRs X1 + X2 Data Mem X3 W FPRs X1 X2 Fadd X3 W Delay writeback so all operations have same latency to W stage Write ports never oversubscribed (one inst. in & one inst. out every cycle) Instructions commit in order, simplifies precise exception implementation FDiv X2 Fmul X3 X2 Unpipelined divider X3 Commit Point

9 Some basic concepts and definitions To reach higher performance (for a given technology) more parallelism must be extracted from the program. In other words...multiple-issue Dependences must be detected and solved, and instructions must be re-ordered (scheduled) so as to achieve highest parallelism of instruction execution compatible with available resources. -9-

10 Definition of Instruction Level Parallelism ILP = Exploit potential overlap of execution among unrelated instructions Overlapping possible whenever: No Structural Hazards No RAW, WAR of WAW Hazards No Control Hazards 10

11 ILP: Dual-Issue Pipelining Execution I1 I 1 IF ID EX MEM WB Time I2 I 2 I3 IF 2 ns ID IF EX ID MEM EX WB MEM WB I4 IF ID EX MEM WB IPC ideal = 2; CPI ideal = 0.5 I5 2 ns IF ID EX MEM WB I6 IF ID EX MEM WB I7 2 ns IF ID EX MEM WB I8 IF ID EX MEM WB I9 I10 2 ns IF IF ID ID EX EX MEM MEM WB WB -11-

12 2-issue MIPS Pipeline Architecture M u x 4 M u x ALU PC Instruction memory Registers M u x Write data Data memory Sign extend Sign extend ALU Address M u x 2-instructions issued per clock: 1 ALU or BR instruction 1 load/store instruction 12

13 Getting higher performance... In a multiple-issue pipelined machine, the ideal CPI would be CPI ideal < 1 If we consider for example 2-issue processor, best case: max throughput would be to complete 2 Instructions Per Clock: IPC ideal = 2; CPI ideal =

14 Dependences Determining dependences among instructions is critical to defining the amount of parallelism existing in a program. If two instructions are dependent, they cannot execute in parallel: they must be executed in order or only partially overlapped. Three different types of dependences: Data Dependences (or True Data Dependences) Name Dependences Control Dependences -14-

15 Name Dependences Name dependence occurs when 2 instructions use the same register or memory location (called name), but there is no flow of data between the instructions associated with that name. Two types of name dependences between an instruction i that precedes instruction j in program order: Antidependence: when j writes a register or memory location that instruction i reads (it can generate a WAR). The original instructions ordering must be preserved to ensure that i reads the correct value. Output Dependence: when i and j write the same register or memory location (it can generate a WAW). The original instructions ordering must be preserved to ensure that the value finally written corresponds to j. -15-

16 Name Dependences Name dependences are not true data dependences, since there is no value (no data flow) being transmitted between instructions. If the name (register number or memory location) used in the instructions could be changed, the instructions do not conflict. Dependences trough memory locations are more difficult to detect ( memory disambiguation problem), since two addresses may refer to the same location but can look different. Register renaming can be more easily done. Renaming can be done either statically by the compiler or dynamically by the hardware. -16-

17 Data Dependences and Hazards A data/name dependence can potentially generate a data hazard (RAW, WAW, or WAR), but the actual hazard and the number of stalls to eliminate the hazards are a property of the pipeline. RAW hazards correspond to true data dependences. WAW hazards correspond to output dependences WAR hazards correspond to antidependences. Dependences are a property of the program, while hazards are a property of the pipeline. -17-

18 Control Dependences A control dependence determines the ordering of instructions and it is preserved by two properties: Instructions execution in program order to ensure that an instruction that occurs before a branch is executed before the branch. Detection of control hazards to ensure that an instruction (that is control dependent on a branch) is not executed until the branch direction is known. Although preserving control dependence is a simple way to preserve program order, control dependence is not the critical property that must be preserved. -18-

19 Program Properties Two properties are critical to program correctness (and normally preserved by maintaining both data and control dependences): 1. Data flow: Actual flow of data values among instructions that produces the correct results and consumes them. 2. Exception behavior: Preserving exception behavior means that any changes in the ordering of instruction execution must not change how exceptions are raised in the program. -19-

20 Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: Depend on the hardware to locate parallelism Static Scheduling: Rely on software for identifying potential parallelism Hardware intensive approaches dominate desktop and server markets -20-

21 Basic Assumptions We consider single-issue processors The Instruction Fetch stage precedes the Issue Stage and may fetch either into an Instruction Register or into a queue of pending instructions Instructions are then issued from the IR or from the queue Execution stage may require multiple cycles, depending on the operation type. 21

22 Key Idea: Dynamic Scheduling Problem: Data dependences that cannot be hidden with bypassing or forwarding cause hardware stalls of the pipeline Solution: Allow instructions behind a stall to proceed HW rearranges dynamically the instruction execution to reduce stalls Enables out-of-order execution and completion (commit) First implemented in CDC 6600 (1963). 22

23 Example 1 DIVD F0,F2,F4 ADDD F10,F0,F8 # RAW F0 SUBD F12,F8,F14 RAW Hazard: ADDD stalls for F0 (waiting that DIVD commits). SUBD would stall even if not data dependent on anything in the pipeline without dynamic scheduling. BASIC IDEA: to enable SUBD to proceed (out-oforder execution) 23

24 Dynamic Scheduling The hardware reorder the instruction execution to reduce pipeline stalls while maintaining data flow and exception behavior. Main advantages (PROs): It enables handling some cases where dependences are unknown at compile time It simplifies the compiler complexity It allows compiled code to run efficiently on a different pipeline (code portability). Those advantages are gained at a cost of (CONs): A significant increase in hardware complexity, Increased power consumption Could generate imprecise exceptions -24-

25 Dynamic Scheduling Simple pipeline: hazards due to data dependences that cannot be hidden by forwarding stall the pipeline no new instructions are fetched nor issued. Dynamic scheduling: Hardware reorder instructions execution so as to reduce stalls, maintaining data flow and exception behaviour. Typical Example: Superscalar Processor -25-

26 Dynamic Scheduling (2) Basically: Instructions are fetched and issued in program order (in-order-issue) Execution begins as soon as operands are available possibly, out of order execution note: possible even with pipelined scalar architectures. Out-of order execution introduces possibility of WAR, WAW data hazards. Out-of order execution implies out of order completion unless there is a re-order buffer to get inorder completion -26-

27 Static Scheduling Compilers can use sophisticated algorithms for code scheduling to exploit ILP (Instruction Level Parallelism). The size of a basic block a straight-line code sequence with no branches in except to the entry and no branches out except at the exit is usually quite small and the amount of parallelism available within a basic block is quite small. Example: For typical MIPS programs the average branch frequency is between 15% and 25% from 4 to 7 instructions execute between a pair of branches. -27-

28 Static Scheduling Data dependence can further limit the amount of ILP we can exploit within a basic block to much less than the average basic block size. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks (i.e. across branches such as in trace scheduling). -28-

29 Static Scheduling Static detection and resolution of dependences static scheduling: accomplished by the compiler dependences are avoided by code reordering. Output of the compiler: reordered into dependency-free code. Typical example: VLIW (Very Long Instruction Word) processors expect dependency-free code generated by the compiler -29-

30 Main Limits of Static Scheduling Unpredictable branches Variable memory latency (unpredictable cache misses) Code size explosion Compiler complexity

31 Summary of Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: depend on the hardware to locate parallelism Static Scheduling: rely on the compiler for identifying potential parallelism Hardware intensive approaches dominate desktop and server markets 31

32 Several steps towards exploiting more ILP 32

33 Several steps towards exploiting more ILP 33

34 Several steps towards exploiting more ILP Single-Issue Out-of-Order Execution 34

35 Several steps towards exploiting more ILP Multiple-Issue Out-of-Order Execution 35

36 36

37 Superscalar Processor Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 37

38 Dynamic Scheduler 38

39 39

40 40

41 Dynamic Scheduling is expensive! Large amount of logic, significant area cost PowerPC 750 Instruction Sequencer is approx. 70% of the area of all execution units! (Integer units + Load/Store units + FP unit) Cycle time limited by scheduling logic (dispatcher and associated dependency checking logic) Design verification extremely complex Very complex irregular logic Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 41

42 Issue-Width limited in practice The issue width is the number of instructions that can be issued in a single cycle by a multiple issue (also called ILP) processor And fetched, and decoded, etc. When superscalar was invented, 2- and rapidly 4-issue width processors were created (i.e. 4 instructions executed in a single cycle, ideal CPI = 1/4) Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 42

43 Issue-Width limited in practice Now, the maximum (rare) is 6, but no more exists. The widths of current processors range from single-issue (ARM11, UltraSPARC-T1) through 2-issue (UltraSPARC-T2/T3, Cortex-A8 & A9, Atom, Bobcat) to 3-issue (Pentium-Pro/II/III/M, Athlon, Pentium-4, Athlon 64/Phenom, Cortex-A15) or 4-issue (UltraSPARC-III/IV, PowerPC G4e, Core 2, Core i, Core i*2, Bulldozer) or 5-issue (PowerPC G5), or even 6-issue (Itanium, but it's a VLIW). Because it is too hard to decide which 8, or 16, instructions can execute every cycle (too many!) It takes too long to compute So the frequency of the processor would have to be decreased Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 43

44 Summary of superscalar and dynamic scheduling Main advantage Very high performance: Ideal CPI very low: CPI ideal = 1 / issue-width Disadvantages Very expensive logic to decide dependencies and independencies, i.e. to decide which instructions can be issued every clock cycle It does not scale: almost impractical to make issuewidth greater than 4 (we would have to slow down the clock) Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 44

45 45

46 Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 46

47 Superscalar vs VLIW Scheduling Deciding when and where to execute an instruction i.e. in which cycle and in which functional unit For a superscalar processor it is decided at run time, by custom logic in HW For a VLIW processor it is decided at compile-time, by the compiler, and therefore by a SW program Good for embedded processors: Simpler HW design (no dynamic scheduler), smaller area and cheap Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 47

48 Challenges for VLIW Compiler technology The compiler needs to find a lot of parallelism in order to keep the multiple functional units of the processors busy Binary incompatibility Consequence of the larger exposure of the microarchitecture (= implementation choices) at the compiler in the generated code Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 48

49 Advantages of SW vs HW Scheduling 49

50 Current Superscalar & VLIW processors Dynamically-scheduled superscalar processors are the commercial state-of-the-art for general purpose: current implementations of Intel Core i, Alpha, PowerPC, MIPS, Sparc, etc. are all superscalar VLIW processors are primarily successful as embedded media processors for consumer electronic devices (embedded): TriMedia media processors by NXP (formerly Philips Semiconductors) The C6000 DSP family by Texas Instruments The STMicroelectronics ST200 family The SHARC DSP by Analog Devices Itanium 2 is the only general purpose VLIW, a hybrid VLIW (EPIC, Explicitly Parallel Instructions Computing) 50

### INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

### Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

### How to improve (decrease) CPI

How to improve (decrease) CPI Recall: CPI = Ideal CPI + CPI contributed by stalls Ideal CPI =1 for single issue machine even with multiple execution units Ideal CPI will be less than 1 if we have several

Advanced Pipelining Techniques 1. Dynamic Scheduling 2. Loop Unrolling 3. Software Pipelining 4. Dynamic Branch Prediction Units 5. Register Renaming 6. Superscalar Processors 7. VLIW (Very Large Instruction

### Computer Architecture TDTS10

why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

### Computer Organization and Architecture

Computer Organization and Architecture Chapter 14 Instruction Level Parallelism and Superscalar Processors What does Superscalar mean? Common instructions (arithmetic, load/store, conditional branch) can

### VLIW Processors. VLIW Processors

1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

### This Unit: Code Scheduling. CIS 371 Computer Organization and Design. Pipelining Review. Readings. ! Pipelining and superscalar review

This Unit: Code Scheduling CIS 371 Computer Organization and Design App App App System software Mem CPU I/O! Pipelining and superscalar review! Code scheduling! To reduce pipeline stalls! To increase ILP

### Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

### Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2

### Multicore and Parallel Processing

Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 Administrivia FlameWar Games Night Next Friday, April 27 th 5pm

### Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

### EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

### Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Software Pipelining for (i=1, i

### Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

### Instruction Set Architectures

Instruction Set Architectures Early trend was to add more and more instructions to new CPUs to do elaborate operations (CISC) VAX architecture had an instruction to multiply polynomials! RISC philosophy

### Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

### A Brief Review of Processor Architecture. Why are Modern Processors so Complicated? Basic Structure

A Brief Review of Processor Architecture Why are Modern Processors so Complicated? Basic Structure CPU PC IR Regs ALU Memory Fetch PC -> Mem addr [addr] > IR PC ++ Decode Select regs Execute Perform op

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

### CPU Performance Equation

CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down

### The Microarchitecture of Superscalar Processors

The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: jes@ece.wisc.edu

### Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2

### Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction

### Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

### Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

### Design Cycle for Microprocessors

Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

### İSTANBUL AYDIN UNIVERSITY

İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER

### Module: Software Instruction Scheduling Part I

Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section

### Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

### PROBLEMS #20,R0,R1 #\$3A,R2,R4

506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #\$3A,R2,R4 R0,R2,R5 In all instructions,

### Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim MIPS Architecture MIPS (Microprocessor without interlocked pipeline stages) MIPS Computer Systems Inc. Developed from Stanford MIPS architecture usages 1990 s R2000, R3000,

### Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

### Computer Organization and Components

Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

### Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

### Introduction to Microprocessors

Introduction to Microprocessors Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?

### ARM Cortex-A8 Processor

ARM Cortex-A8 Processor High Performances And Low Power for Portable Applications Architectures for Multimedia Systems Prof. Cristina Silvano Gianfranco Longi Matr. 712351 ARM Partners 1 ARM Powered Products

### High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)

### The AMD Athlon XP Processor with 512KB L2 Cache

The AMD Athlon XP Processor with 512KB L2 Cache Technology and Performance Leadership for x86 Microprocessors Jack Huynh ADVANCED MICRO DEVICES, INC. One AMD Place Sunnyvale, CA 94088 Page 1 The AMD Athlon

### Introduction to Cloud Computing

Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

### What are embedded systems? Challenges in embedded computing system design. Design methodologies.

Embedded Systems Sandip Kundu 1 ECE 354 Lecture 1 The Big Picture What are embedded systems? Challenges in embedded computing system design. Design methodologies. Sophisticated functionality. Real-time

### White Paper COMPUTE CORES

White Paper COMPUTE CORES TABLE OF CONTENTS A NEW ERA OF COMPUTING 3 3 HISTORY OF PROCESSORS 3 3 THE COMPUTE CORE NOMENCLATURE 5 3 AMD S HETEROGENEOUS PLATFORM 5 3 SUMMARY 6 4 WHITE PAPER: COMPUTE CORES

### CSE 30321 Computer Architecture I Fall 2009 Final Exam December 18, 2009

CSE 30321 Computer Architecture I Fall 2009 Final Exam December 18, 2009 Test Guidelines: 1. Place your name on EACH page of the test in the space provided. 2. every question in the space provided. If

### Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive

### CS352H: Computer Systems Architecture

CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions

### Instruction scheduling

Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.

### ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

### The University of Nottingham

The University of Nottingham School of Computer Science A Level 1 Module, Autumn Semester 2007-2008 Computer Systems Architecture (G51CSA) Time Allowed: TWO Hours Candidates must NOT start writing their

### Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

Superscalar Processors Superscalar Processors: Branch Prediction Dynamic Scheduling Superscalar: A Sequential Architecture Superscalar processor is a representative ILP implementation of a sequential architecture

### Performance evaluation

Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography

### Processor Architectures

ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture

### A Lab Course on Computer Architecture

A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

### Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic 6.823 L6-2 Technology Assumptions A small amount of very fast memory

### Exercises for EITF20 Computer Architecture HT1 2010

Exercises for EITF20 Computer Architecture HT1 2010 Anders Ardö Department of Electrical and Information Technology, EIT Lund University December 4, 2013 1 Contents 1 Performance 3 2 ISA 7 3 Pipelining

### Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

### CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis

### SOC architecture and design

SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

### Microprocessor and Microcontroller Architecture

Microprocessor and Microcontroller Architecture 1 Von Neumann Architecture Stored-Program Digital Computer Digital computation in ALU Programmable via set of standard instructions input memory output Internal

### A Flexible Simulator of Pipelined Processors

A Flexible Simulator of Pipelined Processors Ben Juurlink Koen Bertels Bei Li Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology

### 18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right

### what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

### CS521 CSE IITG 11/23/2012

CS521 CSE TG 11/23/2012 A Sahu 1 Degree of overlap Serial, Overlapped, d, Super pipelined/superscalar Depth Shallow, Deep Structure Linear, Non linear Scheduling of operations Static, Dynamic A Sahu slide

### LSN 2 Computer Processors

LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

### Instruction Set Design

Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

### Introducción. Diseño de sistemas digitales.1

Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1

### Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December

### IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

### Five Families of ARM Processor IP

ARM1026EJ-S Synthesizable ARM10E Family Processor Core Eric Schorn CPU Product Manager ARM Austin Design Center Five Families of ARM Processor IP Performance ARM preserves SW & HW investment through code

### CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities

### Analysis (III) Low Power Design. Kai Huang

Analysis (III) Low Power Design Kai Huang Chinese new year: 1.3 billion urban exodus 1/28/2014 Kai.Huang@tum 2 The interactive map, which is updated hourly The thicker, brighter lines are the busiest routes.

### <Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle

### Computer Architectures

Computer Architectures 2. Instruction Set Architectures 2015. február 12. Budapest Gábor Horváth associate professor BUTE Dept. of Networked Systems and Services ghorvath@hit.bme.hu 2 Instruction set architectures

### Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC

### IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

### Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

### Boosting Beyond Static Scheduling in a Superscalar Processor

Boosting Beyond Static Scheduling in a Superscalar Processor Michael D. Smith, Monica S. Lam, and Mark A. Horowitz Computer Systems Laboratory Stanford University, Stanford CA 94305-4070 May 1990 1 Introduction

### Coming Challenges in Microarchitecture and Architecture

Coming Challenges in Microarchitecture and Architecture RONNY RONEN, SENIOR MEMBER, IEEE, AVI MENDELSON, MEMBER, IEEE, KONRAD LAI, SHIH-LIEN LU, MEMBER, IEEE, FRED POLLACK, AND JOHN P. SHEN, FELLOW, IEEE

### Giving credit where credit is due

CSCE 230J Computer Organization Processor Architecture VI: Wrap-Up Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due ost of slides for

### Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Design of Pipelined MIPS Processor Sept. 24 & 26, 1997 Topics Instruction processing Principles of pipelining Inserting pipe registers Data Hazards Control Hazards Exceptions MIPS architecture subset R-type

### Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

### on an system with an infinite number of processors. Calculate the speedup of

1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

### Annotation to the assignments and the solution sheet. Note the following points

Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed

### CISC, RISC, and DSP Microprocessors

CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:

### VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz

### Who Cares about Memory Hierarchy?

Cache Design Who Cares about Memory Hierarchy? Processor vs Memory Performance CPU-DRAM Gap 1980: no cache in microprocessor; 1995 2-level cache Memory Cache cpu cache memory Memory Locality Memory hierarchies

### OC By Arsene Fansi T. POLIMI 2008 1

IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

### Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................

### Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

### Central Processing Unit (CPU)

Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following

### Multithreading Lin Gao cs9244 report, 2006

Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............

### Instruction Set Architecture (ISA) Design. Classification Categories

Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers

### Measuring Cache Performance

Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses With simplifying assumptions: Memory stall cycles = = Memory

### Driving force. What future software needs. Potential research topics

Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #

### ARM Microprocessor and ARM-Based Microcontrollers

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 A Microcontroller-Based Embedded System Roadmap 1 Introduction ARM ARM Basics 2 ARM Extensions Thumb Jazelle NEON & DSP Enhancement

### SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright

### Reduced Instruction Set Computer vs Complex Instruction Set Computers

RISC vs CISC Reduced Instruction Set Computer vs Complex Instruction Set Computers for a given benchmark the performance of a particular computer: P = 1 I C 1 S where P = time to execute I = number of