Instruction Level Parallelism

Size: px
Start display at page:

Download "Instruction Level Parallelism"

Transcription

1 Instruction Level Parallelism Pipelining achieves Instruction Level Parallelism (ILP) Multiple instructions in parallel But, problems with pipeline hazards CPI = Ideal CPI + stalls/instruction Stalls = Structural + Data (RAW/WAW/WAR) + Control How to reduce stalls? That is, how to increase ILP?

2 Techniques for Improving ILP Loop unrolling Basic pipeline scheduling Dynamic scheduling, scoreboarding, register renaming Dynamic memory disambiguation Dynamic branch prediction Multiple instruction issue per cycle Software and hardware techniques

3 Loop-Level Parallelism Basic block: straight-line code w/o branches Fraction of branches: 0.15 ILP is limited! Average basic-block size is 6-7 instructions And, these may be dependent Hence, look for parallelism beyond a basic block Loop-level parallelism is a simple example of this

4 Loop-Level Parallelism: An Example Consider the loop: for(int i = 1000; i >= 1; i = i-1) { x[i] = x[i] + C; // FP } Each iteration of the loop is independent of other iterations Loop-level parallelism To convert it into ILP: Loop unrolling (static, dynamic) Vector instructions

5 The Loop, in DLX In DLX, the loop looks like: Loop: LD ADDD SD SUBI BNEZ Assume: R1 is the initial address F0, 0(R1) // F0 is array element F4, F0, F2// F2 has the scalar 'C' 0(R1), F4 // Stored result R1, R1, 8 // For next iteration R1, Loop // More iterations? F2 has the scalar value 'C' Lowest address in array is '8'

6 How Many Cycles per Loop? CC1 Loop: LD F0, 0(R1) CC2 stall CC3 ADDD F4, F0, F2 CC4 stall CC5 stall CC6 SD 0(R1), F4 CC7 SUBI R1, R1, 8 CC8 stall CC9 BNEZ R1, Loop CC10 stall

7 Reducing Stalls by Scheduling CC1 Loop: LD F0, 0(R1) CC2 SUBI R1, R1, 8 CC3 ADDD F4, F0, F2 CC4 stall CC5 BNEZ R1, Loop CC6 SD 8(R1), F4 Realizing that SUBI and SD can be swapped is non-trivial! Overhead versus actual work: 3 cycles of work, 3 cycles of overhead

8 Unrolling the Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 // No SUBI, BNEZ LD F6, -8(R1) // Note diff FP reg, new offset ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) // Note diff FP reg, new offset ADDD F12, F10, F2 SD -16(R1), F8 LD F14, -24(R1) // Note diff FP reg, new offset ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, 32

9 How Many Cycles per Loop? Loop: LD F0, 0(R1) // 1 stall ADDD F4, F0, F2 // 2 stalls SD 0(R1), F4 LD F6, -8(R1) // 1 stall ADDD F8, F6, F2 // 2 stalls SD -8(R1), F8 LD F10, -16(R1) // 1 stall ADDD F12, F10, F2 // 2 stalls SD -16(R1), F8 LD F14, -24(R1) // 1 stall ADDD F16, F14, F2 // 2 stalls SD -24(R1), F16 SUBI R1, R1, 32// 1 stall 28 cycles per unrolled loop == 7 cycles per original loop

10 Scheduling the Unrolled Loop Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, 32 SD 16(R1), F8 BNEZ R1, Loop 14 cycles per unrolled loop == 3.5 cycles per original loop

11 Observations and Requirements Gain from scheduling is even higher for unrolled loop! More parallelism is exposed on unrolling Need to know that 1000 is a multiple of 4 Requirements: Determine that loop can be unrolled Use different registers to avoid conflicts Determine that SD can be moved after SUBI, and find the offset adjustment Understand dependences

12 Dependences Dependent instructions ==> cannot be in parallel Three kinds of dependences: Data dependence (RAW) Name dependence (WAW and WAR) Control dependence

13 Dependences (continued) Dependences are properties of programs Stalls are properties of the pipeline Two possibilities: Maintain dependence, but avoid stalls Eliminate dependence by code transformation

14 Data Dependence Data dependence represents data flow from one instruction to another One instruction uses the result of another Take transitive closure In our example: Loop: LD F0, 0(R1) Note: dependence in memory is hard to detect 100(R4) and 80(R6) may be the same 20(R1) and 20(R1) may be different at different times ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, 8

15 Name Dependence Two instructions use the same register/memory (name), but there is no flow of data Anti-dependence: WAR hazard Output dependence: WAW hazard Can do register renaming s tatically, or dynamically

16 Name Dependence in our Example Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) ADDD F4, F0, F2 SD -8 R1), F4 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1), F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1), F4 SUBI R1, R1, 32 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 Register LD F10, -16(R1) renaming ADDD F12, F10, F2 SD -16(R1), F8 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, 32

17 Control Dependence An example: T1; if p1 { S1; } Statement S1 is control-dependent on p1, but T1 is not What this means for execution S1 cannot be moved before p1 T1 cannot be moved after p1

18 Control Dependence in our Example Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, 8 BEQZ R1, exit LD F6 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, 8 BEQZ R1, exit // Two more such... SUBI R1, R1, 8 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F8 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, 32

19 Handling Control Dependence Control dependence need not be maintained We need to maintain: Exception behaviour do not caus e new exceptions Data flow ensur e the right data item is used Speculation and conditional instructions are techniques to get around control dependence

20 Loop Unrolling: a Relook Our example: for(int i = 1000; i >= 1; i = i-1) { x[i] = x[i] + C; // FP } Consider: for(int i = 1000; i >= 1; i = i-1) { A[i-1] = A[i] + C[i]; // S1 B[i-1] = B[i] + A[i-1]; // S2 } S2 is dependent on S1 S1 is dependent on its previous iteration; same case with S2 Loop-carried dependence ==> loop iterations have to be in-order

21 Removing Loop-Carried Another example: Dependence for(int i = 1000; i >= 1; i = i-1) { A[i] = A[i] + B[i]; // S1 B[i-1] = C[i] + D[i]; // S2 } S1 depends on the prior iteration of S2 Can be removed (no cyclic dependence) A[1000] = A[1000] + B[1000]; for(int i = 1000; i >= 2; i = i-1) { B[i-1] = C[i] + D[i]; // S2 A[i-1] = A[i-1] + B[i-1]; // S1 }

22 Static vs. Dynamic Scheduling Static scheduling: limitations Dependences may not be known at compile time Even if known, compiler becomes complex Compiler has to have knowledge of pipeline Dynamic scheduling Handle dynamic dependences Simpler compiler Efficient even if code compiled for a different pipeline

23 Dynamic Scheduling For now, we will focus on overcoming data hazards The idea: DIVD ADDD SUBD F0, F2, F4 F10, F0, F8 F12, F8, F14 SUBD can proceed without waiting for DIVD

24 CDC 6600: A Case Study IF stage: fetch instructions onto a queue ID stage is split into two stages: Issue: decode and check for structural hazards Read operands: check for data hazards Execution may begin, and may complete outof-order Complications in exception handling Ignore for now What is the logic for data hazard checks?

25 The CDC Scoreboard Out-of-order completion ==> WAR and WAW hazards possible Scoreboard: a data-structure for all hazard detection in the presence of out-of-order execution/completion All instructions cons ult the scoreboard to detect hazards

26 The Scoreboard Solution Three components: Stages of the pipeline: Issue (ID1), Read-operands (ID2), EX, WB Data structure (in hardware) Logic for hazard detection, stalling

27 Scoreboard Control & the Pipeline Stages Issue (ID1): decode, check if functional unit is free, and if a previous instruction has the same destination register No such hazard ==> scoreboard issues to the appropriate functional unit Note: structural/waw hazards prevented by stalling here Note: stall here ==> IF queue will grow Read operands (ID2): Operand is available if no earlier instruction is going to write it, or if the register is being written currently RAW hazards are resolved here

28 Scoreboard Control & the Pipeline Stages (continued) Execute (EX): Functional units perform execution Scoreboard is notified on completion Write-Back (WB): Check for WAR hazards Stall on detection Write-back otherwise

29 Some Remarks WAW causes stall in ID1, WAR causes stall in WB No forwarding logic Output written as soon as it is available (and no WAR hazard) Structural hazard possible in register read/write CDC has 16 functional units, and 4 buses

30 The Scoreboard Data-Structures Instruction status Functional unit status Register result status Randy Katz's CS252 slides... (Lecture 10, Spring 1996) Scoreboard pipeline control A detailed example

31 Limitations of the Scoreboard Speedup of 1.7 for (compiled) FORTRAN, speedup of 2.5 for hand-coded assembly Scoreboard only in basic-block! Some hazards still cause stalls: Structural WAR, WAW

32 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit Main additional feature: register renaming to avoid WAR and WAW stalls

33 Register Renaming: Basic Idea Compiler maps memory --> registers statically Register renaming maps registers --> virtual registers in hardware, dynamically Should keep track of this mapping Make sure to read the current value Num. virtual registers > Num. ISA registers usually Virtual registers are known as reservation stations in the IBM 360/91

34 Tomasulo: Main Architectural Features Reservation stations: fetch and buffer operand as soon as it is available Load/store buffers: have the address (and data for store) to be loaded/stored Distributed hazard detection and execution control Common Data Bus (CDB): results passed from where generated to where needed Note: IBM 360/91 also had reg-mem instns.

35 The Tomasulo Architecture From mem. Load Buffers From instn. unit FP Opn Queue FP Regs Opn. Bus Opnd. Bus Store Buffers To mem. Resvn. Stns. FP ADD/SUB Common Data Bus Resvn. Stns. FP MUL/DIV

36 Pipeline Stages Issue: Wait for free Reservation Station (RS) or load/store buffer, and place instruction there Rename registers in the process (WAR and WAW handled here) Execute (EX): Monitor CDB for required operand Checks for RAW hazard in this process Write Result (WB): Write to CDB Picked up by any RS, store buffer, or register

37 Register Renaming In RS, operands referred to by a tag (if operand not already in a register) The tag refers to the RS (which contains the instruction) which will produce the required operand Thus each RS acts as a virtual register

38 The Data Structure Three parts, like in the scoreboard: Instruction status Reservation stations, Load/Store buffers, Register file Register status: which unit is going to produce the register value This is the register --> virtual register mapping

39 Components of RS, Reg. File, Load/Store Buffers Each RS has: Op: the operation (+, -, x, /) Vj, Vk: the operands (if available) Qj, Qk: the RS tag producing Vj/Vk (0 if Vj/Vk known) Busy: is RS busy? Each reg. in reg. file and store buffer has: Qi: tag of RS whose result should go to the reg. or the mem. locn. (blank ==> no such active RS) Load and store buffers have: Busy field, store buffer has value V to be stored

40 Maintaining the Data Structure Issue: Wait until: RS or buffer empty Updates: Qj, Qk, Vj, Vk, Busy of RS/buffer; Maintain register mapping (register status) Execute: Wait until: Qj=0 and Qk=0 (operands available) Write result: CDB result picked up by RS (update Qj, Qk, Vj, Vk), store buffers (update Qi, V), register file (update register status) Update Busy of the RS which finished

41 Some Examples Randy Katz's CS252 slides... (Lecture 11, Spring 1996) Dynamic loop unrolling example from text

42 Dynamic Loop Unrolling Loop: LD ADDD SD SUBI BNEZ F0, 0(R1) // F0 is array element F4, F0, F2// F2 has the scalar 'C' 0(R1), F4 // Stored result R1, R1, 8 // For next iteration R1, Loop // More iterations? Assume branch predicted to be taken Denote: load buffers as L1, L2..., ADDD RSs as A1, A2... First loop: F0 --> L1, F4 --> A1 Second loop: F0 --> L2, F4 --> A2

43 Summary Remarks Memory disambiguation required Drawbacks of Tomasulo: Large amount of hardware Complex control logic CDB is performance bottleneck But: Required if designing for an old ISA Multiple issue ==> register renaming and dynamic scheduling required Next class: branch prediction

44 Dealing with Control Hazards Software techniques: Branch delay slots Software branch prediction Canceling or nullifying branches Misprediction rates can be high Worse if multiple issue per cycle Hence, hardware/dynamic branch prediction

45 Branch Prediction Buffer PC --> Taken/Not-Taken (T/NT) mapping Can use just the last few bits of PC Prediction may be that of some other branch Ok since correctness is not affected Shortcoming of this prediction scheme: Branch mispredicted twice for each execution of a loop Bad if loop is small for(int i = 0; i < 10; i++) { } x[i] = x[i] + C;

46 Two-Bit Predictor Have to mispredict twice before changing prediction Built in hysteresis General case is an n-bit predictor 0 to (2^n)-1 saturating counter 0 to (2^[n-1])-1 predict as taken 2^[n-1] to (2^n)-1 predict as not-taken Experimental studies: 2-bit as good as n-bit

47 Implementing Branch Prediction Buffers Implementing branch prediction buffers Small cache accessed along with the instruction in IF Or, additional 2 bits in instruction cache Note: branch prediction buffer not useful for DLX pipeline Branch target not known earlier than branch condition

48 Prediction Performance 18.00% 16.00% 14.00% Misprediction rate 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% Nasa7 Matrix30 0 Tom catv Doduc Spice Fpppp Gcc Espre sso Eqntott Li 4096 entries in the prediction buffer SPEC89, IBM Power architecture

49 Improving Branch Prediction Two ways: increase buffer size, improve accuracy 18.00% 16.00% Misprediction rate 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 4096 entries Inf. entries 2.00% 0.00% Nasa 7 Matrix30 0 Tom catv Do duc Spice Fppp p Gcc Espr esso Eqntott Li

50 Improving Prediction Accuracy Predict branches based on outcomes of recent other branches if(aa == 2) { aa = 0; } if(bb == 2) { bb = 0; } if(aa == bb) { // Do something Correlating, } or two-level predictor

51 Two-Level Predictor There are effectively two predictors for each branch: Depending on whether previous branch is T/NT Prediction bits Prediction if last branch NT Prediction if last branch T NT/NT NT NT NT/T NT T T/NT T NT T/T T T

52 Two-Level Predictor (continued) Last predictor was a (1,1) predictor One bit each of history, and prediction General case is (m,n) predictor m bits of history, n bits of prediction How to implement? Have an m-bit shift register

53 Cost of Two-Level Predictor Number of bits required: Num. branch entries x 2^m x n How many bits in 4096 (0,2) predictor? 8K How many branch entries for an 8K (2,2) predictor? 1K

54 Performance of (2,2) Predictor 4096 entries; (0,2) Inf. entries; (0,2) 1K entries; (2,2) 20.00% 17.50% Misprediction rate 15.00% 12.50% 10.00% 7.50% 5.00% 2.50% 0.00% Nasa7 Matrix30 0 Tom catv Doduc Spice Fpppp Gcc Espre sso Eqntott Li

55 Branch Target Buffer Branch prediction buffer is not useful for DLX Need to know target address by the end of IF Store branch target address also Branch target buffer, or cache Access branch target buffer in IF cycle Hit ==> predicted branch target known at the end of IF We also need to know if the branch is predicted T/NT

56 Branch Target Buffer Lookup based on PC (continued) Predicted target No entry found ==> (Target = PC+4) Exact match of PC is important Since we are predicting even before knowing that it is a branch instruction Hardware is similar to a cache Need to store predicted PC only for taken predictions

57 Steps in Using a Target Buffer Access Instn. Cache and target buffer Use predicted PC Mispredicted branch; restart fetch; delete buffer entry Entry found? Yes No A taken branch? No A taken branch? Yes Normal execution No Yes Make new target buffer entry Correct prediction, proceed IF ID EX

58 Penalties in Branch Prediction Buffer hit? Branch taken? Penalty Yes Yes 0 Yes No 2 No - 2 Given a prediction accuracy of p, a buffer hitrate of h, and a taken branch frequency of f, what is the branch penalty? h x (1-p) x 2 + (1-h) x f x 2

59 Storing Target Instructions Directly store instructions instead of target address Target buffer access is now allowed to take longer Or, branch folding can be achieved Replace fetched instruction with that found in the target buffer entry Zero cycle unconditional branch; may be conditional as well

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

WAR: Write After Read

WAR: Write After Read WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

More information

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1 Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis

More information

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology Superscalar Processors Superscalar Processors: Branch Prediction Dynamic Scheduling Superscalar: A Sequential Architecture Superscalar processor is a representative ILP implementation of a sequential architecture

More information

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2

More information

CS352H: Computer Systems Architecture

CS352H: Computer Systems Architecture CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 230J Computer Organization Processor Architecture VI: Wrap-Up Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due ost of slides for

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2

More information

Module: Software Instruction Scheduling Part I

Module: Software Instruction Scheduling Part I Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section

More information

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100) EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

More information

The Microarchitecture of Superscalar Processors

The Microarchitecture of Superscalar Processors The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: jes@ece.wisc.edu

More information

CPU Performance Equation

CPU Performance Equation CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down

More information

Instruction Set Architecture (ISA) Design. Classification Categories

Instruction Set Architecture (ISA) Design. Classification Categories Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers

More information

Instruction Set Design

Instruction Set Design Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

More information

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

More information

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction

More information

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic 1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic 6.823 L6-2 Technology Assumptions A small amount of very fast memory

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

Computer Organization and Components

Computer Organization and Components Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Per Stenström, Håkan Nilsson, and Jonas Skeppstedt Department of Computer Engineering, Lund University P.O. Box 118, S-221

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

CS521 CSE IITG 11/23/2012

CS521 CSE IITG 11/23/2012 CS521 CSE TG 11/23/2012 A Sahu 1 Degree of overlap Serial, Overlapped, d, Super pipelined/superscalar Depth Shallow, Deep Structure Linear, Non linear Scheduling of operations Static, Dynamic A Sahu slide

More information

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007 Application Note 195 ARM11 performance monitor unit Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007 Copyright 2007 ARM Limited. All rights reserved. Application Note

More information

Computer Organization and Architecture

Computer Organization and Architecture Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal

More information

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu. Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive

More information

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore 560 012

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore 560 012 Department of Computer Science Indian Institute of Science Bangalore 560 2 NPTEL Course on Compiler Design Introduction to Overlaps execution of instructions from multiple iterations of a loop Executes

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows:

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows: 4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the

More information

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another. Data Hazards 1 Hazards: Key Points Hazards cause imperfect pipelining They prevent us from achieving CPI = 1 They are generally causes by counter flow data pennces in the pipeline Three kinds Structural

More information

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995 UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering EEC180B Lab 7: MISP Processor Design Spring 1995 Objective: In this lab, you will complete the design of the MISP processor,

More information

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer y, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm

More information

Streamlining Data Cache Access with Fast Address Calculation

Streamlining Data Cache Access with Fast Address Calculation Streamlining Data Cache Access with Fast Address Calculation Todd M Austin Dionisios N Pnevmatikatos Gurindar S Sohi Computer Sciences Department University of Wisconsin-Madison 2 W Dayton Street Madison,

More information

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield. Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December

More information

Computer organization

Computer organization Computer organization Computer design an application of digital logic design procedures Computer = processing unit + memory system Processing unit = control + datapath Control = finite state machine inputs

More information

2

2 1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

CPU Organisation and Operation

CPU Organisation and Operation CPU Organisation and Operation The Fetch-Execute Cycle The operation of the CPU 1 is usually described in terms of the Fetch-Execute cycle. 2 Fetch-Execute Cycle Fetch the Instruction Increment the Program

More information

The 104 Duke_ACC Machine

The 104 Duke_ACC Machine The 104 Duke_ACC Machine The goal of the next two lessons is to design and simulate a simple accumulator-based processor. The specifications for this processor and some of the QuartusII design components

More information

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar Memory ICS 233 Computer Architecture and Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline Random

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

Introducción. Diseño de sistemas digitales.1

Introducción. Diseño de sistemas digitales.1 Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1

More information

2) Write in detail the issues in the design of code generator.

2) Write in detail the issues in the design of code generator. COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage

More information

POWER8 Performance Analysis

POWER8 Performance Analysis POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling

More information

Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture

Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2012 John Kubiatowicz Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture Your Name:

More information

EEM 486: Computer Architecture. Lecture 4. Performance

EEM 486: Computer Architecture. Lecture 4. Performance EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design

More information

An Introduction to the ARM 7 Architecture

An Introduction to the ARM 7 Architecture An Introduction to the ARM 7 Architecture Trevor Martin CEng, MIEE Technical Director This article gives an overview of the ARM 7 architecture and a description of its major features for a developer new

More information

Instruction scheduling

Instruction scheduling Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.

More information

Computer Architectures

Computer Architectures Computer Architectures 2. Instruction Set Architectures 2015. február 12. Budapest Gábor Horváth associate professor BUTE Dept. of Networked Systems and Services ghorvath@hit.bme.hu 2 Instruction set architectures

More information

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus Datorteknik F1 bild 1 What is a bus? Slow vehicle that many people ride together well, true... A bunch of wires... A is: a shared communication link a single set of wires used to connect multiple subsystems

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Instruction Set Architecture

Instruction Set Architecture Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)

More information

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Pentium vs. Power PC Computer Architecture and PCI Bus Interface Pentium vs. Power PC Computer Architecture and PCI Bus Interface CSE 3322 1 Pentium vs. Power PC Computer Architecture and PCI Bus Interface Nowadays, there are two major types of microprocessors in the

More information

Energy Efficient Design of the Reorder Buffer 1

Energy Efficient Design of the Reorder Buffer 1 Energy Efficient Design of the Reorder Buffer 1 Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose Department of Computer Science, State University of New York, Binghamton, NY 13902 6000 {dima, gurhan, ghose}@cs.binghamton.edu

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

Central Processing Unit (CPU)

Central Processing Unit (CPU) Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following

More information

Instruction Level Parallelism I: Pipelining

Instruction Level Parallelism I: Pipelining Instruction Level Parallelism I: Pipelining Readings: H&P Appendix A Instruction Level Parallelism I: Pipelining 1 This Unit: Pipelining Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

Putting Checkpoints to Work in Thread Level Speculative Execution

Putting Checkpoints to Work in Thread Level Speculative Execution Putting Checkpoints to Work in Thread Level Speculative Execution Salman Khan E H U N I V E R S I T Y T O H F G R E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of

More information

361 Computer Architecture Lecture 14: Cache Memory

361 Computer Architecture Lecture 14: Cache Memory 1 361 Computer Architecture Lecture 14 Memory cache.1 The Motivation for s Memory System Processor DRAM Motivation Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

CARNEGIE MELLON UNIVERSITY

CARNEGIE MELLON UNIVERSITY CARNEGIE MELLON UNIVERSITY VALUE LOCALITY AND SPECULATIVE EXECUTION A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree of DOCTOR OF PHILOSOPHY in

More information

OBJECT-ORIENTED programs are becoming more common

OBJECT-ORIENTED programs are becoming more common IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 9, SEPTEMBER 2009 1153 Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware Hyesoon

More information

Software Pipelining - Modulo Scheduling

Software Pipelining - Modulo Scheduling EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

Software implementation of Post-Quantum Cryptography

Software implementation of Post-Quantum Cryptography Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software

More information

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor

More information

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 "JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer

More information

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors Haitham Akkary Ravi Rajwar Srikanth T. Srinivasan Microprocessor Research Labs, Intel Corporation Hillsboro, Oregon

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

Microprocessor and Microcontroller Architecture

Microprocessor and Microcontroller Architecture Microprocessor and Microcontroller Architecture 1 Von Neumann Architecture Stored-Program Digital Computer Digital computation in ALU Programmable via set of standard instructions input memory output Internal

More information

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds

More information

150127-Microprocessor & Assembly Language

150127-Microprocessor & Assembly Language Chapter 3 Z80 Microprocessor Architecture The Z 80 is one of the most talented 8 bit microprocessors, and many microprocessor-based systems are designed around the Z80. The Z80 microprocessor needs an

More information

Assembly Language Programming

Assembly Language Programming Assembly Language Programming Assemblers were the first programs to assist in programming. The idea of the assembler is simple: represent each computer instruction with an acronym (group of letters). Eg:

More information

CHAPTER 7: The CPU and Memory

CHAPTER 7: The CPU and Memory CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

More information

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis

More information

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz

More information

Types of Workloads. Raj Jain. Washington University in St. Louis

Types of Workloads. Raj Jain. Washington University in St. Louis Types of Workloads Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/ 4-1 Overview!

More information