Instruction Level Parallelism and Its ECE 154B

Size: px
Start display at page:

Download "Instruction Level Parallelism and Its ECE 154B"

Transcription

1 Instruction Level Parallelism and Its Exploitation (part I) ECE 154B Dmitri Strukov

2 Introduction Pipelining become universal technique in 1985 Overlaps execution of instructions Exploits Instruction Level Parallelism Beyond this, there are two main approaches: Hardware based dynamic approaches Used in server and desktop processors Not used as extensively in PMP processors Compiler based static approaches Not as successful outside of scientific applications

3 Instruction Level Parallelism When exploiting instruction level parallelism, goal is p g p,g to minimize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls

4 ILP techniques

5 Data Dependence How much parallelisms exist is a property of code Exploiting the most of it is a goal of ILP Challenges: Dependencies and Hazards Data dependency Name dependency Control dependency Dependency could result in data hazard

6 Data Dependence Data dependency Instruction ti j is data dt dependent d on instruction ti i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction I Dependent instructions cannot be executed simultaneously Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys: Possibility of a hazard Order in which resultsmust be calculated Upper bound on exploitable instruction level parallelism Dependencies that flow through memory locations are difficult to detect

7 Name Dependence Two instructions use the same name but no flow of information Not a true data dependence, but is a problem when reordering instructions Antidependence: instruction j writes a register or memory location that instruction ireads Initial ordering (ibefore j) must be preserved Output dependence: instruction iand instruction j write the same register or memory location Ordering must be preserved To resolve, use renaming techniques

8 Control Dependency Control Dependence Ordering of instruction iwith respect to a branch instruction ti Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controller by the branch An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

9 Examples Example 1: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6 L: OR R7,R1,R8 OR instruction dependent on DADDU and DSUBU Example 2: DADDU R1,R2,R3 BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9 skip: OR R7,R8,R9 Assume R4 isn t used after skip Possible to move DSUBU bf before the branch

10 Data Hazards Data Hazards Read after write (RAW) occur with true data dependency add ; nand Write after write (WAW) output dependency in pipelines pp that write in more than one pipe pp stage or with out or order execution add ; nand Write after read (WAR) antidependency add ; nand can occur with reoredering

11 Compiler Techniques for Exposing ILP Pipeline scheduling Separate dependent instruction from the source instruction by the pipeline latency of the source instruction Example: for (i=999; i>=0; i=i 1) x[i] = x[i] + s;

12 Pipeline Stalls Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D F4,0(R1) DADDUI R1,R1,# 8 stall (assume integer load latency is 1) BNE R1,R2,Loop

13 Pipeline Scheduling Scheduled code: Loop: L.D F0,0(R1) DADDUI R1,R1,# 8 ADD.D F4,F0,F2 stall tll stall S.D F4,8(R1) BNE R1,R2,LoopR2

14 Loop unrolling Parallelism with basic block is limited Typical size of basic block = 3 6 instructions Must optimize across branches Loop Level Parallelism Unroll loop statically or dynamically Use SIMD (vector processors and GPUs)

15 Loop Unrolling Loop unrolling Unroll by a factor of 4 (assume # elements is divisible by 4) Loop: Eliminate unnecessary instructions L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6, 8(R1) ADD.D F8,F6,F2 S.D DF8, 8(R1) ;drop DADDUI & BNE L.D F10, 16(R1) ADD.D F12,F10,F2 S.D F12, 16(R1) ;drop DADDUI & BNE LDF14 L.D F14, 24(R1) ADD.D F16,F14,F2 S.D F16, 24(R1) DADDUI R1,R1,# 32 BNE R1,R2,LoopR2 note: number of live registers vs. original loop

16 Loop Unrolling/Pipeline Scheduling Pipeline schedule the unrolled loop: Loop: L.D F0,0(R1) L.D F6, 8(R1) LDF10 L.D F10, 16(R1) 16(R1) L.D F14, 24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8, 8(R1) DADDUI R1,R1,# 32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop R2

17 Strip Mining Unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Generate pair of loops: First executes n mod k times Second executes n / k times Strip mining

18 Challenges to Loop Unrolling Limited benefit for large k (Amdahl's law) Code size Register pressure

19 Out of Order Order Execution: Scoreboarding

20 A More Realistic Pipeline

21 Out of Order Execution Variable latencies make out of order execution desirable How do we prevent WAR and WAW hazards? How do we deal with variable latency? Forwarding for RAW hazards harder. Instruction add mul div add sub IF ID EX M WB IF ID E1 E2 E3 E4 E5 E6 E7 M WB IF ID x x x x x x E1 E2 E3 E4 IF ID EX M WB IF ID EX M WB

22 Scoreboard: a Bookkeeping Technique Out of order execution divides ID stage: 1. Issue decode instructions, check for structural hazards 2. Read operands wait until no data hazards, then read operands Scoreboards date to CDC6600 in 1963 Instructions execute whenever not dependent on previous instructions and no hazards. CDC 6600: In order issue, out of order execution, outof order commit (or completion) No forwarding! Imprecise interrupt/exception model

23 Scoreboard Architecture (CDC 6600) Registe ers FP Mult FP Mult FP Divide FP Add Integer Units Func ctional SCOREBOARD Memory Fetch/Issue

24 Scoreboard Implications Out of order completion => WAR, WAW hazards? Solutions for WAR: Stallwriteback until all prior uses of the destination register have been read Read registers only during Read Operands stage Solution for WAW: Detect hazard and stall issue of new instruction until other instruction completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies between instructions that have already issued Scoreboard replaces ID, EX, M and WB with 4 stages

25 Four Stages of Scoreboard Control Issue decodeinstructions & check for structural hazards (ID1) Instructions issued in program order (for hazard checking) Don t issue if structural hazard Don t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards) Read operands wait until no data hazards, then read operands (ID2) All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. No forwarding of data in this model!

26 Four Stages of Scoreboard Control Execution operate p on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. This includes memory ops. Write result finish execution (WB) Stalluntil no WAR hazards withprevious instructions: Example: DIV ADD NAND CDC 6600 scoreboard would stall NAND until ADD reads operands

27 Three Parts of the Scoreboard Instruction status: Which state the instruction is in Functional unit status: Indicates the state of the functional unit (FU) nine fields for each functional unit Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g., + or ) Fi: Destination register number Fj,Fk: Source register numbers Qj,Qk: Functional units producing source registers Fj, Fk Rj,Rk: Flags indicating when Fj, Fk are ready Register result status Indicates which functional unit will write each register, if one exists Blank when no pending instructions will write that register

28 Instruction status: Scoreboard Example Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU

29 Detailed Scoreboard Pipeline Control Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj(f) Fi(FU) or Rj(f)=No) & (Fk(f) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) ( 0; Busy(FU) No

30 Instruction status: Scoreboard Example: Cycle 1 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Integer

31 Instruction status: Scoreboard Example: Cycle 2 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Integer Issue 2nd LD?

32 Instruction status: Scoreboard Example: Cycle 3 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 No Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Integer Issue MULT?

33 Instruction status: Scoreboard Example: Cycle 4 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU

34 Instruction status: Scoreboard Example: Cycle 5 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Integer

35 Instruction status: Scoreboard Example: Cycle 6 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 Integer

36 Instruction status: Scoreboard Example: Cycle 7 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 Integer Add Read multiply operands?

37 Scoreboard Example: Cycle 8 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Add Divide

38 Instruction status: Scoreboard Example: Cycle 9 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Note Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Remaining Mult2 No 2Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 Add Divide Read operands for MULT & SUB? Issue ADDD?

39 Scoreboard Example: Cycle 10 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 9 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 Add Divide

40 Instruction status: Scoreboard Example: Cycle 11 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 Add Divide

41 Scoreboard Example: Cycle 12 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 7 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 Divide Read operands for DIVD?

42 Scoreboard Example: Cycle 13 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 6 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 Add Divide

43 Scoreboard Example: Cycle 14 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 5 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 Add Divide

44 Scoreboard Example: Cycle 15 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 4 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 Add Divide

45 Scoreboard Example: Cycle 16 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU Mult1 Add Divide

46 Scoreboard Example: Cycle 17 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F WAR Hazard! Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 17 FU Mult1 Add Divide Why not write result of ADD???

47 Scoreboard Example: Cycle 18 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 18 FU Mult1 Add Divide

48 Scoreboard Example: Cycle 19 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 0 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 19 FU Mult1 Add Divide

49 Scoreboard Example: Cycle 20 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 20 FU Add Divide

50 Scoreboard Example: Cycle 21 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 21 FU Add Divide WAR Hazard is now gone...

51 Scoreboard Example: Cycle 22 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 39 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 22 FU Divide

52

53 Scoreboard Example: Cycle 61 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 61 FU Divide

54 Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU

55 Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU In-order issue; out-of-order execute & commit

56 CDC 6600 Scoreboard Historical context: Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards Precise interrupts?

57 Acknowledgements Some of the slides contain material developed and copyrighted by Sally A. McKee (Cornell University) and instructor material for the textbook 57

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

Module: Software Instruction Scheduling Part I

Module: Software Instruction Scheduling Part I Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section

More information

WAR: Write After Read

WAR: Write After Read WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

More information

Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology Superscalar Processors Superscalar Processors: Branch Prediction Dynamic Scheduling Superscalar: A Sequential Architecture Superscalar processor is a representative ILP implementation of a sequential architecture

More information

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1 Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

CPU Performance Equation

CPU Performance Equation CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore 560 012

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore 560 012 Department of Computer Science Indian Institute of Science Bangalore 560 2 NPTEL Course on Compiler Design Introduction to Overlaps execution of instructions from multiple iterations of a loop Executes

More information

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

More information

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction

More information

Instruction Set Design

Instruction Set Design Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

The Microarchitecture of Superscalar Processors

The Microarchitecture of Superscalar Processors The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: jes@ece.wisc.edu

More information

CS521 CSE IITG 11/23/2012

CS521 CSE IITG 11/23/2012 CS521 CSE TG 11/23/2012 A Sahu 1 Degree of overlap Serial, Overlapped, d, Super pipelined/superscalar Depth Shallow, Deep Structure Linear, Non linear Scheduling of operations Static, Dynamic A Sahu slide

More information

CS352H: Computer Systems Architecture

CS352H: Computer Systems Architecture CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100) EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Instruction scheduling

Instruction scheduling Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.

More information

Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture

Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2012 John Kubiatowicz Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture Your Name:

More information

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor

More information

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997 Design of Pipelined MIPS Processor Sept. 24 & 26, 1997 Topics Instruction processing Principles of pipelining Inserting pipe registers Data Hazards Control Hazards Exceptions MIPS architecture subset R-type

More information

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic 1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic 6.823 L6-2 Technology Assumptions A small amount of very fast memory

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995 UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering EEC180B Lab 7: MISP Processor Design Spring 1995 Objective: In this lab, you will complete the design of the MISP processor,

More information

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows:

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows: 4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 230J Computer Organization Processor Architecture VI: Wrap-Up Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due ost of slides for

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

Performance evaluation

Performance evaluation Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2

More information

Computer Organization and Components

Computer Organization and Components Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Annotation to the assignments and the solution sheet. Note the following points

Annotation to the assignments and the solution sheet. Note the following points Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed

More information

CHAPTER 7: The CPU and Memory

CHAPTER 7: The CPU and Memory CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

More information

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds

More information

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs. This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 2007, Ingredients. Software Pipelining. Data Dependence. Resource Constraints

Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 2007, Ingredients. Software Pipelining. Data Dependence. Resource Constraints Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 7, Ingredients Software Pipelining Data & Resource Constraints Resource Constraints in C67x Loop Scheduling Without Resource Bounds

More information

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield. Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December

More information

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS BY TANYA M. LATTNER B.S., University of Portland, 2000 THESIS Submitted in partial fulfillment of the requirements for the degree

More information

PROBLEMS. which was discussed in Section 1.6.3.

PROBLEMS. which was discussed in Section 1.6.3. 22 CHAPTER 1 BASIC STRUCTURE OF COMPUTERS (Corrisponde al cap. 1 - Introduzione al calcolatore) PROBLEMS 1.1 List the steps needed to execute the machine instruction LOCA,R0 in terms of transfers between

More information

Chapter 5 Instructor's Manual

Chapter 5 Instructor's Manual The Essentials of Computer Organization and Architecture Linda Null and Julia Lobur Jones and Bartlett Publishers, 2003 Chapter 5 Instructor's Manual Chapter Objectives Chapter 5, A Closer Look at Instruction

More information

Software implementation of Post-Quantum Cryptography

Software implementation of Post-Quantum Cryptography Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010 GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context:

More information

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

EE361: Digital Computer Organization Course Syllabus

EE361: Digital Computer Organization Course Syllabus EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)

More information

StrongARM** SA-110 Microprocessor Instruction Timing

StrongARM** SA-110 Microprocessor Instruction Timing StrongARM** SA-110 Microprocessor Instruction Timing Application Note September 1998 Order Number: 278194-001 Information in this document is provided in connection with Intel products. No license, express

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

Computer Architecture Syllabus of Qualifying Examination

Computer Architecture Syllabus of Qualifying Examination Computer Architecture Syllabus of Qualifying Examination PhD in Engineering with a focus in Computer Science Reference course: CS 5200 Computer Architecture, College of EAS, UCCS Created by Prof. Xiaobo

More information

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)

More information

Software Pipelining - Modulo Scheduling

Software Pipelining - Modulo Scheduling EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:

More information

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Per Stenström, Håkan Nilsson, and Jonas Skeppstedt Department of Computer Engineering, Lund University P.O. Box 118, S-221

More information

FLIX: Fast Relief for Performance-Hungry Embedded Applications

FLIX: Fast Relief for Performance-Hungry Embedded Applications FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...

More information

Performance monitoring of the software frameworks for LHC experiments

Performance monitoring of the software frameworks for LHC experiments Performance monitoring of the software frameworks for LHC experiments William A. Romero R. wil-rome@uniandes.edu.co J.M. Dana Jose.Dana@cern.ch First EELA-2 Conference Bogotá, COL OUTLINE Introduction

More information

Instruction Level Parallelism I: Pipelining

Instruction Level Parallelism I: Pipelining Instruction Level Parallelism I: Pipelining Readings: H&P Appendix A Instruction Level Parallelism I: Pipelining 1 This Unit: Pipelining Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC

More information

An Introduction to the ARM 7 Architecture

An Introduction to the ARM 7 Architecture An Introduction to the ARM 7 Architecture Trevor Martin CEng, MIEE Technical Director This article gives an overview of the ARM 7 architecture and a description of its major features for a developer new

More information

High-Performance Modular Multiplication on the Cell Processor

High-Performance Modular Multiplication on the Cell Processor High-Performance Modular Multiplication on the Cell Processor Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 19 Outline Motivation and previous work

More information

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s) Addressing The problem Objectives:- When & Where do we encounter Data? The concept of addressing data' in computations The implications for our machine design(s) Introducing the stack-machine concept Slide

More information

Instruction Set Architecture

Instruction Set Architecture Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)

More information

SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

More information

Processor Architectures

Processor Architectures ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture

More information

Design Cycle for Microprocessors

Design Cycle for Microprocessors Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

More information

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu. Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive

More information

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached

More information

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine This is a limited version of a hardware implementation to execute the JAVA programming language. 1 of 23 Structured Computer

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

Central Processing Unit (CPU)

Central Processing Unit (CPU) Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following

More information

Using Power to Improve C Programming Education

Using Power to Improve C Programming Education Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se

More information

Let s put together a Manual Processor

Let s put together a Manual Processor Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce

More information

OAMulator. Online One Address Machine emulator and OAMPL compiler. http://myspiders.biz.uiowa.edu/~fil/oam/

OAMulator. Online One Address Machine emulator and OAMPL compiler. http://myspiders.biz.uiowa.edu/~fil/oam/ OAMulator Online One Address Machine emulator and OAMPL compiler http://myspiders.biz.uiowa.edu/~fil/oam/ OAMulator educational goals OAM emulator concepts Von Neumann architecture Registers, ALU, controller

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing

More information