Instruction-Level Parallelism and Its Exploitation (part I and II) ECE 154B Dmitri Strukov

Transcription

1 Instruction-Level Parallelism and Its Exploitation (part I and II) ECE 154B Dmitri Strukov

2 Introduction Pipelining become universal technique in 1985 Overlaps execution of instructions Exploits Instruction Level Parallelism Beyond this, there are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors Not used as extensively in PMP processors Compiler-based static approaches Not as successful outside of scientific applications

3 Instruction-Level Parallelism When exploiting instruction-level parallelism, goal is to minimize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls

4 ILP techniques

5 Data Dependence How much parallelisms exist is a property of code Exploiting the most of it is a goal of ILP Challenges: Dependencies and Hazards Data dependency Name dependency Control dependency Dependency could result in data hazard

6 Data Dependence Data dependency Instruction j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction I Dependent instructions cannot be executed simultaneously Pipeline organization determines if dependence is detected and if it causes a stall Data dependence conveys: Possibility of a hazard Order in which results must be calculated Upper bound on exploitable instruction level parallelism Dependencies that flow through memory locations are difficult to detect

7 Name Dependence Two instructions use the same name but no flow of information Not a true data dependence, but is a problem when reordering instructions Antidependence: instruction j writes a register or memory location that instruction i reads Initial ordering (i before j) must be preserved Output dependence: instruction i and instruction j write the same register or memory location Ordering must be preserved To resolve, use renaming techniques

8 Data Hazards Data Hazards Read after write (RAW) occur with true data dependency add ; nand Write after write (WAW) output dependency in pipelines that write in more than one pipe stage or with out-ororder execution add ; nand Write after read (WAR) antidependency add ; nand can occur with reoredering

9 Compiler Techniques for Exposing ILP Pipeline scheduling Separate dependent instruction from the source instruction by the pipeline latency of the source instruction Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s;

10 A More Realistic Pipeline

11 Pipeline Stalls Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D F4,0(R1) DADDUI R1,R1,#-8 stall (assume integer load latency is 1) BNE R1,R2,Loop

12 Scheduled code: Loop: L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall stall S.D F4,8(R1) BNE R1,R2,Loop Pipeline Scheduling

13 Loop unrolling Parallelism with basic block is limited Typical size of basic block = 3-6 instructions Must optimize across branches Loop-Level Parallelism Unroll loop statically or dynamically Use SIMD (vector processors and GPUs)

14 Loop Unrolling Loop unrolling Unroll by a factor of 4 (assume # elements is divisible by 4) Loop: Eliminate unnecessary instructions L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop note: number of live registers vs. original loop

15 Loop Unrolling/Pipeline Scheduling Pipeline schedule the unrolled loop: Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop

16 Strip Mining Unknown number of loop iterations? Number of iterations = n Goal: make k copies of the loop body Generate pair of loops: First executes n mod k times Second executes n / k times Strip mining

17 Challenges to Loop Unrolling Limited benefit for large k (Amdahl's law) Code size Register pressure

18 Out-of-Order Execution: Scoreboarding

19 Out-of-Order Execution Variable latencies make out-of-order execution desirable How do we prevent WAR and WAW hazards? How do we deal with variable latency? Forwarding for RAW hazards harder. Instruction add mul div add sub IF ID EX M WB IF ID E1 E2 E3 E4 E5 E6 E7 M WB IF ID x x x x x x E1 E2 E3 E4 IF ID EX M WB IF ID EX M WB

20 Scoreboard: a Bookkeeping Technique Out-of-order execution divides ID stage: 1. Issue decode instructions, check for structural hazards 2. Read operands wait until no data hazards, then read operands Scoreboards date to CDC6600 in 1963 Instructions execute whenever not dependent on previous instructions and no hazards. CDC 6600: In order issue, out-of-order execution, outof-order commit (or completion) No forwarding! Imprecise interrupt/exception model

21 Registers Functional Units Scoreboard Architecture (CDC 6600) FP Mult FP Mult FP Divide FP Add Integer SCOREBOARD Memory Fetch/Issue

22 Scoreboard Implications Out-of-order completion => WAR, WAW hazards? Solutions for WAR: Stall writeback until all prior uses of the destination register have been read Read registers only during Read Operands stage Solution for WAW: Detect hazard and stall issue of new instruction until other instruction completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies between instructions that have already issued Scoreboard replaces ID, EX, M and WB with 4 stages

23 Four Stages of Scoreboard Control Issue decode instructions & check for structural hazards (ID1) Instructions issued in program order (for hazard checking) Don t issue if structural hazard Don t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards) Read operands wait until no data hazards, then read operands (ID2) All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. No forwarding of data in this model!

24 Four Stages of Scoreboard Control Execution operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. This includes memory ops. Write result finish execution (WB) Stall until no WAR hazards with previous instructions: Example: DIV ADD NAND CDC 6600 scoreboard would stall NAND until ADD reads operands

25 Three Parts of the Scoreboard Instruction status: Which state the instruction is in Functional unit status: Indicates the state of the functional unit (FU) nine fields for each functional unit Busy: Indicates whether the unit is busy or not Op: Operation to perform in the unit (e.g., + or ) Fi: Destination register number Fj,Fk: Source-register numbers Qj,Qk: Functional units producing source registers Fj, Fk Rj,Rk: Flags indicating when Fj, Fk are ready Register result status Indicates which functional unit will write each register, if one exists Blank when no pending instructions will write that register

26 Instruction status: Scoreboard Example Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU

27 Detailed Scoreboard Pipeline Control Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj(f) Fi(FU) or Rj(f)=No) & (Fk(f) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

28 Instruction status: Scoreboard Example: Cycle 1 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Integer

29 Instruction status: Scoreboard Example: Cycle 2 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Integer Issue 2nd LD?

30 Instruction status: Scoreboard Example: Cycle 3 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 No Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Integer Issue MULT?

31 Instruction status: Scoreboard Example: Cycle 4 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU

32 Instruction status: Scoreboard Example: Cycle 5 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Integer

33 Instruction status: Scoreboard Example: Cycle 6 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 Integer

34 Instruction status: Scoreboard Example: Cycle 7 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 Integer Add Read multiply operands?

35 Scoreboard Example: Cycle 8 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Add Divide

36 Instruction status: Scoreboard Example: Cycle 9 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Note Remaining Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 2 Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 Add Divide Read operands for MULT & SUB? Issue ADDD?

37 Scoreboard Example: Cycle 10 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 9 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1 Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 Add Divide

38 Instruction status: Scoreboard Example: Cycle 11 Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0 Add Yes Sub F8 F6 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 Add Divide

39 Scoreboard Example: Cycle 12 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 7 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 Divide Read operands for DIVD?

40 Scoreboard Example: Cycle 13 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 6 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 Add Divide

41 Scoreboard Example: Cycle 14 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 5 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 Add Divide

42 Scoreboard Example: Cycle 15 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 4 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 Add Divide

43 Scoreboard Example: Cycle 16 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3 Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU Mult1 Add Divide

44 Scoreboard Example: Cycle 17 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F WAR Hazard! Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 17 FU Mult1 Add Divide Why not write result of ADD???

45 Scoreboard Example: Cycle 18 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 18 FU Mult1 Add Divide

46 Scoreboard Example: Cycle 19 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 0 Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 19 FU Mult1 Add Divide

47 Scoreboard Example: Cycle 20 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 20 FU Add Divide

48 Scoreboard Example: Cycle 21 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 21 FU Add Divide WAR Hazard is now gone...

49 Scoreboard Example: Cycle 22 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 39 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 22 FU Divide

50

51 Scoreboard Example: Cycle 61 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 61 FU Divide

52 Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU

53 Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU In-order issue; out-of-order execute & commit

54 CDC 6600 Scoreboard Historical context: Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards Precise interrupts?

55 Instruction-Level Parallelism and Its Exploitation (Part II) ECE 154B Dmitri Strukov

56 ILP techniques not covered last week this week next week

57 Scoreboard Technique Review Allow for out of order execution by processing several instructions simultaneously In-order issue / out-of-order ex/completion Pipe stages: issue, read registers, ex, write back Book keeping to resolve RAW, WAW, and RAW Resolve RAW at read registers stage Stall issue for WAW, stall completion (WB) for WAR

58 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj(f) Fi(FU) or Rj(f)=No) & (Fk(f) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

59 time Fi Fj Fk Qj Qk Rj Rk Instr A: r2 r4/r5 Instr B: r6 r3*r2 Instr C: r3 r7+r8 Not completed Not completed Completed Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj(f) Fi(FU) or Rj(f)=No) & (Fk(f) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

60 time Fi Fj Fk Qj Qk Rj Rk Instr A: r2 r4/r5 Instr B: r6 r3*r2 Instr C: r3 r7+r8 r2 r3 r2 B No Not completed Not completed C@t+2 r3 Completed Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj(f) Fi(FU) or Rj(f)=No) & (Fk(f) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

61 Another Dynamic Algorithm: Tomasulo s Algorithm For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 IBM has memory-register ops Small number of floating point registers prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers renaming in hardware! Why Study? The descendants of this have flourished! Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,

62 Tomasulo vs. Scoreboard Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard FU buffers called reservation stations ; have pending operands Registers in instructions replaced by values or pointers to reservation stations(rs); called register renaming avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well

63 Big picture on ILP for now 1) Trying in improve performance by exploiting ILP (overlapping as much as possible execution of instructions) 2) So far, issue one instruction per cycle and overlap exaction of many by pipelining - We will broaden it later to allow issue of multiple instructions per cycle (superscalar processors) 3) Problem with pipelining hazards (WAW, RAW, WAR), and resulting stalls, which are especially painful with relatively large memory (less of a problem earlier, more a problem now consequence of memory wall problem) and FP latencies (was problem earlier, less of a problem now) 4) Let s reorder instructions (without affective program correctness) can align (overlap) better instructions to reduce hazards - Scoreboard allows reordering but deals somewhat efficiently only with RAW but not with WAW and WAR - WAW and RAW improved in Tomasulo via register renaming (essentially storing register value instead of register label in scoreboard). Also more efficient due to forwarding (common data bus) - However, both scoreboard and Tomasulo are limited in reordering given typical small number of instructions in basic blocks (i.e. cannot reorder across branches)

64 Tomasulo Organization + register status

65 Reservation Station Components Op: Operation to perform in the unit (e.g., + or ) Vj, Vk: Value of source operands Store buffers have V fields, results to be stored Qj, Qk: Reservation stations producing source registers (value to be written) No ready flags as in Scoreboard; Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

66 Three Stages of Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers) 2. Execute operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination ( go to bus) Common data bus: data + source ( come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

67 Tomasulo Example Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 Load1 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 0 FU

68 Tomasulo Example Cycle 1 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Load1

69 Tomasulo Example Cycle 2 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Load2 Load1 Note: Unlike 6600, can have multiple loads outstanding (This was not an inherent limitation of scoreboarding)

70 Tomasulo Example Cycle 3 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Mult1 Load2 Load1 Note: registers names are removed ( renamed ) in Reservation Stations; MULT issued vs. scoreboard Load1 completing; what is waiting for Load1?

71 Tomasulo Example Cycle 4 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Mult1 Load2 M(A1) Add1 Load2 completing; what is waiting for Load2?

72 Tomasulo Example Cycle 5 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Mult1 M(A2) M(A1) Add1 Mult2

73 Tomasulo Example Cycle 6 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 M(A2) Add2 Add1 Mult2 Issue ADDD here?

74 Tomasulo Example Cycle 7 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 M(A2) Add2 Add1 Mult2 Add1 completing; what is waiting for it?

75 Tomasulo Example Cycle 8 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 M(A2) Add2 (M-M) Mult2

76 Tomasulo Example Cycle 9 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 M(A2) Add2 (M-M) Mult2

77 Tomasulo Example Cycle 10 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 M(A2) Add2 (M-M) Mult2 Add2 completing; what is waiting for it?

78 Tomasulo Example Cycle 11 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 Write result of ADDD here? All quick instructions complete in this cycle!

79 Tomasulo Example Cycle 12 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2

82 Tomasulo Example Cycle 15 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2

83 Tomasulo Example Cycle 16 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2

84 Faster than light computation (skip a couple of cycles)

85 Tomasulo Example Cycle 55 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2

86 Tomasulo Example Cycle 56 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 Mult2 is completing; what is waiting for it?

87 Tomasulo Example Cycle 57 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 No LD F2 45+ R Load2 No MULTD F0 F2 F Load3 No SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Result Once again: In-order issue, out-of-order execution and completion

88 Compare to Scoreboard Cycle 62 Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue ComplResult LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Why take longer on scoreboard/6600? Not efficient WAR and WAW Structural hazards Lack of forwarding

89 Tomasulo vs. Scoreboard (IBM 360/91 vs. CDC 6600) Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3+, 2x/ ) (1 load/store, 1+, 2x, 1 ) window size: 14 instructions 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall issue Broadcast results from FU Write/read registers Control: reservation stations central scoreboard

90 Tomasulo Drawbacks Complexity delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Each CDB must go to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! (Multiple CDBs more FU logic for parallel assoc stores) Non-precise interrupts!

91 Summary on Tomasulo Reservations stations: implicit register renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW (when branch can be quickly resolved) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

92 Summary on ILP for now 1) Trying in improve performance by exploiting ILP (overlapping as much as possible execution of instructions) 2) So far, issue one instruction per cycle and overlap exaction of many by pipelining - We will broaden it later to allow issue of multiple instructions per cycle (superscalar processors) 3) Problem with pipelining hazards (WAW, RAW, WAR), and resulting stalls, which are especially painful with relatively large memory (less of a problem earlier, more a problem now consequence of memory wall phenomena) and FP latencies (was problem earlier, less of a problem now) 4) Let s reorder instructions (without affective program correctness) can align (overlap) better instructions to reduce hazards - Scoreboard allows reordering but deals somewhat efficiently only with RAW but not with WAW and WAR - WAW and RAW improved in Tomasulo via register renaming (essentially storing register value instead of register label in scoreboard). Also more efficient due to forwarding (common data bus) - However, both scoreboard and Tomasulo are limited in reordering given typical small number of instructions in basic blocks (i.e. cannot reorder across branches) 5) Solution let s execute speculatively instructions to have more instructions available for reordering - Need mechanism to recover from wrong execution.

93 Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP (will look at branch prediction implementations next week) One way to overcome control dependencies is with speculation Make a guess and execute program as if our guess is correct Need mechanisms to handle the case when the speculation is incorrect Can do some speculation in the compiler Reordered / duplicated instructions around branches

94 Hardware-Based Speculation Extends the idea of dynamic scheduling with three key ideas: 1. Dynamic branch prediction 2. Speculation to allow the execution of instructions before control dependencies are resolved 3. Dynamic scheduling to deal with scheduling different combinations of basic blocks What we saw earlier was within a basic block Modern processors started using speculation around the introduction of the PowerPC 603, Intel Pentium II and extend Tomasulo s approach to support speculation

95 Speculating with Tomasulo Separate execution from completion Allow instructions to execute speculatively but do not let instructions update registers or memory until they are no longer speculative Instruction Commit After an instruction is no longer speculative it is allowed to make register and memory updates Allow instructions to execute and complete out of order but force them to commit in order Add a hardware buffer, called the reorder buffer (ROB), with registers to hold the result of an instruction between completion and commit Acts as a FIFO queue in order issued

96 Original Tomasulo Architecture

97 Tomasulo and Reorder Buffer Sits between Execution and Register File Source of operands In this case integrated with Store buffer Reservation stations use ROB slot as a tag Instructions commit at head of ROB FIFO queue Easy to undo speculated instructions on mispredicted branches or on exceptions

98 ROB Data Structure Instruction Type Field Indicates whether the instruction is a branch, store, or register operation Destination Field Register number for loads, ALU ops, or memory address for stores Value Field Holds the value of the instruction result until instruction commits Ready Field Indicates if instruction has completed execution and the value is ready

99 Instruction Execution 1. Issue: Get an instruction from the Instruction Queue If the reservation station and the ROB has a free slot (no structural hazard), issue the instruction to the reservation station and the ROB, send operands to the reservation station if available in the register file or the ROB. The allocated ROB slot number is sent to the reservation station to use as a tag when placing data on the CDB. 2. Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3. Write result: Finish execution (WB) Write on CDB to all awaiting units and to the ROB using the tag; mark reservation station available 4. Commit: Update register or memory with the ROB result When an instruction reaches the head of the ROB and results are present, update the register with the result or store to memory and remove the instruction from the ROB If an incorrectly predicted branch reaches the head of the ROB, flush the ROB, and restart at the correct successor of the branch Blue text = Change from Tomasulo

100 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 ROB2 F0 LD F0, 10(R2) I ROB1 Oldest Dest Registers Dest To Memory From Memory FP adders Reservation Stations FP multipliers Dest 1 10+R2

101 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Dest Registers Dest To Memory From Memory FP adders Reservation Stations FP multipliers Dest 1 10+R2

102 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Dest Registers 2 ADDD R(F4),1 Dest To Memory From Memory FP adders Reservation Stations FP multipliers Dest 1 10+R2

103 FP Op Queue Tomasulo With ROB State ROB7 ROB6 Newest Reorder Buffer ROB5 ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Oldest Dest Registers 2 ADDD R(F4),1 Dest 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers Dest 1 10+R2

104 FP Op Queue Reorder Buffer Tomasulo With ROB State ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 LD F4,0(R3) E ROB5 -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Dest Registers 2 ADDD R(F4),1 6 ADDD 5,R(F6) FP adders Reservation Stations Dest 3 MULD 2,R(F6) FP multipliers Dest To Memory From Memory 1 10+R2 5 0+R3

105 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] ROB5 ST F4, 0(R3) I ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 LD F4,0(R3) E ROB5 -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Dest Registers 2 ADDD R(F4),1 6 ADDD 5,R(F6) FP adders Reservation Stations Dest 3 MULD 2,R(F6) FP multipliers Dest To Memory From Memory 1 10+R2 5 0+R3

106 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 ADDD F0,F4,F6 I ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Dest Registers 2 ADDD R(F4),1 6 ADDD V1,R(F6) FP adders Reservation Stations Dest 3 MULD 2,R(F6) FP multipliers Dest To Memory From Memory 1 10+R2

107 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 ADDD F0,F4,F6 E ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Dest Registers 2 ADDD R(F4),1 6 ADDD V1,R(F6) FP adders Reservation Stations Dest 3 MULD 2,R(F6) FP multipliers Dest To Memory From Memory 1 10+R2

108 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 LD F0, 10(R2) E ROB1 Newest Oldest Dest Registers 2 ADDD R(F4),1 Dest 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers Dest 1 10+R2

109 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L I ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 I ROB2 F0 V3 LD F0, 10(R2) W ROB1 Newest Oldest Dest Registers 2 ADDD R(F4),V3 Dest 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers Dest

110 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L E ROB4 F2 MULD F2,F10,F6 I ROB3 F10 ADDD F10,F4,F0 E ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Dest Registers 2 ADDD R(F4),V3 F0=V3 Dest 3 MULD 2,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers Dest

111 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L W ROB4 F2 MULD F2,F10,F6 I ROB3 F10 V4 ADDD F10,F4,F0 W ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Dest Registers F0=V3 Dest 3 MULD V4,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers Dest

112 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L W ROB4 F2 MULD F2,F10,F6 E ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Dest Registers F0=V3 F10=V4 Dest 3 MULD V4,R(F6) To Memory From Memory FP adders Reservation Stations FP multipliers Dest

113 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L W ROB4 F2 V5 MULD F2,F10,F6 W ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Dest Registers F0=V3 F10=V4 Dest To Memory From Memory FP adders Reservation Stations FP multipliers Dest

114 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L W ROB4 F2 V5 MULD F2,F10,F6 C ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Dest Registers F0=V3 F10=V4 F2=V5 Dest To Memory From Memory FP adders Reservation Stations FP multipliers Dest

115 FP Op Queue Reorder Buffer Tomasulo With ROB State [R3] V1 ST F4, 0(R3) W ROB7 F0 V2 ADDD F0,F4,F6 W ROB6 F4 V1 LD F4,0(R3) W ROB5 -- BNE F0, 0, L C ROB4 F2 V5 MULD F2,F10,F6 C ROB3 F10 V4 ADDD F10,F4,F0 C ROB2 F0 V3 LD F0, 10(R2) C ROB1 Newest Oldest Dest Registers F0=V3 F10=V4 F2=V5 Dest To Memory From Memory FP adders Reservation Stations FP multipliers Dest

116 Avoiding Memory Hazards A store only updates memory when it reaches the head of the ROB Otherwise WAW and WAR hazards are possible By waiting to reach the head memory is updated in order and no earlier loads or stores can still be pending If a load accesses a memory location written to by an earlier store then it cannot perform the memory access until the store has written the data Prevents RAW hazard through memory

117 Reorder Buffer Implementation In practice Try to recover as early as possible after a branch is mispredicted rather than wait until branch reaches the head Performance in speculative processors more sensitive to branch prediction Higher cost of misprediction Exceptions (will look into that next week) Don t recognize the exception until it is ready to commit Could try to handle exceptions as they arise and earlier branches resolved, but more challenging

118 Acknowledgements Some of the slides contain material developed and copyrighted by Sally A. McKee and K. Mock (Cornell University) and instructor material for the textbook 119