Instruction Level Parallelism Part I

Transcription

1 Course on: Advanced Computer Architectures Instruction Level Parallelism Part I Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it 1

2 Outline of Part I Introduction to ILP Dependences and Hazards Dynamic scheduling vs Static scheduling Scoreboard Technique 2

3 Getting higher performance... In a pipelined machine, actual CPI is derived as: CPI pipeline = CPI ideal + Structural Stalls + Data Hazard Stalls + Control Stalls + Memory Stalls Reduction of any right-hand term reduces CPI pipeline to CPI ideal (increase Instructions Per Clock: IPC = 1 / CPI) Best case: the max throughput would be to complete 1 Instructions Per Clock: IPC ideal = 1; CPI ideal = 1-3-

4 Sequential vs. Pipelining Execution I1 I2 IF ID EX MEM WB IF ID EX MEM WB 10 ns 10 ns I1 IF ID EX MEM WB Time I2 2 ns IF ID EX MEM WB I3 2 ns IF ID EX MEM WB I4 2 ns IF ID EX MEM WB I5 2 ns IF ID EX MEM WB -4-

5 Definition of Instruction Level Parallelism ILP = Exploit potential overlap of execution among unrelated instructions Overlapping possible whenever: No Structural Hazards No RAW, WAR of WAW Hazards No Control Hazards 5

6 Some basic concepts and definitions To reach higher performance (for a given technology) more parallelism must be extracted from the program. In other words...multiple-issue Dependences must be detected and solved, and instructions must be re-ordered (scheduled) so as to achieve highest parallelism of instruction execution compatible with available resources. -6-

7 ILP: Multiple Issue Pipelining Execution I1 I 1 IF ID EX MEM WB Time I2 I 2 I3 IF 2 ns ID IF EX ID MEM EX WB MEM WB I4 IF ID EX MEM WB I5 2 ns IF ID EX MEM WB I6 IF ID EX MEM WB I7 2 ns IF ID EX MEM WB I8 IF ID EX MEM WB I9 I10 2 ns IF IF ID ID EX EX MEM MEM WB WB -7-

8 2-issue MIPS Pipeline Architecture M u x 4 M u x ALU PC Instruction memory Registers M u x Write data Data memory Sign extend Sign extend ALU Address M u x 2-instructions issued per clock: 1 ALU or BR instruction 1 load/store instruction 8

9 Getting higher performance... In a multiple-issue pipelined machine, the ideal CPI would be CPI ideal < 1 If we consider for example 2-issue processor, best case: max throughput would be to complete 2 Instructions Per Clock: IPC ideal = 2; CPI ideal =

10 Dependences Determining dependences among instructions is critical to defining the amount of parallelism existing in a program. If two instructions are dependent, they cannot execute in parallel: they must be executed in order or only partially overlapped. Three different types of dependences: Data Dependences (or True Data Dependences) Name Dependences Control Dependences -10-

11 Name Dependences Name dependence occurs when 2 instructions use the same register or memory location (called name), but there is no flow of data between the instructions associated with that name. Two types of name dependences between an instruction i that precedes instruction j in program order: Antidependence: when j writes a register or memory location that instruction i reads (WAR). The original instructions ordering must be preserved to ensure that i reads the correct value. Output Dependence: when i and j write the same register or memory location (WAW). The original instructions ordering must be preserved to ensure that the value finally written corresponds to j. -11-

12 Name Dependences Name dependences are not true data dependences, since there is no value (no data flow) being transmitted between instructions. If the name (register number or memory location) used in the instructions could be changed, the instructions do not conflict. Dependences trough memory locations are more difficult to detect ( memory disambiguation problem), since two addresses may refer to the same location but can look different. Register renaming can be more easily done. Renaming can be done either statically by the compiler or dynamically by the hardware. -12-

13 Data Dependences and Hazards A data/name dependence can potentially generate a data hazard (RAW, WAW, or WAR), but the actual hazard and the number of stalls to eliminate the hazards are a property of the pipeline. RAW hazards correspond to true data dependences. WAW hazards correspond to output dependences WAR hazards correspond to antidependences. Dependences are a property of the program, while hazards are a property of the pipeline. -13-

14 Control Dependences A control dependence determines the ordering of instructions and it is preserved by two properties: Instructions execution in program order to ensure that an instruction that occurs before a branch is executed before the branch. Detection of control hazards to ensure that an instruction (that is control dependent on a branch) is not executed until the branch direction is known. Although preserving control dependence is a simple way to preserve program order, control dependence is not the critical property that must be preserved. -14-

15 Program Properties Two properties are critical to program correctness (and normally preserved by maintaining both data and control dependences): 1. Data flow: Actual flow of data values among instructions that produces the correct results and consumes them. 2. Exception behavior: Preserving exception behavior means that any changes in the ordering of instruction execution must not change how exceptions are raised in the program. -15-

16 Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: Depend on the hardware to locate parallelism Static Scheduling: Rely on software for identifying potential parallelism Hardware intensive approaches dominate desktop and server markets -16-

17 Dynamic Scheduling The hardware reorder the instruction execution to reduce pipeline stalls while maintaining data flow and exception behavior. Main advantages (PROs): It enables handling some cases where dependences are unknown at compile time It simplifies the compiler complexity It allows compiled code to run efficiently on a different pipeline (code portability). Those advantages are gained at a cost of (CONs): a significant increase in hardware complexity, power consumption. -17-

18 Static Scheduling Compilers can use sophisticated algorithms for code scheduling to exploit ILP (Instruction Level Parallelism). However the size of a basic block a straight-line code sequence with no branches in except to the entry and no branches out except at the exit is usually quite small and the amount of parallelism available within a basic block is quite small. Example: For typical MIPS programs the average branch frequency is between 15% and 25% from 4 to 7 instructions execute between a pair of branches. -18-

19 Static Scheduling Data dependence can further limit the amount of ILP we can exploit within a basic block to much less than the average basic block size. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks (i.e. across branches such as in trace scheduling). -19-

20 Static Scheduling Static detection and resolution of dependences ( static scheduling): accomplished by the compiler dependences are avoided by code reordering. Output of the compiler: reordered into dependency-free code. Typical example: VLIW (Very Long Instruction Word) processors expect dependency-free code generated by the compiler -20-

21 Dynamic Scheduling Simple pipeline: hazards due to data dependences that cannot be hidden by forwarding stall the pipeline no new instructions are fetched nor issued. Dynamic scheduling: Hardware reorder instructions execution so as to reduce stalls, maintaining data flow and exception behaviour. Typical Example: Superscalar Processor -21-

22 Dynamic Scheduling (2) Basically: Instructions are fetched and issued in program order (in-order-issue) Execution begins as soon as operands are available possibly, out of order execution note: possible even with pipelined scalar architectures. Out-of order execution introduces possibility of WAR, WAW data hazards. Out-of order execution implies out of order completion unless there is a re-order buffer to get inorder completion -22-

23 Summary of Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: depend on the hardware to locate parallelism Static Scheduling: rely on the compiler for identifying potential parallelism Hardware intensive approaches dominate desktop and server markets 23

24 Summary of Pipelining Basics Hazards limit performance: Structural: Need more HW resources Data: Need forwarding, compiler scheduling Control: Early evaluation, Branch Delay Slot, Static and Dynamic Branch Prediction Increasing length of pipe (superpipelining) increases impact of hazards Pipelining helps instruction throughput, not latency 24

25 Review: Summary of Pipelining Basics Interrupts, Instruction Set, FP makes pipelining harder Compilers reduce cost of data and control hazards Load delay slots Branch delay slots Branch prediction Today: Longer pipelines Better branch prediction, more instruction parallelism? 25

26 Basic Assumptions We consider single-issue processors The Instruction Fetch stage precedes the Issue Stage and may fetch either into an Instruction Register or into a queue of pending instructions Instructions are then issued from the IR or from the queue Execution stage may require multiple cycles, depending on the operation type. 26

27 Key Idea: Dynamic Scheduling Problem: Data dependences that cannot be hidden with bypassing or forwarding cause hardware stalls of the pipeline Solution: Allow instructions behind a stall to proceed Hw rearranges dynamically the instruction execution to reduce stalls Enables out-of-order execution and completion (commit) First implemented in CDC 6600 (1963). 27

28 Dynamic Scheduling Advantages: Enables handling cases of dependence unknown at compile time Simplifies compiler Allows code compiled for one pipeline to run efficiently on a different pipeline (code portability) Disadvantages: Significant increase in hw complexity Could generate imprecise exceptions 28

29 Example 1 DIVD F0,F2,F4 ADDD F10,F0,F8 # RAW F0 SUBD F12,F8,F14 RAW Hazard: ADDD stalls for F0 (waiting that DIVD commits). SUBD would stall even if not data dependent on anything in the pipeline without dynamic scheduling. BASIC IDEA: to enable SUBD to proceed (out-of-order execution) 29

30 Example 2: Analysis of dependences and hazards LD F6, 34(R2) LD F2, 45(R3) MULTD F0, F2, F4 # RAW F2 SUBD F8, F6, F2 # RAW F2,RAW F6 DIVD F10, F0, F6 # RAW F0,RAW F6 ADDD F6, F8, F2 # WAR F6,RAW F8,RAW F2 30

31 Scoreboard Dynamic Scheduling Algorithm 31

32 Scoreboard basic scheme Out-of-order execution divides ID stage: 1.Issue Decode instructions, check for structural hazards 2.Read operands (RR) Wait until no data hazards, then read operands Instructions execute whenever not dependent on previous instructions and no hazards Scoreboard allows instructions to execute whenever 1 & 2 hold, not waiting for prior instructions 32

33 Scoreboard basic scheme We distinguish when an instruction begins execution and it completes execution: between the two times, the instruction is in execution. We assume the pipeline allows multiple instructions in execution at the same time that requires multiple functional units, pipelined functional units or both. CDC 6600: In order issue, out of order execution, out of order completion (commit) No forwarding! Imprecise interrupt/exception model for now! 33

34 Functional Units Scoreboard Architecture Memory 34

35 Scoreboard Pipeline Scoreboard replaces ID, EX, WB stages with 4 stages ID stage split in two parts: Issue (decode and check structural hazard) Read Operands (wait until no data hazards) Scoreboard allows instructions without dependencies to execute In-order issue BUT out-of-order read-operands out-of-order execution and completion All instructions pass through the issue stage in-order, but they can be stalled or bypass each other in the read operand stage and thus enter execution out-of-order and with different latencies, which implies out-of-order completion. 35

36 Scoreboard Implications Out-of-order completion WAR and WAW hazards can occur Solutions for WAR: Stall write back until registers have been read. Read registers only during Read Operands stage. 36

37 Scoreboard Implications Solution for WAW: Detect hazard and stall issue of new instruction until the other instruction completes No register renaming Need to have multiple instructions in execution phase Multiple execution units or pipelined execution units Scoreboard keeps track of dependencies and state of operations 37

38 Scoreboard Scheme Hazard detection and resolution is centralized in the scoreboard: Every instruction goes through the Scoreboard, where a record of data dependences is constructed The Scoreboard then determines when the instruction can read its operand and begin execution If the scoreboard decides the instruction cannot execute immediately, it monitors every change and decides when the instruction can execute. The scoreboard controls when the instruction can write its result into destination register 38

39 Exception handling Problem with out-of order completion Must preserve exception behavior as in-order execution Solution: ensure that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed 39

40 Imprecise exceptions An exception is imprecise if the processor state when an exception is raised does not look exactly as if the instructions were executed in-order. Imprecise exceptions can occur because: The pipeline may have already completed instructions that are later in program order than the instruction causing the exception The pipeline may have not yet completed some instructions that are earlier in program order than the instruction causing the exception Imprecise exception make it difficult to restart execution after handling 40

41 Four Stages of Scoreboard Control 1. Issue Decode instruction and check for structural hazards & WAW hazards Instructions issued in program order (for hazard checking) If a functional unit for the instruction is free and no other active instruction has the same destination register (no WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural hazard or a WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 41

42 Four Stages of Scoreboard Control Note that when the issue stage stalls, it causes the buffer between Instruction fetch and issue to fill: If the buffer has a single entry: IF stalls If the buffer is a queue of multiple instruction: IF stalls when the queue fills 42

43 Four Stages of Scoreboard Control 2. Read Operands Wait until no data hazards, then read operands A source operand is available if: - No earlier issued active instruction will write it or - A functional unit is writing its value in a register When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. RAW hazards are resolved dynamically in this step, and instructions may be sent into execution out of order. No forwarding of data in this model 43

44 Four Stages of Scoreboard Control 3.Execution The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. FUs are characterized by: -Variable latency (the effective time used to complete one operation). - Initiation interval (the number of cycles that must elapse between issuing two operations to the same functional unit). - Load/Store latency depends on data cache HIT/MISS 44

45 Four Stages of Scoreboard Control 4. Write result Check for WAR hazards and finish execution Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the completing instruction. 45

46 WAR/WAW Example DIVD F0,F2,F4 ADDD F6,F0,F8 SUBD F8,F8,F14 MULD F6,F10,F8 RAW F0 WAR F8 WAW F6 The scoreboard would: Stall SUBD in the WB stage, waiting for ADDD reads F0 and F8 and Stall MULD in the issue stage until ADDD writes F6. Can be solved through register renaming 46

47 Scoreboard Structure 1. Instruction status 2. Functional Unit status Indicates the state of the functional unit (FU): Busy Indicates whether the unit is busy or not Op - The operation to perform in the unit (+,-, etc.) Fi - Destination register Fj, Fk Source register numbers Qj, Qk Functional units producing source registers Fj, Fk Rj, Rk Flags indicating when Fj, Fk are ready. Flags are set to NO after operands are read. 3. Register result status. Indicates which functional unit will write each register. Blank if no pending instructions will write that register. 47

48 Detailed Scoreboard Pipeline Control Instruction status Issue Read operands Execution complete Write result Wait until Not busy (FU) and not result(d) Rj and Rk Functional unit done f((fj( f ) Fi(FU) or Rj( f )=No) & (Fk( f ) Fi(FU) or Rk( f )=No)) Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D ; Fj(FU) `S1 ; Fk(FU) `S2 ; Qj Result( S1 ); Qk Result(`S2 ); Rj not Qj; Rk not Qk; Result( D ) FU; Rj No; Rk No f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No 48

49 Scoreboard Example Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 FU 49

50 Scoreboard Example: Cycle 1 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Integer 50

51 Scoreboard Example Cycle 2 Instruction status Read ExecutioWrite Instruction j k Issue operand completeresult LD F6 34+ R2 1 2 LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Integer Issue 2nd load? Integer Pipeline Full Cannot exec 2 nd Load due to structural hazard on Integer Unit Issue stalls 51

52 Scoreboard Example Cycle 3 Instruction status Read ExecutioWrite Instruction j k Issue operand completeresult LD F6 34+ R LD F2 45+ R3 MULT F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status dest S1 S2 FU for j FU for k Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F6 R2 Yes Mult1 No Mult2 No Add No Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Integer Issue stalls Load execution complete in one clock cycle (data cache hit) 52

53 Scoreboard Example: Cycle 4 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Integer Issue stalls Write F6 53

54 Scoreboard Example: Cycle 5 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 No Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Integer The 2 nd load is issued 54

55 Scoreboard Example: Cycle 6 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R3 5 6 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 Integer MULT is issued but has to wait for F2 from LOAD (RAW Hazard on F2) 55

56 Scoreboard Example: Cycle 7 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 Yes No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 Integer Add Read multiply operands? Now SUBD can be issued to ADD Functional Unit (then SUBD has to wait for RAW F2 from load) 56

57 Scoreboard Example: Cycle 8a (First half of clock cycle) Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes Load F2 R3 No Mult1 Yes Mult F0 F2 F4 Integer No Yes Mult2 No Add Yes Sub F8 F6 F2 Integer Yes No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Integer Add Divide DIVD is issued but there is another RAW hazard (F0) from MULTD -> DIVD has to wait for F0 57

58 Scoreboard Example: Cycle 8b (Second half of clock cycle) Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 Add Divide Load completes (Writes F2), and operands for MULT an SUBD are ready 58

59 Scoreboard Example: Cycle 9 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Note Remaining Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 2Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 9 FU Mult1 Add Divide Read operands for MULTD & SUBD by multiple-port Register File Issue ADDD? No for structural hazard on ADD Functional Unit MULTD and SUBD are sent in execution in parallel: Latency of 10 cycles for MULTD and 2 cycles for SUBD 59

60 Scoreboard Example: Cycle 10 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F2 7 9 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 9Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 1Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 10 FU Mult1 Add Divide 60

61 Scoreboard Example: Cycle 11 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 8Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 0Add Yes Sub F8 F6 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 11 FU Mult1 Add Divide SUBD ends execution 61

62 Scoreboard Example: Cycle 12 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 7Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No Add No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 12 FU Mult1 Divide SUBD writes result in F8 62

63 Scoreboard Example: Cycle 13 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 6Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 13 FU Mult1 Add Divide ADDD can be issued DIVD still waits for operand F0 from MULTD 63

64 Scoreboard Example: Cycle 14 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 5Mult1 Yes Mult F0 F2 F4 No No Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 14 FU Mult1 Add Divide ADDD reads operands (out-of-order read operands: ADDD reads operands before DIVD) 64

65 Scoreboard Example: Cycle 15 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 4Mult1 Yes Mult F0 F2 F4 No No Mult2 No 1 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 15 FU Mult1 Add Divide ADDD starts execution 65

66 Scoreboard Example: Cycle 16 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 3Mult1 Yes Mult F0 F2 F4 No No Mult2 No 0 Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 16 FU Mult1 Add Divide ADDD ends execution 66

67 Scoreboard Example: Cycle 17 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F WAR F6 Hazard! Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 2Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 17 FU Mult1 Add Divide Why not write result of ADDD??? WAR must be detected before write result of ADDD in F6 DIVD must first read F6 (before ADDD write F6), but DIVD cannot read operands until MULTD writes F0 67

68 Scoreboard Example: Cycle 18 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F4 6 9 SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 1Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 18 FU Mult1 Add Divide 68

69 Scoreboard Example: Cycle 19 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No 0Mult1 Yes Mult F0 F2 F4 No No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Mult1 No Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 19 FU Mult1 Add Divide MULTD ends execution 69

70 Scoreboard Example: Cycle 20 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F6 8 ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 20 FU Add Divide MULTD writes in F0 70

71 Scoreboard Example: Cycle 21 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 40 Integer No Mult1 No Mult2 No Add Yes Add F6 F8 F2 No No Divide Yes Div F10 F0 F6 Yes Yes Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 21 FU Add Divide DIVD can read operands WAR Hazard is now gone... 71

72 Scoreboard Example: Cycle 22 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 39 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 22 FU Divide DIVD has read its operands in previous cycle ADDD can now write the result in F6 72

73 (skipping some cycles ) 73

74 Scoreboard Example: Cycle 61 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 No Mult2 No Add No 0 Divide Yes Div F10 F0 F6 No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 61 FU Divide DIVD ends execution 74

75 Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU DIVD writes in F10 75

76 Review: Scoreboard Example: Cycle 62 Instruction status: Read Exec Write Instruction j k Issue Oper Comp Result LD F6 34+ R LD F2 45+ R MULTD F0 F2 F SUBD F8 F6 F DIVD F10 F0 F ADDD F6 F8 F Functional unit status: dest S1 S2 FU FU Fj? Fk? Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult1 Mult2 Add Divide No No No No No Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 62 FU In-order issue; out-of-order execute & commit 76

77 CDC 6600 Scoreboard Speedup of 2.5 w.r.t. no dynamic scheduling Speedup 1.7 by reorganizing instructions from compiler; BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especially integer/load store units Do not issue on structural hazards Wait for WAR hazards Prevent WAW hazards 77

78 Summary Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops Memory dependencies hardest to determine HW exploiting ILP Works when can t know dependence at run time Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode Issue Instruction & Read Operands) Enables out-of-order execution => out-of-order completion ID stage checked both structural and WAW hazards on destination operands. 78

79 End of Part I References: Appendix A and Chapter 2 of the text book: J. Hennessey, D. Patterson, Computer Architecture: a quantitative approach 4 th edition, Morgan-Kaufmann Publishers. 79