Computer Architecture Lecture 4: Pipelining (Appendix C) Chih Wei Liu 劉志尉 National Chiao Tung University

Transcription

1 Computer Architecture Lecture 4: Pipelining (Appendix C) Chih Wei Liu 劉志尉 National Chiao Tung University

2 Why Pipeline? T I 2 I 1 I 2 I 1 Throughput = 1/T I 4 I 3 I 2 I 1 I 4 I 3 I 2 I 1 I 4 I 3 I 2 I 1 I 4 I 3 I 2 I 1 I 4 I 3 I 2 I 1 CA-Lec4 cwliu@twins.ee.nctu.edu.tw 2 Throughput = 4/T

3 Visualizing Pipelining Figure C.3, Page C 9 Time (clock cycles) I n s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 O r d e r CA-Lec4 cwliu@twins.ee.nctu.edu.tw 3

4 Designing a Pipelined Processor Start with ISA Examine the datapath and control diagram Starting with single or multi cycle datapath? Single or multi cycle control? Partition datapath into steps Insert pipeline registers between successive steps Associate resources with steps Ensure that flows do not conflict, or figure out how to resolve Assert control in appropriate stage CA-Lec4 cwliu@twins.ee.nctu.edu.tw 4

5 Example: 5 Steps of MIPS Datapath Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC Address 4 Adder Memory Inst Next SEQ PC RS1 RS2 RD File MUX MUX Zero? MUX Data Memory L M D MUX IR <= mem[pc]; PC <= PC + 4 Imm Sign Extend [IR rd ] <= [IR rs ] op IRop [IR rt ] WB Data CA-Lec4 cwliu@twins.ee.nctu.edu.tw 5

6 5 Steps of MIPS Datapath Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Next SEQ PC Zero? MUX Address Memory IR <= mem[pc]; PC <= PC + 4 A <= [IR rs ]; B <= [IR rt ] IF/ID RS2 Imm File Sign Extend ID/EX MUX MUX EX/MEM RD RD RD Data Memory MEM/WB MUX WB Data rslt <= A op IRop B WB <= rslt [IR rd ] <= WB CA-Lec4 cwliu@twins.ee.nctu.edu.tw 6

7 The 5 Step of the Lw Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load /Dec Exec Mem Wr Lw instruction can be implemented in 5 clock cycles : Instruction Fetch Fetch the instruction from the Instruction Memory /Dec: isters Fetch and Instruction Decode Exec: Execution and calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file Branch requires? cycles, Store requires? cycles, others require? cycles CA-Lec4 cwliu@twins.ee.nctu.edu.tw 7

8 The 4 Step of R type Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type /Dec Exec Wr : Instruction Fetch Fetch the instruction from the Instruction Memory Update PC /Dec: isters Fetch and Instruction Decode Exec: operates on the two register operands Wr: Write the output back to the register file does not access data memory CA-Lec4 cwliu@twins.ee.nctu.edu.tw 8

9 Important Observation Each functional unit can only be used once per instruction: Load uses ister File s Write Port during its 5th step Load /Dec Exec Mem Wr This s what caused the problem R type uses ister File s Write Port during its 4th step R-type /Dec Exec Wr A structural hazard will be happened!! CA-Lec4 cwliu@twins.ee.nctu.edu.tw 9

10 Pipelining the R type and Load Instructions Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type /Dec Exec Wr Ops! We have a problem! R-type /Dec Exec Wr Load /Dec Exec Mem Wr R-type /Dec Exec Wr R-type /Dec Exec Wr Structural hazard: Two instructions try to write to the register file at the same time! Only one write port CA-Lec4 cwliu@twins.ee.nctu.edu.tw 10

11 Sol 1: Insert Bubble into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type /Dec Exec Wr Load /Dec Exec Mem Wr R-type /Dec Exec Wr R-type /Dec Pipeline Exec Wr R-type Bubble /Dec Exec Wr /Dec Exec Insert a bubble into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. CA-Lec4 cwliu@twins.ee.nctu.edu.tw 11

12 Sol 2: Delay R type s Write by One Cycle 5 step R type instructions: Mem step for R type inst. is a NOOP : nothing is being done R-type /Dec Exec Mem Wr Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type /Dec Exec Mem Wr R-type /Dec Exec Mem Wr Load /Dec Exec Mem Wr R-type A unified pipeline architecture for each type of instructions!! /Dec Exec Mem Wr R-type /Dec Exec Mem Wr CA-Lec4 cwliu@twins.ee.nctu.edu.tw 12

13 Similarly, for Store and Branch Instructions Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store /Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq /Dec Exec Mem Wr CA-Lec4 cwliu@twins.ee.nctu.edu.tw 13

14 Data Stationary Control Pass control signals along just like the data Main control generates control signals during ID WB Instruction Control M WB EX M WB IF/ID ID/EX EX/MEM MEM/WB CA-Lec4 cwliu@twins.ee.nctu.edu.tw 14

15 Use of Data Stationary Control The Main Control generates the control signals during /Dec Control signals for Exec (ExtOp, Src,...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (Memto MemWr) are used 3 cycles later /Dec Exec Mem Wr ExtOp ExtOp Src Src IF/ID ister Main Control Op Dst MemWr Branch Memto ID/Ex ister Op Dst MemWr Branch Memto Ex/Mem ister MemWr Branch Memto Mem/Wr ister Memto Wr Wr Wr Wr CA-Lec4 cwliu@twins.ee.nctu.edu.tw 15

16 Pipeline Summary A pipeline is like an hooked assembly line. Pipelining, in general, is not visible to the programmer (vs ILP) Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages, if perfectly balanced stage. Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Pipeline hazard CA-Lec4 cwliu@twins.ee.nctu.edu.tw 16

17 Pipelining is not quite that easy! Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). CA-Lec4 17

18 One Memory Port/Structural Hazards Time (clock cycles) Figure A.4, Page A 14 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 CA-Lec4 cwliu@twins.ee.nctu.edu.tw 18

19 One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Stall Instr 3 Bubble Bubble Bubble Bubble Bubble What will the bubble impact? CA-Lec4 cwliu@twins.ee.nctu.edu.tw 19

20 Speed Up Equations for Pipelining Speedup Average instruction time unpipelined Average instruction time pipelined CPI CPI unpipelined pipelined Clock cycle unpipelined Clock cycle pipelined CPI pipelined Ideal CPI Average Stall cycles per Inst Speedup Ideal CPI Pipeline depth Ideal CPI Pipeline stall CPI Cycle Time Cycle Time unpipelined pipelined for balanced pipelining For simple RISC pipeline, CPI = 1: Pipeline depth Speedup 1 Pipeline stall CPI Cycle Time Cycle Time unpipelined pipelined CA-Lec4 cwliu@twins.ee.nctu.edu.tw 20

21 Example: Dual port vs. Single port Machine A: Dual ported memory ( Harvard Architecture ) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Suppose that Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/( x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster CA-Lec4 cwliu@twins.ee.nctu.edu.tw 21

22 Data Hazard Problem on r1 Time (clock cycles) Dependencies backwards in time are hazards IF ID/RF EX MEM WB I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 CA-Lec4 cwliu@twins.ee.nctu.edu.tw 22

23 Types of Data Hazards RAW (read after write): true data dependence Get wrong propagation result WAR (write after read): anti dependence Get wrong operand WAW (write after write): output dependence Leave wrong result CA-Lec4 23

24 Name Dependence Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti dependence by compiler writers. This results from reuse of the name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages. Reads are always in stage 2 and Writes are always in stage 5 Can happen in OOO processor (discussed in chapter 3). CA-Lec4 cwliu@twins.ee.nctu.edu.tw 24

25 Name Dependence Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes and OOO processor CA-Lec4 cwliu@twins.ee.nctu.edu.tw 25

26 Forwarding to Avoid Data Hazard I n s t r. add r1,r2,r3 sub r4,r1,r3 Time (clock cycles) Forward result from one stage to another Define read/write properly O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 CA-Lec4 cwliu@twins.ee.nctu.edu.tw 26

27 HW Change for Forwarding Additional hardware is required. NextPC isters ID/EX mux mux EX/MEM Data Memory MEM/WR Immediate mux What circuit detects and resolves this hazard? CA-Lec4 27

28 Forwarding to Avoid LW SW Data Hazard Time (clock cycles) I n s t r. add r1,r2,r3 lw r4, 0(r1) O r d e r sw r4,12(r1) or r8,r6,r9 xor r10,r9,r11 CA-Lec4 cwliu@twins.ee.nctu.edu.tw 28

29 Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 O r d e r and r6,r1,r7 or r8,r1,r9 Must delay/stall instruction dependent on loads: Load stall cycles CA-Lec4 cwliu@twins.ee.nctu.edu.tw 29

30 Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 Bubble O r d e r and r6,r1,r7 or r8,r1,r9 Bubble Bubble CA-Lec4 cwliu@twins.ee.nctu.edu.tw 30

31 Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b Fast code: LW Rb,b LW Rc,c LW Rc,c ADD Ra,Rb,Rc LW Re,e SW a,ra ADD Ra,Rb,Rc LW Re,e LW Rf,f SW a,ra LW Rf,f SUB Rd,Re,Rf SUB Rd,Re,Rf SW d,rd SW d,rd Compiler optimizes for performance. Hardware checks for safety. CA-Lec4 cwliu@twins.ee.nctu.edu.tw 31

32 Control Hazard on Branches 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 What do you do with the 3 instructions in between? The simplest solution is to stall the pipeline as soon as a branch instruction is detected CA-Lec4 cwliu@twins.ee.nctu.edu.tw 32

33 Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or 0 MIPS Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3 CA-Lec4 cwliu@twins.ee.nctu.edu.tw 33

34 Pipelined MIPS Datapath Instruction Fetch Instr. Decode. Fetch Execute Addr. Calc Memory Access Write Back Next PC 4 Adder Next SEQ PC RS1 Adder MUX Zero? Address Memory IF/ID RS2 File ID/EX MUX EX/MEM Data Memory MEM/WB MUX Imm Sign Extend RD RD RD WB Data Interplay of instruction set design and cycle time. CA-Lec4 cwliu@twins.ee.nctu.edu.tw 34

35 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% MIPS branches taken on average But haven t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome What happens when hit not taken branch? CA-Lec4 cwliu@twins.ee.nctu.edu.tw 35

36 Four Branch Hazard Alternatives #4: Delayed Branch make the stall cycle useful Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2... sequential successor n branch target if taken Branch delay of length n These insts. are executed!! 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this CA-Lec4 cwliu@twins.ee.nctu.edu.tw 36

37 Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 sub $4,$5,$6 A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails CA-Lec4 cwliu@twins.ee.nctu.edu.tw 37

38 Delay Branch Scheduling Schemes and Their Requirements Scheduling Strategy From before Requirements Branch must not depend on the rescheduled instructions Improve Performance When? Always From target From fall through Must be OK to execute rescheduled instructions if branch is not taken. May need to duplicate instructions Must be OK to execute instructions if branch is taken When branch is taken. May enlarge program if instructions are duplicated When branch is not taken. CA-Lec4 38

39 Delayed Branch Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slots Delayed branch (a static way) has lost popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches relatively cheaper CA-Lec4 cwliu@twins.ee.nctu.edu.tw 39

40 Performance of Branch Schemes Example For MIPS R4K, a deeper pipeline, it takes at least 3 pipeline stages before the branchtarget address is known and an additional cycle before the branch condition is evaluated. Scheme Penalty Uncond Penalty untaken Penalty taken Stall Predicted taken Predicted not taken 2 0 3

41 Evaluating Branch Alternatives Pipeline speedup = Pipeline depth 1 +Branch frequency Branch penalty Assume: 4% unconditional branch, 6% conditional branch untaken, 10% conditional branch taken Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 2x0.04+3x n/ Predict not taken 2x0.04+3x0.06+2x n/ Predict taken 2x0.04+0x0.06+3x n/ Delayed branch 0.5x( ) 1.10 n/ CA-Lec4 cwliu@twins.ee.nctu.edu.tw 41

42 Problems with Pipelining Exception: An unusual event happens to an instruction during its execution Examples: divide by zero, undefined opcode Interrupt: Hardware signal to switch the processor to a new instruction stream Example: a sound card interrupts when it needs more audio output samples (an audio click happens if it is left waiting) Problem: It must appear that the exception or interrupt must appear between 2 instructions (I i and I i+1 ) The effect of all instructions up to and including I i is totalling complete No effect of any instruction after I i can take place The interrupt (exception) handler either aborts program or restarts at instruction I i+1 CA-Lec4 cwliu@twins.ee.nctu.edu.tw 42

43 Example: Device Interrupt (Say, arrival of network message) External Interrupt add subi slli lw lw add sw r1,r2,r3 r4,r1,#4 r4,r4,#2 Hiccup(!) r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2 Raise priority Reenable All Ints Save registers lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 Restore registers Clear current Int Disable All Ints Restore priority RTE Interrupt Handler CA-Lec4 43

44 Alternative: Polling (again, for arrival of network message) External Interrupt no_mess: Disable Network Intr subi r4,r1,#4 slli r4,r4,#2 lw r2,0(r4) lw r3,4(r4) add r2,r2,r3 sw 8(r4),r2 lw r1,12(r0) beq r1,no_mess lw r1,20(r0) lw r2,0(r1) addi r3,r0,#5 sw 0(r1),r3 Clear Network Intr Polling Point (check device register) Handler CA-Lec4 44

45 Polling is faster/slower than Interrupts. Polling is faster than interrupts because Compiler knows which registers in use at polling point. Hence, do not need to save and restore registers (or not as many). Other interrupt overhead avoided (pipeline flush, trap priorities, etc). Polling is slower than interrupts because Overhead of polling instructions is incurred regardless of whether or not handler is run. This could add to inner loop delay. Device may have to wait for service for a long time. When to use one or the other? Multi axis tradeoff Frequent/regular events good for polling, as long as device can be controlled at user level. Interrupts good for infrequent/irregular events Interrupts good for ensuring regular/predictable service of events. CA-Lec4 cwliu@twins.ee.nctu.edu.tw 45

46 Trap/Interrupt Classifications Traps: relevant to the current process Faults, arithmetic traps, and synchronous traps Invoke software on behalf of the currently executing process Interrupts: caused by asynchronous, outside events I/O devices requiring service (DISK, network) Clock interrupts (real time scheduling) Machine Checks: caused by serious hardware failure Not always restartable Indicate that bad things have happened. Non recoverable ECC error Machine room fire Power outage CA-Lec4 46

47 A related classification: Synchronous vs. Asynchronous Synchronous: means related to the instruction stream, i.e. during the execution of an instruction Must stop an instruction that is currently executing Page fault on load or store instruction Arithmetic exception Software Trap Instructions Asynchronous: means unrelated to the instruction stream, i.e. caused by an outside event. Does not have to disrupt instructions that are already executing Interrupts are asynchronous Machine checks are asynchronous SemiSynchronous (or high availability interrupts): Caused by external event but may have to disrupt current instructions in order to guarantee service CA-Lec4 47

48 Interrupt Controller Timer Interrupt Mask Priority Encoder IntID Interrupt CPU Int Disable Network Software Interrupt Control NMI Interrupts invoked with interrupt lines from devices Interrupt controller chooses interrupt request to honor Mask enables/disables interrupts Priority encoder picks highest enabled interrupt Software Interrupt Set/Cleared by Software Interrupt identity specified with ID line CPU can disable all interrupts with internal flag Non maskable interrupt line (NMI) can t be disabled CA-Lec4 cwliu@twins.ee.nctu.edu.tw 48

49 Precise Interrupts/Exceptions An interrupt or exception is considered precise if there is a single instruction (or interrupt point) for which: All instructions before that have committed their state No following instructions (including the interrupting instruction) have modified any state. This means that you can restart execution at the interrupt point and get the right answer Implicit in our previous example of a device interrupt: Interrupt point is at first lw instruction External Interrupt add subi slli lw lw add sw r1,r2,r3 r4,r1,#4 r4,r4,#2 r2,0(r4) r3,4(r4) r2,r2,r3 8(r4),r2 Int handler CA-Lec4 cwliu@twins.ee.nctu.edu.tw 49

50 Why are precise interrupts desirable? Many types of interrupts/exceptions need to be restartable, which is easier to figure out what actually happened: e.g. TLB faults. Need to fix translation, then restart load/store e.g. IEEE gradual underflow, illegal operation, etc Restartability doesn t require preciseness. However, preciseness makes it a lot easier to restart. Simplify the task of the operating system a lot Less state needs to be saved away if unloading process. Quick to restart (making for fast interrupts) CA-Lec4 cwliu@twins.ee.nctu.edu.tw 50

51 Precise Exceptions in simple 5 stage pipeline: Exceptions/interrupts may occur at different stages in pipeline : Arithmetic exceptions occur in execution stage TLB faults can occur in instruction fetch or memory stage What about interrupts? try to interrupt the pipeline as little as possible All of this solved by tagging instructions in pipeline as cause exception or not and wait until end of memory stage to flag exception Interrupts become marked NOPs (like bubbles) that are placed into pipeline instead of an instruction. Assume that interrupt condition persists in case NOP flushed Clever instruction fetch might start fetching instructions from interrupt vector, but this is complicated by need for supervisor mode switch, saving of one or more PCs, etc CA-Lec4 cwliu@twins.ee.nctu.edu.tw 51

52 Another Exception Problem Data TLB IFetch Dcd Exec Mem WB Time Bad Inst Inst TLB fault Overflow Program Flow IFetch Dcd Exec Mem WB Use pipeline to sort this out! Pass exception status along with instruction. Keep track of PCs for every instruction in pipeline. Don t act on exception until it reache WB stage IFetch Dcd Exec Mem WB Handle interrupts through faulting noop in IF stage When instruction reaches WB stage: Save PC EPC, Interrupt vector addr PC Turn all instructions in earlier stages into noops! IFetch Dcd Exec Mem WB CA-Lec4 cwliu@twins.ee.nctu.edu.tw 52

53 Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and register write stages. CA-Lec4 53