Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Transcription

1 1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic

2 6.823 L6-2 Technology Assumptions A small amount of very fast memory (caches) backed up by a large, slor memory Fast ALU (at least for integers) Multiported Register files (slor!) It makes the following timing assumption valid t IM t RF t ALU t DM t RW A 5-stage pipelined Harvard architecture will be the focus of our detailed design

3 6.823 L6-3 5-Stage Pipelined Execution 0x4 PC rdata Inst. rs1 rs2 rd1 ws wdrd2 GPRs Imm Ext ALU rdata Data wdata I-Fetch (IF) Decode, Reg. Fetch (ID) Execute (EX) (MA) time t0 t1 t2 t3 t4 t5 t6 t7.... instruction1 IF 1 ID 1 EX 1 MA 1 WB 1 instruction2 IF 2 ID 2 EX 2 MA 2 WB 2 instruction3 IF 3 ID 3 EX 3 MA 3 WB 3 instruction4 IF 4 ID 4 EX 4 MA 4 WB 4 instruction5 IF 5 ID 5 EX 5 MA 5 WB 5 Write -Back (WB)

4 5-Stage Pipelined Execution Resource Usage Diagram L6-4 0x4 PC rdata Inst. rs1 rs2 rd1 ws wdrd2 GPRs Imm Ext ALU rdata Data wdata I-Fetch (IF) Resources Decode, Reg. Fetch (ID) Execute (EX) (MA) time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 4 I 5 ID I 1 I 4 I 5 EX I 1 I 4 I 5 MA I 1 I 4 I 5 WB I 1 I 4 I 5 Write -Back (WB)

5 Pipelined Execution: ALU Instructions L6-5 0x4 31 PC inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y rdata Data wdata R MD1 MD2 Not quite correct! We need an Instruction Reg () for each stage

6 s and Control points L6-6 0x4 31 PC inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y rdata Data wdata R MD1 MD2 Are control points connected properly? - ALU instructions - Load/Store instructions -Write back

7 Pipelined MIPS Datapath without jumps L6-7 F D E M W 0x4 31 RegWrite RegDst PC inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel ALU Y MemWrite rdata Data wdata WBSrc R MD1 MD2 ExtSel BSrc

8 How Instructions can Interact with each other in a pipeline L6-8 An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard An instruction may produce data that is needed by a later instruction data hazard In the extreme case, an instruction may determine the next instruction to be executed control hazard (branches, interrupts,...)

9 6.823 L6-9 Data Hazards r4 r1 r1 0x4 31 PC inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y rdata Data wdata R MD1 MD2... r1 r r4 r r1 is stale. Oops!

10 6.823 L6-10 Resolving Data Hazards Freeze earlier pipeline stages until the data becomes available interlocks If data is available somewhere in the datapath provide a bypass to get it to the right stage Speculate about the hazard resolution and kill the instruction later if the speculation is wrong.

11 Feedback to Resolve Hazards L6-11 FB 1 FB 2 FB 3 FB 4 stage 1 stage 2 stage 3 stage 4 Detect a hazard and provide feedback to previous stages to stall or kill instructions Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur)

12 Interlocks to resolve Data Hazards Stall Condition L6-12 0x4 nop 31 PC inst Inst... r1 r r4 r rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B MD1 ALU Y MD2 rdata Data wdata R

13 6.823 L6-13 Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 (r0) + 10 IF 1 ID 1 EX 1 MA 1 WB 1 ( ) r4 (r1) + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 ( ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) (I 5 ) stalled stages IF 4 ID 4 IF 5 EX 4 ID 5 MA 4 EX 5 WB 4 MA 5 WB 5 Resource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 4 I 5 ID I 1 I 4 I 5 EX I 1 nop nop nop I 4 I 5 MA I 1 nop nop nop I 4 I 5 WB I 1 nop nop nop I 4 I 5 nop pipeline bubble

14 Interlock Control Logic L6-14 stall C stall ws rs rt? 0x4 nop 31 PC inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y rdata Data wdata R MD1 MD2 Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.

15 Interlocks Control Logic ignoring jumps & branches stall ws C stall rs rt? re1 re2 C dest ws ws C dest L6-15 0x4 C re nop 31 PC inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y rdata Data wdata C dest R MD1 MD2 Should always stall if the rs field matches some rd? not every instruction writes a register not every instruction reads a register re

16 Source & Destination Registers L6-16 J-type: op immediate26 source(s) destination ALU rd (rs) func (rt) rs, rt rd ALUi rt (rs) op imm rs rt LW rt M [(rs) + imm] rs rt SW M [(rs) + imm] (rt) rs, rt BZ cond (rs) true: PC (PC) + imm rs false: PC (PC) + 4 rs J PC (PC) + imm JAL r31 (PC), PC (PC) + imm 31 JR PC (rs) rs JALR r31 (PC), PC (rs) rs 31 R-type: op rs rt rd func I-type: op rs rt immediate16

17 Deriving the Stall Signal L6-17 C destws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on... off C re re1 = Case opcode ALU, ALUi, LW, SW, BZ, JR, JALR J, JAL re2 = Case opcode ALU, SW... on off on off C stall stall = ((rs D =ws E ). E + (rs D =ws M ). M + (rs D =ws W ). W ). re1 D + ((rt D =ws E ). E + (rt D =ws M ). M + (rt D =ws W ). W ). re2 D This is not the full story!

18 Hazards due to Loads & Stores L6-18 Stall Condition What if (r1)+7 = (r3)+5? 0x4 nop 31 PC inst Inst rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y rdata Data wdata R... M[(r1)+7] (r2) r4 M[(r3)+5]... MD1 MD2 Is there any possible data hazard in this instruction sequence?

19 Load & Store Hazards L M[(r1)+7] (r2) r4 M[(r3)+5]... (r1)+7 = (r3)+5 data hazard Hover, the hazard is avoided because our memory system completes writes in one cycle! Load/Store hazards, even when they do exist, are often resolved in the memory system itself. More on this later in the course.

20 Five-minute break to stretch your legs 20

21 6.823 L6-21 Complications due to Jumps PCSrc (pc+4 / jabs / rind/ br) stall E M 0x4 nop Jump? I 1 PC 104 inst Inst Note fetching the next instruction before decode is speculation kill I ADD 100 J ADD I ADD kill A jump instruction kills (not stalls) the following instruction How?

22 Pipelining Jumps PCSrc (pc+4 / jabs / rind/ br) stall 0x4 nop L6-22 To kill a fetched instruction -- Insert a mux before E M Jump? 1 I 1 PC inst Inst Src D nop nop Any interaction beten stall and jump? I ADD 100 J ADD I ADD kill Src D = Case opcode D J, JAL nop... IM

23 Jump Pipeline Diagrams L6-23 time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 ( ) 100: J 200 IF 2 ID 2 EX 2 MA 2 WB 2 ( ) 104: ADD IF 3 nop nop nop nop (I 4 ) 304: ADD IF 4 ID 4 EX 4 MA 4 WB 4 Resource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 4 I 5 ID I 1 nop I 4 I 5 EX I 1 nop I 4 I 5 MA I 1 nop I 4 I 5 WB I 1 nop I 4 I 5 nop pipeline bubble

24 6.823 L6-24 Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall E M 0x4 nop BEQZ? I 1 zero? Src D PC 104 inst Inst nop A ALU Y I ADD 100 BEQZ r ADD I ADD Branch condition is not known until the execute stage what action should be taken in the decode stage?

25 6.823 L6-25 Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall? E BEQZ? M 0x4 nop I 1 zero? Src D PC 108 I ADD 100 BEQZ r ADD I ADD inst Inst nop A If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid stall signal is not valid ALU Y

26 6.823 L6-26 Pipelining Conditional Branches PCSrc (pc+4/jabs/rind/br) stall 0x4 Src E nop E BEQZ? M Jump? I 1 zero? PC PC 108 I ADD 100 BEQZ r ADD I ADD inst Inst Src D nop A If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid stall signal is not valid ALU Y

27 6.823 L6-27 New Stall Signal stall = ( ((rs D =ws E ). E + (rs D =ws M ). M + (rs D =ws W ). W ).re1 D + ((rt D =ws E ). E + (rt D =ws M ). M + (rt D =ws W ). W ).re2 D ).!((opcode E =BEQZ).z + (opcode E =BNEZ).!z) Don t stall if the branch is taken. Why? Instruction at the decode stage is invalid

28 Control Equations for PC and Muxes L6-28 PCSrc = Case opcode E BEQZ.z, BNEZ.!z br... Case opcode D J, JAL jabs JR, JALR rind... pc+4 Src D = Case opcode E BEQZ.z, BNEZ.!z nop... Case opcode D J, JAL, JR, JALR nop... IM Give priority to the older instruction, i.e., execute stage instruction over decode stage instruction Src E = Case opcode E BEQZ.z, BNEZ.!z nop... stall.nop +!stall. D

29 Branch Pipeline Diagrams (resolved in execute stage) L6-29 time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 ( ) 100: BEQZ 200 IF 2 ID 2 EX 2 MA 2 WB 2 ( ) 104: ADD IF 3 ID 3 nop nop nop (I 4 ) 108: IF 4 nop nop nop nop (I 5 ) 304: ADD IF 5 ID 5 EX 5 MA 5 WB 5 Resource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 4 I 5 ID I 1 nop I 5 EX I 1 nop nop I 5 MA I 1 nop nop I 5 WB I 1 nop nop I 5 nop pipeline bubble

30 Reducing Branch Penalty (resolve in decode stage) One pipeline bubble can be removed if an extra comparator is used in the Decode stage L6-30 PCSrc (pc+4 / jabs / rind/ br) 0x4 nop E PC inst Inst nop D rs1 rs2 rd1 ws wd rd2 GPRs Zero detect on register file output Pipeline diagram now same as for jumps

31 6.823 L6-31 Branch Delay Slots (expose control hazard to software) Change the ISA semantics so that the instruction that follows a jump or branch is always executed gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted. I ADD 100 BEQZ r ADD I ADD Delay slot instruction executed regardless of branch outcome Other techniques include branch prediction, which can dramatically reduce the branch penalty... to come later

32 Bypassing L6-32 time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 )r1 r IF 1 ID 1 EX 1 MA 1 WB 1 ( ) r4 r IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 ( ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 (I 4 ) stalled stages IF 4 ID 4 EX 4 (I 5 ) IF 5 ID 5 Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 r IF 1 ID 1 EX 1 MA 1 WB 1 ( ) r4 r IF 2 ID 2 EX 2 MA 2 WB 2 ( ) IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5

33 ing a Bypass L6-33 stall 0x4 r4 r1... nop r1... E M W 31 PC inst Inst D rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext ASrc A B ALU Y rdata Data wdata R MD1 MD2... When does this bypass help? (I 1 ) r1 r r1 M[r0 + 10] JAL 500 ( ) r4 r r4 r r4 r yes no no

34 The Bypass Signal Deriving it from the Stall Signal L6-34 stall = ( ((rs D =ws E ). E + (rs D =ws M ). M + (rs D =ws W ). W ).re1 D +((rt D =ws E ). E + (rt D =ws M ). M + (rt D =ws W ). W ).re2 D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 ASrc = (rs D =ws E ). E.re1 D = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on... off Is this correct? No because only ALU and ALUi instructions can benefit from this bypass Split E into two components: -bypass, -stall

35 Bypass and Stall Signals L6-35 Split E into two components: -bypass, -stall -bypass E = Case opcode E ALU, ALUi (ws 0)... off -stall E = Case opcode E LW (ws 0) JAL, JALR on... off ASrc = (rs D =ws E ).-bypass E. re1 D stall = ((rs D =ws E ).-stall E + (rs D =ws M ). M + (rs D =ws W ). W ). re1 D +((rt D = ws E ). E + (rt D = ws M ). M + (rt D = ws W ). W ). re2 D

36 Fully Bypassed Datapath L6-36 stall PC for JAL,... 0x4 nop ASrc E M W 31 PC inst Inst Is there still a need for the stall signal? D rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext BSrc A B MD1 ALU Y MD2 rdata Data wdata stall = (rs D =ws E ). (opcode E =LW E ).(ws E 0 ).re1 D + (rt D =ws E ). (opcode E =LW E ).(ws E 0 ).re2 D R

37 Why an Instruction may not be dispatched every cycle (CPI>1) Full bypassing may be too expensive to implement typically all frequently used paths are provided some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI Loads have two cycle latency Instruction after load cannot use load result MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II. Conditional branches may cause bubbles kill following instruction(s) if no delay slots L6-37 Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler.

38 Thank you! 38