Pipelining: Its Natural!

Transcription

1 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes, Dryer takes 40 minutes, Folder takes 20 minutes T a s k O r d e r 6 P Time Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage ultiple tasks operating simultaneously Potential speedup = Number pipe stages nbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup (1) Computer Pipelines IPS desirable features: all instructions same length, registers located in same place in instruction format, operands only in loads or stores We will first review a nonpipelined IPS architecture Review you should know about: multiplexors, register files, AL s arithmetic Vs logical shifts program counter (PC), status word (PS) and instruction register (IR), emory ress registers (AR) and data registers (DR) combinational Vs sequential circuits (2) Page 1

2 Implementing the IPS architecture Arithmetic/logic instructions 1) fetch instruction 2) read registers 3) compute (use AL) 4) write to register emory instructions 1) fetch instruction 2) read registers 3) compute ress (use AL) 4) write/read to/from ) write to register (for load) Branch instructions 1) fetch instruction 2) read registers 3) compute branch ress (use AL) 4) evaluate branch condition ) update the PC (condition satisfied) Increment the PC (3) fetch IR em[pc] NPC PC Add N P C ress (cache) Data out PC (cache) IR ress Data in data (cache) Data out (4) Page 2

3 Register read To decoder (control) 4 Add N P C 6 Read ress 1 Read ress 2 Write ress Write data Register file Read data 1 Read data 2 PC IR Register file A B A Regs[IR 610 ] B Regs[IR 111 ] Imm (IR 16 ) 48 ## IR Sign extend 64 Imm () se AL and evaluate branch condition 4 Add N P C Zero? cond PC IR Register file A B = AL AL out 16 Sign extend 64 Imm (the multiplexors settings and AL function depend on the opcode) (6) Page 3

4 Page 4 (7) se Add PC 4 N P C IR Register file A B Sign extend 16 Imm AL AL out cond Zero? data LD 64 (8) Write to register Add PC 4 N P C IR Register file A B Sign extend 16 Imm AL AL out cond Zero? data LD 64

5 The control signals PC L2 4 Add L3 N P C IR 16 L4 Register file Sign extend 64 A B Imm Zero? 2 1 AL OP cond AL out L1 data L LD 3 (9) The control signals 1, 2, 3 = select multiplexor input (one bit each) L1 L2 L3 L4 L OP =0 =1 = set if the instruction is a branch (one bit) = loads the PC (one bit) = read the instruction (one bit) = read/write register (two bits) 00 = noop, 01 = read, 10 = write = read/write the data (two bits) 00 = noop, 01 = read, 10 = write = AL control (?? Bits) 00 = noop, 11 =, 10 = others (10) Page

6 Pipelining IPS IF/ID ID/E E/E E/WB 4 Add Zero? PC Register file AL data 16 Sign extend 64 (11) IF/ID ID/E E/E E/WB 4 Add Zero? PC Register file AL data 16 Sign extend 64 (12) Page 6

7 Visualizing pipelining CC1 CC2 CC3 CC4 CC CC6 Clock cycles I Reg D Reg I Reg D Reg I Reg D Reg order I Reg D Reg Pipelining puts itional demands on bandwidth, (13) Visualizing pipelining CC1 CC2 CC3 CC4 CC CC6 Clock cycles lw sw I Reg Alu em WB I Reg Alu em WB I Reg Alu em WB I Reg Alu em WB lw lw sw lw sw lw Clock cycles sw sw lw (14) Page 7

8 Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support a combination of instructions Data hazards: depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard bubbles in the pipeline (1) Structural Hazards (assuming a single ) CC1 CC2 CC3 CC4 CC CC6 Clock cycles Load I Reg D Reg I Reg D Reg I Reg D Reg I Reg D Reg (16) Page 8

9 Structural Hazards (assuming a single ) CC1 CC2 CC3 CC4 CC CC6 Clock cycles Load I Reg D Reg I Reg D Reg I Reg D Reg I Reg D Reg (17) Structural hazard in register file CC1 CC2 CC3 CC4 CC CC6 Clock cycles Load I Reg D Reg I Reg D Reg I Reg D Reg I Reg D Reg (18) Page 9

10 Speed p Equation for Pipelining CPI pipelined = Ideal CPI + stall cycles per instruction Speedup = CPI unpipelined CPI pipelined x Clock Cycle unpipelined Clock Cycle pipelined Example: achine A: pipelined (with some depth) and dual ported achine B: pipelined (same depth as A), but single ported, and a 10 times faster clock rate Ideal CPI = 1 for both, and loads are 40% of instructions executed Speedp A = Pipeline Depth Speedp B = (Pipeline Depth/14) x 10 = 07 x Pipeline Depth Speedp A / Speedp B = 133 (19) Registertoregister data Hazards Add R1, R2, R3 I Reg D Reg Dependence: produces R1 consumed by following instructions Sub R4, R1, R4 I Reg D Reg And R, R1, R I Reg D Reg Add R6, R1, R6 I Reg D Reg Add R7, R1, R7 I Reg D Reg (20) Page 10

11 Three types of Data Dependence Instr i is later in the pipeline than Instr j j depends on i for operand R 1 Read After Write (RAW) Instr j tries to read operand before Instr i writes it 2 Write After Read (WAR) Instr j tries to write operand before Instr i reads it Gets wrong operand Can t happen in IPS stage pipeline (why?) 3 Write After Write (WAW) Instr j tries to write operand before Instr i writes it Leaves wrong result Can t happen in IPS stage pipeline (why?) Will see WAR and WAW in later more complicated pipes (21) Forwarding to Avoid Data Hazard I Add R1, R2, R3 Reg D Reg Sub R4, R1, R4 I Reg D Reg And R, R1, R I Reg D Reg Add R6, R1, R6 I Reg D Reg Add R7, R1, R7 I Reg D Reg (22) Page 11

12 HW Change for Forwarding IF/ID ID/E E/E E/WB PC Register file AL data (23) Forwarding Control Forwarding Paths s t t d t d d A u x u x 0 0 AL data B ID/E E/E E/WB (2) Page 12

13 Detection & Activation for Forwarding Control Src Type Detection Condition Input Action Priority RR E/Erd == ID/Ers ux A 1 1 RR E/Erd == ID/Ert ux B 1 1 RR E/WBrd == ID/Ers ux A 2 2 RR E/WBrd == ID/Ert ux B 2 2 Imm E/Ert == ID/Ers ux A 1 1 Imm E/Ert == ID/Ert ux B 1 1 Imm E/WBrt == ID/Ers ux A 2 2 Imm E/WBrt == ID/Ert ux B 2 2 Ld E/WBrt == ID/Ers ux A 3 2 Ld E/WBrt == ID/Ers ux B 3 2 Src Type = Producer instruction opcode type Action = ux setting; if no match, then ux selection is 0 Priority = Which detection condition takes precedence (note, multiple can match) (26) Data Hazard Even with Forwarding lw r1, 0(r2) I Reg D Reg sub r4,r1,r6 I Reg D Reg and r6,r1,r7 I Reg D Reg or r8,r1,r9 I Reg D Reg Cannot travel back in time Need to stall (interlock) the pipe if: The instruction in ID/E is a lw The instruction in IF/ID uses register Rs1 and/or Rs2 Rd for the instruction in ID/E = Rs1 or Rs2 How can we interlock (stall) the pipeline? (27) Page 13

14 Interlock on Loadse Detect a load followed by a dependent use Src Type Dst Type Condition Ld RegisterRegister ID/Ert == IF/IDrs Ld RegisterRegister ID/Ert == IF/IDrt Ld Immediate ID/Ert == IF/IDrs Insert the bubble on detection Disable write (no updates): PC, ID/IF register Clear the ID/E register (inserting a nop) (28) Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in Slow code: LW LW ADD SW LW LW SB SW d,rd Rb,b Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,ra SB Rd,Re,Rf SW d,rd (29) Page 14

15 Control Hazard on Branches IF/ID ID/E E/E E/WB 4 Add Zero? PC Register file AL data 16 Sign extend 64 (30) Control Hazard on Branches I Reg Alu em WB beqz beqz sw beqz sub sw beqz Clock cycles sub sw sub sw beqz sub sw sub (31) Page 1

16 Branch Stall Impact If CPI = 1, 10% branch, Stall 3 cycles => new CPI = 13 Two part solution: 1 Determine branch taken or not sooner, AND 2 Compute taken branch ress earlier IPS branch tests if two registers are equal IPS solution ove Zero test to ID/E stage Adder to calculate branch target ress in ID/E stage 1 clock cycle penalty for branch versus 3 (32) Early determination of Branch condition and target IF/ID ID/E E/E E/WB 4 Add Add Zero? PC Register file = AL data 16 Sign extend 32 (33) Page 16

17 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken (how?) Advantage of late pipeline state update 47% IPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 3% IPS branches taken on average But haven t calculated branch target ress in IPS» IPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcome (34) Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2 sequential successor n branch target if taken Branch delay of length n 1 slot delay allows proper decision and branch target ress in stage pipeline Delay Slot (DSI) IPS uses this (3) Page 17

18 Example of DSI top:!lw!lw!!r1,0(r2)!!r3,4(r2)!!!r1,r1,r3!!sw!r1,0(r2)!!i!r2,r2,4!!i!r,r,1!!bne!r,r0,top!!delay slot instruction!!!!! (36) Example of DSI top:!lw!lw!r1,0(r2)!!r3,4(r2)!!!r1,r1,r3!!sw!r1,0(r2)!!i!r2,r2,4!!i!r,r,1!!bne!r,r0,top!!nop!!!!! Dependence chain (37) Page 18

19 Example of DSI top:!lw!lw!!r1,0(r2)!!r3,4(r2)!!!r1,r1,r3!!sw!r1,0(r2)!!i!r2,r2,4!!i!r,r,1!!bne!r,r0,top!!nop!!!!! Dependence chain (38) Example of DSI top:!lw!lw!!r1,0(r2)!!r3,4(r2)!!!r1,r1,r3!!sw!r1,0(r2)!!i!r2,r2,4!!i!r,r,1!!bne!r,r0,top!!nop!!!!! Dependence chain (39) Page 19

20 Example of DSI top:!lw!lw!!!r1,0(r2)!!r3,4(r2)!!!r1,r1,r3!!sw!r1,0(r2)!!i!r,r,1!!bne!r,r0,top!!i!r2,r2,4!!!!! Dependence chain (40) Example of DSI top:!lw!lw!!r1,0(r2)!!r3,4(r2)!!!r1,r1,r3!!sw!r1,0(r2)!!i!r2,r2,4!!bne!r2,r0,top!!nop!!!!! Can we move the i? (41) Page 20

21 Where do we get instructions to fill branch delay slot? Before branch instruction From the target ress: should repeat instruction for correctness From fall through: correct if R7 is dead after the branch (42) ulticycle pipelines Ex Int IF ID FP mult FP mult FP FP FP mult em WB Div Assume 4stage, pipelined, FP 7stage, pipelined, FP multiply 2 stage, nonpipelined divide unit Latency of an instruction, I, in a pipeline, P, is the number of bubbles that has to exist in P if the instruction following I wants to use the result of I (44) Page 21

22 Hazards: Two divide instructions will stall the pipe (structural hazards) ay have more than one register write in one cycle (why?) increase number of ports, or stall the pipeline (interlock) ay have WAW hazard (why?) Outoforder completion causes problems with exceptions, The long pipes causes more RAW hazards (why?) To deal with structural Hazards: Stall a conflicting instruction at the ID stage se a shift register to keep track of the utilization of a stage that may suffer from structural hazard (ex Input ports of registers) Stall a conflicting instruction when entering the em stage may give priority to longer instructions to reduce RAW hazards (4) Examples LD F4, 0(R2) uld F0, F4, F6 IF ID IF E ID E WB E WB AddD F2, F0, F8 SD F2, 0(R2) IF ID IF A1 ID A2 E A3 A4 E WB E Stall due to RAW hazards instruction LD F0,F4,F6 IF ID em WB IF ID E em WB IF ID E em WB AddD F2,F4,F6 IF ID A1 A2 A3 A4 em WB IF ID E em WB IF ID E em WB LD F2, 0(R2) IF ID Ex em WB Structural hazards (46) Page 22

23 To deal with WAW: WAW occurs only if a useless instruction is executed If there is a use in between the writes, then a RAW will stall the pipe (if no forwarding is used in this case forwarding is not possible) ay detect hazard and hold the second instruction, ay detect hazard and prevent the first instruction from writing ay detect hazard and replace the first instruction with a noop Can all hazards be detected at the ID stage? To maintain precise exceptions: ay ignore the problem, or buffer the results and enforce the order of writes, or let the trap handling routine enforce the preciseness (software approach), or delay the issue (stall the pipe) to enforce inorder completion (47) set design and pipelining Variable instruction length and execution time leads to imbalance among stages, complicate hazard detection and precise exceptions Caches have similar effects (imbalance pipes) may freeze the entire pipeline on a cache miss Complex ressing modes may change register values may require multiple access self modifying instructions causes pipeline problems Implicitly set condition codes complicates pipeline control hazards (48) Page 23

24 The IPS R bit instruction set (IPS3 ISA) Decomposes access into stages (superpipelining) ses a deeper pipeline IF first half of instruction fetch IS second half of instruction fetch RF instruction decode and register fetch (and check cache tag) E execution: effective ress calculation, AL operation, branch target computation, branch condition evaluation DF first half of data fetch DS second half of data fetch TC check cache tag (to determine if it is a hit) WB write back (49) Forwarding from output is required to instructions that are 3 or 4 cycles later (sooner instructions have to stall) (0) Page 24

25 The IPS R4000 pipeline Branch delay is 3 cycles The architecture supports a singlecycle delayed branch The floatingpoint pipeline: Three units: er ( stages), divider(36 nonpipelined stages) and multiplier (9 stages) Eight hardware units are shared among stages A ( mantissa), D (divide), E(exception test), and N (first and second stages of multiplier), R (rounding), S (shift), (unpack) (1) Page 2