Pipelining An Instruction Assembly Line

Processor Designs So Far Pipelining An Assembly Line EEC7 Computer Architecture FQ 5 We have Single Cycle Design with low CPI but high CC We have ulticycle Design with low CC but high CPI We want best of both: low CC and low CPI Achieved using pipelining Courtesy of Prof. Kent Wilken and Prof. John Owens Pipelining is Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 minutes Dryer takes minutes A B C D Laundry iming How long does laundry take with a single cycle (wash/dry/fold is one clock) design? What is the clock cycle time? CPI? How many clocks? How long does laundry take with a multiple cycle (longest of wash/dry/fold is one clock) design? What is the CC? CPI? How many clocks? Folder takes minutes Sequential Laundry 6 P 7 8 9 idnight ime Pipelined Laundry: Start work ASAP 6 P 7 8 9 idnight ime a s k A 3 3 3 3 a s k A 3 O r d e r B C O r d e r B C D Sequential laundry takes 6 hours for loads If they learned pipelining, how long would laundry take? D

Laundry iming How long does laundry take with a pipelined (longest of wash/dry/fold is one clock) design? What is the CC? How many clocks? a s k O r d e r Pipelined Laundry: Start work ASAP A B C 6 P 7 8 9 idnight ime 3 D Pipelined laundry takes 3.5 hours for loads Pipelining Lessons Sandwich Bar Analogy a s k O r d e r 6 P 7 8 9 ime 3 A B C D Pipelining doesn t t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage ultiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages nbalanced lengths of pipe stages reduces speedup ime to fill pipeline and time to drain it reduces speedup Pipelining: ultiple people going through the sandwich bar at the same time If you re in front of the pickles, but you don t t need the pickles Car assembly line Stall for Dependences Digital System Efficiency Digital System Efficiency A synchronous digital system doing too much work between clocks can be inefficient because logic can be static for much of clock period Logic Logic

Digital System Efficiency Digital System Efficiency Logic Logic Pipeline Efficiency Efficiency can be improved by I.. subdividing logic into multiple stages and shortening CC Single Pipelined Efficiency executing single instruction is even less efficient, why? Logic Logic Logic 3 Logic Logic Logic 3 Single Pipelined Efficiency executing single instruction is even less efficient, why? Single Pipelined Efficiency executing single instruction is even less efficient, why? Logic Logic Logic 3 Logic Logic Logic 3

Pipeline Efficiency Efficiency can be improved by I.. subdividing logic into multiple stages and shortening CC II.. overlapping operation Pipeline Efficiency Efficiency can be improved by I. subdividing logic into multiple stages and shortening CC And by II.. overlapping operation Logic Logic Logic 3 Logic Logic Logic 3 Pipeline Efficiency Efficiency can be improved by I. subdividing logic into multiple stages and shortening CC And by II.. overlapping operation Logic Logic Logic 3 Pipeline Depth If three pipeline stages are better than non-pipelined, are four stages better than three? What is the limit to the number of stages? What is the CC at that limit? Why Pipeline? Suppose we execute instructions. How long on each architecture? Single Cycle achine.5 ns/cycle, CPI= ulticycle achine. ns/cycle, CPI=. Ideal pipelined machine. ns/cycle, CPI= (but remember fill cost!) Why Pipeline? Suppose we execute instructions Single Cycle achine.5 ns/cycle x CPI x inst = 5 ns ulticycle achine. ns/cycle x. CPI (due to inst mix) x inst = ns Ideal pipelined machine. ns/cycle x ( CPI x inst + cycle fill) = ns

Fill Costs Suppose we pipeline different numbers of instructions. What is the overhead of pipelining? Ideal pipelined machine. ns/cycle, CPI=, instructions. ns/cycle, CPI=, instructions. ns/cycle, CPI=,, instructions Fill Costs Overhead: Ideal pipelined machine. ns/cycle, CPI=, instructions ns runtime + ns overhead: % overhead. ns/cycle, CPI=, instructions ns runtime + ns overhead:.% overhead. ns/cycle, CPI=,, instructions, ns runtime + ns overhead:.% overhead Pipelined ultiplier We can take the Wallace tree multiplier we developed in Chapter 3 and create a two stage pipeline Reduce partial product terms Do final n- n bit addition using carry look- ahead adder Wallace ree ultiplier: -Stage Pipeline ultiplicand Reduce Partial Products x x x x x x x x y y y y y y y y ultiplier he two operations are roughly balanced, both take log(n) ) time -Bit er Result Pipeline Wallace ree ultiplier: -Stage Pipeline ultiplicand x x x x x x x x y y y y y y y y Wallace ree ultiplier: -Stage Pipeline ultiplier ultiplicand x x x x x x x x y y y y y y y y ultiplier Reduce Partial Products Reduce Partial Products -Bit er Result Pipeline -Bit er Result Pipeline

Pipelined Processor Design Single-Cycle Design We ll take the single-cycle design and transform it into a pipelined design What we formally called a phase of an instruction s s will now become a pipeline stage ress left ress isters 6 3 Single-Cycle Design Phases Pipelined Design Fetch Decode Execution Fetch Decode Execution E/E E/ left left ress ress ress ress isters isters 6 3 6 3 Fetch & + Decode & E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3

Execution: Load Word : Load Word E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3 Back: Load Word Execution: Store Word E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3 : Store Word Back: Store Word E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3

Execution: R-typeR : R-typeR E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3 Back: R-typeR Execution: Branch E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3 : Branch Back: Branch E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3

Overlapping Execution Overlapping Execution Parts of different instruction execute at each pipeline stage during each cycle CC CC CC CC CC CC lw $, ($) lw $, ($) lw $, ($) lw $, ($) lw $3, 3($) lw $3, 3($) lw $, ($) lw $, ($) lw $5, 5($) lw $5, 5($) Overlapping Execution Overlapping Execution CC CC CC CC CC CC lw $, ($) lw $, ($) lw $, ($) lw $, ($) lw $3, 3($) lw $3, 3($) lw $, ($) lw $, ($) lw $5, 5($) lw $5, 5($) Overlapping Execution Overlapping Execution CC CC CC CC CC CC lw $, ($) lw $, ($) lw $, ($) lw $, ($) lw $3, 3($) lw $3, 3($) lw $, ($) lw $, ($) lw $5, 5($) lw $5, 5($)

Overlapping Execution Lines Src Branch lw $, ($) CC CC CC SLL E/E E/ lw $, ($) P C Src em emto isters lw $3, 3($) lw $, ($) Op em lw $5, 5($) Dst nit Pipelined Src E/E E/ E P C isters SLL Src Branch em emto Src P C Src Op Dst Branch em em emto Op em Dst E/E E/ P C Src Pipeline with Pipelined isters E SLL Src E/E Branch em E/ emto Hazards hazards can occur when:. s I and I are both in the pipeline. I follows I 3. I produces a that is used by I he problem: because I has not posted its to the file, I may read an obsolete value. Op em Dst

Hazards $ is not written by sub until, but following instructions depend on $ as soon as Hazards Sw requires $ at, and $ is written to file at, so not a problem CC CC CC CC CC CC Sub $,, $, $3 Sub $,, $, $3 And $, $,, $5 And $, $, $5 Or $3, $6, $ Or $3, $6, $ $, $, $ $, $, $ Sw $5, ($ $) Sw $5, ($ $) Hazards requires $ at.5, and $ is written to file at., so not a problem Hazards Or requires $ at CC, and $ is written to file at.: hazard CC CC CC CC CC CC Sub $,, $, $3 Sub $,, $, $3 And $, $, $5 And $, $, $5 Or $3, $6, $ Or $3, $6, $ $, $, $ $, $, $ Sw $5, ($) Sw $5, ($) Hazards However $ is available at from internal pipeline latch Hazards And requires $ at, and $ is written to file at.: hazard CC CC CC CC CC CC Sub $,, $, $3 Sub $,, $, $3 And $, $,, $5 And $, $,, $5 Or $3, $6, $ Or $3, $6, $ $, $, $ $, $, $ Sw $5, ($) Sw $5, ($)

Sub $,, $, $3 And $, $,, $5 Or $3, $6, $ $, $, $ CC Hazards However $ is available at CC from internal pipeline latch CC CC Forwarding ost hazards can be resolved by forwarding,, sending the internal pipeline from I to the stage where it is needed by I which is following in the pipeline Result is still posted to the file when I reaches stage Forwarding requires hardware modifications to the path Sw $5, ($) Path without Forwarding isters Path with Forwarding Forwarding paths for s from stage and Back stage to Execution stage New to select operand from file or newest via forwarding isters ForwardA Rs Rt Rt Rd ForwardB E/E.isterRd Forward From E/ Forwarding nit E//.isterRd Forward from E/E Fowarding Path used to decide whether to forward from the stage If ((E/E. E.) #must be writing a and (E/E.isterRd!= ) #$ is always up to date and (E/E.isterRd =.isterrs E.isterRs) #stage and stage 3 source operand match ForwardA = If ((E/E. E.) and (E/E.isterRd!= ) and (E/E.isterRd =.isterrt E.isterRt)) ForwardB = Fowarding Path used to decide whether to forward from the back stage: If ((E/..) #must be writing a and (E/.isterRd!= ) #$ is always up to date and!((e/e.isterrd =.isterrs E.isterRs) ) and (E/E. E.)) # forward from stage is newer, has priority and (E/.isterRd =.isterrs E.isterRs)) #stage and stage 3 source operand match ForwardA = If ((E/..) and (E/.isterRd!= ) and!((e/e.isterrd =.isterrt E.isterRt) ) and (E/E. E.)) and (E/.isterRd =.isterrt E.isterRt)) ForwardB =

Complete path with Forwarding Forwarding () Or $3, $6, $ And $, $,, $5 Sub $,, $, $3 Before<> Before<> E/E E/E E/ E/ E E isters $ $ 5 isters $5 $3.isterRs ID.isterRs.isterRt ID.isterRt.isterRt ID.isterRt.isterRd ID.isterRd Rs Rt Rt Rd E/E.isterRd 5 3 E/E.isterRd Forwarding nit E//.isterRd Forwarding nit E//.isterRd Forwarding (CC) Forwarding () $, $, $ Or $3, $6, $ And $, $,, $5 Sub $,, $, $3 Before<> Sw $5, ($ $) $, $, $ Or $3, $6, $ And $, $,, $5 Sub $,, $, $3 E/E E/E E/ E/ E E 6 isters $6 $ $ $5 isters $ $ $6 $ 6 3 5 E/E.isterRd 6 3 E/E.isterRd Forwarding nit E//.isterRd Forwarding nit E//.isterRd Load Hazards For loads most hazards can be resolved with forwarding but some cannot Load Hazard Solution # Have compiler insert a NOP to space instructions further apart lw $,, ($) and $, $,, $5 or $8, $, $6 CC CC CC lw $,($) nop and $,$ $,$5 or $8,$ $,$6$6 add $9,$,$ $ slt $,$,$,$7,$7 add $9, $, $ slt $, $,, $7

Load Hazards Load Hazard Solution # nop resolves load hazard CC CC CC Best solution: compiler res (schedules) instructions to resolve hazard SL is independent of all instruction except LW, can move up to resolve hazard w/o nop lw $,, ($) nop and $, $,, $5 or $8, $, $6 lw $,($) nop and $,$ $,$5 or $8,$ $,$6$6 add $9,$,$ $ slt $,$,$,$7,$7 lw $,($) slt $,$,$,$7,$7 and $,$ $,$5 or $8,$ $,$6$6 add $9,$,$ $ add $9, $, $ Load Hazard Solution #3 Have the pipeline detect hazard and automatically inject a nop reduces size of program reduces cache misses LW and instruction before advance in next CC, following instructions stall, nop injected into E CC i : IF ID E E or and lw before before Detecting a Load Hazard Need to add an new hazard detection unit. Load hazard occurs under this condition: If (.em E.em) #LW in stage 3 and ((.isterrt =.isterrs ID.isterRs) or (.isterrt =.isterrt ID.isterRt)) #stage Rs or Rt uses LW stall the pipeline his logic is correct but not precise (creates too many stalls). Why? CC i+ CC i+ : or and and lw before Hazard Detection nit Detection occurs in ID pipeline stage Injecting A Nop (Pipeline Bubble) Set instruction s s control bits to all zeros; stall IF, ID ID.Rs/Rt ID.Rs/Rt Hazard Detect.em E E/E E/ ID.Rs/Rt ID.Rs/Rt Hazard Detect.em E E/E E/ isters isters.isterrs ID.isterRs.isterRt ID.isterRt.isterRt ID.isterRt.isterRd ID.isterRd Rs Rt E/E.isterRd.isterRs ID.isterRs.isterRt ID.isterRt.isterRt ID.isterRt.isterRd ID.isterRd Rs Rt E/E.isterRd.isterRt.isterRt Forwarding nit E//.isterRd Forwarding nit E//.isterRd

Refined NOP Injection Setting all control bits to zero is correct but is overkill. What is the minimum set of control bits that must be set to zero? Early Branch Resolution Branch resolved in stage to reduce penalty aken + SLL + Branch E isters = Delayed Branches Further reduce branch penalty by delaying the effect of branch one o cycle I.e., always execute instruction after branch before = target beq r7,r branch delay slot beq r7,r NOP -p p add r,... Delayed Branches Cont. Load r,... => beqz r7,r Load r,... -p p add r,... Delay slot instruction must be independent of branch, "safe". Independent instruction can be moved into delay slot from (a) above the branch, (b) from taken path, or (c) from not-taken taken path add r6,x,y beq r7,r NOP => beq r7,r add r6,x,y (a) from above: : best because instruction always useful (b) from taken path: : because lacking statistics for specific branch, taken is more likely (e.g., /3). Less valuable than (a) ) because fraction of useful work is p rather than beq r7,r NOP -p p add r,x,y Load r,... => beq r7,r add r,x,y -p p Load r,... (c) from not-taken taken path: seful work fraction is -p beq r7,r NOP add r,... Branch arget Copying If branch target instruction is accessed via other program paths, must be copied, not moved. Branch retargeted to instruction after former target Dynamic instruction count decreases, although static does not Load r,... => beq r7,r Load r,... add r,... Load r,... Branches in Delay Slot Delay slot following Jump can almost always be filled by moving or copying target Branch generally cannot fill delay slot. jump NOP JAL?? => x x+ jump JAL JAL Results in return to wrong address!!!

Profiled Delayed Branch Where safe instruction can be moved from taken or not-taken, taken, fill delay slot based on profile statistics, maximizes useful work. E.g., for p =.38: beqz r7,... NOP -p p add r,... Load r,... => beqz r7,... add r,... -p p Load r,... ore Branch Optimizations Jump is eliminated if it is the only instruction left in block after move, => can eliminate two instructions. E.g., p=. beqz... NOP -p p add r,... Load r,... jump => beqz r7,... Load r,... -p p add r,... Better still is to reverse the sense of the branch: beqz... NOP -p p add r,... Load r,... jump bnez... NOP add r,... => p -p => jump Load r,... bnez... add r,... p -p Load r,... pc + instruction Delayed Branches & Exception Delayed branches must save s of next two instructions on exception, rather than one decode reg read pc + op rdest source source (target) pc + op rdest memory access Load (delay slot) Page Fault save save op rdest branch Delayed Branching Performance For typical integer codes a useful instruction is executed in delay slot about 7% of the time thus branch cost is: (cycles lost per branch =.3) x (fraction of branches =.5) = 8% performance loss ificant reduction from 75%, but still want smaller branch cost NOE: if you consider slides 6&7 as the baseline, then CPI for branch instruction is in baseline implementation and 75% performance loss is correct. his is a bit different than what we discussed in class. Early Branch Resolution Branch resolved in stage to reduce penalty + SLL + isters aken Branch = E Branch Prediction se a magic box to: Determine the instruction being fetched is a branch Predict the branch outcome (taken/not taken) Produce the target address While the instruction is being fetched!

Branch Prediction Resolve branch in IF stage if prediction is correct Predictor Parget ress Simple Branch Prediction able Branch arget Buffer (BB) Predictor module is a n entry table indexed by n lower address bits. Fields are:. 3-n n upper address bits. IsBranch bit 3. arget ress (3 bits). Prediction able was filled during past s of the branch Branch_aken Branch arget Buffer Because BB is small, faster than instruction memory BB in Context 3-n n - pr_ = Hit n Is_Brch Predict Branch_aken _r Parget ress Branch_aken Parget ress wo-bit Branch Predictor Branch history using four states (two bits) avoids loop problem: strong not-taken taken weak not-taken taken weak taken strong taken Implemented as a state machine Weak aken N Weak Naken N N Strong aken Strong Naken N wo-bit Branch Predictor Can think of this predictor as a saturating counter: + for taken, - for not taken N N N N

Branch arget Buffer Complete Because BB is small, faster than instruction memory Final BB in Context 3-n n - pr_ = Hit n Is_Brch Branch_aken S State Hit S _r Parget ress State Branch_aken Parget ress Confirming Branch Prediction Prediction must be confirmed in ID stage, branch- history state updated + State Hit SLL arget ress + isters Branch = aken E Confirming Branch Prediction ust consider pipeline action for following cases:. No hit, no branch: do nothing. No hit, branch: a. Nullify IF instruction b. Branch to computed target address 3. Hit, condition = prediction: do nothing. Hit, condition = taken, prediction = not taken: a. Nullify IF instruction b. Branch to computer target address 5. Hit, condition = not taken, prediction = taken a. Nullify IF instruction b. Branch to ID stage + Next using Branch Prediction sing branch prediction now have four choices for next : IF_+, Parget address, ID_+, arget address ID_+ IF_+ + SLL arget ress + isters Branch = aken E BB pdates ust consider BB updates for following cases:. No hit, no branch: do nothing. No hit, branch: new BB entry, set prediction to weak condition 3. Hit, condition = taken: ++prediction. Hit, condition = not taken: --prediction 3 BB Parget ress