EECE476. Lecture 24: Advanced Topic. Register Renaming & Parallelism (no text)

Size: px

Start display at page:

Download "EECE476. Lecture 24: Advanced Topic. Register Renaming & Parallelism (no text)"

Myrtle Hudson
7 years ago
Views:

1 EECE476 Lecture 24: Advanced Topic Register Renaming & Parallelism (no text)

2 Performance Summary Performance ExecutionTime = InstrCount * CPI * ClockPeriod Performance enhancing steps so far Single-cycle CPU Large clock period, CPI = 1 Multi-cycle CPU Small clock period, CPI = 1 to 5 (varies with instruction) Pipelined CPU Small clock period, CPI = 1 (ideally) Pipeline == parallelism 5 instructions simultaneously in-flight, all in different stages

3 Performance Enhancing Steps Pipeline problems Hazards interfere with parallelism, introduces stalls CPI = 1 to 5 again (closer to 1) Forwarding reduces data hazards Branch Prediction + Branch Target Buffers reduce control hazards Memory problems Slow memory stall on load/store CPI >> 1 (tens to hundreds!) Caches are of fast memory most of the time Significantly eliminates stalls

4 Performance Enhancing Steps Superscalar More parallelism: 2 instructions/cycle, CPI = 0.5 (ideal) Complex Forwarding Logic, Hazard Detection Logic Small clock period increase Superpipelined More parallelism: approx 2x longer pipeline, CPI = 1 (ideal) Clock period cut in half Superscalar+Superpipelined More parallelism: 2x * 2x = 4x! Best of both worlds. To sustain performance, all need: Branch prediction Branch target buffer Caches

5 Performance Enhancing Steps Dynamic Scheduling Execute program instructions out-of-order Later instructions do not wait for earlier slow/stalled instructions DS + BranchPredict + BTB = Speculative Execution New pipeline steps: 1. Fetch instructions in-order 2. Issue into InstructionQueue (IQ) In-Order 3. Dispatch from IQ to FunctionalUnits (FU, eg, ALU,Load/Store Unit) Out-of-Order 4. Execute in FUs 5. Commit results (writeback) In-Order New hazards to consider when we alter instruction order: WAW WAR

6 New Hazards WAW Write-after-write Program order: first write, second write Second write may execute before first write Second write must stick, even if executed earlier WAR Write-after-read Program order: first read, second write Read uses old register value Write stores new register value Write may execute before read Read must get old value, even if write executed earlier Q: How can we do the second write first and still get correctness?

7 Parallelism Q: How can we do the second write first and still get correctness? Trying to do something earlier Really trying to find more parallelism Do multiple things at once (earlier) 3 Different solutions Scoreboarding Tomasulo s algorithm Register renaming All 3 solutions are similar in various ways Register renaming is simplest Most common solution today

8 Register Renaming Examine the hazards: OR $1,$2,$3 ADD $6,$1,$8 // $1 RAW with OR SW $6, 0($4) // $6 RAW with ADD SUB $8,$10,$14 // $8 WAR with ADD MUL $6,$10,$8 // $6 WAW ADD, $6 WAR SW, $8 RAW SUB

9 Register Renaming RAW eliminated with forwarding OR $1,$2,$3 ADD $6,$1,$8 SW $6, 0($4) SUB $8,$10,$14 // $8 WAR with ADD MUL $6,$10,$8 // $6 WAW with ADD, $6 WAR with SW Notice: SUB defines new value for $8, old $8 is destroyed No need to really use $8, could have used any new register Compiler ran out of registers, so re-used $8 This limits parallelism!

10 Register Renaming RAW eliminated with forwarding OR $1,$2,$3 ADD $6,$1,$8 SW $6, 0($4) SUB $8,$10,$14 // $8 WAR with ADD MUL $6,$10,$8 // $6 WAW with ADD, $6 WAR with SW Notice: MUL defines new value for $6, old $6 from ADD destroyed No need to really use $6, could have used any new register Compiler ran out of registers, so re-used $6 This limits parallelism!

11 Register Renaming Trick: rename registers $8 in SUB will be renamed T0 $6 in MUL will be renamed T1 OR $1,$2,$3 ADD $6,$1,$8 SW $6, 0($4) SUB T0,$10,$14 // T0 used to be WAR with ADD MUL T1,$10,T0 // T1 used to be WAW with ADD, WAR with SW No more WAR, WAW hazards! Can do SUB, MUL before OR, ADD, SW Need more registers

12 Renaming Registers Problem WAR, WAW only due to limited # regs in CPU CPU must have finite # regs MIPS ISA uses 5 bits to specify a register New instruction sets for bigger register files? Solution Design ISA for finite registers (eg, 32) called architectural registers Build CPU with more registers (eg, 64) called physical registers CPU maps architectural registers to physical registers Remember the page table? Register renaming is born!

13 Register Renaming Register renaming CPU supplies new physical registers when WAW, WAR detected Renames architectural register with new physical register number Future program order instructions use new physical register number Past program order instruction in the pipeline use old physical register number Mechanism: Rename buffer Small memory acting as lookup table Input: architectural register, output: physical register

14 Register Renaming Renaming Scheme Keep list of available physical registers Eg, store in a FIFO WAW, WAR allocates new physical register Always done on a write Eg, ADD $3, $4, $6 $3 is written, T9 alloc'ed Store mapping into rename buffer Normal execution continues Future instructions: map arch-to-phy register numbers Problem: FIFO runs out of physical registers Solution: Reclaim previously-used physical reg numbers When to reclaim?

15 Register Renaming Reclaiming physical registers Start with: ADD $3, $4, $6 $3 is written, T9 alloced In future: use T9 every time $3 appears But not forever! Eg, encounter future instruction: SUB $3, $11, $14 $3 written again! $3 is to be written again, T9 is no longer needed Out-of-order, many instructions still using T9, cannot reclaim yet Now allocate T10 for $3 and future Existing instructions complete using T9 When last instruction finishes T9, we can reclaim it When? Instrs. finish or commit In-Order: guarantees instructions using T9 are done When a write-to-$3 commits, eg SUB above, reclaim by putting T9 in FIFO

16 Register Renaming More physical registers: eg 64 available Can start WAR, WAW instructions earlier Finds more parallelism! Better performance on CPUs with high parallelism, eg superscalar+superpipelined

17 More Performance? Pipelining, Superscalar, Dynamic Scheduling Performance obtained from parallism Branch Prediction, Register Renaming help extract parallism from a program Called Instruction-Level Parallelism or ILP Q: Are there limits to ILP? Most programs written with sequential operation in mind Now executing many operations in parallel How much parallelism?

18 Research Studies on Limits of ILP Weiss and Smith [1984] instr/cycle Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] instrs in parallel!!!

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano