Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Size: px

Start display at page:

Download "Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:"

Kelly Mason
10 years ago
Views:

1 Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law): More complex hardware mechanisms Or, more replicated functional units (more parallelism) Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Superscalar: instructions are chosen dynamically by the hardware VLIW (Very Long Instruction Word): instructions are chosen statically by the compiler (and assembled in a single long instruction ) Inf3 Computer Architecture

instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Superscalar: instructions are chosen dynamically by the hardware

2 Superscalar Processors Hardware attempts to issue up to n instructions on every cycle, where n is called the issue width of the processor and the processor is said to have n issue slots and to be a n-issue processor Instructions issued must respect data dependences In some cycles not all issue slots can be used Extra hardware is needed to detect more combinations of dependences and hazards and to provide more bypasses Branches? With branch prediction, we can predict branches and fetch instructions Can we execute such predicted instructions? Inf3 Computer Architecture

issue slots can be used Extra hardware is needed to detect more combinations of dependences and hazards and to provide more bypasses Branches?

3 Speculative Execution Speculative execution execute control dependent instructions even when we are not sure if they should be executed Hardware undo, in case of a mis-prediction Multi-issue + branch prediction + Tomasulo Implemented in most current processors Key Idea: Execute out-of-order but commit in order Inf3 Computer Architecture

mis-prediction Multi-issue + branch prediction + Tomasulo Implemented in most current

4 Remember Control Hazards in 5-stage pipeline? When a branch is executed, PC is not affected until the branch instruction reaches the MEM stage. By this time 3 instructions have been fetched from the fall-through path. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 BEQZ R1, label IF Reg ALU Mem Reg SUB R4, R2, R5 IF Reg ALU Mem Reg Kill instructions in EX, DEC and IF as they move forwards AND R6, R2, r7 IF Reg ALU Mem Reg OR R8, r2, R9 : : label: XOR R10, R1, R11 IF Reg ALU Mem Reg IF Reg ALU Mem Reg Inf3 Computer Architecture

By this time 3 instructions have been fetched from the fall-through path.

5 Tomasulo with Hardware Speculation Issue: Get instruction from queue Issue if an RS is free and an ROB entry is also free Stall if no RS or no free ROB entry Instructions now tagged with ROB entry number, not RS.Id From instruction fetch unit Instruction Queue st f4, 8(r2) add f4, f5, f3 mul f3, f1, f2 ld f1, 4(r1) Re- Order Buffer Execute: Same as before: monitor CDB and start instruction when operands are available Write Result: Broadcast result with ROB identifier ROB captures result to commit later Store operations are saved in the ROB until store data is available and store instruction is committed Commit: If branch, check prediction and squash following instructions if incorrect If store, send data and address to memory unit and perform write action Else, update register with new value and release ROB entry Memory unit Address unit Address unit Load buffers FP adders 4 5 Reservation stations FP registers FP multipliers Inf3 Computer Architecture

Id From instruction fetch unit Instruction Queue st f4, 8(r2) add f4, f5, f3 mul f3, f1, f2 ld f1, 4(r1) Re- Order Buffer Execute: Same as before: monitor CDB and start instruction when operands are

6 VLIW Processors Compiler chooses and packs independent instructions into a single long instruction word or bundle No need for hardware to check the instructions for dependences Not all portions of the long instruction word will be used in every cycle Compiler must be able to expose a lot of parallelism in the instruction flow Example: Memory op 1 Memory op 2 fp op 1 fp op 2 int op ld f18,-32(r1) ld f22,-40(r1) addd f4,f0,f2 addd f8,f6,f2 Inf3 Computer Architecture

in every cycle Compiler must be able to expose a lot of parallelism in the instruction flow Example: Memory op 1 Memory

7 Superscalar vs. VLIW Processors + Superscalar can handle dynamic events like cache misses, unpredictable memory dependences, branches, etc. + Superscalar can exploit old binaries from previous implementations - Superscalar complexity limits issue width to 4 or 8 + VLIW requires much simpler hardware implementation + VLIW implementations can have wider issue - VLIW requires more complex compiler support - VLIW cannot use old binaries when pipeline implementation changes - VLIW code size increases because of empty issue slots Inf3 Computer Architecture

much simpler hardware implementation + VLIW implementations can have wider issue - VLIW requires more complex compiler support - VLIW cannot

8 What are the practical Limitations to ILP? Limitations on max issue width and instruction window size Effects of realistic branch prediction The effect of limited numbers of registers (ROB size) The effects of cache performance gcc espresso li fpppp doduc tomcatv Instruction issues per cycle H&P, fig 3.1 (p.157) shows the available ILP in a perfect processor, with none of the above constraints. This is shown for 6 of the SPEC92 benchmarks the first 3 are integer, the final 3 are floating point benchmarks. These levels of ILP are impossible to achieve in practice, due to the limitations on window size, issue width, branch prediction accuracy and cache performance. Inf3 Computer Architecture

gcc espresso li fpppp doduc tomcatv 55 63 18 75 119 150 0 50 100 150 200 Instruction issues per cycle H&P, fig 3.1 (p.

9 Effect of Instruction Window Instructions Per Clock gcc espresso li fpppp doduc tomcatv Infinite Inf3 Computer Architecture

10 13 12 8 8 119 14 1615 9 14 0 gcc espresso li fpppp doduc tomcatv

10 Branch Prediction Limits ILP We ve looked at various branch prediction schemes Here we see the accuracy achieved by 3 standard schemes Profile-based, 2-bit counter and Tournament predictors are in use today gcc espresso li fpppp doduc tomcatv H&P, fig 3.2 (p.159) shows the branch prediction accuracy for three types of dynamic branch prediction scheme Branch prediction accuracy Profile-based 2-bit counter Tournament Inf3 Computer Architecture

84 86 84 94 96 98 97 97 95 100 99 99 H&P, fig 3.2 (p.

11 Effect of Branch prediction Inf3 Computer Architecture

12 Effect of Limited ROB Inf3 Computer Architecture

13 Limits to Multiple-issue Inherent limitation of ILP in most programs: In practice we would need N independent instructions to keep a W- issue processor busy, where N=W*pipeline width Data and control dependences significantly limit amount of ILP Complexity of the hardware based on issue width Number of functional units increases linearly OK Number of ports for register file increases linearly bad Number of ports for memory increases linearly bad Number of dependence tests increases quadratically bad Bypass/forwarding logic and wires increases quadratically bad Inf3 Computer Architecture

functional units increases linearly OK Number of ports for register file increases linearly bad Number of ports for memory increases linearly bad

14 Summary of Factors Limiting ILP in Real Programs Compared with an ideal processor Finite number of registers (introduces WAW and WAR stalls) Imperfect branch prediction (pipeline flushes) Limited issue width Instruction fetch delays (cache misses) Limited instruction window Implications for future performance growth? Single processor has inherent limits To use future silicon area, need to go to multiple cores Inf3 Computer Architecture

fetch delays (cache misses) Limited instruction window Implications for future performance growth?

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano