Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law): More complex hardware mechanisms Or, more replicated functional units (more parallelism) Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Superscalar: instructions are chosen dynamically by the hardware VLIW (Very Long Instruction Word): instructions are chosen statically by the compiler (and assembled in a single long instruction ) Inf3 Computer Architecture - 2011-2012 1

Superscalar Processors Hardware attempts to issue up to n instructions on every cycle, where n is called the issue width of the processor and the processor is said to have n issue slots and to be a n-issue processor Instructions issued must respect data dependences In some cycles not all issue slots can be used Extra hardware is needed to detect more combinations of dependences and hazards and to provide more bypasses Branches? With branch prediction, we can predict branches and fetch instructions Can we execute such predicted instructions? Inf3 Computer Architecture - 2011-2012 2

Speculative Execution Speculative execution execute control dependent instructions even when we are not sure if they should be executed Hardware undo, in case of a mis-prediction Multi-issue + branch prediction + Tomasulo Implemented in most current processors Key Idea: Execute out-of-order but commit in order Inf3 Computer Architecture - 2011-2012 3

Remember Control Hazards in 5-stage pipeline? When a branch is executed, PC is not affected until the branch instruction reaches the MEM stage. By this time 3 instructions have been fetched from the fall-through path. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 BEQZ R1, label IF Reg ALU Mem Reg SUB R4, R2, R5 IF Reg ALU Mem Reg Kill instructions in EX, DEC and IF as they move forwards AND R6, R2, r7 IF Reg ALU Mem Reg OR R8, r2, R9 : : label: XOR R10, R1, R11 IF Reg ALU Mem Reg IF Reg ALU Mem Reg Inf3 Computer Architecture - 2011-2012 4

Tomasulo with Hardware Speculation Issue: Get instruction from queue Issue if an RS is free and an ROB entry is also free Stall if no RS or no free ROB entry Instructions now tagged with ROB entry number, not RS.Id From instruction fetch unit Instruction Queue st f4, 8(r2) add f4, f5, f3 mul f3, f1, f2 ld f1, 4(r1) Re- Order Buffer Execute: Same as before: monitor CDB and start instruction when operands are available Write Result: Broadcast result with ROB identifier ROB captures result to commit later Store operations are saved in the ROB until store data is available and store instruction is committed Commit: If branch, check prediction and squash following instructions if incorrect If store, send data and address to memory unit and perform write action Else, update register with new value and release ROB entry Memory unit Address unit Address unit 6... 11 Load buffers 1 2 3 FP adders 4 5 Reservation stations FP registers FP multipliers Inf3 Computer Architecture - 2011-2012 5

VLIW Processors Compiler chooses and packs independent instructions into a single long instruction word or bundle No need for hardware to check the instructions for dependences Not all portions of the long instruction word will be used in every cycle Compiler must be able to expose a lot of parallelism in the instruction flow Example: Memory op 1 Memory op 2 fp op 1 fp op 2 int op ld f18,-32(r1) ld f22,-40(r1) addd f4,f0,f2 addd f8,f6,f2 Inf3 Computer Architecture - 2011-2012 6

Superscalar vs. VLIW Processors + Superscalar can handle dynamic events like cache misses, unpredictable memory dependences, branches, etc. + Superscalar can exploit old binaries from previous implementations - Superscalar complexity limits issue width to 4 or 8 + VLIW requires much simpler hardware implementation + VLIW implementations can have wider issue - VLIW requires more complex compiler support - VLIW cannot use old binaries when pipeline implementation changes - VLIW code size increases because of empty issue slots Inf3 Computer Architecture - 2011-2012 7

What are the practical Limitations to ILP? Limitations on max issue width and instruction window size Effects of realistic branch prediction The effect of limited numbers of registers (ROB size) The effects of cache performance gcc espresso li fpppp doduc tomcatv 55 63 18 75 119 150 0 50 100 150 200 Instruction issues per cycle H&P, fig 3.1 (p.157) shows the available ILP in a perfect processor, with none of the above constraints. This is shown for 6 of the SPEC92 benchmarks the first 3 are integer, the final 3 are floating point benchmarks. These levels of ILP are impossible to achieve in practice, due to the limitations on window size, issue width, branch prediction accuracy and cache performance. Inf3 Computer Architecture - 2011-2012 8

Effect of Instruction Window 160 150 140 Instructions Per Clock 120 100 80 60 40 55 36 63 41 119 75 61 59 60 49 45 35 34 20 15 18 15 10 10 13 12 8 8 119 14 1615 9 14 0 gcc espresso li fpppp doduc tomcatv Infinite 2048 512 128 32 Inf3 Computer Architecture - 2011-2012 9

Branch Prediction Limits ILP We ve looked at various branch prediction schemes Here we see the accuracy achieved by 3 standard schemes Profile-based, 2-bit counter and Tournament predictors are in use today gcc espresso li fpppp doduc tomcatv 70 88 82 86 77 88 84 86 84 94 96 98 97 97 95 100 99 99 H&P, fig 3.2 (p.159) shows the branch prediction accuracy for three types of dynamic branch prediction scheme. 0 20 40 60 80 100 Branch prediction accuracy Profile-based 2-bit counter Tournament Inf3 Computer Architecture - 2011-2012 10

Effect of Branch prediction Inf3 Computer Architecture - 2011-2012 11

Effect of Limited ROB Inf3 Computer Architecture - 2011-2012 12

Limits to Multiple-issue Inherent limitation of ILP in most programs: In practice we would need N independent instructions to keep a W- issue processor busy, where N=W*pipeline width Data and control dependences significantly limit amount of ILP Complexity of the hardware based on issue width Number of functional units increases linearly OK Number of ports for register file increases linearly bad Number of ports for memory increases linearly bad Number of dependence tests increases quadratically bad Bypass/forwarding logic and wires increases quadratically bad Inf3 Computer Architecture - 2011-2012 13

Summary of Factors Limiting ILP in Real Programs Compared with an ideal processor Finite number of registers (introduces WAW and WAR stalls) Imperfect branch prediction (pipeline flushes) Limited issue width Instruction fetch delays (cache misses) Limited instruction window Implications for future performance growth? Single processor has inherent limits To use future silicon area, need to go to multiple cores Inf3 Computer Architecture - 2011-2012 14