EEL601: COMPUTER ARCHITECTURE. Turbo Majumder Department of Electrical Engineering Indian Institute of Technology Delhi

Transcription

1 EEL601: COMPUTER ARCHITECTURE Turbo Majumder Department of Electrical Engineering Indian Institute of Technology Delhi

2 INSTRUCTION LEVEL PARALLELISM Reference: Chapter 3 and Appendix C of Computer Architecture: A Quantitative Approach, Fifth Edition by J. L. Hennessy and D. A. Patterson

3 What is ILP? Exploit inherent parallelism from instructions. Easy Exploit parallelism within a basic block Linear code, no branches within We don t have too many lines of such code. Have to look for something else Two approaches: Static: Relies on compiler to exploit parallelism at compile-time. e.g., ARM Cortex-A8, Intel Itanium Dynamic: Relies on hardware to discover and exploit parallelism at run-time. e.g., Intel Core series, ARM Cortex-A9 Pipeline CPI = ideal pipeline CPI + structural hazard stalls + data hazard stalls + control stalls

4 ILP Techniques Technique A. Forwarding and bypassing B. Delayed branches with simple branch scheduling C. Scoreboarding D. Loop unrolling E. Branch prediction F. Dynamic scheduling with renaming G. Hardware speculation Reduces A. Potential data hazard stalls B. Control hazard stalls C. Data hazard stalls from true dependences D. Control hazard stalls E. Control stalls F. Stalls from data hazards, output dependences, antidependences (RAW, WAW, WAR) G. Data hazard and control hazard stalls

5 Basic Dynamic Scheduling Scoreboarding (from CDC 6600) Recall: MIPS pipeline (IF-ID-EX-MEM-WB) checks for structural and data hazards during ID. What s new here? We allow an instruction to to begin execution even though its predecessor is stalled, subject to certain conditions. What conditions? Decode; Wait till no structural hazard exists. Wait till no data hazard exists. In-order issue Out-of-order execution Out-of-order completion Issue Read Operands Instruction Decode (MIPS)

6 Preliminaries Output dependences (possible WAW hazard), antidependences (possible WAR hazard) [unlike MIPS] Example DIV.D ADD.D SUB.D SUB.D F0, F2, F4 F6, F0, F8 F8, F8, F10 F8, F12, F14 Goal is to maintain IPC of 1 (no structural hazards). Issue independent instructions. Multiple instructions executing simultaneously Multiple functional units 2 multipliers, 1 adder, 1 division unit, 1 integer unit (load/store, integer operation, branch)

7 Method Every instruction has to go through scoreboard. Central monitor for all functional units for data dependences If dependence found, find when instruction can be issued Also, for output dependences and antidependences Controls write-back Steps: (for FP) Issue: FU free no structural hazard Result not in D no WAW Else, stall IF buffer full? IF stall

8 Method (continued) Steps (continued) Read operands Source operands available? Resolves RAW hazard Execution Possibly many cycles, e.g. MUL.D, DIV.D FU done? Write result Destination of FU source of some previous instruction not yet read? Hold on prevents WAR hazard

9 Scoreboard Example L.D F6, 34(R2) ; 1 EX cycle L.D F2, 45(R3) MUL.D F0, F2, F4 ; 20 EX cycles SUB.D F8, F6, F2 ; 2 EX cycles DIV.D F10, F0, F6 ; 40 EX cycles ADD.D F6, F8, F2 ; 2 EX cycles Instruction status: {Issue, Read operands, Execution, Write result} FU status: {Busy, Op, Fi, Fj, Fk, Qi, Qj, Rj, Rk} Register result status: Fn {{FU},Φ}, n = 0, 2, 4,

10 Scoreboard Example Each instruction has issued or pending issue Instruction status L.D L.D MUL.D SUB.D DIV.D ADD.D Instruction F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 Issue Read operands Execution complete Write result FU status FU name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Y L.D F2 R3 N Mult1 Y MUL.D F0 F2 F4 Integer N Y Mult2 N Add Y SUB.D F8 F6 F2 Integer Y N Divide Y DIV.D F10 F0 F6 Mult1 N Y Register result status F0 F2 F4 F6 F8 F10 F12 F30 FU Mult1 Integer Add Divide

11 Scoreboard Example Just before MUL.D goes to write result Instruction status L.D L.D MUL.D SUB.D DIV.D ADD.D Instruction F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 Issue Read operands Execution complete Write result FU status FU name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer N Mult1 Y MUL.D F0 F2 F4 N N Mult2 N Add Y ADD.D F8 F6 F2 N N Divide Y DIV.D F10 F0 F6 Mult1 N Y Register result status F0 F2 F4 F6 F8 F10 F12 F30 FU Mult1 Add Divide

12 Scoreboard Example Just before DIV.D goes to write result Instruction status L.D L.D MUL.D SUB.D DIV.D ADD.D Instruction F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 Issue Read operands Execution complete Write result FU status FU name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer N Mult1 N Mult2 N Add N Divide Y DIV.D F10 F0 F6 N N FU Register result status F0 F2 F4 F6 F8 F10 F12 F30 Divide

13 Scoreboard Performance Amount of parallelism available In code that has been already optimised by compiler Number of scoreboard entries Size of look-ahead window Number and types of FUs More FUs? Antidependences and output dependences WAW and WAR stalls Gets worse with dynamic scheduling where multiple iterations of a loop overlap.

14 Costs-Benefits of Dynamic Scheduling Benefits Makes compiled code independent of pipeline structure. Compiler needs no knowledge of microarchitecture. Dependences unknown at compile-time, such as memory references, are handled. Unpredictable run-time delays, such as cache misses, are handled by executing independent instructions out-oforder. Costs Increased hardware complexity More difficult to preserve exception behaviour Possible imprecise exceptions

15 Tomasulo s Algorithm: Concepts Tracks when operands are available Register renaming Minimises WAW and WAR hazards Example: DIV.D ADD.D S.D SUB.D MUL.D F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6, F10, F8 antidependence Also, name/output dependence with F6 After register renaming DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T

16 Register Renaming Implemented using reservation stations (RS) RS: {RSID, Busy, Op, Vj, Vk, Qj, Qk, A} RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file). Pending instructions designate the RS to which they will send their output. Result values broadcast on a result bus, called the common data bus (CDB). Only the last output updates the register file. As instructions are issued, the register specifiers are renamed with RSID. May be more reservation stations than registers. Load and store buffers act as RS for memory operations.

17 Tomasulo s Algorithm: Schematic

18 Steps of Tomasulo s Algorithm Issue Get next instruction from FIFO instruction queue If RS and operand values available, issue the instruction to the RS If operand values not available, keep track of FUs producing them. If RS not available, stall the instruction. Execute When operand becomes available, store it in any reservation station(s) awaiting it. When all operands are ready, start execution. No RAW hazards. Multiple instructions may be ready to execute. Loads and store maintained in program order through effective address. No instruction allowed to initiate execution until all branches that precede it in program order have completed. Write result Write result on CDB into reservation stations and store buffers Stores must wait until effective address and value are available.

19 Tomasulo s Algorithm: Nitty-Gritties RS / load-store buffer / register data structures contain tags that detect and eliminate hazards. Register names are discarded. Only values and/or pointers to other RS producing the value RSID or load-store buffer ID: Virtual register IDs referenced by these tags Issued instruction: Source operand RSID Number of RS > Number of actual registers All WAW and WAR hazards are eliminated CDB: FU-computed result directly goes to RS, store buffer and/or FP registers Similar to forwarding and bypassing in a statically scheduled pipeline. Also, not limited by individual register bus contention as in a centralised register file.

20 Tomasulo s Algorithm: Example L.D F6, 32(R2) L.D F2, 44(R3) MUL.D F0, F2, F4 ; 6 EX cycles SUB.D F8, F2, F6 ; 2 EX cycles DIV.D F10, F0, F6 ; 12 EX cycles ADD.D F6, F8, F2 ; 2 EX cycles Even for memory Instruction status: {Issue, Execute, Write result} RS status: {RSID, Busy, Op, Vj, Vk, Qj, Qk, A} Register status: {Qi RSID}

21 Tomasulo s Algorithm: Example Just after first L.D writes result Instruction status L.D L.D MUL.D SUB.D DIV.D ADD.D Instruction F6, 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, F6 F10, F0, F6 F6, F8, F2 Issue Execute Write result RS status RSID Busy Op Vj Vk Qj Qk A Load1 N Load2 Y L.D 44+Regs[R3] Add1 Y SUB.D Mem[32+Regs[R2]] Load2 Add2 Y ADD.D Add1 Load2 Add3 N Mult1 Y MUL.D Regs[F4] Load2 Mult2 Y DIV.D Mem[32+Regs[R2]] Mult1 Register status Field F0 F2 F4 F6 F8 F10 F12 F30 Qi Mult1 Load2 Add2 Add1 Mult2

22 Tomasulo s Algorithm: Example Just before MUL.D writes result Instruction status L.D L.D MUL.D SUB.D DIV.D ADD.D Instruction F6, 32(R2) F2, 44(R3) F0, F2, F4 F8, F2, F6 F10, F0, F6 F6, F8, F2 Issue Execute Write result RS status RSID Busy Op Vj Vk Qj Qk A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 Y MUL.D Mem[44+Regs[R3]] Regs[F4] Mult2 Y DIV.D Mem[32+Regs[R2]] Mult1 Register status Field F0 F2 F4 F6 F8 F10 F12 F30 Qi Mult1 Mult2

23 Dynamic Loop Unrolling Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, loop If we predict branches are taken, we can unroll this loop dynamically. Ignore DADDIU which is an integer operation for the purposes of this example. With loop unrolling, multiple instructions can be issued per cycle to achieve a sustained CPI close to 1.0. With 4 cycles for MUL.D, two iterations are enough; with 6, more iterations need to be unrolled. Load: check effective address against all store buffers ( A ) Prevents RAW hazard Store: check effective address against all load and store buffers ( A ) Prevents WAW and WAR hazard

24 Dynamic Loop Unrolling All instructions issued; none completed Instruction status Instruction Issue Execute Write result Iteration L.D F0, 0(R1) 1 MUL.D F4, F0, F2 1 S.D F4, 0(R1) 1 L.D F0, 0(R1) 2 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 2 RS status RSID Busy Op Vj Vk Qj Qk A Load1 Y L.D Regs[R1]+0 Load2 Y L.D Regs[R1]-8 Add1 N Add2 N Add3 N Mult1 Y MUL.D Regs[F2] Load1 Mult2 Y MUL.D Regs[F2] Load2 Store1 Y S.D Regs[R1] Mult1 Store2 Y S.D Regs[R1]-8 Mult2 Register status Field F0 F2 F4 F6 F8 F10 F12 F30 Qi Load2 Mult2

25 Tomasulo s Algorithm: Summary Branch prediction accuracy is important. More on this later. Complexity Necessitates high-speed associative buffers for tag matching. Control logic is involved, too. Single CDB can be a bottleneck. Cache miss latencies are nicely hidden by out-of-order execution. Scheduling non-numeric code: register renaming, dynamic scheduling, speculation Compiler-independent optimisation: multiple instruction issue

26 Hardware-Based Speculation Accurate branch prediction not sufficient to sustain CPI of 1.0 in wide issue processors. Keep on executing assuming prediction is correct. Fetch, issue and execute by speculation. Need way to handle incorrect prediction Key ideas: Dynamic branch prediction Speculation to execute instruction before control dependence resolution Dynamic scheduling for combinations of basic blocks W.r.t. Tomasulo s algorithm Separation of result bypassing through CDB and actual instruction completion is needed.

27 Hardware-Based Speculation (continued) Using bypassed value Speculative register read Actual register file or memory update when instruction no longer speculative additional instruction commit step Commit is in-order Separation of execution completion/bypassing and instruction commit: reorder buffer (ROB) ROB is the purgatory between instruction executing completion and instruction commit. Supplies operands to other (speculative) instructions meanwhile. Subsumes store buffer. Result tagged using ROB entry no. rather than RSID.

28 Tomasulo with Speculation: Schematic

29 Tomasulo with Speculation: Steps Issue If empty RS and empty slot in ROB; else stall Operands in registers or ROB sent to RS ROB buffers busy RS tagged with ROB entry: for sending result through CDB Execute Monitor CDB for available operands Store needs to compute only effective address Write result Write to ROB via CDB based on tag RS free Store: Value stored in ROB, else keep monitoring CDB Commit From head of ROB (instruction with result value in buffer) Store: write value to memory Incorrect branch prediction: flush ROB

30 Tomasulo with Speculation: Example L.D F6, 32(R2) L.D F2, 44(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 ; 6 EX cycles ; 2 EX cycles ; 12 EX cycles ; 2 EX cycles Unlike plain Tomasulo s algorithm, no instruction completes unless MUL.D commits. Precise exception behaviour Very important for exceptions like page-faults. Should be transparent to the user

31 Tomasulo with Speculation: Example Just before MUL.D commits Reorder buffer (ROB) Entry Busy Instruction State Destination V (Value) 0 N L.D F6, 32(R2) Commit F6 Mem[32+Regs[R2]] 1 N L.D F2, 44(R3) Commit F2 Mem[44+Regs[R3]] 2 Y MUL.D F0, F2, F4 Write result F0 ROB[1].V * Regs[F4] 3 Y SUB.D F8, F2, F6 Write result F8 ROB[1].V - ROB[0].V 4 Y DIV.D F10, F0, F6 Execute F10 5 Y ADD.D F6, F8, F2 Write result F6 ROB[3].V + ROB[1].V RS status RSID Busy Op Vj Vk Qj Qk Dest A Load1 N Load2 N Add1 N Add2 N Add3 N Mult1 N MUL.D Mem[44+Regs[R3]] Regs[F4] ROB[2] Mult2 Y DIV.D Mem[32+Regs[R2]] ROB[2] ROB[4] FP Register status F0 F2 F4 F6 F8 F10 F12 F30 Qi ROB[2] ROB[5] ROB[3] ROB[4] Busy Y N N Y Y Y N N

32 Beyond IPC 1.0 CPI less than 1.0 Multiple issue processors Statically scheduled superscalar, e.g. ARM Cortex-A8 Mostly embedded processors Dynamically scheduled superscalar with speculation, e.g. Intel Core series (i3, i5, i7) Mostly general purpose processors Very Long Instruction Word (VLIW), e.g. TI C6x, Intel Itanium Mostly DSP, scientific computing processors

33 VLIW Typical 5 independent operations in one instruction word 1 integer, 2 floating point, 2 memory references Uncover parallelism Loop unrolling Local scheduling (of straight line code) Global scheduling Example Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

34 VLIW (continued) 9 cycles 23 operations IPC Efficiency: 23/ % Many empty slots Many more FP registers needed compared to MIPS Cannot guarantee code will look like this.

35 VLIW Problems Code size explosion Loop unrolling over many iterations to uncover ILP Unused instruction bits Encoding can help. Hazard detection FU stall processor stall Cache stalls difficult to predict at compile-time Some hardware checks introduced Code compatibility Compiled code dependent on pipeline structure and FU latency Issue width Extension: Vector Processors Works much better when simple loops need unrolling. VLIW better for less structured code

36 Putting All Together: Dynamic Scheduling, Multiple Issue and Speculation Modern microarchitectures Dynamic scheduling + multiple issue + speculation Two approaches for multiple issue Assign reservation stations and update pipeline control table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Every possible combination of dependent instructions must be analysed. Complexity increases as square of the number of instructions/clock. Pipelining can only help us so much. Issue logic becomes the bottleneck.

37 Modern Microarchitecture Schematic Issue and completion logic must be enhanced to support multiple instructions per clock.

38 Multiple Issue: Steps Assign a reservation station and reorder buffer for every instruction that might be issued in the next bundle. Can be done independent of actual instruction type If instructions mostly of the same type (same FU), RS(FU) will be full. Break the bundle. Analyse all dependencies among all instructions to be issued. Within-bundle dependencies ROB entry used to update RS of dependent instruction All of it in parallel in one clock cycle! Similar for multiple commit

39 Multiple Issue: Example Loop: LD R2, 0(R1) ; R2=array element DADDIU R2, R2, #1 ; increment R2 SD R2, 0(R1) ; store result DADDIU R1, R1, #8 ; increment pointer BNE R2, R3, Loop ; branch if not last element Separate integer FUs for ALU and effective address calculation 2 instructions issued/committed per clock

40 Multiple Issue: Example Without Speculation

41 Multiple Issue: Example With Speculation

42 Upstream Optimisation Increasing Instruction Fetch Bandwidth Number of instructions fetched per cycle Average throughput Handling branches and jumps Methods: Branch target buffer Return address predictor Integrated instruction fetch Monolithic unit Standard for most multiple issue processors now

43 Branch Target Buffers Similar to cache Match current PC, get next PC

44 Branch Target Buffers (continued) Penalties with BTB (in MIPS pipeline) Instruction in buffer Optimisation Prediction Actual branch Penalty cycles Yes Taken Taken 0 Yes Taken Not taken 2 No - Taken 2 No - Not taken 0 Target instruction(s) in addition to target PC Eliminate fetch time Larger BTB Branch folding Hit Unconditional branch? Instruction substituted from BTB zero cycle latency

45 Return Address Predictor Indirect jumps from procedure returns Procedure calls from multiple locations Branch prediction accuracy very low Accurate prediction possible if RAP buffer depth max procedure call depth

46 Integrated Instruction Fetch Integrated branch prediction Branch prediction part of IF unit Pushed upstream Instruction prefetch Fetch ahead of time from the instruction cache Instruction memory access and buffering Multiple cache line access Latency hidden through prefetch On-demand instruction provision for issue unit

47 Speculation: Opportunities and Challenges Register renaming (extending physical register set) vs. reorder buffers Slightly easier commit, but other problems (e.g. deallocating a physical renaming register) How much speculation is good? L2 or TLB miss Do not speculate (large penalty) Speculating through multiple branches High branch frequency High branch clustering Slow FU Energy efficiency challenge Instructions speculated but not needed Undoing the effect of wrong speculation Value prediction Address aliasing prediction

48 Only true data dependences Only true control depen dence Limitations of ILP Perfect scenario Infinite register renaming no WAW, WAR Perfect branch prediction Perfect jump prediction Perfect memory address alias analysis Perfect caches Realistic limits (Optimistic) Up to 64 instructions issued per clock with no issue restrictions (biggest bottleneck); Window size n: number of comparisons: Tournament (branch) predictor with 1K entries and 16-entry RAP Perfect memory address disambiguation Renaming registers (64 integer + 64 FP) e.g., Intel Core i7 ROB has 128 entries

49 Limitations of ILP (continued)

50 Thread-Level Parallelism Multithreading Only private state duplication Separate PC, registers and page-table per thread Fine-grained Switch threads at every clock Round-robin Hides latencies from both short and long stalls Slows down an individual thread Coarse-grained Switch threads only on costly stalls Simultaneous (SMT) Multiple instructions from independent threads dynamically scheduled and issued

51 Thread-Level Parallelism: Example