Instruction Level Parallelism

Transcription

1 Instruction Level Parallelism Pipelining achieves Instruction Level Parallelism (ILP) Multiple instructions in parallel But, problems with pipeline hazards CPI = Ideal CPI + stalls/instruction Stalls = Structural + Data (RAW/WAW/WAR) + Control How to reduce stalls? That is, how to increase ILP?

2 Techniques for Improving ILP Loop unrolling Basic pipeline scheduling Dynamic scheduling, scoreboarding, register renaming Dynamic memory disambiguation Dynamic branch prediction Multiple instruction issue per cycle Software and hardware techniques

3 Loop-Level Parallelism Basic block: straight-line code w/o branches Fraction of branches: 0.15 ILP is limited! Average basic-block size is 6-7 instructions And, these may be dependent Hence, look for parallelism beyond a basic block Loop-level parallelism is a simple example of this

4 Loop-Level Parallelism: An Example Consider the loop: for(int i = 1000; i >= 1; i = i-1) { x[i] = x[i] + C; // FP } Each iteration of the loop is independent of other iterations Loop-level parallelism To convert it into ILP: Loop unrolling (static, dynamic) Vector instructions

5 The Loop, in DLX In DLX, the loop looks like: Loop: LD ADDD SD SUBI BNEZ Assume: R1 is the initial address F0, 0(R1) // F0 is array element F4, F0, F2// F2 has the scalar 'C' 0(R1), F4 // Stored result R1, R1, 8 // For next iteration R1, Loop // More iterations? F2 has the scalar value 'C' Lowest address in array is '8'

6 How Many Cycles per Loop? CC1 Loop: LD F0, 0(R1) CC2 stall CC3 ADDD F4, F0, F2 CC4 stall CC5 stall CC6 SD 0(R1), F4 CC7 SUBI R1, R1, 8 CC8 stall CC9 BNEZ R1, Loop CC10 stall

7 Reducing Stalls by Scheduling CC1 Loop: LD F0, 0(R1) CC2 SUBI R1, R1, 8 CC3 ADDD F4, F0, F2 CC4 stall CC5 BNEZ R1, Loop CC6 SD 8(R1), F4 Realizing that SUBI and SD can be swapped is non-trivial! Overhead versus actual work: 3 cycles of work, 3 cycles of overhead

8 Unrolling the Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 // No SUBI, BNEZ LD F6, -8(R1) // Note diff FP reg, new offset ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) // Note diff FP reg, new offset ADDD F12, F10, F2 SD -16(R1), F8 LD F14, -24(R1) // Note diff FP reg, new offset ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, 32

9 How Many Cycles per Loop? Loop: LD F0, 0(R1) // 1 stall ADDD F4, F0, F2 // 2 stalls SD 0(R1), F4 LD F6, -8(R1) // 1 stall ADDD F8, F6, F2 // 2 stalls SD -8(R1), F8 LD F10, -16(R1) // 1 stall ADDD F12, F10, F2 // 2 stalls SD -16(R1), F8 LD F14, -24(R1) // 1 stall ADDD F16, F14, F2 // 2 stalls SD -24(R1), F16 SUBI R1, R1, 32// 1 stall 28 cycles per unrolled loop == 7 cycles per original loop

10 Scheduling the Unrolled Loop Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, 32 SD 16(R1), F8 BNEZ R1, Loop 14 cycles per unrolled loop == 3.5 cycles per original loop

11 Observations and Requirements Gain from scheduling is even higher for unrolled loop! More parallelism is exposed on unrolling Need to know that 1000 is a multiple of 4 Requirements: Determine that loop can be unrolled Use different registers to avoid conflicts Determine that SD can be moved after SUBI, and find the offset adjustment Understand dependences

12 Dependences Dependent instructions ==> cannot be in parallel Three kinds of dependences: Data dependence (RAW) Name dependence (WAW and WAR) Control dependence

13 Dependences (continued) Dependences are properties of programs Stalls are properties of the pipeline Two possibilities: Maintain dependence, but avoid stalls Eliminate dependence by code transformation

14 Data Dependence Data dependence represents data flow from one instruction to another One instruction uses the result of another Take transitive closure In our example: Loop: LD F0, 0(R1) Note: dependence in memory is hard to detect 100(R4) and 80(R6) may be the same 20(R1) and 20(R1) may be different at different times ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, 8

15 Name Dependence Two instructions use the same register/memory (name), but there is no flow of data Anti-dependence: WAR hazard Output dependence: WAW hazard Can do register renaming s tatically, or dynamically

16 Name Dependence in our Example Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) ADDD F4, F0, F2 SD -8 R1), F4 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1), F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1), F4 SUBI R1, R1, 32 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 Register LD F10, -16(R1) renaming ADDD F12, F10, F2 SD -16(R1), F8 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, 32

17 Control Dependence An example: T1; if p1 { S1; } Statement S1 is control-dependent on p1, but T1 is not What this means for execution S1 cannot be moved before p1 T1 cannot be moved after p1

18 Control Dependence in our Example Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, 8 BEQZ R1, exit LD F6 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, 8 BEQZ R1, exit // Two more such... SUBI R1, R1, 8 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F8 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, 32

19 Handling Control Dependence Control dependence need not be maintained We need to maintain: Exception behaviour do not caus e new exceptions Data flow ensur e the right data item is used Speculation and conditional instructions are techniques to get around control dependence

20 Loop Unrolling: a Relook Our example: for(int i = 1000; i >= 1; i = i-1) { x[i] = x[i] + C; // FP } Consider: for(int i = 1000; i >= 1; i = i-1) { A[i-1] = A[i] + C[i]; // S1 B[i-1] = B[i] + A[i-1]; // S2 } S2 is dependent on S1 S1 is dependent on its previous iteration; same case with S2 Loop-carried dependence ==> loop iterations have to be in-order

21 Removing Loop-Carried Another example: Dependence for(int i = 1000; i >= 1; i = i-1) { A[i] = A[i] + B[i]; // S1 B[i-1] = C[i] + D[i]; // S2 } S1 depends on the prior iteration of S2 Can be removed (no cyclic dependence) A[1000] = A[1000] + B[1000]; for(int i = 1000; i >= 2; i = i-1) { B[i-1] = C[i] + D[i]; // S2 A[i-1] = A[i-1] + B[i-1]; // S1 }

22 Static vs. Dynamic Scheduling Static scheduling: limitations Dependences may not be known at compile time Even if known, compiler becomes complex Compiler has to have knowledge of pipeline Dynamic scheduling Handle dynamic dependences Simpler compiler Efficient even if code compiled for a different pipeline

23 Dynamic Scheduling For now, we will focus on overcoming data hazards The idea: DIVD ADDD SUBD F0, F2, F4 F10, F0, F8 F12, F8, F14 SUBD can proceed without waiting for DIVD

24 CDC 6600: A Case Study IF stage: fetch instructions onto a queue ID stage is split into two stages: Issue: decode and check for structural hazards Read operands: check for data hazards Execution may begin, and may complete outof-order Complications in exception handling Ignore for now What is the logic for data hazard checks?

25 The CDC Scoreboard Out-of-order completion ==> WAR and WAW hazards possible Scoreboard: a data-structure for all hazard detection in the presence of out-of-order execution/completion All instructions cons ult the scoreboard to detect hazards

26 The Scoreboard Solution Three components: Stages of the pipeline: Issue (ID1), Read-operands (ID2), EX, WB Data structure (in hardware) Logic for hazard detection, stalling

27 Scoreboard Control & the Pipeline Stages Issue (ID1): decode, check if functional unit is free, and if a previous instruction has the same destination register No such hazard ==> scoreboard issues to the appropriate functional unit Note: structural/waw hazards prevented by stalling here Note: stall here ==> IF queue will grow Read operands (ID2): Operand is available if no earlier instruction is going to write it, or if the register is being written currently RAW hazards are resolved here

28 Scoreboard Control & the Pipeline Stages (continued) Execute (EX): Functional units perform execution Scoreboard is notified on completion Write-Back (WB): Check for WAR hazards Stall on detection Write-back otherwise

29 Some Remarks WAW causes stall in ID1, WAR causes stall in WB No forwarding logic Output written as soon as it is available (and no WAR hazard) Structural hazard possible in register read/write CDC has 16 functional units, and 4 buses

30 The Scoreboard Data-Structures Instruction status Functional unit status Register result status Randy Katz's CS252 slides... (Lecture 10, Spring 1996) Scoreboard pipeline control A detailed example

31 Limitations of the Scoreboard Speedup of 1.7 for (compiled) FORTRAN, speedup of 2.5 for hand-coded assembly Scoreboard only in basic-block! Some hazards still cause stalls: Structural WAR, WAW

32 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit Main additional feature: register renaming to avoid WAR and WAW stalls

33 Register Renaming: Basic Idea Compiler maps memory --> registers statically Register renaming maps registers --> virtual registers in hardware, dynamically Should keep track of this mapping Make sure to read the current value Num. virtual registers > Num. ISA registers usually Virtual registers are known as reservation stations in the IBM 360/91

34 Tomasulo: Main Architectural Features Reservation stations: fetch and buffer operand as soon as it is available Load/store buffers: have the address (and data for store) to be loaded/stored Distributed hazard detection and execution control Common Data Bus (CDB): results passed from where generated to where needed Note: IBM 360/91 also had reg-mem instns.

35 The Tomasulo Architecture From mem. Load Buffers From instn. unit FP Opn Queue FP Regs Opn. Bus Opnd. Bus Store Buffers To mem. Resvn. Stns. FP ADD/SUB Common Data Bus Resvn. Stns. FP MUL/DIV

36 Pipeline Stages Issue: Wait for free Reservation Station (RS) or load/store buffer, and place instruction there Rename registers in the process (WAR and WAW handled here) Execute (EX): Monitor CDB for required operand Checks for RAW hazard in this process Write Result (WB): Write to CDB Picked up by any RS, store buffer, or register

37 Register Renaming In RS, operands referred to by a tag (if operand not already in a register) The tag refers to the RS (which contains the instruction) which will produce the required operand Thus each RS acts as a virtual register

38 The Data Structure Three parts, like in the scoreboard: Instruction status Reservation stations, Load/Store buffers, Register file Register status: which unit is going to produce the register value This is the register --> virtual register mapping

39 Components of RS, Reg. File, Load/Store Buffers Each RS has: Op: the operation (+, -, x, /) Vj, Vk: the operands (if available) Qj, Qk: the RS tag producing Vj/Vk (0 if Vj/Vk known) Busy: is RS busy? Each reg. in reg. file and store buffer has: Qi: tag of RS whose result should go to the reg. or the mem. locn. (blank ==> no such active RS) Load and store buffers have: Busy field, store buffer has value V to be stored

40 Maintaining the Data Structure Issue: Wait until: RS or buffer empty Updates: Qj, Qk, Vj, Vk, Busy of RS/buffer; Maintain register mapping (register status) Execute: Wait until: Qj=0 and Qk=0 (operands available) Write result: CDB result picked up by RS (update Qj, Qk, Vj, Vk), store buffers (update Qi, V), register file (update register status) Update Busy of the RS which finished

41 Some Examples Randy Katz's CS252 slides... (Lecture 11, Spring 1996) Dynamic loop unrolling example from text

42 Dynamic Loop Unrolling Loop: LD ADDD SD SUBI BNEZ F0, 0(R1) // F0 is array element F4, F0, F2// F2 has the scalar 'C' 0(R1), F4 // Stored result R1, R1, 8 // For next iteration R1, Loop // More iterations? Assume branch predicted to be taken Denote: load buffers as L1, L2..., ADDD RSs as A1, A2... First loop: F0 --> L1, F4 --> A1 Second loop: F0 --> L2, F4 --> A2

43 Summary Remarks Memory disambiguation required Drawbacks of Tomasulo: Large amount of hardware Complex control logic CDB is performance bottleneck But: Required if designing for an old ISA Multiple issue ==> register renaming and dynamic scheduling required Next class: branch prediction

44 Dealing with Control Hazards Software techniques: Branch delay slots Software branch prediction Canceling or nullifying branches Misprediction rates can be high Worse if multiple issue per cycle Hence, hardware/dynamic branch prediction

45 Branch Prediction Buffer PC --> Taken/Not-Taken (T/NT) mapping Can use just the last few bits of PC Prediction may be that of some other branch Ok since correctness is not affected Shortcoming of this prediction scheme: Branch mispredicted twice for each execution of a loop Bad if loop is small for(int i = 0; i < 10; i++) { } x[i] = x[i] + C;

46 Two-Bit Predictor Have to mispredict twice before changing prediction Built in hysteresis General case is an n-bit predictor 0 to (2^n)-1 saturating counter 0 to (2^[n-1])-1 predict as taken 2^[n-1] to (2^n)-1 predict as not-taken Experimental studies: 2-bit as good as n-bit

47 Implementing Branch Prediction Buffers Implementing branch prediction buffers Small cache accessed along with the instruction in IF Or, additional 2 bits in instruction cache Note: branch prediction buffer not useful for DLX pipeline Branch target not known earlier than branch condition

48 Prediction Performance 18.00% 16.00% 14.00% Misprediction rate 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% Nasa7 Matrix30 0 Tom catv Doduc Spice Fpppp Gcc Espre sso Eqntott Li 4096 entries in the prediction buffer SPEC89, IBM Power architecture

49 Improving Branch Prediction Two ways: increase buffer size, improve accuracy 18.00% 16.00% Misprediction rate 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 4096 entries Inf. entries 2.00% 0.00% Nasa 7 Matrix30 0 Tom catv Do duc Spice Fppp p Gcc Espr esso Eqntott Li

50 Improving Prediction Accuracy Predict branches based on outcomes of recent other branches if(aa == 2) { aa = 0; } if(bb == 2) { bb = 0; } if(aa == bb) { // Do something Correlating, } or two-level predictor

51 Two-Level Predictor There are effectively two predictors for each branch: Depending on whether previous branch is T/NT Prediction bits Prediction if last branch NT Prediction if last branch T NT/NT NT NT NT/T NT T T/NT T NT T/T T T

52 Two-Level Predictor (continued) Last predictor was a (1,1) predictor One bit each of history, and prediction General case is (m,n) predictor m bits of history, n bits of prediction How to implement? Have an m-bit shift register

53 Cost of Two-Level Predictor Number of bits required: Num. branch entries x 2^m x n How many bits in 4096 (0,2) predictor? 8K How many branch entries for an 8K (2,2) predictor? 1K

54 Performance of (2,2) Predictor 4096 entries; (0,2) Inf. entries; (0,2) 1K entries; (2,2) 20.00% 17.50% Misprediction rate 15.00% 12.50% 10.00% 7.50% 5.00% 2.50% 0.00% Nasa7 Matrix30 0 Tom catv Doduc Spice Fpppp Gcc Espre sso Eqntott Li

55 Branch Target Buffer Branch prediction buffer is not useful for DLX Need to know target address by the end of IF Store branch target address also Branch target buffer, or cache Access branch target buffer in IF cycle Hit ==> predicted branch target known at the end of IF We also need to know if the branch is predicted T/NT

56 Branch Target Buffer Lookup based on PC (continued) Predicted target No entry found ==> (Target = PC+4) Exact match of PC is important Since we are predicting even before knowing that it is a branch instruction Hardware is similar to a cache Need to store predicted PC only for taken predictions

57 Steps in Using a Target Buffer Access Instn. Cache and target buffer Use predicted PC Mispredicted branch; restart fetch; delete buffer entry Entry found? Yes No A taken branch? No A taken branch? Yes Normal execution No Yes Make new target buffer entry Correct prediction, proceed IF ID EX

58 Penalties in Branch Prediction Buffer hit? Branch taken? Penalty Yes Yes 0 Yes No 2 No - 2 Given a prediction accuracy of p, a buffer hitrate of h, and a taken branch frequency of f, what is the branch penalty? h x (1-p) x 2 + (1-h) x f x 2

59 Storing Target Instructions Directly store instructions instead of target address Target buffer access is now allowed to take longer Or, branch folding can be achieved Replace fetched instruction with that found in the target buffer entry Zero cycle unconditional branch; may be conditional as well