Pipe Line Efficiency

Transcription

1 Pipe Line Efficiency

2 2 PipeLine Pipeline efficiency Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: best possible (1 as n ) Structural hazards: Insufficient hardware Data hazards: Need results of earlier calculations Control hazards: Need to foretell the future. Branches and jumps Instruction-Level Parallelism (ILP): Seeks to overlap instruction execution Hardware: dynamically runtime Software: statically compile time Simplest is Loop Level Parallelism

3 3 Loops Loop level parallelism Dynamic: branch prediction Static: Loop unrolling Dependence independent/parallel. Simultaneous execution possible. Can be placed in a pipeline, with only (possibly) structural hazards. dependent.. Must occur in order; partial overlap possible Dependent: if instruction2 uses result of instruction 1. Data Hazard. Read after Write Hazard : RAW Hazard Preserve order only where vital Hardware and Software must produce the same result as strict sequential execution. Actual hazard: existence; stall length. Depends on implementation of pipeline. Dependence in program indicates potential for hazard. stipulates an order upper bound on possible performance May not be correct, if code is not correct

4 4 Dependence Name dependence Also called anti-dependence. When a memory location or register is re-used. Means instruction 2 may write before instruction 1 has used the value Write after Read, WAR hazard. Or instruction 1 may write after instruction 2 has written, but before the subsequent use is made of the data Write after Write, WAW hazard. Both may be resolved by using a separate register. register renaming (hardware or software) Preserve order only where vital Control Dependence if then else blocks Often blocks can be executed ignoring conditions, if we can throw away the results. Ensure the system is completely unaffected by unwanted calculations. Need to handle exceptions and ensure correct data flow May not be correct, if code is not correct

5 5 Loops Assembler for (int pnt=1000; pnt>0; pnt--) { arr[pnt] = arr[pnt] + offset; } This Java loop will compile to something like lw $r2, offset; Loop: lw $r3, 0($r1); offset arrel element add $r4,$r3, $r2; add sw 0($r1), $r4; store result addi $r1, $r1, -4; decrement pnt bnez $r1, Loop continue Why one stall and then two MIPS 5 step pipeline becomes lw $r2, offset; Loop: lw $r3, 0($r1); stall waiting for $r3 add $r4,$r3, $r2; stall waiting for $r4 stall sw 0($r1), $r4; addi $r1, $r1, -4; stall no forward to bnch bnez $r1, Loop; May not be correct, if code is not correct

6 6 Loops Re-ordering lw $r2, offset; Loop: lw $r3, 0($r1); stall waiting for $r3 add $r4,$r3, $r2; stall waiting for $r4 stall sw 0($r1), $r4; addi $r1, $r1, -4; stall no forward to bnch bnez $r1, Loop; 9 cycles lw $r2, offset; Loop: lw $r3, 0($r1); addi $r1, $r1, -4; decrement early add $r4,$r3, $r2; removes stall stall waiting for $r4 stall sw 8($r1), $r4; displacement bnez $r1, Loop; $r1 already done Reordered 7 cycles

7 7 Loops Unrolling lw $r2, offset; Loop: lw $r3, 0($r1); stall stall comes back add $r4,$r3, $r2; stall*2 sw 0($r1), $r4; lw $r2, offset; second iteration sw -4($r1), $r4; lw $r2, offset; third iteration sw -8($r1), $r4; lw $r3, -12($r1); stall add $r4,$r3, $r2; stall*2 sw -12($r1), $r4; addi $r1, $r1, -16; decrement 4 times bnez $r1, Loop; Unrolled 26 cycles 6.5 per iteration Harder if total number is not divisible by the unroll number

8 8 Loops General upper bound U Loop unroll m times. Execute unrolled loop U/m and original loop U mod m. Optimise unrolled loop lw $r2, offset; Loop: lw $r7, 0($r1); lw $r8, -4($r1); lw $r9, -8($r1); lw $10, -12($r1); addi $r1, $r1, -16; decrement 4 times add $r3,$r7, $r2; stall hidden add $r4,$r8, $r2; add $r5,$r9, $r2; add $r6,$r10, $r2; sw 0($r1), $r3; sw 4($r1), $r4; sw 8($r1), $r5; sw 12($r1), $r6; bnez $r1, Loop; Unrolled optimised 14 cycles 3.5 per iteration cf 9 originally speed up > 3

9 9 Unrolling Decisions Unrolling useful if iterations are independent Use different registers to avoid name dependence. Sets limit on size of unroll Eliminate test and branch instructions. Will need to modify them at the end of the code Independent iterations allow reorder of load and store between loops Ensure code delivers same result. Note number of iterations depends on length of code. Unroll too many times for a long loop and may start generating cache missed for the code. Also run out of independent registers. Long loops have little overhead from house keeping code. Marginal advantage of extra iterations decreases. Long loops have other ways to hide stalls.

10 10 Branches Branch or Control Hazards Time Branch Penalty Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Instruction FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO EI WO FI DI CO FO FI DI CO FI DI Instruction 7 Instruction 15 Instruction 16 FI FI DI CO FO EI WO FI DI CO FO EI WO A plot of the effect of a conditional branch at instruction 3 on the pipeline. On cycle 8 when the address is known the pipeline is emptied (flushed). Pipe starts refilling and there are 5 slots when no instruction completes Figure The Effect of a Conditional Branch on Instruction Pipeline Operation Diagram Stalling 10:

11 10 Branches Mitigating Branch Hazards Methods to mitigated the effect Multiple streams Prefetch branch target Loop buffer Branch Prediction Delayed Branch

12 12 Branches Multiple streams Duplicate the pipeline. Process branch taken and branch not taken If you have two pipelines, then they will both need access to registers, cache and memory. Contention or increasing number of functional units What happens when another branch enters one of the pipelines. Using space for the units, and power for operation. Can we get the same effect more efficiently

13 13 Branches Prefetch target buffer The branch target is fetched before it is clear that it is necessary. Meanwhile the instruction following the branch enters the pipeline. For branch not taken the pipeline operates at full efficiency. If the branch is taken, then the branch instruction has already been fetched, reducing the length of the stall

14 14 Branches Loop buffer The instruction fetch part of the pipeline contains a small buffer of high speed memory which contains the last few instructions. For branch taken the hardware takes the instruction from the buffer (if it is there) ie back to the start of the loop The loop buffer allows for pre-fetching instructions and thus there will be instructions in the buffer ahead of the current one. These are available with no memory access overhead. IF THEN ELSE constructs. Where the branch target may only be a few instructions ahead it will be pre-fetched Called loop buffer, but with other advantages. If the buffer is large enough to contain the whole loop, then they will only be fetched once.

15 15 Branches Loop buffer The instruction fetch part of the pipeline contains a small buffer of high speed memory which contains the last few instructions. For branch taken the hardware takes the instruction from the buffer (if it is there) ie back to the start of the loop The loop buffer allows for pre-fetching instructions and thus there will be instructions in the buffer ahead of the current one. These are available with no memory access overhead. IF THEN ELSE constructs. Where the branch target may only be a few instructions ahead it will be pre-fetched Called loop buffer, but with other advantages. If the buffer is large enough to contain the whole loop, then they will only be fetched once.

16 16 Branches Delayed Branch Address Instruction 100 LDA 101 ADD 102 JMP MUL Unoptimised Here 103 will be in the pipeline and the pipeline will need to be cleared before proceeding. Adding to the complication of the circuit Address Instruction 100 LDA 101 ADD 102 JMP NOOP 104 MUL Here the addition of a NOOP instruction means that the pipeline does not need to be cleared NOOP

17 17 Branches Delayed Branch Address Instruction 100 LDA 101 JMP ADD 103 MUL Delay slot Here the branch command is pulled back and executed earlier. The command which precedes the jump is now after the the jump. It can finish executing as the jump command is executed and the command following the jump is being fetched. This must be done more carefully if the branch is conditional. The execution following the branch is referred to as the delay slot

18 18 Branches Branch Prediction This means saying ahead of time whether a branch will be taken or not. Turns out to be possible to guess the correct answer a significant percentage of the time. If we consider a loop The branch at the end of the loop is taken every time that the system executes the loop. Only at the end of the loop is the branch not taken.

19 19 Prediction Branch Prediction Reduces stalls, only if the prediction is correct. Simplest is predict branch taken 25% 22% Misprediction Rate 20% 15% 10% 5% 12% 18% 11% 12% 4% 6% 9% 10% 15% 0% compress eqntott espresso gcc li doduc ear hydro2d mdljdp su2cor Some spec programs and failure rate for branch taken. static, compile time Can collect data during execution and modify prediction.

20 20 Prediction Dynamic Branch Prediction Depends on underlying regularity of code and data. Dynamic is better than static. Performance is a function of the accuracy and the cost of starting down the wrong route. Weather tomorrow, same as today. Simplest is a Branch History Table (BHT) do the same as last time. 1bit Finite state machine A loop will have 2 mis-directions first and last Average is only 9 (seems small??) 2 bit prediction: only change on two mistakes Predict Taken T Predict Not Taken T NT T NT T Predict Taken NT NT Predict Not Taken

21 21 Returns Return address predictors One target for improvement is predicting indirect jumps, whose address varies at run time. The return address from a procedure/method accounts for the vast majority of indirect jumps. Spec95 benchmarks show 15% of branches are returns. Java and other OO languages use methods and shorter code runs. Even more advantage in optimising method returns. BTB will nearly always get this wrong calls from a single site typically not temporally clustered. (Repeated calls from a loop may be) Solution : return address stack. Push on call Pop on recall. Approaching 100% accuracy depending on call depth and stack size.

22 22 Scheduling Split Instruction Decode Instruction Decode ID. Step of the pipelining. Checks for structural and data hazards Split into two Issue: Decode instructions. Structural hazards? Wait for data hazards to clear Read Operands: Extension of 5 step pipeline to out of order execution creates possibility of WAR and WAW Solved by register renaming. Out of order execution creates problems for exceptions. Imprecise exceptions raised exceptions which do not look as if the instructions were exectured sequential Instructions earlier than the exception may not have completed Instructions later than the exception may have completed

23 CDC 6600

24 CDC kwords 3 million instructions per second

25 25 CDC 160 Computer (and operator) Multiply/divide unit Computer around 1960 CPU and core memory

26 26 CDC160 Instructions 12 bit word Registers PC; Accumulator; address register 4k words OpCode F Mode Address E 0010 And Or 0100 Load Load Complement Add Subtract Store Shift and Replace Add and Replace Add One and Replace. 00 or 10

27 27 CDC160 Addressing modes OpCode F Mode Address E Direct Indirect. Fetch contents of E (only 6 bits of address space) If E=0 next word holds address (address all memory) If E not zero it points to a word (from 0-64) which holds the address of the next instruction Forward Relative: Next instruction is PC+E field Backward Relative: Next instruction is PC-E field All possible modes used. And limits instructions to 16

28 28 CDC160 Immediate instructions OpCode F Address E Small constants can be added in one instruction And E (zero extended) is added to Acc Or E or d with Acc Load. E into Acc Load ComplementE complement into ACC Add E added to Acc Subtract E subtracted from Acc f/b OpCode F Address E Relative jumps 110x00 - Zero Jump. All bits 0 110x01 - Non-Zero Jump. 1 bit non 0 110x10 - Positive Jump. A>0 110x11 - Negative Jump. A<0 Bit 8 jump forward or backward

29 29 CDC160 Indirect Jumps Jump Indirect Jump Forward Indirect. Shifts Shift Left Shift Left 2 (SN > 37) Shift Left Shift Left 6 (SN > 37) Multiply by Multiply by 100 (SN > 37). Control Instructions Note more than 16 instructions. So instructions are effectively vartiable length Halt Error Transmit Program Counter into Accumulator. No specific function calls to allow storage of PC and transfer of control. TPC Allows then store and jump in fewer instructions.

30 30 Dynamic scheduling Scoreboarding Developed for the CDC 6600 in the mid 1960 s Execution: Goal; 1 instruction per cycle. Instructions executed as early as possible When no structural hazards Instruction stall proceed to subsequent instructions Execute unless they depend on previous executing (or stalled) instructions. Many instructions in simultaneous execution Need hardware to match CDC FP units, 5 Memory Refs, 7 integer ops CDC Load/store MIPS only FP units See Hennessey Appendix A The scoreboard handles hazard detection Example Two multiplier units, one adder, one divide, one for

31 31 Dynamic scheduling Execution: See Hennessey Appendix A

32 32 MIPS Static v Dynamic scheduling Static scheduling: can be done at compile time. Some things are not defined at compile time. If we have an instruction like F1, F2 MUL F0, The instruction cannot proceed until the operands are available. Would like to start some other instructions. How many depends on how long? If the contents of F1 need to be loaded from memory, then this instruction may need to wait. For how long? Depends if F1 is fetched from cache or memory. That will not be known until run time. Dynamic scheduling: provides a mechanism to keep the pipeline flowing, using information not

33 33 MIPS Scoreboard Structural hazards cab be mitigated by increasing the number of functional units available (FP add, FP mult,...) and distributing the instructions between them. This takes extra logic. The scoreboard is some extra circuitry which takes the instructions and distributes the FP operations between multiple functional units add, multiply, divide. The aim (as always) is 1 instruction per clock cycle Three of the steps in the standard MIPS pipeline ID,EX,WB Are replaced by Issue, Read Operands, Execution, Write Results Issue decode instructions, check for structural hazards Read operands wait until no data hazards, then read operands

34 34 MIPS Scoreboard In order Out of order Out of order issue execution commit The scoreboard consists of three parts Instruction status Functional unit status Register result status

35 35 Status Units Instruction status Which step the instruction is executing Functional unit status The status of each functional unit Busy Op add, subtract, Fi register Fj, Fk Qj, Qk producing Fj, Fk Rj, Rk not read yes/no operation destination source registers functional units yes, registers ready & Register result status Which functional unit will write to each register If a functional unit has this register as its destination

36 36 Operation Issue: Functional Unit Free and no other unit has same destination register. Scoreboard issues instruction and updates its internal data structure. Protects against WAW hazards. If issue stalls the buffer between IF and Issue fills. Read Operands: Scoreboard monitors availability of source operands (data flow). Source operand available if not earlier issued instruction is going to write to it When source operands available to scoreboard tells the functional unit to begin execution. Resolves RAW hazards instructions may be sent to execution out of order Execution: Functional Unit executes instruction and informs scoreboard when complete. Write Result: Scoreboard checks for WAR hazards and stalls completing instruction if required. Execution as if serial Instructions may complete out of order. Instructions may even overtake each other.

37 37 Limitations Amount of parallelism Each instruction depends on predecessor None Number of scoreboard entries (window size) Determines how far ahead the pipeline can look for instructions Number and type of functional units How many instructions can occur in parallel. Tend to provide more multiply units, since multiply takes longer than add. Presence of anti-dependences and dependences Lead to RAW and WAW stalls

38 38 Note Balance VAX 8650 had a cycle time of 55ns with a sophisticated pipeline. VAX 8700 had a simpler pipeline which allowed a speed of 45ns had 20% less CPI, 8700 was 20% faster. But 8700 was simpler less hardware. Manufacturers ingenuity replaces simple clock speed. Doesn t always work Performance measurement Compiler optimisation covers some of the same ground as dynamic scheduling. Measure improvement with unoptimised code will give an over optimistic idea of improvement Be sure what it is you are measuring Complex system non-linear interactions Measurements are hard