Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm)

Transcription

1 EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada

2 EE557 Textbook References Prof. Dubois EE557 Classnotes Prof. Annavaram s slides Prof. Patterson s Lecture slides 2

3 Instruction Scheduling (Re-ordering of instructions) We will limit our discussion to scheduling of instructions mostly with-in the basic block (basic block = a straight-line code sequence with no branches). Compiler can perform static instruction scheduling. Tomasulo Algorithm lets us schedule instructions dynamically (in hardware). 3

4 Static Scheduling (based on Prof. Dubois slides) Strengths -- Hardware simplicity -- Compiler has a global view of the code Weaknesses -- can not be CPU-implementation specific -- can not foresee dynamic events -- cache misses -- data-dependent delays -- conditional branches -- can not pre-compute memory addresses 4

5 5

6 Simple 5-stage pipeline In-order execution RAW dependency Solve it by forwarding, if not, by stalling Dependent instructions are stalled in the ID stage IM DM IF ID EX M WB 6

7 Simple 5-stage pipeline: Dependent instructions are stalled in the ID stage and lw 7

8 Simple 5-stage pipeline: Dependent instructions can not be stalled in the EX stage. Why? and lw 8

9 Provide multiple functional units (for simplicity, we avoid talking about floating point execution unit and floating point register file) Stall, after decoding, in queues Divide Multiply IM Integer IF ID DM Load/ Store WB Queues and Functional unit 9

10 Tomasulo s plan OoO Out of order execution Multiple functional units (say, Integer, DM, Multiplier, Divider) Queues between ID and EX stages (in place ID/EX register) 10

11 Out of order execution?! Problems all over??!! For the time, no branch prediction, no speculative execution beyond branches, just stall on a conditional branch No support for precise exceptions Even then, 11

12 RAW, WAR, and WAW RAW = Read After Write lw $8, 40($2); add $9, $8, $7; WAR = Write after Read add $9, $8, $6; lw $8, 40($2); WAW = Write after Write add $9, $8, $6; lw $9, 40($2); WAW? How is it possible? Consider a printer or a FIFO 12

13 RAW, WAR, and WAW (some terminology to remember) Name Dependences RAW = Read After Write lw $8, 40($2); add $9, $8, $7; WAR = Write after Read add $9, $8, $6; lw $8, 40($2); WAW = Write after Write add $9, $8, $6; lw $9, 40($2); RAW A true dependency WAR An anti-dependency WAW An output dependency 13

14 RAW, WAR, and WAW In-order execution: We need to deal with RAW only. Out of order execution: Now we need to deal with WAR and WAW besides RAW. 14

15 Limited Architectural Registers More Physical Registers Register Renaming lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3); It is clear that compiler is using $8 as a temporary register. If there is a delay in obtaining $2, the first part of the code can not proceed. Unfortunately, the second part of the code can not proceed because of name dependency for $8. 15

16 If we had 64 registers instead of 32 registers, then perhaps compiler might have used $48 instead of $8 and we could have executed the second part of the code before the first part! lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $48, 60($3); add $48, $48, $48; sw $48, 60($3); This is an example of name dependency. 16

17 Four different temporary registers can be used here as shown: $8, $18, $28, and $48 (or called with coded names, LION, TIGER, CAT, and ANT). lw $8, 40($2); add $18, $8, $8; sw $18, 40($2); lw $28, 60($3); add $48, $28, $28; sw $48, 60($3); lw LION, 40($2); add TIGER, LION, LION; sw TIGER, 40($2); lw CAT, 60($3); add ANT, CAT, CAT; sw ANT, 60($3); 17

18 Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibility with previously compiled codes? Answer: Yes / No Why? 18

19 Answer: Can not change the number of Architectural Registers Register Renaming Through Tagging Registers This solves name dependency problems (WAR and WAW) while attending to true dependency (RAW) through waiting in queues. 19

20 RST RF square_root $2, $10; $1 $2 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3); $3 $4 $5 $6 $7 $8... $31 destination $1 $2 $3 $4 $5 $6 $7 $8... $31 dependent source RST = Register Status Table RF = Register File 20

21 RST RF square_root $2, $10; $1 $2 lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3); $3 $4 $5 $6 $7 $8... $31 $1 $2 $3 $4 $5 $6 $7 $8... $31 21

24 square_root $2, $10; lw $8, 40($2); add $8, $8, $8; sw $8, 40($2); lw $8, 60($3); add $8, $8, $8; sw $8, 60($3); Dispatch unit decodes and dispatches instructions. For destination operand, an instruction carries a TAG (but not the actual register name)! For source operands, an instruction carries either the values or TAGs of the operands (but not the actual register names)! 24

25 TAGs for destinations or sources or for both? A new tag is assigned to the destination register of the instruction being dispatched. For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already) is conveyed to the instruction. If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value. 25

26 4 Unique TAG 4 Like SSN, we need a unique TAG SSNs are reused. Similarly TAGs can be reused. TAGs are similar to the number TOKENs. 26

27 Take a number vs. Take a token 4 In State Bank of India, they issue brass tokens to customers waiting for service. Tokens are reclaimed and reused. 27

28 TAGs (= Tokens) 4 How many Tokens should the bank cashier have to start with? What happens if the tokens are run out? Does he need to have any order in holding tokens and issuing tokens? Does he have to collect tokens back? 28

29 TAG FIFO (FIFOs are taught in EE560) To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit. Filled with (say) 64 tokens (in any order) initially on reset. Tokens return in out of order anyway. Put tokens back in stack and issue. wp rp wp 2 rp wp 1 2 rp Full 2 tokens issued 1 token returned 29

30 Simplified for EE457 TAG FIFO Block Diagram provided by Prof. Dubois 2 63 Int. Divider Integer Multiplier Issue Unit 30

31 Front-End & Back-End IFQ Instruction Fetch Queue (a FIFO structure) Dispatch unit (including RST, RF, Tag FIFO) Load Store and other Issue Queues Issue Unit Functional units CDB (Common Data Bus) 31

32 32

33 Bottle neck in the design CDB = Common Data Bus Do all instructions use CDB? sw? j (jump)? beq 33

34 load store queue Address calculation Memory disambiguation 34

35 Address calculation for lw and sw EE557 approach for address calculation EE457/560 approach for address calculation Dedicated adder, to compute address, attached to the loadstore queue. 35

36 EE557 Memory Disambiguation 36

37 Memory Disambiguation RAW sw $2, 2000($0); lw $8, 2000($0); WAW sw $2, 2000($0); sw $8, 2000($0); WAR lw $2, 2000($0); sw $8, 2000($0); 37

38 Memory Disambiguation RAW sw $2, 2000($0); lw $8, 2000($0); WAW sw $2, 2000($0); sw $8, 2000($0); WAR lw $2, 2000($0); sw $8, 2000($0); This later lw can proceed only if there is no store ahead of it with the same address. This later sw can proceed only if there is no store ahead of it with the same address. This later sw can proceed only if there is no load ahead of it with the same address. 38

39 Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable? In the case of L-S Queue? In the case of Integer and other queues (mult queue, div queue)? 39

40 Maintaining instructions in the order of arrival (issue order/program order) in a queue Is it necessary or is it desirable? In the case of L-S Queue? NECESSARY to enforce memory disambiguation rules In the case of Integer and other queues (mult queue, div queue)? DESIRABLE, so that an earlier instruction gets executed whenever possible, there by perhaps reducing too many instructions waiting on it. 40

41 Priority (based on the order of arrival) among instructions ready to execute Is it necessary or is it desirable? Local priority with in the queues Global priority across the queues 41

42 Issue Unit CDB availability constraint CDB Pipelined functional unit vs. Multi-cycle functional unit Conflict resolution Round-robin priority adequate?, well, 42

43 Conditional branches Dispatch unit stops dispatching until the branch is resolved. CDB broadcasts the result of the branch Dispatching continues there after either at the fall-through instruction or at target instruction. Successful branch shall cause flushing of IFQ very much like jump. 43

44 Conditional branches Since we stop dispatching instructions after a branch, does it mean that this branch is the last instruction to be executed in the back-end? Is it possible that the back-end holds simultaneously (a) some instructions dispatched before the branch and (b) some instructions issued after the branch was resolved? 44

45 Tomasulo Loop Example Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss), second load takes 1 clock (hit) Based on Prof. Annavaram s lecture slide 45

46 How could Tomasulo overlap iterations of loops? Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; The destination registers, different TAGs in different iterations. These tags were given in place of the source operands to the dependent instructions following them. 46

47 Say, only two iterations. Let us unroll the two iterations. Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; Loop: LW $2, 40($1); MULT $4 $2, $3; SW $4, 40($1); ADDI $1, $1, -4; BNE $1, $0, Loop; destination register dependent source register(s) 47

48 Because, there is no reorder buffer. Note: Your EE560 project will use a reorder buffer! 48