Computer Organization and Components

Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides version. 2 Course Structure Module : C and ssembly Programming L L2 L X LB Module : Processor Design L9 L S2 LB5 L S LB2 Module 2: I/O Systems L5 L6 X2 LB Module 5: Hierarchy L X LB6 Module : Logic Design L7 L8 X LB Module 6: Parallel Processors and Programs L2 L X5 S Proj. xpo L

bstractions in Computer Systems Networked Systems and Systems of Systems Computer System pplication Software Software Operating System Hardware/Software Interface Set rchitecture Microarchitecture Digital Hardware Design Logic and Building Blocks Digital Circuits nalog Circuits nalog Design and Physics Devices and Physics genda by Raysonho @ Open Grid Scheduler / Grid ngine - Own work. Licensed under Creative Commons Zero, Public Domain

5 cknowledgement: The structure and several of the good examples are derived from the book Digital Design and Computer rchitecture (2) by D. M. Harris and S. L. Harris. 6 Jump Path (Revisited) Inst W 25:2 Zero 2:6 PC 2 2 :28 27: 25: 5: RegWrite Branch RegDst LUSrc LUControl 2:6 5: Sign xtend LU W MemWrite MemToReg

7 Control Unit (Revisited) Decoding the instruction op 5 Main Decoder RegWrite RegDst LUSrc Branch MemWrite MemToReg Jump Control signals to the data path LUOp 2 funct 6 LU Decoder LUControl 8 Performance nalysis (Revisited) xecution time (in seconds) = # instructions clock cycles instruction seconds clock cycle Number of instructions in a program (# = number of) Determined by programmer or the compiler or both. verage cycles per instruction (CPI) Determined by the microarchitecture implementation. Seconds per cycle = clock period T C. Determined by the critical path in the logic. For the single-cycle processor, each instruction takes one clock cycle. That is, CPI =. The main problem with the single-cycle processor design (last lecture) is the long critical path. Solution: Pipelining

Parallelism and Pipelining (/6) Definitions 9 Processing System: system that takes input and produces outputs. Token: n input that is processed by the processing system and results in an output. Latency: The time it takes for the system to process one token. Throughput: The number of tokens that can be processed per time unit. Parallelism and Pipelining (2/6) Sequential Processing xample: ssume we have a Christmas card factory with two machines (M and M2). pproach. Process tokens sequentially. In this case a token is a card. The latency is 6 = s M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) The throughput is / =. tokens per second or 6 tokens per minute. M M2 M M2 2 6 8 2 6 8 2 22 2 26 s

Parallelism and Pipelining (/6) Parallel Processing (Spatial Parallelism) xample: ssume we have a Christmas card factory with four machines. M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) M: Prints out the card (takes 6s) M: Puts on a stamp (takes s) 2 6 8 2 6 8 2 22 2 26 s Parallelism and Pipelining (/6) Pipelining (Temporal Parallelism) 2 xample: ssume we have a Christmas card factory with two machines. M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) 2 6 8 2 6 8 2 22 2 26 s

Parallelism and Pipelining (5/6) Summary Parallelism and Pipelining (6/6) Performance nalysis for Pipelining Idea: We introduce a pipeline in the processor How does this affect the execution time? xecution time (in seconds) = # instructions clock cycles instruction seconds clock cycle Pipelining does not change the number of instructions Pipelining will not improve the CPI (actually, make it slightly worse) Pipelining will improve the cycle period (make the critical path shorter)

Towards a Pipelined (/8) 5 Recall the single-cycle data path (the logic for the j and beq instructions is hidden) PC next PC Inst W 25:2 2:6 2 2 2:6 5: 5: Sign xtend LU W Towards a Pipelined (2/8) Fetch Stage 6 register splits the datapath into stages, forming a pipeline. First, we introduce a instruction fetch stage. PC next Inst 25:2 2:6 2 5: W 2 2:6 5: Sign xtend LU W Fetch (F)

Towards a Pipelined (/8) Decode Stage 7 PC next W 2 2 2:6 5: decode stage decodes an instruction and reads out values from the register file. LU W Sign xtend Fetch (F) Decode (D) Towards a Pipelined (/8) xecute Stage 8 n execute stage performs the computation using the LU. PC next W 2 2 2:6 5: LU W Sign xtend Fetch (F) Decode (D) xecute ()

Towards a Pipelined (5/8) Stage 9 PC next 2 W 2 2:6 5: Reading and writing to memory is done in the memory stage. LU W Sign xtend Fetch (F) Decode (D) xecute () (M) Towards a Pipelined (6/8) Writeback Stage Can you see a problem with the writeback? PC next 2 W 2 2:6 5: Sign xtend The results are written back to the register file in the writeback stage. LU W 2 Fetch (F) Decode (D) xecute () (M) Writeback (W)

Towards a Pipelined (7/8) Writeback Stage 2 PC next W 2 2 2:6 5: LU W Sign xtend Fetch (F) Decode (D) xecute () (M) Writeback (W) Towards a Pipelined (8/8) nother issue 22 Can you see another issue? PC next W 2 2 2:6 5: LU W Sign xtend Fetch (F) Decode (D) xecute () (M) Writeback (W)

2 cknowledgement: The structure and several of the good examples are derived from the book Digital Design and Computer rchitecture (2) by D. M. Harris and S. L. Harris. Five-Stage Pipeline In each cycle, a new instruction is fetched, but it takes 5 cycles to complete the instruction. In each cycle all stages are handling different instructions in parallel. 2 xample. In cycle 6, the result of the sub instruction is written back to register $t. add $s, $s, $s2 2 5 6 7 8 9 F D M sub $t, $t, $t2 F D addi $t, $, 55 F D xori $t, $t5, F and $t6, $s, $s We can fill the pipeline because there are no dependencies between instructions

Hazards (/) Read after Write (RW) add $s, $s, $s2 The add instruction writes back the value $s in cycle 5 But $s is used in the decode phase in cycle. 2 5 25 data hazard occurs when an instruction reads a register that has not yet been written to. This kind of data hazard is called read after write (RW) hazard. sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 and will also use the wrong value for $s. Hazards (2/) Solution : Forwarding The result from the execute stage for add can be forwarded (also called bypassing) to the execute stage for sub. add $s, $s, $s2 2 5 26 Hazard detection is implemented using a hazard detection unit that gives control signals to the datapath if data should be forwarded. sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 Can all data hazards be solved using forwarding? The and instruction s hazard is solved by forwarding as well.

Hazards (/) Solution : Forwarding (partially) 27 2 5 The sub instruction cannot be solved using forwarding because the memory access is available at the end of cycle, but is needed in the beginning of cycle. lw $s, 2($s2) sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 The and instruction memory result can be forwarded after the memory stage to execution. xori can read the data from the write stage (writes in first part of cycle, reads in second part) Hazards (/) Solution 2: Stalling 28 Solution when forwarding does not work: stalling fter stalling, the result can be forwarded to the execute stage. 2 5 lw $s, 2($s2) sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 F D D M W F We need to stall the pipeline. Stages are repeated and the fetch of xori is delayed. Stalling results in more than one cycle per instruction. The unused stage is called a bubble.

(/5) ssume Branch Not Taken 29 2: beq $s, $s2, 2: sub $t, $s, $t 2 5 Computes the branch target address and compares for equality in the execute () stage. If branch taken, update the PC in the memory (M) stage. 28: and $t2, $t, $s 2C: xori $t, $s, 2... 6: addi $t, $s, If the branch is taken, we need to flush the pipeline. We have a branch misprediction penalty of cycles. Can we improve this? (2/5) Improving the Pipeline PC next 2 2 2:6 5: LU Sign xtend Fetch (F) Decode (D) xecute () (M) Writeback (W)

(/5) ssume Branch Not Taken 2: beq $s, $s2, 2: sub $t, $s, $t 2 5 The decode phase can change the next PC, so that the instruction at the branch taken address is fetched. 28: and $t2, $t, $s 2C: xori $t, $s, 2... 6: addi $t, $s, Branch misprediction penalty is now reduced to cycle. Note that we may now introduce another data hazard (if operands are not available in the decode stage). Can be solved with forwarding or stalling Branch Delay Slot (explained)

(/5) Deeper Pipelines Why do we sometimes want more stages than 5? Why not always have more pipeline stages? (/5) Deeper Pipelines How can we handle deep pipelines, and minimize misprediction? Static Branch Predictors Statically (at compile time) determine if a branch is taken or not. For instance, predict branch not taken. Dynamic Branch Predictors Dynamically (at runtime) predict if a branch will be taken or note. Operates in the fetch state. Maintains a table, called the branch target buffer, that contains hundreds or thousands of executed branch instructions, their destinations, and information if the branches were taken or not.

5 by Raysonho @ Open Grid Scheduler / Grid ngine - Own work. Licensed under Creative Commons Zero, Public Domain 6 rmv7 The most popular IS for embedded devices (9 billon devices in 2, growth 2 billion a year) More complex addressing modes than MIPS (can do shift and add of addresses in registers in one instruction) RMv7 Condition results are saved in special flags: negative, zero, carry, overflow. 6 registers, each -bit (integers) size -bit (Thumb-mode, 6-bits encoding). Conditional execution of instructions, depending on condition code. xample: RM Cortex-8, a processor at GHz, -stage pipeline, with branch predictor.

7 Standard in laptops, PCs, and in the cloud CISC instructions are more powerful than for RM and MIPS, but requires more complex hardware architecture has evolved over the last 5 years, There are 6,, and 6 bits variants. 8 general purpose registers (eax, ebx, ecx, edx, esp, ebp, esi, edi). Variable length of instruction encoding (between and 5 bytes) rithmetic operations allow destination operand to be in memory. Major manufacturers are Intel and MD. 8 Summary Some key take away points: Pipelining is a temporal way of achieving parallelism Pipelining processors improve performance by reducing the clock period (shorter critical path) Pipelining introduces pipeline hazards. There are two main kind of hazards: data hazards and control hazards. hazards are solved by forwarding or stalling Control hazards are solved by flushing the pipeline and improved by branch prediction. Thanks for listening!