The Pipelined CPU. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Revised 9/22/2013

Similar documents
CS352H: Computer Systems Architecture

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

WAR: Write After Read

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

PROBLEMS #20,R0,R1 #$3A,R2,R4


EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Solutions. Solution The values of the signals are as follows:

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Introducción. Diseño de sistemas digitales.1

Computer Organization and Components

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Pipelining Review and Its Limitations

Course on Advanced Computer Architectures

Instruction Set Architecture

Computer organization

Week 1 out-of-class notes, discussions and sample problems

VLIW Processors. VLIW Processors

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Instruction Set Design

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

(Refer Slide Time: 00:01:16 min)

CS521 CSE IITG 11/23/2012

Introduction to Cloud Computing

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Central Processing Unit (CPU)

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Giving credit where credit is due

Let s put together a Manual Processor

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

CPU Performance Equation

Instruction Set Architecture (ISA)

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

LSN 2 Computer Processors

Intel 8086 architecture

The Microarchitecture of Superscalar Processors

Design of Digital Circuits (SS16)

A Lab Course on Computer Architecture

A SystemC Transaction Level Model for the MIPS R3000 Processor

OC By Arsene Fansi T. POLIMI

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

PROBLEMS. which was discussed in Section

StrongARM** SA-110 Microprocessor Instruction Timing

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Processor Architectures

CPU Organization and Assembly Language

CPU Organisation and Operation

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

CHAPTER 7: The CPU and Memory

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

Microprocessor & Assembly Language

Reduced Instruction Set Computer (RISC)

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Computer Organization and Architecture

Review: MIPS Addressing Modes/Instruction Formats

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 5 Instructor's Manual

EC 362 Problem Set #2

AC : A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION

Central Processing Unit

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Switch Fabric Implementation Using Shared Memory

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Computer Architecture TDTS10

Performance evaluation

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Computer Systems Structure Input/Output

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Software Pipelining - Modulo Scheduling

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

CSE 141 Introduction to Computer Architecture Summer Session I, Lecture 1 Introduction. Pramod V. Argade June 27, 2005

EE361: Digital Computer Organization Course Syllabus

IA-64 Application Developer s Architecture Guide

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

TIMING DIAGRAM O 8085

AsicBoost A Speedup for Bitcoin Mining

Levels of Programming Languages. Gerald Penn CSC 324

Transcription:

The Pipelined CPU Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Revised 9/22/2013

Pipelining: A Faster Approach There are two types of simple control unit design: 1. The single cycle CPU with its slow clock, which executes one instruction per clock pulse. 2. The multi cycle CPU with its faster clock. This divides the execution of an instruction into 3, 4, or 5 phases, but takes that number of clock pulses to execute a single instruction. We now move to the more sophisticated CPU design that allows the apparent execution of one instruction per clock cycle, even with the faster clock. This design technique is called pipelining, though it might better be considered as an assembly line.

Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/0.5n + 1.5 4 = number of stages 4.5 An Overview of Pipelining Chapter 4 The Processor 3

The Ford Assembly Line in 1913

Mr. Ford s Idea Henry Ford began working on the assembly line concept about 1908 and had essentially perfected the idea by 1913. His motivations are worth study. In previous years, automobile manufacture was done by highly skilled technicians, each of whom assembled the whole car. It occurred to Mr. Ford that he could get more get more workers if he did not require such a high skill level. One way to do this was to have each worker perform only a few tasks. It soon became obvious that is was easier to bring the automobile to the worker than have the worker (and his tools) move to the automobile. The assembly line was born.

Productivity in an Assembly Line We may ask about how to measure the productivity of an assembly line. How long does it take for any individual car to be fabricated, from basic parts to finish? This might be a day or so. How many cars are produced per hour? This might be measured in the 100 s. The key measure is the rate of production.

The Pipelined CPU The CPU pipeline is similar to an assembly line. 1. The execution of an instruction is broken into a number of simple steps, each of which can be handled by an efficient execution unit. 2. The CPU is designed so that it can execute a number of instructions simultaneously, each in its own distinct phase of execution. 3. The important number is the number of instructions completed per unit time, or equivalently the instruction issue rate.

A Constraint on Pipelining The effect of an instruction sequence cannot be changed by the pipeline. Consider the following code fragment. add $s0, $t0, $t1 # $s0 = $t0 + $t1 sub $t2, $s0, $t3 # $t2 = $s0 - $t3 This does not have the same effect as it would if reordered. sub $t2, $s0, $t3 # $t2 = $s0 - $t3 add $s0, $t0, $t1 # $s0 = $t0 + $t1

The Earliest Pipelines The first problem to be attacked in the development of pipelined architectures was the fetch execute cycle. The instruction is fetched and then executed. How about fetching one instruction while a previous instruction is executing? It is here that we see one of the advantages of RISC designs, such as the MIPS. Each instruction has the same length (32 bits) as any other instruction, so that an instruction can be pre-fetched without taking time to identify it.

Instruction Pre-Fetch Break up fetch execute and do the two in parallel. This dates to the IBM Stretch (1959). The prefetch buffer is implemented in the CPU with on chip registers. Logically, it is a queue. The CDC 6600 buffer had a queue of length 8.

Flushing the Instruction Queue The CDC-6600 was word addressable. Suppose that an instruction at address X is executing. The instructions at addresses (X+1) through (X+7) are in the queue. Suppose that the instruction at address X causes a jump to address (X + 100). The instructions already in the queue are stale and will not be used. Flush the queue.

MIPS Pipeline Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register Chapter 4 The Processor 12

Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3 rd stage, access memory in 4 th stage Alignment of memory operands Memory access takes only one cycle Chapter 4 The Processor 13

Pipeline Performance Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps Chapter 4 The Processor 14

A Pipeline Needs Registers Suppose a CPU with five independent instructions in different stages of execution. Each instruction needs to execute within its own context, and have its own state (set of registers) Fortunately, Moore s law has allowed us to place increasingly larger numbers of transistors (hence registers) in a CPU.

MIPS Pipelined Datapath 4.6 Pipelined Datapath and Control MEM Right-to-left flow leads to hazards WB Chapter 4 The Processor 16

Pipeline registers Need registers between stages To hold information produced in previous cycle Chapter 4 The Processor 17

Pipelined Control Control signals derived from instruction As in single-cycle implementation Chapter 4 The Processor 18

IF for Load, Store, Chapter 4 The Processor 19

ID for Load, Store, Chapter 4 The Processor 20

EX for Load Chapter 4 The Processor 21

MEM for Load Chapter 4 The Processor 22

WB for Load Wrong register number Chapter 4 The Processor 23

Corrected Datapath for Load Chapter 4 The Processor 24

Pipelines and Heat The number of stages in a pipeline is often called the depth of the pipeline. The MIPS design has depth = 5. The greater the depth of the pipeline, the faster the CPU, and the more heat the CPU generates, as it has more registers and circuits to support the pipeline.

Graphical Depiction of the Pipeline The data elements in IF and ID are read during the second half of the clock cycle. The data elements in WB are written during the first half of the clock cycle.

Pipeline Performance Single-cycle (T c = 800ps) Pipelined (T c = 200ps) Chapter 4 The Processor 27

Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructions pipelined = Time between instructions nonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease Chapter 4 The Processor 28

Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3 rd stage, access memory in 4 th stage Alignment of memory operands Memory access takes only one cycle Chapter 4 The Processor 29

The Ideal Pipeline Each instruction is issued and enters the pipeline. This step is IF (Instruction Fetch) As it progresses through the pipeline it does not depend on the results of any other instruction now in the pipeline. The speedup of an N-stage pipeline under these conditions is about N. The older vector machines, such as the Cray, structured code that would meet this ideal.

Pipeline Realities 1. It is not possible for any instruction to depend on the results of instructions that will execute in the future. This is a logical impossibility. 2. There are no issues associated with dependence on instructions that have completed execution and exited the pipeline. 3. Hazards occur due to instructions that have started execution, but are still in the pipeline. 4. It is possible, and practical, to design a compiler that will minimize hazards in the pipeline. This is a desirable result of the joint design of compile and ISA. 5. It is not possible for the compiler to eliminate all pipelining problems without reducing the CPU to a non pipelined datapath, which is unacceptably slow.

One Simplistic Solution As we shall see in a moment, the following code causes a data hazard in the pipeline. add $s0, $t0, $t1 sub $t2, $s0, $t3 Here is a simplistic way to avoid the hazard. add $s0, $t0, $t1 nop # Do nothing for two nop # clock cycles. sub $t2, $s0, $t3

One View of the Pipeline Process This is the ideal case with no stalls.

Another View of the Pipeline

Hazards Situations that prevent starting the next instruction in the next cycle Structure hazards A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous instruction Chapter 4 The Processor 35

Bubbles in the Pipeline A bubble, also called a pipeline stall occurs when a hazard must be resolved prior to continuation of instruction execution. The control unit will insert a nop instruction into the pipeline in order to stall it.

Structure Hazards Conflict for use of a resource In MIPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches Chapter 4 The Processor 37

Avoiding the Structure Hazard

Data Hazards An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1 sub $t2, $s0, $t3 Chapter 4 The Processor 39

Forwarding (aka Bypassing) Use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath Chapter 4 The Processor 40

Load-Use Data Hazard Can t always avoid stalls by forwarding If value not computed when needed Can t forward backward in time! Chapter 4 The Processor 41

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 13 cycles lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles Chapter 4 The Processor 42

Dependencies & Forwarding Chapter 4 The Processor 43

Forwarding Paths Chapter 4 The Processor 44

Detecting the Need to Forward Pass register numbers along pipeline e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ALU operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg Chapter 4 The Processor 45

Datapath with Forwarding Chapter 4 The Processor 46

Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always fetch correct instruction Still working on ID stage of branch In MIPS pipeline Need to compare registers and compute target early in the pipeline Add hardware to do it in ID stage Chapter 4 The Processor 47

HLL Control Hazard Examples # Branch based on a loop structure. # Python semantics: xx[10] is the last one processed. for k in range (0, 11): xx[k] = 0 # Branch based on a conditional statement. if (x < y): z = y x else: z = x y

Branch for Loop 100 la $t0, xx #Get base address of xx. Pseudo- #instruction taking 8 bytes. 108 li $t1, 0 #Initialize counter 112 slti $t2, $t1,11 #$t2 set to 1 of $t1 < 11 116 beq $t2, $0, 4 # 120 + 4*4 = 136 120 sw $0, ($t0) #Clear word at address in $t0 124 addi $t0, 4 #Move address to next word 128 addi $t1, 1 #Increment counter 132 beq $0,$0, -6 #Branch back to 112 #End of loop 136 li $t0, 0 # 136 6*4 = 112

Branch for Conditional 200 lw $t1, x #Get value of x into $t1 204 lw $t2, y #Get value of y into $t2 208 slt $t0, $t1, $t2 #Set $t0 = 1 iff $t1 < $t2 212 beq $t0, $0, 3 #Branch if $t0 == 0 ($t1 >= $t2) 216 sub $t3, $t2, $t1 #Form y x 220 beq $0, $0, 1 #Go to 228 224 sub $t3, $t1, $t2 #Form x y 228 sw $t3, z #Store result

Stall on Branch Wait until branch outcome determined before fetching next instruction Chapter 4 The Processor 51

Branch Prediction Longer pipelines can t readily determine branch outcome early Stall penalty becomes unacceptable Predict outcome of branch Only stall if prediction is wrong In MIPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay Chapter 4 The Processor 52

MIPS with Predict Not Taken Prediction correct Prediction incorrect Chapter 4 The Processor 53

More-Realistic Branch Prediction Static branch prediction Based on typical branch behavior Example: loop and if-statement branches Predict backward branches taken Predict forward branches not taken Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history Chapter 4 The Processor 54

Unrolling Loops Consider the following loop code for ( k = 0; k < 3, k++) a[k] = b[k] + c[k]; This is logically equivalent to the following: k = 0 ; loop: a[k] = b[k] + c[k] ; k = k + 1; if (k < 3) go to loop # Here is the branch

The Unrolled Loop In this case, unrolling the loop removes the necessity for branch prediction by removing the branch instruction. Here is the unrolled code a[0] = b[0] + c[0] ; a[1] = b[1] + c[1] ; a[2] = b[2] + c[2] ; Note: There are no data hazards here. This is the trick used by vector computers.

1-Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: inner: beq,, inner beq,, outer Mispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of inner loop next time around Chapter 4 The Processor 57

2-Bit Predictor Only change prediction on two successive mispredictions Chapter 4 The Processor 58

Calculating the Branch Target Even with predictor, still need to calculate the target address 1-cycle penalty for a taken branch Branch target buffer Cache of target addresses Indexed by PC when instruction fetched If hit and instruction is branch predicted taken, can fetch target immediately Chapter 4 The Processor 59

The Branch Penalty Consider the following program fragment in which the branch has been taken. 36 sub $10, $4, $8 40 beq $1, $3, 7 #40 + 4 + 7*4 44 and $12, $2, $5 48 or $13, $2, $6 52 add $14, $4, $7 56 slt $15, $6, $7 and so on 72 lw $2, 50($7)

Branch Hazards If branch outcome determined in MEM 4.8 Control Hazards Flush these instructions (Set control values to 0) PC Chapter 4 The Processor 61

Reducing Branch Delay Move the hardware to determine the branch outcome to the ID stage. This involves minimal additional hardware. 1. A register comparison unit, comprising XOR gates to handle the branch on equal. 2. An additional adder to determine the target address. Now only the instruction in the IF state must be flushed if the branch is taken.

If Branch Outcome Determined in ID

Reorder the Code This illustrates the idea of a delay slot. The instruction after the beq is always executed. 36 beq $1, $3, 8 #36 + 4 + 8*4 40 sub $10, $4, $8 #Always 44 and $12, $2, $5 48 or $13, $2, $6 52 add $14, $4, $7 56 slt $15, $6, $7 and so on 72 lw $2, 50($7)

Scheduling the Delay Slot The placement depends on the code construct.

Pipeline Summary The BIG Picture Pipelining improves performance by increasing instruction throughput Executes multiple instructions in parallel Each instruction has the same latency Subject to hazards Structure, data, control Instruction set design affects complexity of pipeline implementation Chapter 4 The Processor 66