Lecture Topics. Pipelining in Processors. Pipelining and ISA ECE 486/586. Computer Architecture. Lecture # 10. Reference:

Similar documents
Pipelining Review and Its Limitations

WAR: Write After Read

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:


PROBLEMS. which was discussed in Section

Computer Architecture TDTS10

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Chapter 5 Instructor's Manual

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

LSN 2 Computer Processors

EC 362 Problem Set #2

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Week 1 out-of-class notes, discussions and sample problems

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

CS352H: Computer Systems Architecture

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

EE361: Digital Computer Organization Course Syllabus

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

PROBLEMS #20,R0,R1 #$3A,R2,R4

Solutions. Solution The values of the signals are as follows:

on an system with an infinite number of processors. Calculate the speedup of

The Microarchitecture of Superscalar Processors

Central Processing Unit (CPU)

Instruction scheduling

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CPU Performance Equation

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Review: MIPS Addressing Modes/Instruction Formats

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

(Refer Slide Time: 00:01:16 min)

Computer Organization and Components

A Lab Course on Computer Architecture

IA-64 Application Developer s Architecture Guide

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Performance evaluation

Instruction Set Design

An Introduction to the ARM 7 Architecture

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Instruction Set Architecture (ISA)

Let s put together a Manual Processor

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

CPU Organization and Assembly Language

Central Processing Unit

CPU Organisation and Operation

Thread level parallelism

Course on Advanced Computer Architectures

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

ARM Microprocessor and ARM-Based Microcontrollers

VLIW Processors. VLIW Processors

Processor Architectures

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Module: Software Instruction Scheduling Part I

Intel 8086 architecture

A SystemC Transaction Level Model for the MIPS R3000 Processor

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

StrongARM** SA-110 Microprocessor Instruction Timing

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Computer organization

TIMING DIAGRAM O 8085

Introduction to Microprocessors

CS521 CSE IITG 11/23/2012

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Microprocessor & Assembly Language

CISC, RISC, and DSP Microprocessors

CHAPTER 7: The CPU and Memory

Computer Organization. and Instruction Execution. August 22

CPU Performance. Lecture 8 CAP

CHAPTER 4 MARIE: An Introduction to a Simple Computer

MICROPROCESSOR AND MICROCOMPUTER BASICS

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

EEM 486: Computer Architecture. Lecture 4. Performance

Administrative Issues

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

OC By Arsene Fansi T. POLIMI

Introducción. Diseño de sistemas digitales.1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Instruction Set Architecture

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Property of ISA vs. Uarch?

Systems I: Computer Organization and Architecture

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Transcription:

Lecture Topics ECE 486/586 Computer Architecture Lecture # 10 Spring 2015 Pipelining Hazards and Stalls Reference: Effect of Stalls on Pipeline Performance Structural hazards Data Hazards Appendix C: Sections C.1, C.2 Portland State University Pipelining in Processors Exploit parallelism in sequential instruction stream Resources (e.g. ALU, memory, register file) can be used concurrently by different instructions Multiple instructions processed in parallel More instructions completed per unit time Higher throughput (performance) Pipelining and ISA What is desirable in instruction sets for pipelining Variable length instructions vs. fixed length instructions? Memory operands supported for any instruction or memory operands only in loads/stores? Register operands in different places in each instruction format vs. in the same place in different instruction formats?

MIPS Choices for Pipelining 5-Stage MIPS Pipeline 32-bit fixed instruction format Memory access only via load/store instruction Opcode field for all instruction formats in the same place Source registers for R-type and I-type instruction formats in the same place Single addressing mode for load/store instructions Simple branch conditions At any given time, each pipeline stage processes a different instruction Register file appears in two stages: read in stage-2 and written in stage-5 5-Stage MIPS Pipeline Pipeline Registers How do we ensure that each stage has the correct inputs for the instruction that is being processed by that stage? Information needed by an instruction is carried through the pipeline (via pipeline registers) as the instruction proceeds from one stage to the next

Representing Pipelined Execution Pipeline Performance Clock Cycle 1 2 3 4 5 6 7 I j I j+1 I j+2 Time Pipelining At best no impact on latency Still need to wait n stages (cycles) for completion of instruction Improves throughput No single instruction executes faster but overall throughput is higher Average instruction execution time decreases Successive instructions complete in each successive cycle (no 5 cycle wait between instructions) Reality Clock determined by slowest stage Pipeline overhead Clock skew Register delay Pipeline fill and drain Pipeline Performance Pipeline Performance: Example Ideal case Balanced pipeline (each stage has the same delay) Zero overhead due to clock skew and pipeline registers Ignore pipeline fill and drain overheads = = = (in an ideal case) Example: A program consisting of 500 instructions is executed on a 5-stage processor. How many cycles would be required to complete the program, (i) without pipelining, (ii) with pipelining? Assume idealoverlap in case of pipelining. Solution: Without pipelining:each instruction will require 5 cycles. There will be no overlap amongst successive instructions. Number of cycles = 500 * 5 = 2500 With pipelining:each pipeline stage will process a different instruction every cycle. First instruction will complete in 5 cycles, then one instruction will complete in every cycle, due to ideal overlap. Number of cycles = 5 + ((500-1)*1) = 504 Speedup for ideal pipelining = 2500/504 = 4.96 (or approx. 5)

Pipeline Performance: Example Pipeline Performance (cont.) Problem: Consider a non-pipelined processor using the 5-stage datapathwith 1 ns clock cycle. Assume that due to clock skew and pipeline registers, pipelining the processor adds 0.2 ns of overhead to the clock speed. How much speedup can we expect to gain from pipelining? Assume a balanced pipeline and ignore the pipeline fill and drain overheads. Solution: Without pipelining:clock period = 1 ns, CPI = 5 With pipelining:clock period = 1 + 0.2 = 1.2 ns, CPI = 1 Speedup from pipelining = = (1 ns * 5) / (1.2 ns * 1) = 5 / 1.2 = 4.17 The potential increase in performance resulting from pipelining is proportional to the number of pipeline stages However, this increase would be achieved only if all pipeline stages require the same time to complete, and there is no interruption throughout program execution Unfortunately, this is not true there are times when an instruction cannot proceed from one stage to the next in every clock cycle I j+2 Pipeline Performance (cont.) Clock Cycle 1 2 3 4 5 6 7 I j I j+1 8 9 Time Assume that Instruction I j+1 is stalledin the decode stage for two extra cycles This will cause I j+2 to be stalled in the fetch stage, until I j+1 proceeds New instructions cannot enter the pipeline until I j+2 proceeds past the fetch stage after cycle 5 => execution time increases by two cycles Major Hurdle of Pipelining: Hazards Hazards prevent the next instruction in the instruction stream from executing during its designated clock cycle Three types of pipeline hazards Structural hazard a situation where two (or more) instructions require the use of a given hardware resource at the same time Data hazard any condition in which either the source or the destination operands of an instruction are not available, when needed in the pipeline Instruction processing will be delayed until operands become available Control hazard a delay in the availability of an instruction or the memory address needed to fetch the instruction

Major Hurdle of Pipelining: Hazards Major Hurdle of Pipelining: Hazards Hazards may require stalling the pipeline allowing some instructions to proceed while others are delayed (and no additional instructions fetched) until the conditions that caused the hazard do not exist anymore. This is called clearing the hazard Stalling increases the CPI reduces the speedup from pipelining = 1+ Three types of pipeline hazards Structural hazard a situation where two (or more) instructions require the use of a given hardware resource at the same time Data hazard any condition in which either the source or the destination operands of an instruction are not available, when needed in the pipeline Control hazard a delay in the availability of an instruction or the memory address needed to fetch the instruction Structural Hazards Structural Hazards: An Example A situation where two (or more) instructions require the use of a given hardware resource at the same time Example: In the MIPS 5-stage pipeline, both the and stages require memory access In the same cycle A new instruction is fetched from memoryin the stage A Load instruction reads data from memory in the stage What happens if the memory has only one read port?

Stalls Instructions ahead of the stall proceed to completion Stall delays instruction and those following it Cost of Stall Problem: Consider two pipelined processors. Suppose data references represent 40% of the instructions executed and that the ideal CPI of the pipelined processor, ignoring the structural hazard is 1. Assume that the processor with the hazard has a clock rate that is 1.05 times higher than the hazard free processor and incurs a one cycle stall on structural hazards as in our example. Is the pipeline with or without the structural hazard faster, and by how much? Solution: Without structural hazard:clock period = C, CPI = 1 With structural hazard:clock period = C/1.05, CPI = 1 + (0.4 * 1) = 1.4 =.. =.. =1.33 Processor without the structural hazard is 1.33 times faster How to Mitigate Structural Hazards? Major Hurdle of Pipelining: Hazards Key Idea: Add more hardware resources How to avoid structural hazard on memory access during and stages? Separate instruction and data caches How to avoid structural hazard on register access during and stages? Dual ported register file Write early in stage Read late in stage Why no structural hazard on ALU during (PC PC + 4) and stages? PC + 4 in stage uses a dedicated adder, not the ALU Three types of pipeline hazards Structural hazard a situation where two (or more) instructions require the use of a given hardware resource at the same time Data hazard any condition in which either the source or the destination operands of an instruction are not available, when needed in the pipeline Control hazard a delay in the availability of an instruction or the memory address needed to fetch the instruction

Data Hazards Consider the following two instructions executed in a program sequence: Add R2, R3, #100 Subtract R9, R2, #30 The destination register (R2) for the firstinstruction is a source register for the second instruction Register R2 carries data from the firstinstruction to the secondinstruction => There is a data dependency between these two instructions First instruction writes to register R2 in the stage Second instruction reads register R2 in the stage If secondinstruction reads R2 beforethe firstinstruction writes R2, the result of secondinstruction would be incorrect, as it would be based on R2 s old value (read-after-write dependence) To obtain the correct result, second instruction needs to wait until the first instruction has written to R2 Data Hazards Clock Cycle 1 2 3 4 5 6 7 Add R2, R3, #100 Subtract R9, R2, #30 stall stall 8 Time Subtractis stalled in decode stage for two additional cycles to delay reading R2 until the new value of R2 has been written by Add How is this pipeline stall implemented in hardware? Control circuitry recognizes the data dependency when it decodes Subtract (comparing source register ids with dest. register ids of prior instructions) During cycles 3 to 5: Add proceeds through the pipe Subtractis held in the stage In cycle 5 Addwrites R2 in the first half cycle, Subtractreads R2 in the second half cycle Types of Data Hazards Three general types of data hazards Read After Write (RAW) True dependence Caused by dependence. Subsequent instruction has actual need for data produced by earlier instruction Write after Read (WAR) False dependence Called anti-dependence. Can t occur in in-order pipelines (e.g., our simple MIPS pipeline). We ll see it later in advanced pipelines Write after Write (WAW) False dependence Called output dependence. Can t occur in in-order pipelines (e.g., our simple MIPS pipeline). We ll see it later in advanced pipelines Read-after-Write (RAW) Hazard Instr j tries to read operand before Instr i write it (where j > i) Instr i : add r1, r2, r3 Instr j : sub r4, r1, r3 This hazard results from a true dependence: an actual need for communication from the first instruction to the second

Write-after-Read (WAR) Hazard Write-after-Write (WAW) Hazard Instr j tries to write operand before Instr i reads it (where j > i) Instr i : sub r4, r1, r3 Instr j : add r1, r2, r3 This is an anti-dependence (not a true dependence) Arises from the reuse of register r1 If Instr j writes to r1 before Instr 1 reads r1, then Instr i will use the value written by Instr j => Incorrect behavior WAR hazards cannot happen in MIPS 5-stage pipeline because: Instructions are executed in order Register reads happen earlier (stage-2) than register writes (stage-5) Instr j writes to operand before Instr i writes to it (where j > i) Instr i : sub r1, r4, r3 Instr j : add r1, r2, r3 Instr k : mul r6, r1, r7 This is an output dependence (not a true dependence) Arises from the reuse of register r1 If Instr j writes to r1 before Instr i then Instr k will use the value written by Instr i => violation of program order WAW hazards cannot happen in MIPS 5-stage pipeline because: Instructions are executed in order Register writes are always in stage-5