Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Similar documents
CS352H: Computer Systems Architecture

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

PROBLEMS #20,R0,R1 #$3A,R2,R4

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

VLIW Processors. VLIW Processors

WAR: Write After Read


Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Solutions. Solution The values of the signals are as follows:

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Course on Advanced Computer Architectures

Instruction Set Design

Pipelining Review and Its Limitations

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers

CPU Performance Equation

Giving credit where credit is due

Computer Organization and Components

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Central Processing Unit (CPU)

LSN 2 Computer Processors

Week 1 out-of-class notes, discussions and sample problems

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

CS521 CSE IITG 11/23/2012

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

OC By Arsene Fansi T. POLIMI

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Introduction to the Latest Tensilica Baseband Solutions

Introduction to Cloud Computing

A Lab Course on Computer Architecture

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

MACHINE ARCHITECTURE & LANGUAGE

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Module: Software Instruction Scheduling Part I

The 104 Duke_ACC Machine

EE361: Digital Computer Organization Course Syllabus

EECS 427 RISC PROCESSOR

ARM Microprocessor and ARM-Based Microcontrollers

picojava TM : A Hardware Implementation of the Java Virtual Machine

CPU Organization and Assembly Language

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

Computer organization

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Problems and Measures Regarding Waste 1 Management and 3R Era of public health improvement Situation subsequent to the Meiji Restoration

Wiggins/Redstone: An On-line Program Specializer

Operating System Overview. Otto J. Anshus

A Unified View of Virtual Machines

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Systems I: Computer Organization and Architecture

IA-64 Application Developer s Architecture Guide

Computer Architecture TDTS10

Lecture 2: Computer Hardware and Ports.

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Acommonvericationproblemforhardwaredesignsistodetermineifevery

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Allocation vs. scheduling

Router Architectures

Instruction Scheduling. Software Pipelining - 2

NetFlow probe on NetFPGA

Performance evaluation

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Chapter 1 Computer System Overview

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

CPU Organisation and Operation

(Refer Slide Time: 00:01:16 min)

Instruction scheduling

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Introducción. Diseño de sistemas digitales.1

2

CSE 141L Computer Architecture Lab Fall Lecture 2

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Embedded computing applications demand both

PROBLEMS. which was discussed in Section

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

StrongARM** SA-110 Microprocessor Instruction Timing

Review: MIPS Addressing Modes/Instruction Formats

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

Transcription:

Data Hazards 1

Hazards: Key Points Hazards cause imperfect pipelining They prevent us from achieving CPI = 1 They are generally causes by counter flow data pennces in the pipeline Three kinds Structural -- contention for hardware resources Data -- a data value is not available when/where it is need. Control -- the next instruction to execute is not known. Two ways to al with hazards Removal -- add hardware and/or complexity to work around the hazard so it does not exist Bypassing/forwarding Speculation Stall -- Sacrifice performance to prevent the hazard from occurring Stalling causes bubbles 2

Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values (for now) Also memory accesses (more on this later) add $s0, $t0, $t1 add $t3, $s0, $t4 and $t3, $t2, $t4 sw $t1, 0($t2) ld $t3, 0($t2) ld $t4, 16($s4) 3

Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values (for now) Also memory accesses (more on this later) add $s0, $t0, $t1 add $t3, $s0, $t4 and $t3, $t2, $t4 sw $t1, 0($t2) ld $t3, 0($t2) ld $t4, 16($s4) 3

Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values (for now) Also memory accesses (more on this later) add $s0, $t0, $t1 add $t3, $s0, $t4 and $t3, $t2, $t4 sw $t1, 0($t2) ld $t3, 0($t2) ld $t4, 16($s4) 3

Depennces in the pipeline In our simple pipeline, these instructions cause a hazard Cycles add $s0, $t0, $t1 Fetch Deco Mem Write Fetch Deco Mem Write 4

How can we fix it? Ias? 5

Solution 1: Make the compiler al with it. Expose hazards to the big A architecture A result is available N instructions after the instruction that generates it. In the meantime, the register file has the old value. lay slots What is N? Can it change? What can the compiler do? Fetch Deco Mem Write 6

Compiling for lay slots The compiler must fill the lay slots with other instructions What if it can t? add $s0, $t0, $t1 Rearrange instructions add $s0, $t0, $t1 and $t7, $t5, $t4 add $t3, $s0, $t4 and $t7, $t5, $t4 add $t3, $s0, $t4 7

Compiling for lay slots The compiler must fill the lay slots with other instructions What if it can t? No-ops add $s0, $t0, $t1 Rearrange instructions add $s0, $t0, $t1 and $t7, $t5, $t4 add $t3, $s0, $t4 and $t7, $t5, $t4 add $t3, $s0, $t4 7

Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do. It moves through the pipeline like an instruction. Cycles add $s0, $t0, $t1 Fetch Deco Mem Write Fetch Stall Deco Mem Write 8

Stalling the pipeline Freeze all pipeline stages before the stage where the hazard occurred. Disable the PC update Disable the pipeline registers This essentially equivalent to always inserting a nop when a hazard exists Insert nop control bits at stalled stage (co in our example) How is this solution still potentially better than relying on the compiler? 9

Stalling the pipeline Freeze all pipeline stages before the stage where the hazard occurred. Disable the PC update Disable the pipeline registers This essentially equivalent to always inserting a nop when a hazard exists Insert nop control bits at stalled stage (co in our example) How is this solution still potentially better than relying on the compiler? The compiler can still act like there are lay slots to avoid stalls. Implementation tails are not exposed in the ISA 9

The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? What do we need to know to figure it out? 10

The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? Fraction of instructions that stall: 30% Baseline CPI = 1 Stall CPI = 1 + 2 = 3 New CPI = 11

The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? Fraction of instructions that stall: 30% Baseline CPI = 1 Stall CPI = 1 + 2 = 3 New CPI = 0.3*3 + 0.7*1 = 1.6 11

Solution 3: Bypassing/Forwarding Data values are computed in Ex and Mem but publicized in write The data exists! We should use it. inputs are need results known Results "published" to registers Fetch Deco Mem Write 12

Bypassing or Forwarding Take the values, where ever they are Cycles add $s0, $t0, $t1 Fetch Deco Mem Write Fetch Deco Mem Write 13

Forwarding Paths Cycles add $s0, $t0, $t1 Fetch Deco Mem Write Fetch Deco Mem Write Fetch Deco Mem Write Fetch Deco Mem Write 14

Forwarding in Hardware Add PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

Forwarding Control The forwarding unit tects instances when the stination and source registers of executing instructions match Set the control lines on the ALU input muxes accordingly Stall if, for some reason, forwarding is not possible. 16

Forwarding for Loads Load values come from the Mem stage Cycles ld $s0, (0)$t0 Fetch Deco Mem Write Fetch Deco Mem 17

Forwarding for Loads Load values come from the Mem stage Cycles ld $s0, (0)$t0 Fetch Deco Mem Write Fetch Deco Mem Time travel presents significant implementation challenges 17

What can we do? Punt to the compiler Easy enough. Will work. Same dangers apply as before. Always stall. Forward when possible, stall otherwise Here the compiler still has leverage Co will be faster if the compiler generates co as if there is a lay slot. If the compiler can t fix it, the hardware will stall 18

Performance cost of stalling ET = I * CPI * CT CPI = 1 + %Stall * StallTime % Stall is termined by how aggressive our bypassing is and the quality of our compiler. Stall time is related to pipeline pth. In our case, it is 1 or 2, because our pipeline is shallow. In eper pipelines, it can larger. 19

Hardware Cost of Forwarding In our pipeline, adding forwarding required relatively little hardware. For eper pipelines it gets much more expensive Roughly: ALU * pipeline stages you need to forward over Some morn processor have multiple ALUs (4-5) And eper pipelines (4-5 stages of to forward across) Not all forwarding paths need to be supported. If a path does not exist, the processor will need to stall. 20