Pipelining: Its Natural!

Similar documents
Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

CS352H: Computer Systems Architecture

WAR: Write After Read

Solutions. Solution The values of the signals are as follows:

Pipelining Review and Its Limitations

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)


Course on Advanced Computer Architectures

PROBLEMS. which was discussed in Section

CS521 CSE IITG 11/23/2012

Computer Organization and Components

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Review: MIPS Addressing Modes/Instruction Formats

PROBLEMS #20,R0,R1 #$3A,R2,R4

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Computer organization

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

CPU Performance Equation

Week 1 out-of-class notes, discussions and sample problems

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

TIMING DIAGRAM O 8085

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Central Processing Unit (CPU)

Performance evaluation

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Giving credit where credit is due

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

EEM 486: Computer Architecture. Lecture 4. Performance

A Lab Course on Computer Architecture

Computer Architecture TDTS10

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Computer Organization and Architecture

Thread level parallelism

CSEE W4824 Computer Architecture Fall 2012

VLIW Processors. VLIW Processors

Introduction to Cloud Computing

StrongARM** SA-110 Microprocessor Instruction Timing

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Instruction Set Architecture

CPU Organisation and Operation

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Module: Software Instruction Scheduling Part I

LSN 2 Computer Processors

EC 362 Problem Set #2

Microprocessor and Microcontroller Architecture

An Introduction to the ARM 7 Architecture

Software Pipelining - Modulo Scheduling

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

A FPGA Implementation of a MIPS RISC Processor for Computer Architecture Education

Generating MIF files

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Boosting Beyond Static Scheduling in a Superscalar Processor

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Pipelined MIPS Processor. Dmitri Strukov ECE 154A

EE361: Digital Computer Organization Course Syllabus

Instruction Set Design

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Reduced Instruction Set Computer (RISC)

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

on an system with an infinite number of processors. Calculate the speedup of

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Introducción. Diseño de sistemas digitales.1

A SystemC Transaction Level Model for the MIPS R3000 Processor

EECS 427 RISC PROCESSOR

An Out-of-Order RiSC-16 Tomasulo + Reorder Buffer = Interruptible Out-of-Order

IA-64 Application Developer s Architecture Guide

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

CPU Organization and Assembly Language

Switch Fabric Implementation Using Shared Memory

Processor Architectures

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Lecture Outline. Stack machines The MIPS assembly language. Code Generation (I)

Computer Architecture Syllabus of Qualifying Examination

Instruction scheduling

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

REDUCED INSTRUCTION SET COMPUTERS

Chapter 2 Topics. 2.1 Classification of Computers & Instructions 2.2 Classes of Instruction Sets 2.3 Informal Description of Simple RISC Computer, SRC

Five Families of ARM Processor IP

The 104 Duke_ACC Machine

The Microarchitecture of Superscalar Processors

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Energy-Efficient, High-Performance Heterogeneous Core Design

Exception and Interrupt Handling in ARM

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Transcription:

Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder takes 20 minutes

Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 20 30 40 20 30 40 20 30 40 20 Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry: Start work ASAP 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 40 40 40 20 Pipelined laundry takes 3.5 hours for 4 loads

Pipelining Lessons T a s k O r d e r 6 PM 7 8 9 Time 30 40 40 40 40 20 A B C D Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup

A simple RISC pipeline IF ID EX MEM WB RISC subset instruction steps R-type Store/Load Branch/Jump Instruction fetch Instruction decode and register fetch Execution Compute addr Instr completion Instr completion Mem access Load completion 5 Stage Pipeline

Illustrating the datapath visually Regread in 1 st half of cycle Regwrite in 2 nd half of cycle Can help with answering questions like: # cycles to execute this code? Resource conflicts?

Keeping instructions from interfering with each other IF/RF RF/EX Additional registers to hold stage values EX/MEM MEM/WB

Amdahl s Law Strikes Again Unpipelined multi-cycle implementation 1 ns clock cycle Branches and ALU ops take 4 cycles Memory ops take 5 cycles Instruction mix: 20% branch, 40% ALU, 40% mem Avg instr exec time = 1 ns +.6 * 4 +.4 * 5 = 4.4 ns 5 stage pipeline.2 ns overhead for clock skew and register delay Each stage takes 1 ns clock cycle Avg instr exec time = 1.2 ns Speedup: 4.4/1.2=3.7 Overhead limits speedup!

Speedup in Pipelining Speedup = Avg instr time Avg instr time unpipeline d pipelined Speedup = CPI ideal = CPI CPI unpipelined pipelined clock clock CPIunpipelined = 1 Pipeline depth unpipelined pipelined CPIpipelined = CPI ideal Speedup = 1+ CPI pipeline + Pipeline unpipelined stalls stalls per instr pipeline depth 1+ pipeline stalls

Project 3: Simplescalar Assigned: 30 Oct Due: 20 Nov, late submission 25 Nov Sorry, no further extensions Handouts will be available soon!

Pipelining Difficulties What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Control Exceptions Multi-cycle operations

Hazards in Pipelining Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more bubbles in the pipeline

Structural Hazards Problem with starting instruction 3 before load is finished Processor with only single memory port generates conflicts

Structural Hazards Insert stall cycles to avoid problem Problem solved at cost in performance What is CPI? What role do cache s play?

Data Hazards Problem with starting next instruction before first is finished dependencies that go backward in time are data hazards

Software Solution Have compiler guarantee no hazards Where do we insert the nops? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows us down!

Forwarding Use temporary results, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding Use temporary results, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to stall the load instruction

Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to stall the load instruction

Branch Hazards When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 40 beq $1, $3, 7 IM Reg DM Reg 44 and $12, $2, $5 IM Reg DM Reg 48 or $13, $6, $2 IM Reg DM Reg 52 add $14, $2, $2 IM Reg DM Reg 72 lw $4, 50($7) IM Reg DM Reg

Four Compile-time Branch Hazard Alternatives Stall until branch direction is clear branch cost fixed, not reducible by software (compiler) Pipeline freeze or flush Predict Branch Not Taken Slightly more complex ( backing-out ), higher performance Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken PC+4 already calculated, so use it to get next instruction

Four Compile-time Branch Hazard Alternatives Predict Branch Taken Usefulness depends on system characteristics In 5-stage MIPS subset, not useful since target address not available until same stage as comparison(1 cycle stall needed) Other machines: branch target known before outcome Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 branch target if taken

Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction Following compile-time branch prediction: From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Compiler effectiveness for single branch delay slot: 1) Restriction of instructions scheduled in delay slots 2) Compile-time ability to predict branch outcome *Support for canceling branch instructions allows compilers to fill more delay slots hw simply squashes mispredicted instructions

Compiler Static Prediction of Taken/Untaken Branches Improves strategy for placing instructions in delay slot Two strategies Backward branch predict taken, forward branch not taken Profile-based prediction: record branch behavior, predict branch based on prior run Always taken Taken backwards Not Taken Forwards

Evaluating Static Branch Prediction Strategies Misprediction ignores frequency of branch Instructions between mispredicted branches is a better metric

Example pipeline depth Speedup 1+ pipeline stalls = pipeline depth 1+ branch freq branch penalty 3 stages before branch-target known 2 cycle penalty on all uncond since target unknown 4 stages before condition evaluated Penalty depends on prediction scheme Branch penalties for 3 schemes: Assume 4% uncond, 6% cond untaken, 10% cond taken Quantify effect of branches on CPI Solution:

Revised Unpipelined Version of MIPS IF ID EX MEM WB R-type Store/Load Branch/Jump Instruction fetch Instruction decode and register fetch Execution Compute addr Instr completion Instr completion Mem access Load completion IF ID EX MEM WB R-R or R-I ALUOutput A op Imm OR ALUOutput A func B PC NPC Regs[rd] ALUOutput OR Regs[rt] ALUOutput OR Regs[rt] LMD Store/Load IR Mem[PC], NPC PC+4 A Regs[rs], B Regs[rt], Imm sign-ext immed of IR ALUOutput A+Imm PC NPC LMD Mem[ALUOutput] OR Mem[ALUOutput] B Regs[rt] LMD Branch (BEQZ) ALUOutput NPC+(Imm<<2) PC NPC OR PC ALUOutput

Revised MIPS Datapath

Pipelined Datapath

Events by Pipeline Stage IF ID EX MEM WB R-R or R-I Store/Load Branch (BEQZ) IR Mem[PC], NPC PC+4 A Regs[rs], B Regs[rt], Imm sign-ext immed of IR ALUOutput A op Imm OR ALUOutput A+Imm ALUOutput NPC+(Imm<<2) ALUOutput A func B PC NPC PC NPC LMD Mem[ALUOutput] OR Mem [ALUOutput] B PC NPC OR PC ALUOutput Regs[rd] ALUOutput OR Regs[rt] ALUOutput Regs[rt] LMD OR Regs[rt] LMD