Spring 2015 :: CSE 502 Computer Architecture. Processor Pipeline. Instructor: Nima Honarmand

Similar documents
Pipelining Review and Its Limitations

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1


INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

CS352H: Computer Systems Architecture

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Solutions. Solution The values of the signals are as follows:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Computer Organization and Components

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

WAR: Write After Read

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

CPU Performance Equation

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Review: MIPS Addressing Modes/Instruction Formats

VLIW Processors. VLIW Processors

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Computer organization

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

PROBLEMS #20,R0,R1 #$3A,R2,R4

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Course on Advanced Computer Architectures

Giving credit where credit is due

Performance evaluation

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Computer Organization and Architecture

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

StrongARM** SA-110 Microprocessor Instruction Timing

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Week 1 out-of-class notes, discussions and sample problems

Computer Architecture TDTS10

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

CS521 CSE IITG 11/23/2012

EE361: Digital Computer Organization Course Syllabus

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Instruction Level Parallelism I: Pipelining

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

PROBLEMS. which was discussed in Section

A Lab Course on Computer Architecture

Introduction to Cloud Computing

The Microarchitecture of Superscalar Processors

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Instruction Set Architecture

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

picojava TM : A Hardware Implementation of the Java Virtual Machine

MACHINE ARCHITECTURE & LANGUAGE

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Where we are CS 4120 Introduction to Compilers Abstract Assembly Instruction selection mov e1 , e2 jmp e cmp e1 , e2 [jne je jgt ] l push e1 call e

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Intel 8086 architecture

Instruction Set Design

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Instruction Set Architecture (ISA)

on an system with an infinite number of processors. Calculate the speedup of

Preface. Any questions from last time? A bit more motivation, information about me. A bit more about this class. Later: Will review 1st 22 slides

Chapter 9 Computer Design Basics!

Module: Software Instruction Scheduling Part I

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

EECS 427 RISC PROCESSOR

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Processor Architectures

Instruction Set Architecture (ISA) Design. Classification Categories

CHAPTER 7: The CPU and Memory

Systems I: Computer Organization and Architecture

CSE 141L Computer Architecture Lab Fall Lecture 2

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Central Processing Unit (CPU)

A SystemC Transaction Level Model for the MIPS R3000 Processor

Chapter 5 Instructor's Manual

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Streamlining Data Cache Access with Fast Address Calculation

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Property of ISA vs. Uarch?

OPENSPARC T1 OVERVIEW

TIMING DIAGRAM O 8085

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

Study Material for MS-07/MCA/204

Let s put together a Manual Processor

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

路 論 Chapter 15 System-Level Physical Design

SOC architecture and design

PART B QUESTIONS AND ANSWERS UNIT I

Software Pipelining - Modulo Scheduling

Thread level parallelism

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

LSN 2 Computer Processors

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Chapter 2 Logic Gates and Introduction to Computer Architecture

IA-64 Application Developer s Architecture Guide

Transcription:

Processor Pipeline Instructor: Nima Honarmand

Processor Datapath

The Generic Instruction Cycle Steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might be from registers or memory Execute (EX_STEP) Perform computation on the operands Result Store or Write Back (RS_STEP) Write the execution results back to registers or memory ISA determines what needs to be done in each step for each instruction μarch determines how HW implements the steps

Example: MIPS Instruction Set All instructions are 32 bits

A Simple MIPS Datapath Write-Back (WB) +1 PC I-cache Reg File ALU D-cache Inst. Fetch (IF) Inst. Decode & Register Read (ID) Execute (EX) Memory (MEM) IF_STEP ID_STEP OF_STEP EX_STEP RS_STEP

Datapath vs. Control Logic Datapath is the collection of HW components and their connection in a processor Determines the static structure of processor Control logic determines the dynamic flow of data between the components E.g., the control lines of MUXes and ALU in last slide Is a function of? Instruction words State of the processor Execution results at each stage

Single-Instruction Datapath Single-cycle ins0.(fetch,dec,ex,mem,wb) ins1.(fetch,dec,ex,mem,wb) Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time Process one instruction at a time Single-cycle control: hardwired Low CPI (1) Long clock period (to accommodate slowest instruction) Multi-cycle control: typically micro-programmed Short clock period High CPI Can we have both low CPI and short clock period? Not if datapath executes only one instruction at a time No good way to make a single instruction go faster

Pipelined Datapath Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) Pipelined ins0.fetch time ins0.(dec,ex) ins1.fetch ins0.(mem,wb) ins1.(dec,ex) ins1.(mem,wb) ins2.fetch ins2.(dec,ex) ins2.(mem,wb) Start with multi-cycle design When insn0 goes from stage 1 to stage 2 insn1 starts stage 1 Each instruction passes through all stages but instructions enter and leave at faster rate Style Ideal CPI Cycle Time (1/freq) Single-cycle 1 Long Multi-cycle > 1 Short Pipelined 1 Short Pipeline can have as many insns in flight as there are stages

Pipeline Illustrated L Comb. Logic n Gate Delay BW = ~(1/n) L ṉ - Gate Delay L ṉ - 2 2 Gate Delay BW = ~(2/n) L ṉ Gate L ṉ Gate ṉ Gate - Delay - Delay L - 3 3 3 Delay BW = ~(3/n) Pipeline Latency = n Gate Delay + (p-1) register delays p: # of stages Improves throughput at the expense of latency

5-Stage MIPS Pipeline

Stage 1: Fetch Fetch an instruction from instruction cache every cycle Use PC to index instruction cache Increment PC (assume no branches for now) Write state to the pipeline register (IF/ID) The next stage will read this pipeline register

Instruction bits Decode PC + 1 Spring 2015 :: CSE 502 Computer Architecture Stage 1: Fetch Diagram M U X 1 + target en PC Instruction Cache en IF / ID Pipeline register

Stage 2: Decode Decodes opcode bits Set up Control signals for later stages Read input operands from register file Specified by decoded instruction bits Write state to the pipeline register (ID/EX) Opcode Register contents, immediate operand PC+1 (even though decode didn t use it) Control signals (from insn) for opcode and destreg

Instruction bits Control Signals/imm regb contents Fetch Execute PC + 1 rega contents PC + 1 Spring 2015 :: CSE 502 Computer Architecture Stage 2: Decode Diagram target rega regb destreg data en Register File IF / ID Pipeline register ID / EX Pipeline register

Stage 3: Execute Perform ALU operations Calculate result of instruction Control signals select operation Contents of rega used as one input Either regb or constant offset (imm from insn) used as second input Calculate PC-relative branch target PC+1+(constant offset) Write state to the pipeline register (EX/Mem) ALU result, contents of regb, and PC+1+offset Control signals (from insn) for opcode and destreg

Control Signals/imm Control Signals regb contents regb contents Decode Memory rega contents ALU result PC + 1 PC+1 +offset Spring 2015 :: CSE 502 Computer Architecture Stage 3: Execute Diagram target + M U X A L U destreg data ID / EX Pipeline register EX/Mem Pipeline register

Stage 4: Memory Perform data cache access ALU result contains address for LD or ST Opcode bits control R/W and enable signals Write state to the pipeline register (Mem/WB) ALU result and Loaded data Control signals (from insn) for opcode and destreg

Control signals Control signals regb contents Execute Loaded data ALU result Write-back ALU result PC+1 +offset Spring 2015 :: CSE 502 Computer Architecture Stage 4: Memory Diagram target in_data in_addr Data Cache en R/W EX/Mem Pipeline register destreg data Mem/WB Pipeline register

Stage 5: Write-back Writing result to register file (if required) Write Loaded data to destreg for LD Write ALU result to destreg for ALU insn Opcode bits control register write enable signal

Control signals Memory Loaded data ALU result Spring 2015 :: CSE 502 Computer Architecture Stage 5: Write-back Diagram data M U X Mem/WB Pipeline register destreg M U X

Putting It All Together M U X 1 + PC+1 PC+1 + target PC Inst Cache instruction rega regb data dest Register File vala valb offset M U X A L U eq? ALU result valb Data Cache ALU result mdata M U X data dest M U X dest dest dest op op op IF/ID ID/EX EX/Mem Mem/WB

Issues With Pipelining

Pipelining Idealism Uniform Sub-operations Operation can partitioned into uniform-latency sub-ops Repetition of Identical Operations Same ops performed on many different inputs Independent Operations All ops are mutually independent

Pipeline Realism Uniform Sub-operations NOT! Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (left-over time near end of cycle) Repetition of Identical Operations NOT! Unifying instruction types Coalescing instruction types into one multi-function pipe Minimize external fragmentation (idle stages to match length) Independent Operations NOT! Resolve data and resource hazards Inter-instruction dependency detection and resolution Pipelining is expensive

The Generic Instruction Pipeline Instruction Fetch IF Instruction Decode ID Operand Fetch OF Instruction Execute EX Write-back WB

Balancing Pipeline Stages IF ID T IF = 6 units T ID = 2 units Without pipelining T cyc T IF +T ID +T OF +T EX +T OS = 31 OF T ID = 9 units Pipelined T cyc max{t IF, T ID, T OF, T EX, T OS } = 9 EX T EX = 5 units Speedup = 31 / 9 = 3.44 WB T OS = 9 units Can we do better?

Balancing Pipeline Stages (1/2) Two methods for stage quantization Divide sub-ops into smaller pieces Merge multiple sub-ops into one Recent/Current trends Deeper pipelines (more and more stages) Pipelining of memory accesses Multiple different pipelines/sub-pipelines

Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: 4 machine cyc / instruction Finer-Grained Machine Cycle: 11 machine cyc /instruction IF ID T IF&ID = 8 units IF IF ID # stages = 4 T cyc = 9 units OF EX T OF = 9 units T EX = 5 units OF OF OF EX # stages = 11 T cyc = 3 units EX WB T OS = 9 units WB WB WB

Pipeline Examples IF_STEP MIPS R2000/R3000 IF_STEP IF ID_STEP ID_STEP OF_STEP OF_STEP RD AMDAHL 470V/7 PC GEN Cache Read Cache Read Decode Read REG Addr GEN EX_STEP ALU Cache Read Cache Read RS_STEP MEM EX_STEP EX 1 WB RS_STEP EX 2 Check Result Write Result

Instruction Dependencies (1/2) Data Dependence Read-After-Write (RAW) (the only true dependence) Read must wait until earlier write finishes Anti-Dependence (WAR) Write must wait until earlier read finishes (avoid clobbering) Output Dependence (WAW) Earlier write can t overwrite later write Control Dependence (a.k.a. Procedural Dependence) Branch condition must execute before branch target Instructions after branch cannot run before branch

Instruction Dependencies (1/2) From Quicksort: # for ( ; (j < high) && (array[j] < array[low]); ++j); L 1 : L 2 : bge j, high, L 2 mul $15, j, 4 addu $24, array, $15 lw $25, 0($24) mul $13, low, 4 addu $14, array, $13 lw $15, 0($14) bge $25, $15, L 1 addu j, j, 1... addu $11, $11, -1... Real code has lots of dependencies

Hardware Dependency Analysis Processor must handle Register Data Dependencies (same register) RAW, WAW, WAR Memory Data Dependencies (same address) RAW, WAW, WAR Control Dependencies

Pipeline Terminology Pipeline Hazards Potential violations of program dependencies Due to multiple in-flight instructions Must ensure program dependencies are not violated Hazard Resolution Static method: compiler guarantees correctness By inserting No-Ops or independent insns between dependent insns Dynamic method: hardware checks at runtime Two basic techniques: Stall (costs perf.), Forward (costs hw) Pipeline Interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependencies at runtime

Pipeline: Steady State t 0 t 1 t 2 t 3 t 4 t 5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF

Data Hazards Necessary conditions: WAR: write stage earlier than read stage Is this possible in IF-ID-RD-EX-MEM-WB? WAW: write stage earlier than write stage Is this possible in IF-ID-RD-EX-MEM-WB? RAW: read stage earlier than write stage Is this possible in IF-ID-RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory

Pipeline: Data Hazard t 0 t 1 t 2 t 3 t 4 t 5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB Only RAW in our case IF ID RD ALU MEM IF ID RD ALU IF ID RD How to detect? Compare read register specifiers for newer instructions with write register specifiers for older instructions IF ID IF

Option 1: Stall on Data Hazard t 0 t 1 t 2 t 3 t 4 t 5 Inst j IF ID RD ALU MEM WB Inst j+1 IF ID RD ALU MEM WB Inst j+2 IF ID Stalled in RD RD ALU MEM WB Inst j+3 IF Stalled in ID ID RD ALU MEM WB Inst j+4 Stalled in IF IF ID RD ALU MEM IF ID RD ALU IF ID RD Instructions in IF and ID stay IF/ID pipeline latch not updated IF ID IF Send no-op down pipeline (called a bubble)

Option 2: Forwarding Paths (1/3) t 0 t 1 t 2 t 3 t 4 t 5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB Many possible paths IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF MEM ALU Requires stalling even with forwarding paths

Option 2: Forwarding Paths (2/3) src1 IF ID src2 dest Register File ALU MEM WB

Option 2: Forwarding Paths (3/3) IF ID src1 src2 dest Register File = = Deeper pipelines in general require additional forwarding paths = = = = ALU MEM WB

Pipeline: Control Hazard t 0 t 1 t 2 t 3 t 4 t 5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF

Option 1: Stall on Control Hazard t 0 t 1 t 2 t 3 t 4 t 5 Inst i Inst i+1 IF ID RD ALU MEM WB IF ID RD ALU MEM WB Inst i+2 Inst i+3 Inst i+4 Stalled in IF IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID Stop fetching until branch outcome is known Send no-ops down the pipe Easy to implement Performs poorly On out of 6 instructions are branches Each branch takes 3 cycles CPI = 1 + 3 x 1/6 = 1.5 (lower bound) IF

Option 2: Prediction for Control Hazards Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 New Inst i+2 New Inst i+3 New Inst i+4 t 0 t 1 t 2 t 3 t 4 t 5 IF ID RD ALU MEM WB IF ID RD ALU MEM WB IF ID RD ALU nop nop nop Fetch Resteered IF ID nop RD nop nop IF nop ID nop nop IF ID RD Predict branch not taken Send sequential instructions down pipeline Must stop memory and RF writes Kill instructions later if incorrect Fetch from branch target IF ID IF Speculative State Cleared nop nop ALU RD ID nop nop ALU RD

Option 3: Delay Slots for Control Hazards Another option: delayed branches # of delay slots (ds) : stages between IF and where the branch is resolved 3 in our example Always execute following ds instructions Put useful instruction there, otherwise no-op Losing popularity Just a stopgap (one cycle, one instruction) Superscalar processors (later) Delay slot just gets in the way (special case) Legacy from old RISC ISAs

Superscalar Pipelines Instruction-Level Parallelism Beyond Simple Pipelines

Going Beyond Scalar Scalar pipeline limited to CPI 1.0 Can never run more than 1 insn per cycle Superscalar can achieve CPI 1.0 (i.e., IPC 1.0) Superscalar means executing multiple insns in parallel

Successive Instructions Spring 2015 :: CSE 502 Computer Architecture Architectures for Instruction Parallelism Scalar pipeline (baseline) Instruction/overlap parallelism = D Operation Latency = 1 Peak IPC = 1.0 D D different instructions overlapped 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles

Successive Instructions Spring 2015 :: CSE 502 Computer Architecture Superscalar Machine Superscalar (pipelined) Execution Instruction parallelism = D x N Operation Latency = 1 Peak IPC = N per cycle D x N different instructions overlapped N 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles

Superscalar Example: Pentium Prefetch Decode1 Decode up to 2 insts 4 32-byte buffers Decode2 Execute Writeback Decode2 Execute Writeback Read operands, Addr comp both mov, lea, simple ALU, push/pop test/cmp Asymmetric pipes u-pipe shift rotate some FP v-pipe jmp, jcc, call, fxch

Pentium Hazards & Stalls Pairing Rules (when can t two insns exec?) Read/flow dependence mov eax, 8 mov [ebp], eax Output dependence mov eax, 8 mov eax, [ebp] Partial register stalls mov al, 1 mov ah, 0 Function unit rules Some instructions can never be paired MUL, DIV, PUSHA, MOVS, some FP

Limitations of In-Order Pipelines If the machine parallelism is increased dependencies reduce performance CPI of in-order pipelines degrades sharply As N approaches avg. distance between dependent instructions Forwarding is no longer effective Must stall often In-order pipelines are rarely full

The In-Order N-Instruction Limit On average, parent-child separation is about 5 insn (Franklin and Sohi 92) Ex. Superscalar degree N = 4 Any dependency between these instructions will cause a stall Dependent insn must be N = 4 instructions away Average of 5 means there are many cases when the separation is < 4 each of these limits parallelism Reasonable in-order superscalar is effectively N=2