Pipelining Review and Its Limitations

Similar documents
Introduction to Microprocessors


INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Computer Architecture TDTS10

Performance evaluation

Week 1 out-of-class notes, discussions and sample problems

PROBLEMS #20,R0,R1 #$3A,R2,R4

Computer Organization and Components

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

CPU Performance Equation

Design Cycle for Microprocessors

CS521 CSE IITG 11/23/2012

Instruction scheduling

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

WAR: Write After Read

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

EE361: Digital Computer Organization Course Syllabus

CS352H: Computer Systems Architecture

VLIW Processors. VLIW Processors

Giving credit where credit is due

IA-64 Application Developer s Architecture Guide

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

The Microarchitecture of Superscalar Processors

LSN 2 Computer Processors

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

on an system with an infinite number of processors. Calculate the speedup of

Module: Software Instruction Scheduling Part I

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Processor Architectures

Intel 8086 architecture

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

In the Beginning The first ISA appears on the IBM System 360 In the good old days

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

CS 152 Computer Architecture and Engineering. Lecture 22: Virtual Machines

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

CSEE W4824 Computer Architecture Fall 2012

CPU performance monitoring using the Time-Stamp Counter register

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Course on Advanced Computer Architectures

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

THE BASICS OF PERFORMANCE- MONITORING HARDWARE

PROBLEMS. which was discussed in Section

Instruction Set Architecture (ISA)

EEM 486: Computer Architecture. Lecture 4. Performance

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Property of ISA vs. Uarch?

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Chapter 2. Why is some hardware better than others for different programs?

OC By Arsene Fansi T. POLIMI

Parallel Programming Survey

System Models for Distributed and Cloud Computing

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Embedded System Hardware - Processing (Part II)

A Lab Course on Computer Architecture

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Chapter 2 Logic Gates and Introduction to Computer Architecture

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Chapter 5 Instructor's Manual

Communicating with devices

Introducción. Diseño de sistemas digitales.1

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Computer Architectures

StrongARM** SA-110 Microprocessor Instruction Timing

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

路 論 Chapter 15 System-Level Physical Design

Multithreading Lin Gao cs9244 report, 2006

Instruction Set Architecture (ISA) Design. Classification Categories

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

picojava TM : A Hardware Implementation of the Java Virtual Machine

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

ARM Microprocessor and ARM-Based Microcontrollers

A Performance Counter Architecture for Computing Accurate CPI Components

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Introduction to Cloud Computing

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

Solutions. Solution The values of the signals are as follows:

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Transcription:

Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology

Agenda Review Instruction set architecture Basic tools for computer architects Amdahl's law Pipelining Ideal pipeline Cost-performance trade-off Dependencies Hazards, interlocks & stalls Forwarding Limits of simple pipeline Moscow Institute of Physics and Technology 2

Review Moscow Institute of Physics and Technology 3

Instruction Set Architecture ISA is the hardware/software interface Defines set of programmer visible state Defines instruction format (bit encoding) and instruction semantics Examples: MIPS, x86, IBM 360, JVM Many possible implementations of one ISA 360 implementations: model 30 (c. 1964), z10 (c. 2008) x86 implementations: 8086, 286, 386, 486, Pentium, Pentium Pro, Pentium 4, Core 2, Core i7, AMD Athlon, Transmeta Crusoe MIPS implementations: R2000, R4000, R10000, R18K, JVM: HotSpot, PicoJava, ARM Jazelle,... Moscow Institute of Physics and Technology 4

Basic Tools for Computer Architects Amdahl's law Pipelining Out-of-order execution Critical path Speculation Locality Important metrics Performance Power/energy Cost Complexity Moscow Institute of Physics and Technology 5

Amdahl's Law Moscow Institute of Physics and Technology 6

Amdahl s Law: Definition Speedup = Time without enhancement / Time with enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S 1 f f 1 f f / S We should concentrate efforts on improving frequently occurring events or frequently used mechanisms Moscow Institute of Physics and Technology 7

Amdahl s Law: Example New processor 10 times faster Input-output is a bottleneck 60% of time we wait Let s calculate Its human nature to be attracted by 10 faster Keeping in perspective its just 1.6 faster Moscow Institute of Physics and Technology 8

Pipelining Moscow Institute of Physics and Technology 9

A Generic 4-stage Processor Pipeline Fetch L2 Cache Decode Floating Multimedia Point Unit Unit Write Integer Unit Fetch Unit gets the next instruction from the cache. Decode Unit determines type of instruction. Instruction and data sent to Execution Unit. Write Unit stores result. Instruction Fetch Decode Execute Write Instruction Fetch Decode Read Execute Memory Write Moscow Institute of Physics and Technology 10

Pipeline: Steady State Cycle 1 2 3 4 5 6 7 8 9 Instr 1 Fetch Decode Read Execute Memory Write Instr 2 Fetch Decode Read Execute Memory Write Instr 3 Fetch Decode Read Execute Memory Write Instr 4 Fetch Decode Read Execute Memory Write Instr 5 Fetch Decode Read Execute Memory Instr 6 Fetch Decode Read Execute Latency elapsed time from start to completion of a particular task Throughput how many tasks can be completed per unit of time Pipelining only improves throughput Each job still takes 4 cycles to complete Real life analogy: Henry Ford s automobile assembly line Moscow Institute of Physics and Technology 11

Pipelining Illustrated L n Gate Delay TP ~ 1/n L n/2 Gate Delay L n/2 Gate Delay TP ~ 2/n L n/3 Gate Delay L n/3 Gate Delay L n/3 Gate Delay TP ~ 3/n Moscow Institute of Physics and Technology 12

Pipelining Performance Model Starting from an unpipelined version with propagation delay T and TP = 1/T Unpipelined k-stage pipelined T/k where S delay through latch and overhead T S T/k S Moscow Institute of Physics and Technology 13

Hardware Cost Model Starting from an unpipelined version with hardware cost G Unpipelined k-stage pipelined G/k where L cost of adding each latch G L G/k L Moscow Institute of Physics and Technology 14

Cost/Performance Trade-off [Peter M. Kogge, 1981] Moscow Institute of Physics and Technology 15

Pipelining Idealism Uniform suboperations The operation to be pipelined can be evenly partitioned into uniformlatency suboperations Repetition of identical operations The same operations are to be performed repeatedly on a large number of different inputs Repetition of independent operations All the repetitions of the same operation are mutually independent Good example: automobile assembly line Moscow Institute of Physics and Technology 16

Pipelining Reality Uniform suboperations... NOT! Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (some waiting stages) Identical operations... NOT! Unify instruction types Coalescing instruction types into one multi-function pipe Minimize external fragmentation (some idling stages) Independent operations... NOT! Resolve dependencies Inter-instruction dependency detection and resolution Minimize performance lose Moscow Institute of Physics and Technology 17

Pipelining Reality Uniform suboperations... NOT! Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (some waiting stages) Identical operations... NOT! Unify instruction types Coalescing instruction types into one multi-function pipe Minimize external fragmentation (some idling stages) Independent operations... NOT! Resolve dependencies Inter-instruction dependency detection and resolution Minimize performance lose Moscow Institute of Physics and Technology 18

Pipelining Reality Uniform suboperations... NOT! Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (some waiting stages) Identical operations... NOT! Unify instruction types Coalescing instruction types into one multi-function pipe Minimize external fragmentation (some idling stages) Independent operations... NOT! Resolve dependencies Inter-instruction dependency detection and resolution Minimize performance lose Moscow Institute of Physics and Technology 19

Pipelining Reality Uniform suboperations... NOT! Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (some waiting stages) Identical operations... NOT! Unify instruction types Coalescing instruction types into one multi-function pipe Minimize external fragmentation (some idling stages) Independent operations... NOT! Resolve dependencies Inter-instruction dependency detection and resolution Minimize performance lose Moscow Institute of Physics and Technology 20

Pipelining Reality Uniform suboperations... NOT! Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (some waiting stages) Identical operations... NOT! Unify instruction types Coalescing instruction types into one multi-function pipe Minimize external fragmentation (some idling stages) Independent operations... NOT! Resolve dependencies Inter-instruction dependency detection and resolution Minimize performance lose Moscow Institute of Physics and Technology 21

Pipelining Reality Uniform suboperations... NOT! Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (some waiting stages) Identical operations... NOT! Unify instruction types Coalescing instruction types into one multi-function pipe Minimize external fragmentation (some idling stages) Independent operations... NOT! Resolve dependencies Inter-instruction dependency detection and resolution Minimize performance lose Moscow Institute of Physics and Technology 22

Hazards, Interlocks, and Stalls Pipeline hazards Potential violations of program dependences Must ensure program dependences are not violated Hazard resolution Static Method: performed at compiled time in software Dynamic Method: performed at run time using hardware Stall Flush Forward Pipeline interlock Hardware mechanisms for dynamic hazard resolution Must detect and enforce dependences at run time Moscow Institute of Physics and Technology 23

Dependencies & Pipeline Hazards Data dependence (register or memory) True dependence (RAW) Instruction must wait for all required input operands Anti-dependence (WAR) Later write must not clobber a still-pending earlier read Output dependence (WAW) Earlier write must not clobber an already-finished later write Control dependence A data dependency on the instruction pointer Conditional branches cause uncertainty to instruction sequencing Resource conflicts Two instructions need the same device Moscow Institute of Physics and Technology 24

Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low Moscow Institute of Physics and Technology 25

Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low L1: L2: bge $10, $9, L2 mul $15, $10, 4 addu $24, $6, $15 lw $25, 0($24) mul $13, $8, 4 addu $14, $6, $13 lw $15, 0($14) bge $25, $15, L2 addu $10, $10, 1 addu $11, $11, -1 Moscow Institute of Physics and Technology 26

Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low L1: L2: bge $10, $9, L2 mul $15, $10, 4 addu $24, $6, $15 lw $25, 0($24) mul $13, $8, 4 addu $14, $6, $13 lw $15, 0($14) bge $25, $15, L2 addu $10, $10, 1 addu $11, $11, -1 Moscow Institute of Physics and Technology 27

Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low L1: L2: bge $10, $9, L2 mul $15, $10, 4 addu $24, $6, $15 lw $25, 0($24) mul $13, $8, 4 addu $14, $6, $13 lw $15, 0($14) bge $25, $15, L2 addu $10, $10, 1 addu $11, $11, -1 Moscow Institute of Physics and Technology 28

Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low L1: L2: bge $10, $9, L2 mul $15, $10, 4 addu $24, $6, $15 lw $25, 0($24) mul $13, $8, 4 addu $14, $6, $13 lw $15, 0($14) bge $25, $15, L2 addu $10, $10, 1 addu $11, $11, -1 Moscow Institute of Physics and Technology 29

Example: Quick Sort for MIPS # for (;(j<high)&&(array[j]<array[low]);++j); # $10 = j; $9 = high; $6 = array; $8 = low L1: L2: bge $10, $9, L2 mul $15, $10, 4 addu $24, $6, $15 lw $25, 0($24) mul $13, $8, 4 addu $14, $6, $13 lw $15, 0($14) bge $25, $15, L2 addu $10, $10, 1 addu $11, $11, -1 Moscow Institute of Physics and Technology 30

Pipeline: Data Hazards Cycle 1 2 3 4 5 6 7 8 9 Instr 1 Fetch Decode Read Execute Memory Write Instr 2 Fetch Decode Read Execute Memory Write Instr 3 Fetch Decode Read Execute Memory Write Instr 4 Fetch Decode Read Execute Memory Write Instr 5 Fetch Decode Read Execute Memory Instr 6 Fetch Decode Read Execute Instr 2 : _ r k Instr 3 : r k _ How long should we stall for? Moscow Institute of Physics and Technology 31

Pipeline: Stall on Data Hazard Cycle 1 2 3 4 5 6 7 8 9 Instr 1 Fetch Decode Read Execute Memory Write Instr 2 Fetch Decode Read Execute Memory Write Instr 3 Fetch Decode Stalled Read Execute Instr 4 Fetch Stalled Decode Read Instr 5 Stalled Fetch Decode Instr 6 Instr 2 : _ r k Bubble Bubble Bubble Instr 3 : r k _ Make the younger instruction wait until the hazard has passed: Stop all up-stream stages Drain all down-stream stages Moscow Institute of Physics and Technology 32

Pipeline: Forwarding Cycle 1 2 3 4 5 6 7 8 9 Instr 1 Fetch Decode Read Execute Memory Write Instr 2 Fetch Decode Read Execute Memory Write Instr 3 Fetch Decode Read Execute Memory Write Instr 4 Fetch Decode Read Execute Memory Write Instr 5 Fetch Decode Read Execute Memory Instr 6 Fetch Decode Read Execute Moscow Institute of Physics and Technology 33

Limitations of Simple Pipelined Processors (aka Scalar Processors) Upper bound on scalar pipeline throughput Limited by IPC = 1 Inefficiencies of very deep pipelines Clocking overheads Longer hazards and stalls Performance lost due to in-order pipeline Unnecessary stalls Inefficient unification Into single pipeline Long latency for each instruction Hazards and associated stalls Moscow Institute of Physics and Technology 34

Limitations of Deeper Pipelines Unpipelined k-stage pipelined T/k S (Code size) (CPI) (Cycle) T? T/k Eventually limited by S S Moscow Institute of Physics and Technology 35

Put it All Together: Limits to Deeper Pipelines Source: Ed Grochowski, 1997 Moscow Institute of Physics and Technology 36

Acknowledgements These slides contain material developed and copyright by: Grant McFarland (Intel) Christos Kozyrakis (Stanford University) Arvind (MIT) Joel Emer (Intel/MIT) Moscow Institute of Physics and Technology 37