Basic Pipelining. Basic Pipelining

Similar documents
Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

CS352H: Computer Systems Architecture

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Review: MIPS Addressing Modes/Instruction Formats

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

WAR: Write After Read

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

PROBLEMS #20,R0,R1 #$3A,R2,R4

Course on Advanced Computer Architectures

Solutions. Solution The values of the signals are as follows:

Pipelining Review and Its Limitations

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Instruction Set Architecture (ISA)

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Microprocessor and Microcontroller Architecture

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Week 1 out-of-class notes, discussions and sample problems

Chapter 5 Instructor's Manual

Thread level parallelism

Computer organization

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

EE361: Digital Computer Organization Course Syllabus


LSN 2 Computer Processors

Computer Organization and Components

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

CS521 CSE IITG 11/23/2012

Introduction to Cloud Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Module: Software Instruction Scheduling Part I

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Performance evaluation

Introducción. Diseño de sistemas digitales.1

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Giving credit where credit is due

CHAPTER 1 INTRODUCTION

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

CHAPTER 4 MARIE: An Introduction to a Simple Computer

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Chapter 9 Computer Design Basics!

VLIW Processors. VLIW Processors

(Refer Slide Time: 00:01:16 min)

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

TIMING DIAGRAM O 8085

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Operating System Impact on SMT Architecture

PROBLEMS. which was discussed in Section

on an system with an infinite number of processors. Calculate the speedup of

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Computer Architecture TDTS10

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Pipelined MIPS Processor. Dmitri Strukov ECE 154A

EC 362 Problem Set #2

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Contents. Chapter 1. Introduction

CPU Performance Equation

Computer Organization and Architecture

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Software Pipelining - Modulo Scheduling

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Central Processing Unit (CPU)

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

CPU Performance. Lecture 8 CAP

Architectures and Platforms

Computer Architecture Syllabus of Qualifying Examination

Multi-Threading Performance on Commodity Multi-Core Processors

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

SOC architecture and design

Types Of Operating Systems

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

REDUCED INSTRUCTION SET COMPUTERS

Chapter 1: Introduction. What is an Operating System?

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

ARM Microprocessor and ARM-Based Microcontrollers

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers

FPGA-based Multithreading for In-Memory Hash Joins

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

CSEE W4824 Computer Architecture Fall 2012

路 論 Chapter 15 System-Level Physical Design

Router Architectures

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Processor Architectures

Instruction Set Design

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Transcription:

Basic Pipelining * In a typical system, speedup is achieved through parallelism at all levels: Multi -user, multi-tasking, multi-processing, multi-programming, multi-threading, compiler optimizations and many other techniques. * Pipelining is a technique for overlapping operations during execution. Today this is a key feature that makes fast CPUs. * Different types of pipeline: instruction pipeline, operation pipeline, multi-issue pipelines. Huang - Spring 2008 CSE 340 1

Topics of Discussion * What is a pipeline? * A simple implementation of MIPS64 * Basic pipeline of MIPS64 * Performance issues * Structural hazards * Data hazards * Control hazards * Implementation issues * Handling multicycle operations * Instruction set design and pipelining * Summary Huang - Spring 2008 CSE 340 2

What is a Pipeline? * Pipeline is like an auto assembly line. * A pipeline has many steps (or stages, segments). * Each stage carries out a different part of instruction or operation. * The stages are connected to form a pipe. * An instruction or operation enters through one end and progresses thru the stages and exit thru the other end. * Pipelining is an implementation technique that exploits parallelism among the instructions in a sequential instruction stream. Huang - Spring 2008 CSE 340 3

Pipeline Characteristics * Throughput: Number of items (cars, instructions) exiting the pipeline per unit time. * Stage time: The pipeline designer s goal is to balance the length of each pipeline stage. In a balanced pipeline, stage time = time per instruction on unpipelined machine/ number of stages. * In many instances, stage time = max ( times for all stages). * Pipeline yields a reduction in CPI. CPI approx. = stage time. Huang - Spring 2008 CSE 340 4

Implementation of MIPS64 ISA * MIPS64 instructions can be implemented in at most five cycles: Instruction fetch (IF): IR <-- Mem[PC] NPC <-- PC+4 Instruction decode (ID) (register fetch): A <-- Regs[rs] B <-- Regs[rt] Imm <-- sign-extended immediate field of IR Huang - Spring 2008 CSE 340 5

Implementation of MIPS64 ISA (contd.) Execution/Effective address (EX): Four alternatives: - Mem. references: ALUoutput <-- A + Imm - Register-Register ALU inst: ALUoutput <-- A func B - Register-Immediate: ALUoutput <-- A op Imm - Branch: ALUoutput <-- NPC + (Imm << 2) Cond <-- (A == 0) Huang - Spring 2008 CSE 340 6

Implementation of MIPS64 ISA (contd.) Memory Access/branch completion (Mem): The PC is updated for all instructions: PC <-- NPC - Memory reference LMD <-- Mem[ALUoutput] or Mem[ALUoutput] <-- B - Branch: if (cond) PC <-- ALUoutput Huang - Spring 2008 CSE 340 7

Implementation of MIPS64 ISA (contd.) Write back (WB): - Register-register ALU inst: Regs[rd] <-- ALUoutput - Register-immediate ALU inst: Regs[rt] <-- ALUoutput - Load instruction: Regs[rt] <-- LMD Study the components of MIPS64 pipeline hardware diagram. Huang - Spring 2008 CSE 340 8

Timing and Control What s missing in the RTL description of MIPS given above is the timing and control information. e.g. (Add R 1, R 2, R 3 ) Add.t 0 : IR <-- Mem[PC] NPC <-- PC+4 Add.t 1 : A <-- Regs[IR 6..10 ] B <-- Regs[IR 11..15 ] Add.t 2 : ALUoutput <-- A op B Add.t 3 : do nothing (idling) Add.t 4 : Regs[IR 16..20 ] <-- ALUoutput Huang - Spring 2008 CSE 340 9

Timing and Control - Branch Br.t 0 : IR <-- Mem[PC] NPC <-- PC + 4 Br.t 1 : A <-- Regs[IR 6..10 ] Imm <-- IR 16..31 with sign Br.t 2 : ALUoutput <-- NPC + Imm Cond <-- (A op 0) Br.t 3 : if (cond) PC <-- ALUoutput else PC <-- NPC Huang - Spring 2008 CSE 340 10

Basic Pipeline of MIPS * Five stages: IF, ID, EX, MEM, WB * On each clock cycle an instruction is fetched and begins its five-cycle execution. * Performance is up to five times that of a machine that is non-pipelined. * What do we need in the implementation of the data path to support pipelining? Huang - Spring 2008 CSE 340 11

Pipelining the MIPS Datapath * Separate instruction and data caches eliminating a conflict that would arise between instruction fetch and data memory access. This design avoids resource conflict. * We need to avoid register file access conflict: it is accessed once during ID and another time during WB stage. * Update PC every cycle. So mux from memory access stage is to be moved to IF stage. * All operations in one stage should complete within a clock cycle. * Values passed from one stage to the next must be placed in buffers/latches. (I use buffers instead of registers to avoid confusion with regular registers.) Huang - Spring 2008 CSE 340 12

Pipelining the MIPS Datapath (contd.) * How to arrive at the above list of requirements? Examine what happens in each pipeline stage depending on the instruction type. Make a list of all the possibilities. * RTL statements of the events on every stage of the MIPS pipeline is given in Fig. A.19. * To control this pipeline, we only need to determine how to set the control on the four multiplexers (mux). -- The first one inputs to PC. Let s call it MUX1. -- The next two input to ALU: MUX2,MUX3. -- The fourth one input to register file: MUX4. Huang - Spring 2008 CSE 340 13

Controlling the Pipeline * Let s refer to interface between stages IF and ID, IF/ID, and the other interfaces between stages ID/EX, EX/MEM, and MEM/WB. * MUX1 is controlled by the condition checking done at EX/MEM. Based on this condition, EX/MEM.cond, the MUX1 selects the current PC or the branch target as the instruction address. Huang - Spring 2008 CSE 340 14

Controlling the Pipeline * MUX2 and MUX3 are controlled by the type of instruction. MUX2 is set by whether the instruction is a branch or not. MUX3 is set by whether the instruction is Register-Register ALU operation or any other operation. * MUX4 is controlled by whether the instruction in the WB stage is a load or an ALU operation. * In addition MUX4 chooses the correct portion of the IR in the MEM/WB buffer to specify the register destination field. Huang - Spring 2008 CSE 340 15

Pipeline Performance - Example1 * In general: 40% ALU, 20% branch, 40% memory * Design1: Unpipelined. 10ns clock cycles. ALU operations and branches take 4 cycles, memory operations take 5 cycles. * Design2: Pipelined. Clock skew and setup add 1 ns overhead to clock cycle. * What is the speedup? Huang - Spring 2008 CSE 340 16

Pipeline Performance - Example1 (contd.) * Design 1: Average instruction executin time = clock cycle * CPI = 10ns*(4*0.4+4*0.2+5*0.4) = 44ns * Design 2: Average instruction execution time at steady state is clock cycle time = 10ns + 1ns = 11ns Therefore, the speedup is 44/11=4. Huang - Spring 2008 CSE 340 17

Pipeline Performance - Example2 * Assume times for each functional unit of a pipeline to be: 10ns, 8ns, 10ns, 10ns and 7ns. Overhead is 1ns per stage. Compute the speedup of the datapath. -- Pipelined: stage time = max(10,8,10,10,7)+overhead = 10 + 1 = 11ns. This is the average instruction execution time at steady state. -- Unpipelined: 10+8+10+10+7 = 45ns. Therefore the speedup is 45/11=4.1. Huang - Spring 2008 CSE 340 18

Pipeline Hazards * Hazards reduce the performance from the ideal speedup gained by pipelines. -- Structural hazard: Resource conflict. Hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. -- Data hazard: When an instruction depends on the results of the previous instruction. -- Control hazard: Due to branches and other instructions that affect the PC. Huang - Spring 2008 CSE 340 19

Pipeline Stalls * A stall is the delay in cycles caused due to any of the hazards mentioned above. * Speedup (with stalls): 1/(1+pipeline stall per instruction)*number of pipeline stages * So what is the speedup for an ideal pipeline with no stalls? Huang - Spring 2008 CSE 340 20