Overview. Chapter 2. CPI Equation. Instruction Level Parallelism. Basic Pipeline Scheduling. Sample Pipeline

Similar documents
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:


WAR: Write After Read

VLIW Processors. VLIW Processors

CS352H: Computer Systems Architecture

Module: Software Instruction Scheduling Part I

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Course on Advanced Computer Architectures

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Pipelining Review and Its Limitations

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

The Microarchitecture of Superscalar Processors

Computer Architecture TDTS10

CPU Performance Equation

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

PROBLEMS #20,R0,R1 #$3A,R2,R4

A Lab Course on Computer Architecture

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Introduction to Cloud Computing

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Computer Organization and Components

Thread level parallelism

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Instruction Set Design

Central Processing Unit (CPU)

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Giving credit where credit is due

Computer organization

IA-64 Application Developer s Architecture Guide

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Instruction Set Architecture (ISA)

Solutions. Solution The values of the signals are as follows:

CS521 CSE IITG 11/23/2012

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore

CPU Organisation and Operation

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Week 1 out-of-class notes, discussions and sample problems

Software implementation of Post-Quantum Cryptography

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

CARNEGIE MELLON UNIVERSITY

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Instruction Scheduling. Software Pipelining - 2

EE361: Digital Computer Organization Course Syllabus

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Boosting Beyond Static Scheduling in a Superscalar Processor

SOC architecture and design

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

CPU Organization and Assembly Language

PROBLEMS. which was discussed in Section

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Very Large Instruction Word Architectures (VLIW Processors and Trace Scheduling)

A SystemC Transaction Level Model for the MIPS R3000 Processor

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

OC By Arsene Fansi T. POLIMI

Performance evaluation

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Multithreading Lin Gao cs9244 report, 2006

Putting Checkpoints to Work in Thread Level Speculative Execution

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

Instruction scheduling

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Putting it all together: Intel Nehalem.

POWER8 Performance Analysis

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

An Introduction to the ARM 7 Architecture

Optimizing Code for Accelerators: The Long Road to High Performance

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Computer Organization

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Operating Systems. Virtual Memory

LSN 2 Computer Processors

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Reduced Instruction Set Computer (RISC)

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

CHAPTER 7: The CPU and Memory

OAMulator. Online One Address Machine emulator and OAMPL compiler.

Data Memory Alternatives for Multiscalar Processors

Streamlining Data Cache Access with Fast Address Calculation

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Transcription:

Overview Chapter 2 Instruction-Level Parallelism and Its Exploitation Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding Tomasulo s Algorithm Reducing Branch Cost with Dynamic Hardware Prediction Basic Branch Prediction and Branch-Prediction Buffers Branch Target Buffers Overview of Superscalar and VLIW processors 1 2 CPI Equation Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Technique Loop unrolling Basic pipeline scheduling Dynamic scheduling with scoreboarding Dynamic scheduling with register renaming Dynamic branch prediction Issuing multiple instructions per cycle Compiler dependence analysis Software pipelining and trace scheduling Speculation Dynamic memory disambiguation Reduces Control stalls RAW stalls RAW stalls WAR and WAW stalls Control stalls Ideal CPI Ideal CPI and data stalls Ideal CPI and data stalls All data and control stalls RAW stalls involving memory 3 Instruction Level Parallelism Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Exploit ILP across multiple basic blocks Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Alternative to vector instructions 4 Basic Pipeline Scheduling Find sequences of unrelated instructions Compiler s ability to schedule Amount of ILP available in the program Latencies of the functional units Latency assumptions for the examples Standard MIPS integer pipeline No structural hazards (fully pipelined or duplicated units Latencies of FP operations: Instruction producing result Instruction using result Latency FP ALU op FP ALU op 3 FP ALU op SD 2 LD FP ALU op 1 LD SD 0 5 FP ALU FP ALU FP ALU SD Sample Pipeline EX IF ID FP1 FP2 FP3 FP4 FP1 FP2 FP3 FP4... DM WB IF ID FP1 FP2 FP3 FP4 DM WB IF ID stall stall stall FP1 FP2 FP3 IF ID FP1 FP2 FP3 FP4 DM WB IF ID EX stall stall DM WB 6 1

Basic Scheduling Loop Unrolling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: 1 stall 2 ADDD F4, F0, F2 3 stall 4 stall 5 6 SUBI R1, R1, #8 7 stall 8 9 stall 10 Sequential MIPS Assembly Code ADDD F4, F0, F2 SUBI R1, R1, #8 Scheduled pipelined pp execution: 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 5 SD 8(R1), F4 6 7 Unrolled loop (four copies): ADDD F4, F0, F2 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 Scheduled Unrolled loop: LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 SD 8(R1), F16 8 Dynamic Scheduling Scheduling separates dependent instructions Static performed by the compiler Dynamic performed by the hardware Advantages of dynamic scheduling Handles dependences unknown at compile time Simplifies the compiler Optimization is done at run time Disadvantages Can not eliminate true data dependences Out-of-order execution (1/2) Central idea of dynamic scheduling In-order execution: DIVD F0, F2, F4 IF ID DIV.. ADDD F10, F0, F8 IF ID stall stall stall SUBD F12, F8, F14 IF stall stall.. Out-of-order execution: DIVD F0, F2, F4 IF ID DIV.. SUBD F12, F8, F14 IF ID A1 A2 A3 A4 ADDD F10, F0, F8 IF ID stall.. 9 10 Out-of-Order Execution (2/2) Separate issue process in ID: Issue decode instruction check structural hazards in-order execution Read operands Wait until no data hazards Read operands Out-of-order execution/completion Exception handling problems WAR hazards 11 Dynamic Scheduling with a Scoreboard Details in Appendix A.7 Allows out-of-order execution Sufficient resources No data dependencies Responsible for issue, execution and hazards Functional units with long delays Duplicated Fully pipelined CDC 6600 16 functional units 12 2

MIPS with Scoreboard Scoreboard Operation 13 Scoreboard centralizes hazard management Every instruction goes through the scoreboard Scoreboard determines when the instruction can read its operands and begin execution Monitors changes in hardware and decides when an stalled instruction can execute Controls when instructions can write results New pipeline ID EX WB Issue Read Regs Execution Write 14 Execution Process Issue Functional unit is free (structural) Active instructions do not have same Rd (WAW) Read Operands Checks availability of source operands Resolves RAW hazards dynamically (out-of-order execution) Execution Functional unit begins execution when operands arrive Notifies the scoreboard when it has completed execution Write result Scoreboard checks WAR hazards Stalls the completing instruction if necessary 15 Scoreboard Data Structure Instruction status indicates pipeline stage Functional unit status Busy functional unit is busy or not Op operation to perform in the unit (+, -, etc.) Fi destination register Fj, Fk source register numbers Qj, Qk functional unit producing Fj, Fk Rj, Rk flags indicating when Fj, Fk are ready Register result status FU that will write registers 16 Scoreboard Data Structure (1/3) Scoreboard Data Structure (2/3) Instruction Issue Read operands Execution completed Write LD F6, 34(R2) Y Y Y Y LD F2, 45(R3) Y Y Y MULTD F0, F2, F4 Y SUBD F8, F6, F2 Y DIVD F10, F0, F6 Y ADDD F6, F8, F2 Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Y Load F2 R3 N Mult1 Y Mult F0 F2 F4 Integer N Y Mult2 N Add Y Sub F8 F6 F2 Integer Y N Divide Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12... F30 Functional Unit Mult1 Int Add Div 17 18 3

Scoreboard Data Structure (3/3) Scoreboard Algorithm 19 20 Scoreboard Limitations Amount of available ILP Number of scoreboard entries Limited to a basic block Extended beyond a branch Number and types of functional units Structural hazards can increase with DS Presence of anti- and output- dependences Lead to WAR and WAW stalls Tomasulo Approach Another approach to eliminate stalls Combines scoreboard with Register renaming (to avoid WAR and WAW) Designed for the IBM 360/91 High FP performance for the whole 360 family Four double precision FP registers Long memory access and long FP delays Can support overlapped execution of multiple iterations of a loop 21 22 Tomasulo Approach Stages Issue Empty reservation station or buffer Send operands to the reservation station Use name of reservation station for operands Execute Execute operation if operands are available Monitor CDB for availability of operands Write result When result is available, write it to the CDB 23 24 4

Example (1/2) Example (2/2) 25 26 Tomasulo s Algorithm Loop Iterations MULTD F4,F0,F2 SUBI R1, R1, #8 An enhanced and detailed design in Fig. 2.12 of the textbook 27 28 Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the pipeline) Predict-not-taken and predict-taken Delayed branch and canceling branch Dynamic predictors Effectiveness of dynamic prediction schemes Accuracy Cost 29 Basic Branch Prediction Buffers a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4 30 5

N-bit Branch Prediction Buffers Use an n-bit saturating counter Only the loop exit causes a misprediction 2-bit predictor almost as good as any general n-bit predictor Prediction Accuracy of a 4K-entry 2-bit Prediction Buffer 31 32 Branch-Target Buffers Prediction with BTF Further reduce control stalls (hopefully to 0) Store the predicted address in the buffer Access the buffer during IF 33 34 Performance Issues Limitations of branch prediction schemes Prediction accuracy (80% - 95%) Type of program Size of buffer Penalty of misprediction Fetch from both directions to reduce penalty Memory system should: Dual-ported Have an interleaved cache Fetch from one path and then from the other Five Primary Approaches in use for Multiple-issue Processors 35 36 6