INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Similar documents

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

WAR: Write After Read

The Microarchitecture of Superscalar Processors

CPU Performance Equation

CS352H: Computer Systems Architecture

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Computer Architecture TDTS10

Course on Advanced Computer Architectures

VLIW Processors. VLIW Processors

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

PROBLEMS #20,R0,R1 #$3A,R2,R4

Pipelining Review and Its Limitations

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Module: Software Instruction Scheduling Part I

Giving credit where credit is due

A Lab Course on Computer Architecture

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Introduction to Cloud Computing

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

LSN 2 Computer Processors

Multithreading Lin Gao cs9244 report, 2006

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Computer Organization and Components

IA-64 Application Developer s Architecture Guide

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

OC By Arsene Fansi T. POLIMI

Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture

Boosting Beyond Static Scheduling in a Superscalar Processor

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Instruction Set Design

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

AMD Opteron Quad-Core

The Pentium 4 and the G4e: An Architectural Comparison

SOC architecture and design

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Week 1 out-of-class notes, discussions and sample problems

CS521 CSE IITG 11/23/2012

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Instruction Set Architecture (ISA)

Basic Performance Measurements for AMD Athlon 64, AMD Opteron and AMD Phenom Processors

Instruction scheduling

Putting Checkpoints to Work in Thread Level Speculative Execution

CARNEGIE MELLON UNIVERSITY

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore

Streamlining Data Cache Access with Fast Address Calculation

Implementation of Core Coalition on FPGAs

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Energy Efficient Design of the Reorder Buffer 1

Chapter 11 I/O Management and Disk Scheduling

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

CPU Organization and Assembly Language

Computer Organization and Architecture

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

CPU Organisation and Operation

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Performance evaluation

Intel 8086 architecture

Putting it all together: Intel Nehalem.

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

StrongARM** SA-110 Microprocessor Instruction Timing

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Software implementation of Post-Quantum Cryptography

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Study Material for MS-07/MCA/204

Central Processing Unit (CPU)

Operating System Impact on SMT Architecture

Thread level parallelism

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Low Power AMD Athlon 64 and AMD Opteron Processors

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

An Out-of-Order RiSC-16 Tomasulo + Reorder Buffer = Interruptible Out-of-Order

Introduction to Microprocessors

Introducción. Diseño de sistemas digitales.1

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Instruction Level Parallelism I: Pipelining

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Enterprise Applications

Transcription:

Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it

Prof. Silvano, Politecnico di Milano 2 Outline Hardware-based Speculation Reorder Buffer Speculative Tomasulo Algorithm Prof. Cristina Silvano - Politecnico di Milano 2

Prof. Silvano, Politecnico di Milano Hardware-based Speculation

ILP Summary: Scoreboard 4 HW exploiting ILP Works when can t know dependence at compile time. Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed Enables out-of-order execution => out-of-order completion ID stage checked both for structural & data dependencies Original version didn t handle forwarding (CDC 6600 in 1963) No automatic register renaming Pipeline stalls for WAR and WAW hazards Prof. Cristina Silvano - Politecnico di Milano 4

ILP Summary: Tomasulo Reservations stations: implicit renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows dynamic loop unrolling Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264 5 Prof. Cristina Silvano - Politecnico di Milano 5

Problem: Fetch unit Stream of Instructions To Execute Prof. Silvano, Politecnico di Milano 6 Instruction Fetch with Branch Prediction Out-Of-Order Execution Unit Correctness Feedback on Branch Results Instruction fetch decoupled from execution Often issue logic (+ rename) included with Fetch Prof. Cristina Silvano - Politecnico di Milano 6

7 Branches must be solved quickly for loop overlap! In our loop-unrolling example, we relied on the fact that branches were under control of fast integer unit in order to get overlap! Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop What happens if branch depends on result of MULTD?? We completely lose all of our advantages! Need to be able to predict the branch outcome. If we were to predict that branch was taken, this would be right most of the time. Problem much worse for superscalar machines! Prof. Cristina Silvano - Politecnico di Milano 7

Prof. Silvano, Politecnico di Milano 8 Prediction: Branches, Dependencies, Data Prediction has become essential to get good performance from scalar instruction streams. We will discuss predicting branches, but architects are now predicting everything: data dependencies, actual data, and results of groups of instructions At what point does computation become a probabilistic operation + verification? We are pretty close with control hazards already Why does prediction work? Underlying algorithm has regularities. Data being operated on have regularities. Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems. Prof. Cristina Silvano - Politecnico di Milano 8

Prof. Silvano, Politecnico di Milano 9 Hardware-based Speculation The key idea behind speculation is: to issue and execute instructions dependent on a branch before the branch outcome is known; to allow instructions to execute out-of-order but to force them to commit in-order; to prevent any irrevocable action (such as updating state or taking an exception) until an instruction commits; When an instruction is no longer speculative, we allow it to update the register file or memory (instruction commit). ReOrder Buffer (ROB) to hold the results of instructions that have completed execution but have not yet committed or to pass results among instructions that may be speculated. Prof. Cristina Silvano - Politecnico di Milano 9

Hardware-based Speculation 10 Outcome of branches is speculated and program is executed as if speculation was correct (without speculation, the simple dynamic scheduling would only fetch and decode, not execute!) Mechanisms are necessary to handle incorrect speculation hardware speculation extend dynamic scheduling beyond a branch (i.e. behind the basic block) Prof. Cristina Silvano - Politecnico di Milano 10

HW-based Speculation 11 HW-based Speculation combines 3 ideas: Dynamic Branch Prediction to choose which instruction to execute before the branch outcome is known Speculation to execute instructions before control dependences are resolved Dynamic Scheduling supporting out-of-order execution but in-order commit to prevent any irrevocable actions (such as register update or taking exception) until an instruction commits Prof. Cristina Silvano - Politecnico di Milano 11

Hardware-based Speculation 12 Issue and execute an instruction dependent on a branch before the branch outcome is known. Commit is always made in order. Commit of a speculative instruction is made only when the branch outcome is known. The same holds for exceptions (synchronous or asynchronous) deviations of control flow Follows the predicted flow of data values to choose when to execute an instruction; Essentially, a data flow mode of execution: instructions start execution as soon as their operands are available (RAW hazards are solved) Prof. Cristina Silvano - Politecnico di Milano 12

Hardware-based Speculation Adopted in PowerPC 603/604, MIPS R10000/ R12000, Pentium II/III/4, AMD K5/K6 Athlon. Extends hardware support for Tomasulo algorithm: to support speculation, commit phase is separated from execution phase, ReOrder Buffer is introduced Speculative Tomasulo Algorithm with ReOrder Buffer 13 Prof. Cristina Silvano - Politecnico di Milano 13

Hardware-based Speculation Basic Tomasulo algorithm: instruction writes result in RF and Reservation Stations, where subsequent instructions find the results to be used as operands With speculation, results are written only when instruction commits and it is known whether the instruction (no more speculative) had to be executed. Key idea: executing out of order, committing in order. Boosting 14 Prof. Cristina Silvano - Politecnico di Milano 14

Relationship between precise interrupts and speculation 15 Speculation is a form of guessing. Important for branch prediction: Need to take our best shot at predicting branch direction. If we issue multiple instructions per cycle, lose lots of potential instructions otherwise: Consider 4 instructions per cycle If take single cycle to decide on branch, waste from 4-7 instruction slots! If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: This is exactly same as precise exceptions! Technique for both precise interrupts/exceptions and speculation: in-order completion or commit Prof. Cristina Silvano - Politecnico di Milano 15

What about Precise Exceptions/Interrupts? Prof. Silvano, Politecnico di Milano 16 Both Scoreboard and Tomasulo have: In-order issue, out-of-order execution, out-of-order completion To allow for precise interrupts there is the need for a way to resynchronize execution with instruction stream (such as with issue-order) Easiest way is with in-order completion (i.e. reorder buffer) Prof. Cristina Silvano - Politecnico di Milano 16

HW support for precise interrupts Concept of ReOrder Buffer (ROB): Prof. Silvano, Politecnico di Milano 17 Holds instructions in FIFO order, exactly as they were issued Each ROB entry contains PC, dest reg, result, exception status When instructions complete, results placed into ROB Supplies operands to other instruction between execution complete & commit more registers like RS Tag results with ROB buffer number instead of reservation station Instructions commit values at head of ROB placed in registers As a result, easy to undo speculated instructions on mispredicted branches or on exceptions Prof. Cristina Silvano - Politecnico di Milano 17

HW support for more ILP 18 ReOrder Buffer also useful for speculation Speculation: Allow an instruction without any consequences (including exceptions) if branch is not actually taken ( HW undo ); called boosting Combine branch prediction to choose which instructions to execute with dynamic scheduling to execute before branches solved Prof. Cristina Silvano - Politecnico di Milano 18

HW support for more ILP (2) 19 Separate speculative bypassing of results from real bypassing of results When instruction no longer speculative, write boosted results (instruction commit) or discard boosted results Execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits Prof. Cristina Silvano - Politecnico di Milano 19

ReOrder Buffer (ROB)

21 ReOrder Buffer (ROB) Buffer to hold the results of instructions that have finished execution but non committed Buffer to pass results among instructions that can be speculated Support out-of-order execution but in-order commit Speculative Tomasulo Algorithm with ROB: Pointers are directed toward ROB slot. A register or memory is updated only when the instruction reaches the head of ROB (that is until the instruction is no longer speculative). Prof. Cristina Silvano - Politecnico di Milano 21

ReOrder Buffer (ROB) Prof. Silvano, Politecnico di Milano FP Op Queue Reorder Buffer FP Regs Commit path Res Stations FP Adder Res Stations FP Adder Prof. Cristina Silvano - Politecnico di Milano 22

23 ReOrder Buffer (ROB) ROB completely replaces the Store Buffers The renaming function of Reservation Stations is replaced by ROB Reservation Stations now used only to buffer instructions and operands to FUs (to reduce structural hazards). Pointers now are directed toward ROB entries Processors with ROB can dynamically execute while maintaining precise interrupt model because instruction commit happens in order. Prof. Cristina Silvano - Politecnico di Milano 23

ReOrder Buffer (ROB) 4 fields: Instruction Type Destination: RF number (for load and ALU ops) Memory address (for stores) Value: To hold the value of the instruction result until the instruction commits Ready: Indicates that the instruction has completed execution and the value is ready 24 Prof. Cristina Silvano - Politecnico di Milano 24

25 ReOrder Buffer (ROB) Reorder buffer can be operand source More registers like reservation stations Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit Once operand commits, result is put into register Instructions commit As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Prof. Cristina Silvano - Politecnico di Milano 25

26 ReOrder Buffer (ROB) Originally (1988) introduced to solve precise interrupt problem; generalized to grant sequential consistency; Basically, ROB is a circular buffer with tail pointer (indicating next free entry) and head pointer indicating the instruction that will commit (leaving ROB) first. Prof. Cristina Silvano - Politecnico di Milano 26

ReOrder Buffer (ROB) 27 Instructions are written in ROB in strict program order when an instruction is issued, an entry is allocated to it in sequence. Entry indicates status of instruction: issued (i), in execution (x), finished (f) (+ other items!), An instruction can commit (retire) iff 1. It has finished, and 2. All previous instructions have already retired. Prof. Cristina Silvano - Politecnico di Milano 27

28 ReOrder Buffer (ROB) Head (next instruction to be retired) I n-1 I n f Active entries x I n+k i i Tail (first free entry): allocate subsequent instructions to subsequent entries, in order Prof. Cristina Silvano - Politecnico di Milano 28

29 ReOrder Buffer (ROB) Only retiring instructions can complete, i.e., update architectural registers and memory; ROB can support both speculative execution and exception handling Speculative execution: each ROB entry is extended to include a speculative status field, indicating whether instruction has been executed speculatively; Finished instructions cannot retire as long as they are in speculative status. Interrupt handling: Exceptions generated in connection with instruction execution are made precise by accepting exception request only when instruction becomes next to retire (exceptions are processed in order) Prof. Cristina Silvano - Politecnico di Milano 29

Prof. Silvano, Politecnico di Milano 30 Speculative Tomasulo Algorithm

Prof. Silvano, Politecnico di Milano 31 Speculative Tomasulo Architecture for an FPU

Prof. Silvano, Politecnico di Milano 32 Four Steps of Speculative Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station and ROB slot free, issue instr & send operands & ROB no. for destination (this stage sometimes called dispatch ) 2. Execution operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both operands in reservation station, execute; checks RAW (sometimes called issue ) 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & ROB; mark reservation station available. 4. Commit update register with ROB result When instr. at head of ROB & result present, update register with result (or store to memory) and remove instr from ROB. Mispredicted branch flushes ROB (sometimes called graduation ) Prof. Cristina Silvano - Politecnico di Milano 32

Steps of Speculative Tomasulo s Algorithm (2) 4. Commit: 3 different possible sequences: 1. Normal commit: instruction reaches the head of the ROB, result is present in the buffer. Result is stored in the register, instruction is removed from ROB; 2. Store commit: as above, but memory rather than register is updated; 3. Instruction is a branch with incorrect prediction: it indicates that speculation was wrong. ROB is flushed ( graduation ), execution restarts at correct successor of the branch. If the branch was correctly predicted, branch is finished 33 Prof. Cristina Silvano - Politecnico di Milano 33

Speculative Tomasulo s Algorithm Tomasulo s Boosting needs a buffer for uncommitted results (ROB). Each entry in ROB contains four fields: Instruction type field indicates whether instruction is a branch (no destination result), a store (has memory address destination), or a load/alu (register destination) Destination field: supplies register number (for loads and ALU instructions) or memory address (for stores) where results should be written; Value field: used to hold value of result until instruction commits Ready field: indicates that instruction has completed execution, value is ready 34 Prof. Cristina Silvano - Politecnico di Milano 34

ReOrder Buffer (ROB) 35 ROB completely replaces Store Buffers: stores execute in two steps, the second one when instruction commits; Renaming function of reservation stations completely replaced by ROB entries; Reservation stations now only queue operations (and operands) to FUs between the time they issue and the time they begin execution; Results are tagged with ROB entry number rather than with RS number ROB entry assigned to instruction must be tracked in the reservation stations. Prof. Cristina Silvano - Politecnico di Milano 35

36 ReOrder Buffer (ROB) and Branch Misprediction All instructions except for mispredicted branches (or incorrectly speculated loads) commit when reaching head of ROB; Mispredicted branches reaches head of ROB wrong speculation is indicated, the ROB entries depending on the mispredicted branch are flushed, execution restarts at correct successor of branch. Speculative actions are easily undone. Processors with ROB can dynamically execute while maintaining a precise interrupt model: If instruction I j causes interrupt, CPU waits until I j reaches the head of the ROB and takes the interrupt, flushing all other pending instructions. Prof. Cristina Silvano - Politecnico di Milano 36

37 Exception Handling Not recognize the exception until it is ready to commit If speculated instruction raises an exception record in the ROB If branch misprediction and the instruction should not have been executed, then flush exception If the instruction reaches the head of ROB and it is no longer speculative, the exception should be taken Prof. Cristina Silvano - Politecnico di Milano 37

Prof. Silvano, Politecnico di Milano 38 In-Order Commit for Precise Exceptions In-order Out-of-order In-order Fetch Decode Reorder Buffer Commit Kill Inject handler PC Kill Execute Kill Exception? Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order ( out-of-order completion) Commit (write-back to architectural state, i.e., regfile & memory) is in-order Temporary storage needed to hold results before commit (shadow registers and store buffers) Prof. Cristina Silvano - Politecnico di Milano 38

Tomasulo with ReOrder Buffer 39 Prof. Cristina Silvano - Politecnico di Milano 39

Tomasulo with ReOrder Buffer 40 Prof. Cristina Silvano - Politecnico di Milano 40

Tomasulo with ReOrder Buffer 41 Prof. Cristina Silvano - Politecnico di Milano 41

Tomasulo with ReOrder Buffer 42 Prof. Cristina Silvano - Politecnico di Milano 42

Tomasulo with ReOrder Buffer 43 Prof. Cristina Silvano - Politecnico di Milano 43

Tomasulo with ReOrder Buffer 44 Prof. Cristina Silvano - Politecnico di Milano 44

Tomasulo with ReOrder Buffer 45 Prof. Cristina Silvano - Politecnico di Milano 45

Tomasulo with ReOrder Buffer 46 Prof. Cristina Silvano - Politecnico di Milano 46

Tomasulo with ReOrder Buffer 47 Prof. Cristina Silvano - Politecnico di Milano 47

Tomasulo with ReOrder Buffer 48 Prof. Cristina Silvano - Politecnico di Milano 48

Tomasulo with ReOrder Buffer 49 Prof. Cristina Silvano - Politecnico di Milano 49

Tomasulo with ReOrder Buffer 50 Prof. Cristina Silvano - Politecnico di Milano 50

Tomasulo with ReOrder Buffer 51 Prof. Cristina Silvano - Politecnico di Milano 51

Tomasulo with ReOrder Buffer 52 Prof. Cristina Silvano - Politecnico di Milano 52

Tomasulo with ReOrder Buffer 53 Prof. Cristina Silvano - Politecnico di Milano 53

Tomasulo with ReOrder Buffer 54 Prof. Cristina Silvano - Politecnico di Milano 54

Tomasulo with ReOrder Buffer 55 Prof. Cristina Silvano - Politecnico di Milano 55