Instruction Level Parallelism Part I - Introduction

Similar documents
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:


Computer Architecture TDTS10

VLIW Processors. VLIW Processors

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Pipelining Review and Its Limitations

WAR: Write After Read

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

CPU Performance Equation

The Microarchitecture of Superscalar Processors

Course on Advanced Computer Architectures

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Thread level parallelism

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

İSTANBUL AYDIN UNIVERSITY

Design Cycle for Microprocessors

Module: Software Instruction Scheduling Part I

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

PROBLEMS #20,R0,R1 #$3A,R2,R4

Computer Organization and Components

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Introduction to Microprocessors

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Introduction to Cloud Computing

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

CS352H: Computer Systems Architecture

Instruction scheduling

A Lab Course on Computer Architecture

Performance evaluation

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

Processor Architectures

Instruction Set Design

Week 1 out-of-class notes, discussions and sample problems

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Microprocessor and Microcontroller Architecture

SOC architecture and design

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

CS521 CSE IITG 11/23/2012

LSN 2 Computer Processors

Introducción. Diseño de sistemas digitales.1

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

IA-64 Application Developer s Architecture Guide

Instruction Set Architecture (ISA)

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Five Families of ARM Processor IP

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Computer Architectures

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Boosting Beyond Static Scheduling in a Superscalar Processor

CISC, RISC, and DSP Microprocessors

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Coming Challenges in Microarchitecture and Architecture

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Energy-Efficient, High-Performance Heterogeneous Core Design

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Multithreading Lin Gao cs9244 report, 2006

Instruction Set Architecture (ISA) Design. Classification Categories

Annotation to the assignments and the solution sheet. Note the following points

on an system with an infinite number of processors. Calculate the speedup of

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

OC By Arsene Fansi T. POLIMI

Central Processing Unit (CPU)

ARM Microprocessor and ARM-Based Microcontrollers

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Driving force. What future software needs. Potential research topics

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Computer Organization and Architecture

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

POWER8 Performance Analysis

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Computer Organization

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Architecture and Implementation of the ARM Cortex -A8 Microprocessor

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Software implementation of Post-Quantum Cryptography

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

PROBLEMS. which was discussed in Section

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

4 RISC versus CISC Architecture

Property of ISA vs. Uarch?

Transcription:

Course on: Advanced Computer Architectures Instruction Level Parallelism Part I - Introduction Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it 1

Outline of Part I Introduction to ILP Dependences and Hazards Dynamic scheduling vs Static scheduling Superscalar vs VLIW 2

Getting higher performance... In a pipelined machine, actual CPI is derived as: CPI pipeline = CPI ideal + Structural Stalls + Data Hazard Stalls + Control Stalls + Memory Stalls Reduction of any right-hand term reduces CPI pipeline to CPI ideal (increases Instructions Per Clock: IPC = 1 / CPI) Best case: the max throughput would be to complete 1 Instruction Per Clock: IPC ideal = 1; CPI ideal = 1-3-

Sequential vs. Pipelining Execution I1 I2 IF ID EX MEM WB IF ID EX MEM WB 10 ns 10 ns I1 IF ID EX MEM WB Time I2 2 ns IF ID EX MEM WB IPC ideal = 1; CPI ideal = 1 I3 2 ns IF ID EX MEM WB I4 2 ns IF ID EX MEM WB I5 2 ns IF ID EX MEM WB -4-

Summary of Pipelining Basics Hazards limit performance: Structural: Need more HW resources Data: Need forwarding, Compiler scheduling Control: Early evaluation, Branch Delay Slot, Static and Dynamic Branch Prediction Increasing length of pipe (superpipelining) increases impact of hazards Pipelining helps instruction throughput, not latency 5

Summary: Types of Data Hazards Consider executing a sequence of r k (r i ) op (r j ) Type of instructions Data-dependence r 3 (r 1 ) op (r 2 ) Read-after-Write r 5 (r 3 ) op (r 4 ) (RAW) hazard Anti-dependence r 3 (r 1 ) op (r 2 ) Write-after-Read r 1 (r 4 ) op (r 5 ) (WAR) hazard Output-dependence r 3 (r 1 ) op (r 2 ) Write-after-Write r 3 (r 6 ) op (r 7 ) (WAW) hazard

Complex Pipelining ALU Mem IF ID Issue WB Fadd GPR s FPR s Fmul Fdiv Long latency or partially pipelined floating-point units Multiple function and memory units Memory systems with variable access time Precise exception

Complex In-Order Pipeline PC Inst. Mem D Decode GPRs X1 + X2 Data Mem X3 W FPRs X1 X2 Fadd X3 W Delay writeback so all operations have same latency to W stage Write ports never oversubscribed (one inst. in & one inst. out every cycle) Instructions commit in order, simplifies precise exception implementation FDiv X2 Fmul X3 X2 Unpipelined divider X3 Commit Point

Some basic concepts and definitions To reach higher performance (for a given technology) more parallelism must be extracted from the program. In other words...multiple-issue Dependences must be detected and solved, and instructions must be re-ordered (scheduled) so as to achieve highest parallelism of instruction execution compatible with available resources. -9-

Definition of Instruction Level Parallelism ILP = Exploit potential overlap of execution among unrelated instructions Overlapping possible whenever: No Structural Hazards No RAW, WAR of WAW Hazards No Control Hazards 10

ILP: Dual-Issue Pipelining Execution I1 I 1 IF ID EX MEM WB Time I2 I 2 I3 IF 2 ns ID IF EX ID MEM EX WB MEM WB I4 IF ID EX MEM WB IPC ideal = 2; CPI ideal = 0.5 I5 2 ns IF ID EX MEM WB I6 IF ID EX MEM WB I7 2 ns IF ID EX MEM WB I8 IF ID EX MEM WB I9 I10 2 ns IF IF ID ID EX EX MEM MEM WB WB -11-

2-issue MIPS Pipeline Architecture 40000040 M u x 4 M u x ALU PC Instruction memory Registers M u x Write data Data memory Sign extend Sign extend ALU Address M u x 2-instructions issued per clock: 1 ALU or BR instruction 1 load/store instruction 12

Getting higher performance... In a multiple-issue pipelined machine, the ideal CPI would be CPI ideal < 1 If we consider for example 2-issue processor, best case: max throughput would be to complete 2 Instructions Per Clock: IPC ideal = 2; CPI ideal = 0.5-13-

Dependences Determining dependences among instructions is critical to defining the amount of parallelism existing in a program. If two instructions are dependent, they cannot execute in parallel: they must be executed in order or only partially overlapped. Three different types of dependences: Data Dependences (or True Data Dependences) Name Dependences Control Dependences -14-

Name Dependences Name dependence occurs when 2 instructions use the same register or memory location (called name), but there is no flow of data between the instructions associated with that name. Two types of name dependences between an instruction i that precedes instruction j in program order: Antidependence: when j writes a register or memory location that instruction i reads (it can generate a WAR). The original instructions ordering must be preserved to ensure that i reads the correct value. Output Dependence: when i and j write the same register or memory location (it can generate a WAW). The original instructions ordering must be preserved to ensure that the value finally written corresponds to j. -15-

Name Dependences Name dependences are not true data dependences, since there is no value (no data flow) being transmitted between instructions. If the name (register number or memory location) used in the instructions could be changed, the instructions do not conflict. Dependences trough memory locations are more difficult to detect ( memory disambiguation problem), since two addresses may refer to the same location but can look different. Register renaming can be more easily done. Renaming can be done either statically by the compiler or dynamically by the hardware. -16-

Data Dependences and Hazards A data/name dependence can potentially generate a data hazard (RAW, WAW, or WAR), but the actual hazard and the number of stalls to eliminate the hazards are a property of the pipeline. RAW hazards correspond to true data dependences. WAW hazards correspond to output dependences WAR hazards correspond to antidependences. Dependences are a property of the program, while hazards are a property of the pipeline. -17-

Control Dependences A control dependence determines the ordering of instructions and it is preserved by two properties: Instructions execution in program order to ensure that an instruction that occurs before a branch is executed before the branch. Detection of control hazards to ensure that an instruction (that is control dependent on a branch) is not executed until the branch direction is known. Although preserving control dependence is a simple way to preserve program order, control dependence is not the critical property that must be preserved. -18-

Program Properties Two properties are critical to program correctness (and normally preserved by maintaining both data and control dependences): 1. Data flow: Actual flow of data values among instructions that produces the correct results and consumes them. 2. Exception behavior: Preserving exception behavior means that any changes in the ordering of instruction execution must not change how exceptions are raised in the program. -19-

Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: Depend on the hardware to locate parallelism Static Scheduling: Rely on software for identifying potential parallelism Hardware intensive approaches dominate desktop and server markets -20-

Basic Assumptions We consider single-issue processors The Instruction Fetch stage precedes the Issue Stage and may fetch either into an Instruction Register or into a queue of pending instructions Instructions are then issued from the IR or from the queue Execution stage may require multiple cycles, depending on the operation type. 21

Key Idea: Dynamic Scheduling Problem: Data dependences that cannot be hidden with bypassing or forwarding cause hardware stalls of the pipeline Solution: Allow instructions behind a stall to proceed HW rearranges dynamically the instruction execution to reduce stalls Enables out-of-order execution and completion (commit) First implemented in CDC 6600 (1963). 22

Example 1 DIVD F0,F2,F4 ADDD F10,F0,F8 # RAW F0 SUBD F12,F8,F14 RAW Hazard: ADDD stalls for F0 (waiting that DIVD commits). SUBD would stall even if not data dependent on anything in the pipeline without dynamic scheduling. BASIC IDEA: to enable SUBD to proceed (out-oforder execution) 23

Dynamic Scheduling The hardware reorder the instruction execution to reduce pipeline stalls while maintaining data flow and exception behavior. Main advantages (PROs): It enables handling some cases where dependences are unknown at compile time It simplifies the compiler complexity It allows compiled code to run efficiently on a different pipeline (code portability). Those advantages are gained at a cost of (CONs): A significant increase in hardware complexity, Increased power consumption Could generate imprecise exceptions -24-

Dynamic Scheduling Simple pipeline: hazards due to data dependences that cannot be hidden by forwarding stall the pipeline no new instructions are fetched nor issued. Dynamic scheduling: Hardware reorder instructions execution so as to reduce stalls, maintaining data flow and exception behaviour. Typical Example: Superscalar Processor -25-

Dynamic Scheduling (2) Basically: Instructions are fetched and issued in program order (in-order-issue) Execution begins as soon as operands are available possibly, out of order execution note: possible even with pipelined scalar architectures. Out-of order execution introduces possibility of WAR, WAW data hazards. Out-of order execution implies out of order completion unless there is a re-order buffer to get inorder completion -26-

Static Scheduling Compilers can use sophisticated algorithms for code scheduling to exploit ILP (Instruction Level Parallelism). The size of a basic block a straight-line code sequence with no branches in except to the entry and no branches out except at the exit is usually quite small and the amount of parallelism available within a basic block is quite small. Example: For typical MIPS programs the average branch frequency is between 15% and 25% from 4 to 7 instructions execute between a pair of branches. -27-

Static Scheduling Data dependence can further limit the amount of ILP we can exploit within a basic block to much less than the average basic block size. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks (i.e. across branches such as in trace scheduling). -28-

Static Scheduling Static detection and resolution of dependences static scheduling: accomplished by the compiler dependences are avoided by code reordering. Output of the compiler: reordered into dependency-free code. Typical example: VLIW (Very Long Instruction Word) processors expect dependency-free code generated by the compiler -29-

Main Limits of Static Scheduling Unpredictable branches Variable memory latency (unpredictable cache misses) Code size explosion Compiler complexity

Summary of Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: depend on the hardware to locate parallelism Static Scheduling: rely on the compiler for identifying potential parallelism Hardware intensive approaches dominate desktop and server markets 31

Several steps towards exploiting more ILP 32

Several steps towards exploiting more ILP 33

Several steps towards exploiting more ILP Single-Issue Out-of-Order Execution 34

Several steps towards exploiting more ILP Multiple-Issue Out-of-Order Execution 35

36

Superscalar Processor Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 37

Dynamic Scheduler 38

39

40

Dynamic Scheduling is expensive! Large amount of logic, significant area cost PowerPC 750 Instruction Sequencer is approx. 70% of the area of all execution units! (Integer units + Load/Store units + FP unit) Cycle time limited by scheduling logic (dispatcher and associated dependency checking logic) Design verification extremely complex Very complex irregular logic Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 41

Issue-Width limited in practice The issue width is the number of instructions that can be issued in a single cycle by a multiple issue (also called ILP) processor And fetched, and decoded, etc. When superscalar was invented, 2- and rapidly 4-issue width processors were created (i.e. 4 instructions executed in a single cycle, ideal CPI = 1/4) Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 42

Issue-Width limited in practice Now, the maximum (rare) is 6, but no more exists. The widths of current processors range from single-issue (ARM11, UltraSPARC-T1) through 2-issue (UltraSPARC-T2/T3, Cortex-A8 & A9, Atom, Bobcat) to 3-issue (Pentium-Pro/II/III/M, Athlon, Pentium-4, Athlon 64/Phenom, Cortex-A15) or 4-issue (UltraSPARC-III/IV, PowerPC G4e, Core 2, Core i, Core i*2, Bulldozer) or 5-issue (PowerPC G5), or even 6-issue (Itanium, but it's a VLIW). Because it is too hard to decide which 8, or 16, instructions can execute every cycle (too many!) It takes too long to compute So the frequency of the processor would have to be decreased Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 43

Summary of superscalar and dynamic scheduling Main advantage Very high performance: Ideal CPI very low: CPI ideal = 1 / issue-width Disadvantages Very expensive logic to decide dependencies and independencies, i.e. to decide which instructions can be issued every clock cycle It does not scale: almost impractical to make issuewidth greater than 4 (we would have to slow down the clock) Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 44

45

Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 46

Superscalar vs VLIW Scheduling Deciding when and where to execute an instruction i.e. in which cycle and in which functional unit For a superscalar processor it is decided at run time, by custom logic in HW For a VLIW processor it is decided at compile-time, by the compiler, and therefore by a SW program Good for embedded processors: Simpler HW design (no dynamic scheduler), smaller area and cheap Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 47

Challenges for VLIW Compiler technology The compiler needs to find a lot of parallelism in order to keep the multiple functional units of the processors busy Binary incompatibility Consequence of the larger exposure of the microarchitecture (= implementation choices) at the compiler in the generated code Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 48

Advantages of SW vs HW Scheduling 49

Current Superscalar & VLIW processors Dynamically-scheduled superscalar processors are the commercial state-of-the-art for general purpose: current implementations of Intel Core i, Alpha, PowerPC, MIPS, Sparc, etc. are all superscalar VLIW processors are primarily successful as embedded media processors for consumer electronic devices (embedded): TriMedia media processors by NXP (formerly Philips Semiconductors) The C6000 DSP family by Texas Instruments The STMicroelectronics ST200 family The SHARC DSP by Analog Devices Itanium 2 is the only general purpose VLIW, a hybrid VLIW (EPIC, Explicitly Parallel Instructions Computing) 50