Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:



Similar documents
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER


WAR: Write After Read

VLIW Processors. VLIW Processors

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

CS352H: Computer Systems Architecture

PROBLEMS #20,R0,R1 #$3A,R2,R4

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Instruction Set Design

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Computer Architecture TDTS10

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Course on Advanced Computer Architectures

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

A Lab Course on Computer Architecture

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

The Microarchitecture of Superscalar Processors

Module: Software Instruction Scheduling Part I

CPU Performance Equation

Computer Organization and Components

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Pipelining Review and Its Limitations

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

CPU Organisation and Operation

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

CPU Organization and Assembly Language

IA-64 Application Developer s Architecture Guide

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Design Cycle for Microprocessors

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Central Processing Unit (CPU)

Introduction to Cloud Computing

Solutions. Solution The values of the signals are as follows:

EEM 486: Computer Architecture. Lecture 4. Performance

Instruction Set Architecture (ISA) Design. Classification Categories

Next Generation GPU Architecture Code-named Fermi

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Introducción. Diseño de sistemas digitales.1

picojava TM : A Hardware Implementation of the Java Virtual Machine

Boosting Beyond Static Scheduling in a Superscalar Processor

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Data Memory Alternatives for Multiscalar Processors

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Thread level parallelism

OC By Arsene Fansi T. POLIMI

Multithreading Lin Gao cs9244 report, 2006

POWER8 Performance Analysis

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Assembly Language Programming

Operating System Impact on SMT Architecture

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Multi-Threading Performance on Commodity Multi-Core Processors

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Photonic Networks for Data Centres and High Performance Computing

EE361: Digital Computer Organization Course Syllabus

Types of Workloads. Raj Jain. Washington University in St. Louis

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

SOC architecture and design

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Performance evaluation

Precise and Accurate Processor Simulation

Computer Systems Design and Architecture by V. Heuring and H. Jordan

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

CS521 CSE IITG 11/23/2012

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Computer Architectures

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

CHAPTER 7: The CPU and Memory

on an system with an infinite number of processors. Calculate the speedup of

Intel Pentium 4 Processor on 90nm Technology

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

CISC, RISC, and DSP Microprocessors

Transcription:

Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law): More complex hardware mechanisms Or, more replicated functional units (more parallelism) Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Superscalar: instructions are chosen dynamically by the hardware VLIW (Very Long Instruction Word): instructions are chosen statically by the compiler (and assembled in a single long instruction ) Inf3 Computer Architecture - 2011-2012 1

Superscalar Processors Hardware attempts to issue up to n instructions on every cycle, where n is called the issue width of the processor and the processor is said to have n issue slots and to be a n-issue processor Instructions issued must respect data dependences In some cycles not all issue slots can be used Extra hardware is needed to detect more combinations of dependences and hazards and to provide more bypasses Branches? With branch prediction, we can predict branches and fetch instructions Can we execute such predicted instructions? Inf3 Computer Architecture - 2011-2012 2

Speculative Execution Speculative execution execute control dependent instructions even when we are not sure if they should be executed Hardware undo, in case of a mis-prediction Multi-issue + branch prediction + Tomasulo Implemented in most current processors Key Idea: Execute out-of-order but commit in order Inf3 Computer Architecture - 2011-2012 3

Remember Control Hazards in 5-stage pipeline? When a branch is executed, PC is not affected until the branch instruction reaches the MEM stage. By this time 3 instructions have been fetched from the fall-through path. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 BEQZ R1, label IF Reg ALU Mem Reg SUB R4, R2, R5 IF Reg ALU Mem Reg Kill instructions in EX, DEC and IF as they move forwards AND R6, R2, r7 IF Reg ALU Mem Reg OR R8, r2, R9 : : label: XOR R10, R1, R11 IF Reg ALU Mem Reg IF Reg ALU Mem Reg Inf3 Computer Architecture - 2011-2012 4

Tomasulo with Hardware Speculation Issue: Get instruction from queue Issue if an RS is free and an ROB entry is also free Stall if no RS or no free ROB entry Instructions now tagged with ROB entry number, not RS.Id From instruction fetch unit Instruction Queue st f4, 8(r2) add f4, f5, f3 mul f3, f1, f2 ld f1, 4(r1) Re- Order Buffer Execute: Same as before: monitor CDB and start instruction when operands are available Write Result: Broadcast result with ROB identifier ROB captures result to commit later Store operations are saved in the ROB until store data is available and store instruction is committed Commit: If branch, check prediction and squash following instructions if incorrect If store, send data and address to memory unit and perform write action Else, update register with new value and release ROB entry Memory unit Address unit Address unit 6... 11 Load buffers 1 2 3 FP adders 4 5 Reservation stations FP registers FP multipliers Inf3 Computer Architecture - 2011-2012 5

VLIW Processors Compiler chooses and packs independent instructions into a single long instruction word or bundle No need for hardware to check the instructions for dependences Not all portions of the long instruction word will be used in every cycle Compiler must be able to expose a lot of parallelism in the instruction flow Example: Memory op 1 Memory op 2 fp op 1 fp op 2 int op ld f18,-32(r1) ld f22,-40(r1) addd f4,f0,f2 addd f8,f6,f2 Inf3 Computer Architecture - 2011-2012 6

Superscalar vs. VLIW Processors + Superscalar can handle dynamic events like cache misses, unpredictable memory dependences, branches, etc. + Superscalar can exploit old binaries from previous implementations - Superscalar complexity limits issue width to 4 or 8 + VLIW requires much simpler hardware implementation + VLIW implementations can have wider issue - VLIW requires more complex compiler support - VLIW cannot use old binaries when pipeline implementation changes - VLIW code size increases because of empty issue slots Inf3 Computer Architecture - 2011-2012 7

What are the practical Limitations to ILP? Limitations on max issue width and instruction window size Effects of realistic branch prediction The effect of limited numbers of registers (ROB size) The effects of cache performance gcc espresso li fpppp doduc tomcatv 55 63 18 75 119 150 0 50 100 150 200 Instruction issues per cycle H&P, fig 3.1 (p.157) shows the available ILP in a perfect processor, with none of the above constraints. This is shown for 6 of the SPEC92 benchmarks the first 3 are integer, the final 3 are floating point benchmarks. These levels of ILP are impossible to achieve in practice, due to the limitations on window size, issue width, branch prediction accuracy and cache performance. Inf3 Computer Architecture - 2011-2012 8

Effect of Instruction Window 160 150 140 Instructions Per Clock 120 100 80 60 40 55 36 63 41 119 75 61 59 60 49 45 35 34 20 15 18 15 10 10 13 12 8 8 119 14 1615 9 14 0 gcc espresso li fpppp doduc tomcatv Infinite 2048 512 128 32 Inf3 Computer Architecture - 2011-2012 9

Branch Prediction Limits ILP We ve looked at various branch prediction schemes Here we see the accuracy achieved by 3 standard schemes Profile-based, 2-bit counter and Tournament predictors are in use today gcc espresso li fpppp doduc tomcatv 70 88 82 86 77 88 84 86 84 94 96 98 97 97 95 100 99 99 H&P, fig 3.2 (p.159) shows the branch prediction accuracy for three types of dynamic branch prediction scheme. 0 20 40 60 80 100 Branch prediction accuracy Profile-based 2-bit counter Tournament Inf3 Computer Architecture - 2011-2012 10

Effect of Branch prediction Inf3 Computer Architecture - 2011-2012 11

Effect of Limited ROB Inf3 Computer Architecture - 2011-2012 12

Limits to Multiple-issue Inherent limitation of ILP in most programs: In practice we would need N independent instructions to keep a W- issue processor busy, where N=W*pipeline width Data and control dependences significantly limit amount of ILP Complexity of the hardware based on issue width Number of functional units increases linearly OK Number of ports for register file increases linearly bad Number of ports for memory increases linearly bad Number of dependence tests increases quadratically bad Bypass/forwarding logic and wires increases quadratically bad Inf3 Computer Architecture - 2011-2012 13

Summary of Factors Limiting ILP in Real Programs Compared with an ideal processor Finite number of registers (introduces WAW and WAR stalls) Imperfect branch prediction (pipeline flushes) Limited issue width Instruction fetch delays (cache misses) Limited instruction window Implications for future performance growth? Single processor has inherent limits To use future silicon area, need to go to multiple cores Inf3 Computer Architecture - 2011-2012 14