CISC 360. Computer Architecture. Seth Morecraft Course Web Site:

Similar documents
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER


VLIW Processors. VLIW Processors

Pipelining Review and Its Limitations

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

WAR: Write After Read

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Computer Architecture TDTS10

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Energy-Efficient, High-Performance Heterogeneous Core Design

Computer Organization and Components

The Microarchitecture of Superscalar Processors

CPU Performance Equation

PROBLEMS #20,R0,R1 #$3A,R2,R4

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

2

A Lab Course on Computer Architecture

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Module: Software Instruction Scheduling Part I

POWER8 Performance Analysis

Thread level parallelism

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

IA-64 Application Developer s Architecture Guide

OC By Arsene Fansi T. POLIMI

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

CS521 CSE IITG 11/23/2012

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

CS352H: Computer Systems Architecture

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Software implementation of Post-Quantum Cryptography

Instruction Set Design

SOC architecture and design

on an system with an infinite number of processors. Calculate the speedup of

Performance evaluation

Instruction scheduling

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Multithreading Lin Gao cs9244 report, 2006

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Intel Pentium 4 Processor on 90nm Technology

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Putting it all together: Intel Nehalem.

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Week 1 out-of-class notes, discussions and sample problems

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Boosting Beyond Static Scheduling in a Superscalar Processor

LSN 2 Computer Processors

Coming Challenges in Microarchitecture and Architecture

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to Cloud Computing

In the Beginning The first ISA appears on the IBM System 360 In the good old days

The Pentium 4 and the G4e: An Architectural Comparison

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Rethinking SIMD Vectorization for In-Memory Databases

Processor Architectures

Design Cycle for Microprocessors

NVIDIA Tegra 4 Family CPU Architecture

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Introduction to Microprocessors

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Five Families of ARM Processor IP

Instruction Set Architecture (ISA)

ARM Microprocessor and ARM-Based Microcontrollers

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Chapter 5 Instructor's Manual

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Using Power to Improve C Programming Education

Wiggins/Redstone: An On-line Program Specializer

İSTANBUL AYDIN UNIVERSITY

Comparative Performance Review of SHA-3 Candidates

Lecture 17: Virtual Memory II. Goals of virtual memory

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Putting Checkpoints to Work in Thread Level Speculative Execution

Vector Architectures

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors

Transcription:

CISC 360 Computer Architecture Seth Morecraft (morecraf@udel.edu) Course Web Site: http://www.eecis.udel.edu/~morecraf/cisc360

Static & Dynamic Scheduling Scheduling: act of finding independent instructions Static done at compile time by the compiler (software) Dynamic done at runtime by the processor (hardware)

Code Scheduling Why schedule code? Scalar pipelines: fill in load-to-use delay slots to improve CPI Superscalar: place independent instructions together As above, load-to-use delay slots Allow multiple-issue decode logic to let them execute at the same time

Compiler Scheduling Compiler can schedule (move) instructions to reduce stalls Basic pipeline scheduling: eliminate back-toback load-use pairs Example code sequence: a = b + c; d = f e;

Compiler Scheduling

Compiler Scheduling Requires Large scheduling scope Independent instruction to put between loaduse pairs + Original example: large scope, two independent computations This example: small scope, one computation

Compiler Scheduling Requires Compiler can create larger scheduling scopes For example: loop unrolling & function inlining * Side Note: Branches can really mess this up

Compiler Scheduling Requires Enough registers To hold additional live values Example code contains 7 different values (including sp) Before: max 3 values live at any time 3 registers enough After: max 4 values live 3 registers not enough

Compiler Scheduling Requires Alias analysis Ability to tell whether load/store reference same memory locations Effectively, whether load/store can be rearranged Must be conservative

Compiler Scheduling Limitations Scheduling scope Example: can t generally move memory operations past branches Limited number of registers (set by ISA) Inexact memory aliasing information Often prevents reordering of loads above stores by compiler Caches misses (or any runtime event) confound scheduling How can the compiler know which loads will miss? Can impact the compiler s scheduling decisions

Can Hardware Overcome These Limits? Dynamically-scheduled processors Also called out-of-order processors Hardware re-schedules insns within a sliding window of VonNeumann insns As with pipelining and superscalar, ISA unchanged Same hardware/software interface, appearance of in-order

Can Hardware Overcome These Limits? Increases scheduling scope Does loop unrolling transparently! Uses branch prediction to unroll branches Examples: Pentium Pro/II/III (3-wide), Core 2 (4-wide), Alpha 21264 (4-wide), MIPS R10000 (4-wide), Power5 (5-wide)

Out-of-Order to the Rescue Dynamic scheduling done by the hardware Still superscalar, but now out-of-order, too Allows instructions to issue when dependences are ready

Out-of-Order Execution Also call Dynamic scheduling Done by the hardware on-the-fly during execution Looks at a window of instructions waiting to execute Each cycle, picks the next ready instruction(s)

Out-of-Order Execution Two steps to enable out-of-order execution: Step #1: Register renaming to avoid false dependencies Step #2: Dynamically schedule to enforce true dependencies Key to understanding out-of-order execution: Data dependencies

Dependence types RAW (Read After Write) = true dependence (true) WAW (Write After Write) = output dependence (false) WAR (Write After Read) = anti-dependence (false) WAW & WAR are false, Can be totally eliminated by renaming

Step #1: Register Renaming To eliminate register conflicts/hazards Architected vs Physical registers level of indirection Names: r1,r2,r3 Locations: p1,p2,p3,p4,p5,p6,p7 Original mapping: r1 p1, r2 p2, r3 p3, p4 p7 are available

Step #1: Register Renaming Renaming conceptually write each register once + Removes false dependences + Leaves true dependences intact! When to reuse a physical register? After overwriting insn done

Register Renaming Two key data structures: maptable[architectural_reg] => physical_reg Free list: allocate (new) & free registers (implemented as a queue) Algorithm: at decode stage for each instruction, lookup (LUT) or map At commit Once all older instructions have committed, free register

Also Have Dependencies via Memory WAR/WAW are false dependencies - But can t rename memory in same way as registers Why? Address are not known at rename Need to use other tricks

Store Queue (SQ) Solves two problems Allows for recovery of speculative stores Allows out-of-order stores Store Queue (SQ) At dispatch, each store is given a slot in the Store Queue First-in-first-out (FIFO) queue Each entry contains: address, value, and age

Store Queue (SQ) Operation: Dispatch (in-order): allocate entry in SQ (stall if full) Execute (out-of-order): write store value into store queue Commit (in-order): read value from SQ and write into data cache Branch recovery: remove entries from the store queue

Memory Forwarding Loads also search the Store Queue (in parallel with cache access) Conceptually like register bypassing, but different implementation Why? Addresses unknown until execute

Memory WAR Hazards Need to make sure that load doesn t read store s result Need to get values based on program order not execution order Bad solution: require all stores/loads to execute in-order Good solution: add age fields to store queue (SQ) Loads read matching address that is earlier (or older ) than it Another reason the SQ is a FIFO queue

Load Queue Detects load ordering violations Load execution: Write address into Load Queue Also note any store forwarded from here Store execution: Search Load Queue Younger load with same addr? Didn t forward from younger store? (optimization for full renaming)

Store Queue + Load Queue Store Queue: handles forwarding Entry per store (allocated @ dispatch, deallocated @ commit) Written by stores (@ execute) Searched by loads (@ execute) Read from to write data cache (@ commit)

Store Queue + Load Queue Load Queue: detects ordering violations Entry per load (allocated @ dispatch, deallocated @ commit) Written by loads (@ execute) Searched by stores (@ execute) Both together Allows aggressive load scheduling Stores don t constrain load execution

Step #2: Dynamic Scheduling Instructions fetch/decoded/renamed into Instruction Buffer Also called instruction window or instruction scheduler Instructions (conceptually) check ready bits every cycle Execute oldest ready instruction, set output as ready

Dynamic Scheduling Operation (Recap) Dynamic scheduling Totally in the hardware (not visible to software) Also called out-of-order execution (OoO) Fetch many instructions into instruction window Use branch prediction to speculate past (multiple) branches Flush pipeline on branch misprediction

Dynamic Scheduling Operation (Recap) Rename registers to avoid false dependencies Execute instructions as soon as possible Register dependencies are known Handling memory dependencies more tricky Commit instructions in order Anything strange happens before commit, just flush the pipeline

How much out-of-order? Core i7 Sandy Bridge : 68-entry reorder buffer, 160 integer registers, 54-entry scheduler

Out of Order: Benefits Allows speculative re-ordering Loads / stores Branch prediction to look past branches Done by hardware Compiler may want different schedule for different hw configs Hardware has only its own configuration to deal with

Out of Order: Benefits Schedule can change due to cache misses Different schedule optimal from on cache hit Memory-level parallelism Executes around cache misses to find independent instructions Finds and initiates independent misses, reducing memory latency Especially good at hiding L2 hits (~12 cycles in Core i7)

Challenges for Out-of-Order Cores Design complexity More complicated than in-order? Certainly! But, we have managed to overcome the design complexity Clock frequency Can we build a high ILP machine at high clock frequency? Yes, with some additional pipe stages, clever design

Challenges for Out-of-Order Cores Limits to (efficiently) scaling the window and ILP Large physical register file Fast register renaming/wakeup/select/load queue/store queue Active areas of micro-architectural research Branch & memory depend. prediction (limits effective window size) 95% branch mis-prediction: 1 in 20 branches, or 1 in 100 insn. Plus all the issues of build wide in-order superscalar Power efficiency Today, even mobile phone chips are out-of-order cores

Wrap-up: Hardware vs. Software Scheduling Static scheduling Performed by compiler, limited in several ways Dynamic scheduling Performed by the hardware, overcomes limitations Static limitation dynamic mitigation Number of registers in the ISA register renaming Scheduling scope branch prediction & speculation Inexact memory aliasing information speculative memory ops Unknown latencies of cache misses execute when ready Which to do? Compiler does what it can, hardware the rest Why? dynamic scheduling needed to sustain more than 2-way issue Helps with hiding memory latency (execute around misses) Intel Core i7 is four-wide execute w/ scheduling window of 100+ Even mobile phones have dynamic scheduled cores (ARM A9)