Similar documents
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

VLIW Processors. VLIW Processors

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

WAR: Write After Read

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

CS352H: Computer Systems Architecture

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Putting it all together: Intel Nehalem.

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

CPU Performance Equation

The Pentium 4 and the G4e: An Architectural Comparison

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

IA-64 Application Developer s Architecture Guide

CPU Organization and Assembly Language

PROBLEMS #20,R0,R1 #$3A,R2,R4

Introduction to Cloud Computing

Pipelining Review and Its Limitations

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Computer Organization and Components

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Intel Pentium 4 Processor on 90nm Technology

Computer Architecture TDTS10

LSN 2 Computer Processors

Thread level parallelism

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

POWER8 Performance Analysis

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Generations of the computer. processors.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

2

The Microarchitecture of Superscalar Processors

Instruction Set Design

Intel 8086 architecture

SOC architecture and design

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Design Cycle for Microprocessors

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

Multithreading Lin Gao cs9244 report, 2006

The IA-32 processor architecture

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

OC By Arsene Fansi T. POLIMI

Instruction Set Architecture (ISA)

CS521 CSE IITG 11/23/2012

Week 1 out-of-class notes, discussions and sample problems

Chapter 1 Computer System Overview

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Operating Systems. Virtual Memory

Coming Challenges in Microarchitecture and Architecture

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Multi-Threading Performance on Commodity Multi-Core Processors

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Module: Software Instruction Scheduling Part I

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

The Microarchitecture of the Pentium 4 Processor

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Processor Architectures

Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors By Dr David Levinthal PhD. Version 1.0

NVIDIA Tegra 4 Family CPU Architecture

Low Power AMD Athlon 64 and AMD Opteron Processors

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Introduction to Microprocessors

Computer Organization and Architecture

Central Processing Unit (CPU)

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

Comparative Performance Review of SHA-3 Candidates

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Enterprise Applications

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Discovering Computers Living in a Digital World

SPARC64 VIIIfx: CPU for the K computer

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

The Motherboard Chapter #5

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

CHAPTER 7: The CPU and Memory

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

THE BASICS OF PERFORMANCE- MONITORING HARDWARE

Operating System Overview. Otto J. Anshus

Transcription:

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction Multiple-issue issue static dynamic (superscalar) Pipelines in real processors - Pentium 4, AMD Athlon (next class) Pipelining Exploits parallelism between instructions Instruction-level parallelism (ILP) Gain performance Pipelining overlaps more instructions Subdivides instruction execution into subtasks Data Hazards Read after Write (RAW) add r0,, r1, r2 sub r4, r3, r0 Write after Write (WAW) add r0,, r1, r2 sub r0,, r4, r5 Write after Read (WAR) add r2, r1, r0 sub r0,, r3, r4 WAW and WAR Only possible if instructions can be executed in parallel or out-of-order register renaming Control Hazards Resolution of branch condition does not occur until MEM stage Causes stalls Approaches to handling: Stall (pipeline bubbles) Delayed branch slot Static prediction» Does not rely on runtime information» various approaches: always taken (or, always not taken) backwards taken, forwards not taken (loops) If branch is mispredicted, need to undo all instructions in the pipeline along the incorrect path (need extra resources) Dynamic Branch Prediction Use execution behavior to make a prediction bimodal branch prediction 2 bit counters use recent branch behavior to predict future Not taken 00 01 10 11 Taken

Multiple Issue Another approach to increasing instruction level parallelism Launch many instructions in each stage Need more resources (instead of 1 washer and dryer, may need multiple washers and dryers, AND more people to help fold clothes) Disadvantage Disadvantage» Need additional resources to keep all resources busy Multiple Issue Additional tasks Packaging instructions into issue slots» how many instructions?» which instructions? Handling data and control hazards Static Multiple Issue Uses compiler to determine: issue packets handle data and control hazards» register renaming removes WAR and WAW dependences add r0,, r1, r2 sub r0,, r4, r5 Rename r0 in sub to r3 Dynamic Multiple Issue aka superscalar Dynamically chooses instructions to execute next, possibly reorders them (to avoid stalls) Reservation stations Holds operands and operation information Reorder buffer Stores results to be written to register file Writes results to register file in-order Real Processors Intel Pentium 4 and AMD Athlon Main Points Difference in approaches Intel Pentium 4» Increase pipeline depth to handle more instructions» Increase clock frequency AMD Athlon» Try to balance performance (IPC) and operating frequency Big Picture Similarities Both use» pipelining» dynamic multiple issue (superscalar)» out-of of-order order execution» branch prediction Both execute IA-32 instructions

Intel Pentium 4 Processor Outline Pentium 4 20 stages Instruction Set Architecture Instruction Stream Pentium 4 revisions Northwood (1/2002) 21 stages Prescott (2/2004) 31 stages Introduction Intel Pentium 4 processor Latest IA-32 processor equipped with a full set of IA-32 SIMD (single-instruction instruction multiple data) operations IA-32 Intel architecture 32-bit (IA-32) 80386 instruction set (1985) CISC, 32-bit addresses Registers Eight 32-bit registers Eight FP stack registers 6 segment registers IA-32 (cont d) Pentium III vs. Pentium 4 Pipeline Addressing modes Register indirect (mem[reg( mem[reg]) Base + displacement (mem[reg( + const]) Base + scaled index (mem[reg( + (2 scale x index)]) Base + scaled index + displacement (mem[reg( + (2 scale x index) + displacement]) SIMD instruction sets MMX (Pentium II)» Eight 64-bit MMX registers, integer ops only SSE (Streaming SIMD Extension, Pentium III)» Eight 128-bit registers

Instruction Set Architecture Instruction Stream Pentium4 ISA = Pentium3 ISA + SSE2 (Streaming SIMD Extensions 2) SSE2 is an architectural enhancement to the IA-32 architecture Instruction Stream Features Added Trace Cache Improved branch predictor Terminology µop Micro-op, op, already decoded RISC-like instructions Front end instruction fetch and issue Front End Prefetches instructions that are likely to be executed Fetches instructions that haven t t been prefetched Decodes instruction into µops Generates µops for complex instructions or special purpose code Predicts branches Decoder Single decoder that can operate at a maximum of 1 instruction per cycle Receives instructions from L2 cache 64 bits at a time Some complex instructions must enlist the help of the microcode ROM Trace Cache Primary instruction cache Stores decoded µops ~12K capacity On a Trace Cache miss, instructions are fetched and decoded from the L2 cache

What is a Trace Cache? I1 I2 br r2, L1 I3 I4 I5 L1: I6 I7 Traditional instruction cache Trace cache I1 I2 I3 I4 Pentium 4 Trace Cache Has its own branch predictor that directs where instruction fetching needs to go next in the Trace Cache Removes Decoding costs on frequently decoded instructions Extra latency to decode instructions upon branch mispredictions I1 I2 I6 I7 Microcode ROM Used for complex IA-32 instructions (> 4 µops), such as string move, and for fault and interrupt handling When a complex instruction is encountered, the Trace Cache jumps into the microcode ROM which then issues the µops After the microcode ROM finishes, the front end of the machine resumes fetching µops from the Trace Cache Branch Prediction Predicts ALL near branches Includes conditional branches, unconditional calls and returns, and indirect branches Does not predict far transfers Includes far calls, irets,, and software interrupts Branch Prediction Dynamically predict the direction and target of branches based on PC using branch target buffer (BTB) If no dynamic prediction is available, statically predict Taken for backwards looping branches Not taken for forward branches Traces are built across predicted branches to avoid branch penalties Branch Target Buffer Stores branch target addresses Uses a branch history table and a branch target buffer to predict Updating occurs when branch is retired

Return Address Stack 16 entries Predicts return addresses for procedure calls Allows branches and their targets to coexist in a single cache line Increases parallelism since decode bandwidth is not wasted Branch Hints P4 permits software to provide hints to the branch prediction and trace formation hardware to enhance performance Take the forms of prefixes to conditional branch instructions Used only at trace build time and have no effect on already built traces Out-of of-order Execution Designed to optimize performance by handling the most common operations in the most common context as fast as possible 126 µops can in flight at once Up to 48 loads / 24 stores Issue Instructions are fetched and decoded by translation engine Translation engine builds instructions into sequences of µops Stores µops to trace cache Trace cache can issue 3 µops per cycle Execution Execution Units Can dispatch up to 6 µops per cycle Exceeds trace cache and retirement µop bandwidth Allows for greater flexibility in issuing µops to different execution units

Double-pumped ALUs ALU executes an operation on both rising and falling edges of clock cycle Retirement Can retire 3 µops per cycle Precise exceptions Reorder buffer to organize completed µops Also keeps track of branches and sends updated branch information to the BTB Execution Pipeline Execution Pipeline Register Renaming Register Renaming (2) 8-entry architectural register file 128-entry physical register file 2 RAT Frontend RAT and Retirement RAT Data does not need to be copied between register files when the instruction retires

Loses consistently to AMD In terms of performance, the Pentium 4 is as slow or slower than existing Pentium III and AMD Athlon processors In terms of price, an entry level Pentium 4 sells for about double the cost of a similar Pentium III or AMD Athlon-based system 1.5GHz clock rate is more hype than substance Northwood Prescott 1/2002 Differences from Willamette Socket 478 21 stage pipeline 512 KB L2 cache 2.0 GHz, 2.2 GHz clock frequency 0.13µ fabrication process (130 nm)» 55 million transistors Prescott 2/2004 Differences 31 stage pipeline! 1MB L2 cache 3.8 GHz clock frequency 0.9µ fabrication process SSE3