CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Similar documents

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Instruction Set Architecture

Computer Organization and Components

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

CS:APP Chapter 4 Computer Architecture Instruction Set Architecture. CS:APP2e

CS352H: Computer Systems Architecture

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Chapter 4 Processor Architecture

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

VLIW Processors. VLIW Processors

MACHINE ARCHITECTURE & LANGUAGE

PROBLEMS #20,R0,R1 #$3A,R2,R4

Computer organization

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

Intel Pentium 4 Processor on 90nm Technology

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Chapter 1 Computer System Overview

Introduction to Microprocessors

Pipelining Review and Its Limitations

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

Fetch. Decode. Execute. Memory. PC update

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

CSE 141L Computer Architecture Lab Fall Lecture 2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

CHAPTER 4 MARIE: An Introduction to a Simple Computer

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

LSN 2 Computer Processors

CSC 2405: Computer Systems II

Central Processing Unit (CPU)

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Instruction Set Design

WAR: Write After Read

CHAPTER 7: The CPU and Memory

EE361: Digital Computer Organization Course Syllabus

Let s put together a Manual Processor

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Intel 8086 architecture

Instruction Set Architecture (ISA)

(Refer Slide Time: 00:01:16 min)

CPU Performance Equation

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Computer Organization and Architecture

Introduction to Cloud Computing

İSTANBUL AYDIN UNIVERSITY

Design of Digital Circuits (SS16)

Putting it all together: Intel Nehalem.

Assembly Language: Function Calls" Jennifer Rexford!

Computer Architecture TDTS10

Computer Architectures

Systems I: Computer Organization and Architecture

In the Beginning The first ISA appears on the IBM System 360 In the good old days

612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap Famiglie di processori) PROBLEMS

PART B QUESTIONS AND ANSWERS UNIT I

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

picojava TM : A Hardware Implementation of the Java Virtual Machine

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Module: Software Instruction Scheduling Part I

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

Microprocessor & Assembly Language

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

MICROPROCESSOR AND MICROCOMPUTER BASICS

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

Review: MIPS Addressing Modes/Instruction Formats

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Multithreading Lin Gao cs9244 report, 2006

Week 1 out-of-class notes, discussions and sample problems

Operating System Impact on SMT Architecture

1 The Java Virtual Machine

Computer Organization. and Instruction Execution. August 22

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Performance monitoring with Intel Architecture

361 Computer Architecture Lecture 14: Cache Memory

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Chapter 9 Computer Design Basics!

IA-64 Application Developer s Architecture Guide

Transcription:

CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University

Overview Wrap-Up of PIPE Design Performance analysis Fetch stage design Exceptional conditions Modern High-Performance Processors Out-of-order execution 2 CS 4300 Computer Architecture

Performance Metrics Clock rate Measured in Megahertz or Gigahertz Function of stage partitioning and circuit design Keep amount of work per stage small Rate at which instructions executed CPI: cycles per instruction On average, how many clock cycles does each instruction require? Function of pipeline design and benchmark programs E.g., how frequently are branches mispredicted? 3 CS 4300 Computer Architecture

CPI for PIPE CPI 1.0 Fetch instruction each clock cycle Effectively process new instruction almost every cycle Although each individual instruction has latency of 5 cycles CPI > 1.0 Sometimes must stall or cancel branches Computing CPI C clock cycles I instructions executed to completion B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = 1.0 + B/I Factor B/I represents average penalty due to bubbles 4 CS 4300 Computer Architecture

CPI for PIPE (Cont.) B/I = LP + MP + RP LP: Penalty due to load/use hazard stalling Typical Values Fraction of instructions that are loads 0.25 Fraction of load instructions requiring stall 0.20 Number of bubbles injected each time 1 LP = 0.25 * 0.20 * 1 = 0.05 MP: Penalty due to mispredicted branches Fraction of instructions that are cond. jumps 0.20 Fraction of cond. jumps mispredicted 0.40 Number of bubbles injected each time 2 MP = 0.20 * 0.40 * 2 = 0.16 RP: Penalty due to ret instructions Fraction of instructions that are returns 0.02 Number of bubbles injected each time 3 RP = 0.02 * 3 = 0.06 Net effect of penalties 0.05 + 0.16 + 0.06 = 0.27 CPI = 1.27 (Not bad!) 5 CS 4300 Computer Architecture

Fetch Logic Revisited During Fetch Cycle 1. Select PC D icode ifun ra rb valc valp M_icode M_Bch M_valA W_icode W_valM 2. Read bytes from instruction memory 3. Examine icode to determine instruction length Instr valid Split Byte 0 Align Bytes 1-5 Need valc Need regids PC increment Predict PC 4. Increment PC Instruction memory Timing Steps 2 & 4 require significant amount of time F Select PC predpc 6 CS 4300 Computer Architecture

Standard Fetch Timing Select PC need_regids, need_valc Mem. Read Increment 1 clock cycle Must Perform Everything in Sequence Can t compute incremented PC until know how much to increment it by 7 CS 4300 Computer Architecture

A Fast PC Increment Circuit incrpc High-order 29 bits MUX 0 1 carry Low-order 3 bits Slow 29-bit incrementer 3-bit adder Fast High-order 29 bits Low-order 3 bits need_regids 0 need_valc PC 8 CS 4300 Computer Architecture

Modified Fetch Timing Select PC need_regids, need_valc 3-bit add Mem. Read MUX Incrementer Standard cycle 29-Bit Incrementer 1 clock cycle Acts as soon as PC selected Output not needed until final MUX Works in parallel with memory read 9 CS 4300 Computer Architecture

More Realistic Fetch Logic Other PC Controls Fetch Control Instr. Length Byte 0 Current Instruction Bytes 1-5 Fetch Box Instruction Cache Integrated into instruction cache Fetches entire cache block (16 or 32 bytes) Selects current instruction from current block Works ahead to fetch next block As reaches end of current block At branch target Current Block Next Block 10 CS 4300 Computer Architecture

Exceptions Conditions under which pipeline cannot continue normal operation Causes Halt instruction Bad address for instruction or data Invalid instruction Pipeline control error Desired Action Complete some instructions (Current) (Previous) (Previous) (Previous) Either current or previous (depends on exception type) Discard others Call exception handler Like an unexpected procedure call 11 CS 4300 Computer Architecture

Exception Examples Detect in Fetch Stage jmp $-1 # Invalid jump target.byte 0xFF # Invalid instruction code halt # Halt instruction Detect in Memory Stage irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address 12 CS 4300 Computer Architecture

Exceptions in Pipeline Processor #1 # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop.byte 0xFF # Invalid instruction code 1 2 3 4 0x000: irmovl $100,%eax F D E M 0x006: rmmovl %eax,0x1000(%eax) F D E 0x00c: nop F D 0x00d:.byte 0xFF F Exception detected 5 W M E D Exception detected Desired Behavior rmmovl should cause exception 13 CS 4300 Computer Architecture

Exceptions in Pipeline Processor #2 # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t:.byte 0xFF # Target 1 2 3 4 5 6 7 8 9 0x000: xorl %eax,%eax F D E M W 0x002: jne t F D E M 0x014: t:.byte 0xFF F D E M W 0x???: (I m lost!) F D E M W 0x007: irmovl $1,%eax F D E M W Desired Behavior No exception should occur Exception detected 14 CS 4300 Computer Architecture

Maintaining Exception Ordering W exc icode vale valm dste dstm M exc icode Bch vale vala dste dstm E exc icode ifun valc vala valb dste dstm srca srcb D exc icode ifun ra rb valc valp F predpc Add exception status field to pipeline registers Fetch stage sets to either AOK, ADR (when bad fetch address), or INS (illegal instruction) Decode & execute pass values through Memory either passes through or sets to ADR Exception triggered only when instruction hits write back 15 CS 4300 Computer Architecture

Side Effects in Pipeline Processor # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes 1 2 3 4 0x000: irmovl $100,%eax F D E M 0x006: rmmovl %eax,0x1000(%eax) F D E 0x00c: addl %eax,%eax F D 5 W M E Exception detected Desired Behavior rmmovl should cause exception Condition code set No following instruction should have any effect 16 CS 4300 Computer Architecture

Avoiding Side Effects Presence of Exception Should Disable State Update When detect exception in memory stage Disable condition code setting in execute Must happen in same clock cycle When exception passes to write-back stage Disable memory write in memory stage Disable condition code setting in execute stage Implementation Hardwired into the design of the PIPE simulator You have no control over this 17 CS 4300 Computer Architecture

Rest of Exception Handling Calling Exception Handler Push PC onto stack Either PC of faulting instruction or of next instruction Usually pass through pipeline along with exception status Jump to handler address Usually fixed address Defined as part of ISA Implementation Haven t tried it yet! 18 CS 4300 Computer Architecture

Modern CPU Design Register Updates Retirement Unit Register File Prediction OK? Instruction Control Fetch Control Instruction Decode Address Instruction Instructions Cache Operations Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Execution Data Cache 19 CS 4300 Computer Architecture

Instruction Control Retirement Unit Register File Instruction Control Address Fetch Control Instruction Instructions Cache Instruction Decode Operations Grabs Instruction Bytes From Memory Based on Current PC + Predicted Targets for Predicted Branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target Translates Instructions Into Operations Primitive steps required to perform instruction Typical instruction requires 1 3 operations Converts Register References Into Tags Abstract identifier linking destination of one operation with sources of later operations 20 CS 4300 Computer Architecture

Execution Unit Register Updates Prediction OK? Operations Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Execution Execution Data Cache Multiple functional units Each can operate in independently Operations performed as soon as operands available Not necessarily in program order Within limits of functional units Control logic Ensures behavior equivalent to sequential program execution 21 CS 4300 Computer Architecture

CPU Capabilities of Pentium III Multiple Instructions Can Execute in Parallel 1 load 1 store 2 integer (one may be branch) 1 FP Addition 1 FP Multiplication or Division Some Instructions Take > 1 Cycle, but Can be Pipelined Instruction Latency Cycles/Issue Load / Store 3 1 Integer Multiply 4 1 Integer Divide 36 36 Double/Single FP Multiply 5 2 Double/Single FP Add 3 1 Double/Single FP Divide 38 38 22 CS 4300 Computer Architecture

PentiumPro Block Diagram P6 Microarchitecture PentiumPro Pentium II Pentium III Microprocessor Report 2/16/95

PentiumPro Operation Translates instructions dynamically into Uops Uops 118 bits wide Holds operation, two sources, and destination Executes Uops with Out of Order engine Uop executed when Operands available Functional unit available Execution controlled by Reservation Stations Keeps track of data dependencies between uops Allocates resources 24 CS 4300 Computer Architecture

PentiumPro Branch Prediction Critical to Performance 11 15 cycle penalty for misprediction Branch Target Buffer 512 entries 4 bits of history Adaptive algorithm Can recognize repeated patterns, e.g., alternating taken not taken Handling BTB misses Detect in cycle 6 Predict taken for negative offset, not taken for positive Loops vs. conditionals 25 CS 4300 Computer Architecture

Example Branch Prediction Branch History Encode information about prior history of branch instructions Predict whether or not branch will be taken NT NT NT T Yes! Yes? No? No! NT State Machine T T T Each time branch taken, transition to right When not taken, transition to left Predict branch taken when in state Yes! or Yes? 26 CS 4300 Computer Architecture

Pentium 4 Block Diagram Intel Tech. Journal Q1, 2001 Next generation microarchitecture 27 CS 4300 Computer Architecture

Pentium 4 Features Trace Cache L2 Cache IA32 Instrs. Instruct. Decoder uops Trace Cache Operations Replaces traditional instruction cache Caches instructions in decoded form Reduces required rate for instruction decoder Double-Pumped ALUs Simple instructions (add) run at 2X clock rate Very Deep Pipeline 20+ cycle branch penalty Enables very high clock rates Slower than Pentium III for a given clock rate 28 CS 4300 Computer Architecture

Processor Summary Design Technique Create uniform framework for all instructions Want to share hardware among instructions Connect standard logic blocks with bits of control logic Operation State held in memories and clocked registers Computation done by combinational logic Clocking of registers/memories sufficient to control overall behavior Enhancing Performance Pipelining increases throughput and improves resource utilization Must make sure maintains ISA behavior 29 CS 4300 Computer Architecture