CS429: Computer Organization and Architecture

Similar documents
CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

CS352H: Computer Systems Architecture

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Computer Organization and Components


Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

MACHINE ARCHITECTURE & LANGUAGE

(Refer Slide Time: 00:01:16 min)

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Pipelining Review and Its Limitations

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

PROBLEMS #20,R0,R1 #$3A,R2,R4

WAR: Write After Read

LSN 2 Computer Processors

8085 INSTRUCTION SET

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

64-Bit NASM Notes. Invoking 64-Bit NASM

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

CHAPTER 7: The CPU and Memory

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

An Introduction to the ARM 7 Architecture

Central Processing Unit (CPU)

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Microprocessor & Assembly Language

Precise and Accurate Processor Simulation

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture (ISA)

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

CPU Performance Equation

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Notes on Assembly Language

VLIW Processors. VLIW Processors

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Exception and Interrupt Handling in ARM

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

CS:APP Chapter 4 Computer Architecture Instruction Set Architecture. CS:APP2e

CPU Organisation and Operation

1 Classical Universal Computer 3

CPU Organization and Assembly Language

Instruction Set Architecture

Attacking Hypervisors via Firmware and Hardware

Instruction Set Design

Administrative Issues

CS521 CSE IITG 11/23/2012

Intel 8086 architecture

Week 1 out-of-class notes, discussions and sample problems

Chapter 1 Computer System Overview

2

Processor Architectures

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

MICROPROCESSOR AND MICROCOMPUTER BASICS

Computer Organization and Architecture

CS101 Lecture 26: Low Level Programming. John Magee 30 July 2013 Some material copyright Jones and Bartlett. Overview/Questions

Course on Advanced Computer Architectures

Exploiting nginx chunked overflow bug, the undisclosed attack vector

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

The WIMP51: A Simple Processor and Visualization Tool to Introduce Undergraduates to Computer Organization

Summary of the MARIE Assembly Language

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

System/Networking performance analytics with perf. Hannes Frederic Sowa

Fetch. Decode. Execute. Memory. PC update

CSC 2405: Computer Systems II

Instruction Set Architecture

StrongARM** SA-110 Microprocessor Instruction Timing

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

CSE 141L Computer Architecture Lab Fall Lecture 2

8-Bit Flash Microcontroller for Smart Cards. AT89SCXXXXA Summary. Features. Description. Complete datasheet available under NDA

COMPUTERS ORGANIZATION 2ND YEAR COMPUTE SCIENCE MANAGEMENT ENGINEERING JOSÉ GARCÍA RODRÍGUEZ JOSÉ ANTONIO SERRA PÉREZ

Solutions. Solution The values of the signals are as follows:

PART B QUESTIONS AND ANSWERS UNIT I

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Hacking Techniques & Intrusion Detection. Ali Al-Shemery arabnix [at] gmail

1 Description of The Simpletron

An Overview of Stack Architecture and the PSC 1000 Microprocessor

PROBLEMS. which was discussed in Section

Traditional IBM Mainframe Operating Principles

Levels of Programming Languages. Gerald Penn CSC 324

Central Processing Unit Simulation Version v2.5 (July 2005) Charles André University Nice-Sophia Antipolis

EE361: Digital Computer Organization Course Syllabus

Computer organization

Comp 255Q - 1M: Computer Organization Lab #3 - Machine Language Programs for the PDP-8

EECS 427 RISC PROCESSOR

Instruction scheduling

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

picojava TM : A Hardware Implementation of the Java Virtual Machine

Review: MIPS Addressing Modes/Instruction Formats

The Microarchitecture of Superscalar Processors

Transcription:

CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: October 31, 2016 at 13:22 CS429 Slideset 16: 1

Data Hazard vs. Control Hazard There are two types of hazards that interfere with flow through a pipeline. Data hazard: results from an instruction are not available when needed by a subsequent instruction. Control hazard: a branch in the control flow makes ambiguous what is the next instruction to fetch. CS429 Slideset 16: 2

How Do We Fix the Pipeline? 1 Pad the program with NOPs. That could mean two things: Change the program itself. That violates our Pipeline Correctness Axiom. Why? Make the implementation behave as if there were NOPs inserted. 2 Stall the pipeline Data hazards: Wait for producing instruction to complete Then proceed with consuming instruction Control hazards: Wait until new PC has been determined Then begin fetching How is this better than inserting NOPs into the program? CS429 Slideset 16: 3

How Do We Fix the Pipeline? 3 Forward data within the pipeline Grab the result from somewhere in the pipe After it has been computed But before it has been written back This gives an opportunity to avoid performance degradation due to hazards! The implemented solution is a combination of 2 and 3: forward data when possible and stall the pipeline where necessary. CS429 Slideset 16: 4

Data Forwarding Naive pipeline Register isn t written until completion of write-back stage. Source operands read from register file in decode stage. Needs to be in register file at start of stage. Observation: value is generated in execute or memory stage.. Trick: Pass value directly from generating instruction to decode stage. Needs to be available at end of decode stage. CS429 Slideset 16: 5

Data Forwarding Example CS429 Slideset 16: 6

Bypass Paths Decode Stage: Forwarding logic selects vala and valb Normally from register file Forwarding: get vala or valb from later pipeline stage Forwarding Sources: Execute: vale Memory: vale, valm Write back: vale, valm Memory Execute Decode Fetch PC W_icode, W_valM M_icode, M_Bch, M_valA W M E D icode, ifun, ra, rb, valc F Instruction memory Addr, Data Bch CC d_srca, d_srcb Data memory E_valA, E_valB, E_srcA, E_srcB vala, valb valp W_valE, W_valM, W_dstE, W_dstM ALU PC increment f_pc m_valm e_vale A B Register M file E valp Forward W_valE W_valM M_valE Write back predpc CS429 Slideset 16: 7

Data Forwarding Example 2 Register %rdx: generated by ALU during previous cycle; forwarded from memory as vala. Register %rax: value just generated by ALU; forward from execute as valb. CS429 Slideset 16: 8

Implementing Forwarding Add new feedback paths from E, M, and W pipeline registers into decode stage. Create logic blocks to select from multiple sources for vala and valb in decode stage. CS429 Slideset 16: 9

CS429 Slideset 16: 10

Limitation of Forwarding Load-use dependency: Value needed by end of decode stage in cycle 7. Value read from memory in memory stage of cycle 8. CS429 Slideset 16: 11

Avoiding Load/Use Hazard Stall using instruction for one cycle. Can pickup loaded value by forwarding from memory stage. CS429 Slideset 16: 12

Control for Load/Use Hazard Stall instructions in fetch and decode stages Inject bubble into execute stage. CS429 Slideset 16: 13

Control for Load/Use Hazard Condition F D E M W Load/Use Hazard stall stall bubble normal normal CS429 Slideset 16: 14

Recall Our Prediction Strategy Note that we could have used a different prediction strategy Instructions that don t transfer control: Predict next PC to be valp; this is always reliable. Call and Unconditional Jumps: Predict next PC to be valc (destination); this is always reliable. Conditional Jumps: Predict next PC to be valc (destination). Only correct if the branch is taken; right about 60% of the time. Return Instruction: Don t try to predict. CS429 Slideset 16: 15

Branch Misprediction Example 0x000: xorq %rax, %rax 0x002: jne target # Not taken 0x00b: irmovq $1, %rax # Fall through 0x015: halt 0x016: target: 0x016: irmovq $2, %rdx # Target 0x020: irmovq $3, %rcx # Target + 1 0x02a: halt Should only execute the first 4 instructions. CS429 Slideset 16: 16

Handling Misprediction Predict branch as taken Fetch 2 instructions at target Cancel when mispredicted Detect branch not taken in execute stage On following cycle, replace instruction in execute and decode stage by bubbles. No side effects have occurred yet. CS429 Slideset 16: 17

Control for Misprediction Condition F D E M W Mispredicted normal bubble bubble normal normal Branch CS429 Slideset 16: 18

Return Example irmovq Stack, %rsp # Initialize stack pointer call p # Procedure call irmovq $5, %rsi # Return point halt.pos 0x20 p: irmovq $-1, %rdi # procedure ret irmovq $1, %rax # should not be executed irmovq $2, %rcx # should not be executed irmovq $3, %rdx # should not be executed irmovq $4, %rbx # should not be executed.pos 0x100 Stack: # Stack pointer Previously executed three additional instructions. CS429 Slideset 16: 19

Correct Return Example ret bubble bubble bubble irmovq $5, %rsi # Return As ret passes through pipeline, stall at fetch stage while in decode, execute, and memory stages. Inject bubble into decode stage. Release stall when reach write-back stage. CS429 Slideset 16: 20

Control for Return ret bubble bubble bubble irmovq $5, %rsi # Return Condition F D E M W Processing ret stall bubble normal normal normal CS429 Slideset 16: 21

Special Control Cases Action (on next cycle): Condition F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal Mispredicted normal bubble bubble normal normal Branch CS429 Slideset 16: 22

Pipeline Summary Data Hazards Most handled by forwarding with no performance penalty Load / use hazard requires one cycle stall Control Hazards Cancel instructions when detect mispredicted branch; two cycles wasted Stall fetch stage while ret pass through pipeline; three cycles wasted. Control Combinations Must analyze carefully First version had a subtle bug Only arises with unusual instruction combination CS429 Slideset 16: 23

Performance Analysis with Pipelining CPU time = Seconds Program = Instructions Program Cycles Instruction Seconds Cycle Ideal pipelined machine: CPI = 1 One instruction completed per cycle. But much faster cycle time than unpipelined machine. However, hazards work against the ideal Hazards resolved using forwarding are fine. Stalling degrades performance and instruction completion rate is interrupted. CPI is a measure of the architectural efficiency of the design. CS429 Slideset 16: 24

Computing CPI CPI is a function of useful instructions and bubbles: CPI = C i +C b C i = 1.0+ C b C i You can reformulate this to account for: load penalties (lp) branch misprediction penalties (mp) return penalties (rp) CPI = 1.0+lp+mp+rp CS429 Slideset 16: 25

Computing CPI (2) So, how do we determine the penalties? Depends on how often each situation occurs on average. How often does a load occur and how often does that load cause a stall? How often does a branch occur and how often is it mispredicted? How often does a return occur? We can measure these using: a simulator, or hardware performance counters. We can also estimate them through historical averages. Then use estimates to make early design tradeoffs for the architecture. CS429 Slideset 16: 26

Computing CPI (3) Cause Name Instruction Condition Stalls Product Frequency Frequency Load/use lp 0.30 0.3 1 0.09 Mispredict mp 0.20 0.4 2 0.16 Return rp 0.02 1.0 3 0.06 Total penalty 0.31 This is not ideal. This gets worse when: CPI = 1+0.31 = 1.31 == 31% you also account for non-ideal memory access latency; deeper pipeline (where stalls per hazard increase). CS429 Slideset 16: 27