Pipelining An Instruction Assembly Line

Similar documents
CS352H: Computer Systems Architecture

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Computer Organization and Components


WAR: Write After Read

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Giving credit where credit is due

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Pipelining Review and Its Limitations

PROBLEMS #20,R0,R1 #$3A,R2,R4

Solutions. Solution The values of the signals are as follows:

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Instruction Set Architecture

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Review: MIPS Addressing Modes/Instruction Formats

Course on Advanced Computer Architectures

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Computer organization

CPU Performance Equation

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Performance evaluation

EC 362 Problem Set #2

Introducción. Diseño de sistemas digitales.1

CS521 CSE IITG 11/23/2012

VLIW Processors. VLIW Processors

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

EE361: Digital Computer Organization Course Syllabus

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Instruction Set Design

Week 1 out-of-class notes, discussions and sample problems

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

A Lab Course on Computer Architecture

Central Processing Unit (CPU)

StrongARM** SA-110 Microprocessor Instruction Timing

Computer Architecture TDTS10

An Introduction to the ARM 7 Architecture

The Microarchitecture of Superscalar Processors

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Chapter 5 Instructor's Manual

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

(Refer Slide Time: 00:01:16 min)

IA-64 Application Developer s Architecture Guide

Reduced Instruction Set Computer (RISC)

CPU Organisation and Operation

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Lecture Outline. Stack machines The MIPS assembly language. Code Generation (I)

CHAPTER 7: The CPU and Memory

PROBLEMS. which was discussed in Section

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

The Design of the Inferno Virtual Machine. Introduction

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

A3 Computer Architecture

Let s put together a Manual Processor

EECS 427 RISC PROCESSOR

Thread level parallelism

Board Notes on Virtual Memory

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Introduction to Cloud Computing

Design of Digital Circuits (SS16)

Chapter 9 Computer Design Basics!

CPU Organization and Assembly Language

on an system with an infinite number of processors. Calculate the speedup of

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Intel 8086 architecture

MIPS Assembler and Simulator

Embedded System Hardware - Processing (Part II)

LSN 2 Computer Processors

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Energy-Efficient, High-Performance Heterogeneous Core Design

2) Write in detail the issues in the design of code generator.

2

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

MACHINE ARCHITECTURE & LANGUAGE

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

PART B QUESTIONS AND ANSWERS UNIT I

Instruction Set Architecture (ISA)

CSE 141 Introduction to Computer Architecture Summer Session I, Lecture 1 Introduction. Pramod V. Argade June 27, 2005

Transcription:

Processor Designs So Far Pipelining An Assembly Line EEC7 Computer Architecture FQ 5 We have Single Cycle Design with low CPI but high CC We have ulticycle Design with low CC but high CPI We want best of both: low CC and low CPI Achieved using pipelining Courtesy of Prof. Kent Wilken and Prof. John Owens Pipelining is Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 minutes Dryer takes minutes A B C D Laundry iming How long does laundry take with a single cycle (wash/dry/fold is one clock) design? What is the clock cycle time? CPI? How many clocks? How long does laundry take with a multiple cycle (longest of wash/dry/fold is one clock) design? What is the CC? CPI? How many clocks? Folder takes minutes Sequential Laundry 6 P 7 8 9 idnight ime Pipelined Laundry: Start work ASAP 6 P 7 8 9 idnight ime a s k A 3 3 3 3 a s k A 3 O r d e r B C O r d e r B C D Sequential laundry takes 6 hours for loads If they learned pipelining, how long would laundry take? D

Laundry iming How long does laundry take with a pipelined (longest of wash/dry/fold is one clock) design? What is the CC? How many clocks? a s k O r d e r Pipelined Laundry: Start work ASAP A B C 6 P 7 8 9 idnight ime 3 D Pipelined laundry takes 3.5 hours for loads Pipelining Lessons Sandwich Bar Analogy a s k O r d e r 6 P 7 8 9 ime 3 A B C D Pipelining doesn t t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage ultiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages nbalanced lengths of pipe stages reduces speedup ime to fill pipeline and time to drain it reduces speedup Pipelining: ultiple people going through the sandwich bar at the same time If you re in front of the pickles, but you don t t need the pickles Car assembly line Stall for Dependences Digital System Efficiency Digital System Efficiency A synchronous digital system doing too much work between clocks can be inefficient because logic can be static for much of clock period Logic Logic

Digital System Efficiency Digital System Efficiency Logic Logic Pipeline Efficiency Efficiency can be improved by I.. subdividing logic into multiple stages and shortening CC Single Pipelined Efficiency executing single instruction is even less efficient, why? Logic Logic Logic 3 Logic Logic Logic 3 Single Pipelined Efficiency executing single instruction is even less efficient, why? Single Pipelined Efficiency executing single instruction is even less efficient, why? Logic Logic Logic 3 Logic Logic Logic 3

Pipeline Efficiency Efficiency can be improved by I.. subdividing logic into multiple stages and shortening CC II.. overlapping operation Pipeline Efficiency Efficiency can be improved by I. subdividing logic into multiple stages and shortening CC And by II.. overlapping operation Logic Logic Logic 3 Logic Logic Logic 3 Pipeline Efficiency Efficiency can be improved by I. subdividing logic into multiple stages and shortening CC And by II.. overlapping operation Logic Logic Logic 3 Pipeline Depth If three pipeline stages are better than non-pipelined, are four stages better than three? What is the limit to the number of stages? What is the CC at that limit? Why Pipeline? Suppose we execute instructions. How long on each architecture? Single Cycle achine.5 ns/cycle, CPI= ulticycle achine. ns/cycle, CPI=. Ideal pipelined machine. ns/cycle, CPI= (but remember fill cost!) Why Pipeline? Suppose we execute instructions Single Cycle achine.5 ns/cycle x CPI x inst = 5 ns ulticycle achine. ns/cycle x. CPI (due to inst mix) x inst = ns Ideal pipelined machine. ns/cycle x ( CPI x inst + cycle fill) = ns

Fill Costs Suppose we pipeline different numbers of instructions. What is the overhead of pipelining? Ideal pipelined machine. ns/cycle, CPI=, instructions. ns/cycle, CPI=, instructions. ns/cycle, CPI=,, instructions Fill Costs Overhead: Ideal pipelined machine. ns/cycle, CPI=, instructions ns runtime + ns overhead: % overhead. ns/cycle, CPI=, instructions ns runtime + ns overhead:.% overhead. ns/cycle, CPI=,, instructions, ns runtime + ns overhead:.% overhead Pipelined ultiplier We can take the Wallace tree multiplier we developed in Chapter 3 and create a two stage pipeline Reduce partial product terms Do final n- n bit addition using carry look- ahead adder Wallace ree ultiplier: -Stage Pipeline ultiplicand Reduce Partial Products x x x x x x x x y y y y y y y y ultiplier he two operations are roughly balanced, both take log(n) ) time -Bit er Result Pipeline Wallace ree ultiplier: -Stage Pipeline ultiplicand x x x x x x x x y y y y y y y y Wallace ree ultiplier: -Stage Pipeline ultiplier ultiplicand x x x x x x x x y y y y y y y y ultiplier Reduce Partial Products Reduce Partial Products -Bit er Result Pipeline -Bit er Result Pipeline

Pipelined Processor Design Single-Cycle Design We ll take the single-cycle design and transform it into a pipelined design What we formally called a phase of an instruction s s will now become a pipeline stage ress left ress isters 6 3 Single-Cycle Design Phases Pipelined Design Fetch Decode Execution Fetch Decode Execution E/E E/ left left ress ress ress ress isters isters 6 3 6 3 Fetch & + Decode & E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3

Execution: Load Word : Load Word E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3 Back: Load Word Execution: Store Word E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3 : Store Word Back: Store Word E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3

Execution: R-typeR : R-typeR E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3 Back: R-typeR Execution: Branch E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3 : Branch Back: Branch E/E E/ E/E E/ left left ress ress ress ress isters isters 6 3 6 3

Overlapping Execution Overlapping Execution Parts of different instruction execute at each pipeline stage during each cycle CC CC CC CC CC CC lw $, ($) lw $, ($) lw $, ($) lw $, ($) lw $3, 3($) lw $3, 3($) lw $, ($) lw $, ($) lw $5, 5($) lw $5, 5($) Overlapping Execution Overlapping Execution CC CC CC CC CC CC lw $, ($) lw $, ($) lw $, ($) lw $, ($) lw $3, 3($) lw $3, 3($) lw $, ($) lw $, ($) lw $5, 5($) lw $5, 5($) Overlapping Execution Overlapping Execution CC CC CC CC CC CC lw $, ($) lw $, ($) lw $, ($) lw $, ($) lw $3, 3($) lw $3, 3($) lw $, ($) lw $, ($) lw $5, 5($) lw $5, 5($)

Overlapping Execution Lines Src Branch lw $, ($) CC CC CC SLL E/E E/ lw $, ($) P C Src em emto isters lw $3, 3($) lw $, ($) Op em lw $5, 5($) Dst nit Pipelined Src E/E E/ E P C isters SLL Src Branch em emto Src P C Src Op Dst Branch em em emto Op em Dst E/E E/ P C Src Pipeline with Pipelined isters E SLL Src E/E Branch em E/ emto Hazards hazards can occur when:. s I and I are both in the pipeline. I follows I 3. I produces a that is used by I he problem: because I has not posted its to the file, I may read an obsolete value. Op em Dst

Hazards $ is not written by sub until, but following instructions depend on $ as soon as Hazards Sw requires $ at, and $ is written to file at, so not a problem CC CC CC CC CC CC Sub $,, $, $3 Sub $,, $, $3 And $, $,, $5 And $, $, $5 Or $3, $6, $ Or $3, $6, $ $, $, $ $, $, $ Sw $5, ($ $) Sw $5, ($ $) Hazards requires $ at.5, and $ is written to file at., so not a problem Hazards Or requires $ at CC, and $ is written to file at.: hazard CC CC CC CC CC CC Sub $,, $, $3 Sub $,, $, $3 And $, $, $5 And $, $, $5 Or $3, $6, $ Or $3, $6, $ $, $, $ $, $, $ Sw $5, ($) Sw $5, ($) Hazards However $ is available at from internal pipeline latch Hazards And requires $ at, and $ is written to file at.: hazard CC CC CC CC CC CC Sub $,, $, $3 Sub $,, $, $3 And $, $,, $5 And $, $,, $5 Or $3, $6, $ Or $3, $6, $ $, $, $ $, $, $ Sw $5, ($) Sw $5, ($)

Sub $,, $, $3 And $, $,, $5 Or $3, $6, $ $, $, $ CC Hazards However $ is available at CC from internal pipeline latch CC CC Forwarding ost hazards can be resolved by forwarding,, sending the internal pipeline from I to the stage where it is needed by I which is following in the pipeline Result is still posted to the file when I reaches stage Forwarding requires hardware modifications to the path Sw $5, ($) Path without Forwarding isters Path with Forwarding Forwarding paths for s from stage and Back stage to Execution stage New to select operand from file or newest via forwarding isters ForwardA Rs Rt Rt Rd ForwardB E/E.isterRd Forward From E/ Forwarding nit E//.isterRd Forward from E/E Fowarding Path used to decide whether to forward from the stage If ((E/E. E.) #must be writing a and (E/E.isterRd!= ) #$ is always up to date and (E/E.isterRd =.isterrs E.isterRs) #stage and stage 3 source operand match ForwardA = If ((E/E. E.) and (E/E.isterRd!= ) and (E/E.isterRd =.isterrt E.isterRt)) ForwardB = Fowarding Path used to decide whether to forward from the back stage: If ((E/..) #must be writing a and (E/.isterRd!= ) #$ is always up to date and!((e/e.isterrd =.isterrs E.isterRs) ) and (E/E. E.)) # forward from stage is newer, has priority and (E/.isterRd =.isterrs E.isterRs)) #stage and stage 3 source operand match ForwardA = If ((E/..) and (E/.isterRd!= ) and!((e/e.isterrd =.isterrt E.isterRt) ) and (E/E. E.)) and (E/.isterRd =.isterrt E.isterRt)) ForwardB =

Complete path with Forwarding Forwarding () Or $3, $6, $ And $, $,, $5 Sub $,, $, $3 Before<> Before<> E/E E/E E/ E/ E E isters $ $ 5 isters $5 $3.isterRs ID.isterRs.isterRt ID.isterRt.isterRt ID.isterRt.isterRd ID.isterRd Rs Rt Rt Rd E/E.isterRd 5 3 E/E.isterRd Forwarding nit E//.isterRd Forwarding nit E//.isterRd Forwarding (CC) Forwarding () $, $, $ Or $3, $6, $ And $, $,, $5 Sub $,, $, $3 Before<> Sw $5, ($ $) $, $, $ Or $3, $6, $ And $, $,, $5 Sub $,, $, $3 E/E E/E E/ E/ E E 6 isters $6 $ $ $5 isters $ $ $6 $ 6 3 5 E/E.isterRd 6 3 E/E.isterRd Forwarding nit E//.isterRd Forwarding nit E//.isterRd Load Hazards For loads most hazards can be resolved with forwarding but some cannot Load Hazard Solution # Have compiler insert a NOP to space instructions further apart lw $,, ($) and $, $,, $5 or $8, $, $6 CC CC CC lw $,($) nop and $,$ $,$5 or $8,$ $,$6$6 add $9,$,$ $ slt $,$,$,$7,$7 add $9, $, $ slt $, $,, $7

Load Hazards Load Hazard Solution # nop resolves load hazard CC CC CC Best solution: compiler res (schedules) instructions to resolve hazard SL is independent of all instruction except LW, can move up to resolve hazard w/o nop lw $,, ($) nop and $, $,, $5 or $8, $, $6 lw $,($) nop and $,$ $,$5 or $8,$ $,$6$6 add $9,$,$ $ slt $,$,$,$7,$7 lw $,($) slt $,$,$,$7,$7 and $,$ $,$5 or $8,$ $,$6$6 add $9,$,$ $ add $9, $, $ Load Hazard Solution #3 Have the pipeline detect hazard and automatically inject a nop reduces size of program reduces cache misses LW and instruction before advance in next CC, following instructions stall, nop injected into E CC i : IF ID E E or and lw before before Detecting a Load Hazard Need to add an new hazard detection unit. Load hazard occurs under this condition: If (.em E.em) #LW in stage 3 and ((.isterrt =.isterrs ID.isterRs) or (.isterrt =.isterrt ID.isterRt)) #stage Rs or Rt uses LW stall the pipeline his logic is correct but not precise (creates too many stalls). Why? CC i+ CC i+ : or and and lw before Hazard Detection nit Detection occurs in ID pipeline stage Injecting A Nop (Pipeline Bubble) Set instruction s s control bits to all zeros; stall IF, ID ID.Rs/Rt ID.Rs/Rt Hazard Detect.em E E/E E/ ID.Rs/Rt ID.Rs/Rt Hazard Detect.em E E/E E/ isters isters.isterrs ID.isterRs.isterRt ID.isterRt.isterRt ID.isterRt.isterRd ID.isterRd Rs Rt E/E.isterRd.isterRs ID.isterRs.isterRt ID.isterRt.isterRt ID.isterRt.isterRd ID.isterRd Rs Rt E/E.isterRd.isterRt.isterRt Forwarding nit E//.isterRd Forwarding nit E//.isterRd

Refined NOP Injection Setting all control bits to zero is correct but is overkill. What is the minimum set of control bits that must be set to zero? Early Branch Resolution Branch resolved in stage to reduce penalty aken + SLL + Branch E isters = Delayed Branches Further reduce branch penalty by delaying the effect of branch one o cycle I.e., always execute instruction after branch before = target beq r7,r branch delay slot beq r7,r NOP -p p add r,... Delayed Branches Cont. Load r,... => beqz r7,r Load r,... -p p add r,... Delay slot instruction must be independent of branch, "safe". Independent instruction can be moved into delay slot from (a) above the branch, (b) from taken path, or (c) from not-taken taken path add r6,x,y beq r7,r NOP => beq r7,r add r6,x,y (a) from above: : best because instruction always useful (b) from taken path: : because lacking statistics for specific branch, taken is more likely (e.g., /3). Less valuable than (a) ) because fraction of useful work is p rather than beq r7,r NOP -p p add r,x,y Load r,... => beq r7,r add r,x,y -p p Load r,... (c) from not-taken taken path: seful work fraction is -p beq r7,r NOP add r,... Branch arget Copying If branch target instruction is accessed via other program paths, must be copied, not moved. Branch retargeted to instruction after former target Dynamic instruction count decreases, although static does not Load r,... => beq r7,r Load r,... add r,... Load r,... Branches in Delay Slot Delay slot following Jump can almost always be filled by moving or copying target Branch generally cannot fill delay slot. jump NOP JAL?? => x x+ jump JAL JAL Results in return to wrong address!!!

Profiled Delayed Branch Where safe instruction can be moved from taken or not-taken, taken, fill delay slot based on profile statistics, maximizes useful work. E.g., for p =.38: beqz r7,... NOP -p p add r,... Load r,... => beqz r7,... add r,... -p p Load r,... ore Branch Optimizations Jump is eliminated if it is the only instruction left in block after move, => can eliminate two instructions. E.g., p=. beqz... NOP -p p add r,... Load r,... jump => beqz r7,... Load r,... -p p add r,... Better still is to reverse the sense of the branch: beqz... NOP -p p add r,... Load r,... jump bnez... NOP add r,... => p -p => jump Load r,... bnez... add r,... p -p Load r,... pc + instruction Delayed Branches & Exception Delayed branches must save s of next two instructions on exception, rather than one decode reg read pc + op rdest source source (target) pc + op rdest memory access Load (delay slot) Page Fault save save op rdest branch Delayed Branching Performance For typical integer codes a useful instruction is executed in delay slot about 7% of the time thus branch cost is: (cycles lost per branch =.3) x (fraction of branches =.5) = 8% performance loss ificant reduction from 75%, but still want smaller branch cost NOE: if you consider slides 6&7 as the baseline, then CPI for branch instruction is in baseline implementation and 75% performance loss is correct. his is a bit different than what we discussed in class. Early Branch Resolution Branch resolved in stage to reduce penalty + SLL + isters aken Branch = E Branch Prediction se a magic box to: Determine the instruction being fetched is a branch Predict the branch outcome (taken/not taken) Produce the target address While the instruction is being fetched!

Branch Prediction Resolve branch in IF stage if prediction is correct Predictor Parget ress Simple Branch Prediction able Branch arget Buffer (BB) Predictor module is a n entry table indexed by n lower address bits. Fields are:. 3-n n upper address bits. IsBranch bit 3. arget ress (3 bits). Prediction able was filled during past s of the branch Branch_aken Branch arget Buffer Because BB is small, faster than instruction memory BB in Context 3-n n - pr_ = Hit n Is_Brch Predict Branch_aken _r Parget ress Branch_aken Parget ress wo-bit Branch Predictor Branch history using four states (two bits) avoids loop problem: strong not-taken taken weak not-taken taken weak taken strong taken Implemented as a state machine Weak aken N Weak Naken N N Strong aken Strong Naken N wo-bit Branch Predictor Can think of this predictor as a saturating counter: + for taken, - for not taken N N N N

Branch arget Buffer Complete Because BB is small, faster than instruction memory Final BB in Context 3-n n - pr_ = Hit n Is_Brch Branch_aken S State Hit S _r Parget ress State Branch_aken Parget ress Confirming Branch Prediction Prediction must be confirmed in ID stage, branch- history state updated + State Hit SLL arget ress + isters Branch = aken E Confirming Branch Prediction ust consider pipeline action for following cases:. No hit, no branch: do nothing. No hit, branch: a. Nullify IF instruction b. Branch to computed target address 3. Hit, condition = prediction: do nothing. Hit, condition = taken, prediction = not taken: a. Nullify IF instruction b. Branch to computer target address 5. Hit, condition = not taken, prediction = taken a. Nullify IF instruction b. Branch to ID stage + Next using Branch Prediction sing branch prediction now have four choices for next : IF_+, Parget address, ID_+, arget address ID_+ IF_+ + SLL arget ress + isters Branch = aken E BB pdates ust consider BB updates for following cases:. No hit, no branch: do nothing. No hit, branch: new BB entry, set prediction to weak condition 3. Hit, condition = taken: ++prediction. Hit, condition = not taken: --prediction 3 BB Parget ress