Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Similar documents

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

CS352H: Computer Systems Architecture

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solutions. Solution The values of the signals are as follows:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

PROBLEMS #20,R0,R1 #$3A,R2,R4

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

WAR: Write After Read

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Week 1 out-of-class notes, discussions and sample problems

Pipelining Review and Its Limitations

Course on Advanced Computer Architectures

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Computer organization

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Review: MIPS Addressing Modes/Instruction Formats

PROBLEMS. which was discussed in Section

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Computer Organization and Components

CPU Performance Equation

Introduction to Cloud Computing

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

Introducción. Diseño de sistemas digitales.1

Instruction Set Architecture

EEM 486: Computer Architecture. Lecture 4. Performance

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

Instruction Set Architecture (ISA) Design. Classification Categories

VLIW Processors. VLIW Processors

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

EE361: Digital Computer Organization Course Syllabus

CPU Organization and Assembly Language

Module: Software Instruction Scheduling Part I

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Let s put together a Manual Processor

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Instruction Set Design

LSN 2 Computer Processors

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Computer Architecture TDTS10

Central Processing Unit (CPU)

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

on an system with an infinite number of processors. Calculate the speedup of

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

CPU Organisation and Operation

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

EC 362 Problem Set #2

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

IA-64 Application Developer s Architecture Guide

Computer Organization and Architecture

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Thread level parallelism

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Reduced Instruction Set Computer (RISC)

Instruction Set Architecture (ISA)

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

An Introduction to the ARM 7 Architecture

StrongARM** SA-110 Microprocessor Instruction Timing

A Lab Course on Computer Architecture

picojava TM : A Hardware Implementation of the Java Virtual Machine

CS521 CSE IITG 11/23/2012

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

A SystemC Transaction Level Model for the MIPS R3000 Processor

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

A FPGA Implementation of a MIPS RISC Processor for Computer Architecture Education

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

CSE 141L Computer Architecture Lab Fall Lecture 2

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Chapter 1 Computer System Overview

Central Processing Unit

Assembly Language Programming

Microprocessor & Assembly Language

The Microarchitecture of Superscalar Processors

(Refer Slide Time: 00:01:16 min)

TIMING DIAGRAM O 8085

Performance evaluation

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

OC By Arsene Fansi T. POLIMI

MACHINE ARCHITECTURE & LANGUAGE

8085 INSTRUCTION SET

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

SIM-PL: Software for teaching computer hardware at secondary schools in the Netherlands

Transcription:

Pipeline Hazards Structure hazard Data hazard

Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of hazard Structural hazards These are conflicts over hardware resources. Data hazards Instruction depends on result of prior computation which is not ready (computed or stored) yet Control hazards branch condition and the branch PC are not available in time to fetch an instruction on the next clock 1.2

Structural hazard: Pipe Stage Contention Structural hazards Occurs when two or more instructions want to use the same hardware resource in the same cycle Causes bubble (stall) in pipelined machines Overcome by replicating hardware resources Multiple accesses to the register file Multiple accesses to memory some functional unit is not fully pipelined. Not fully pipelined functional units 1.3

Multi access to the register file Simply insert a stall, speedup will be decreased. We can resolve it with double bump 1.4

Double Bump Works! allow WRITE-then-READ in one clock cycle (double bump) 1.5

Multi access to Single Memory Port Time (clock cycles) I n st r. O rd Ld/St Instr 1 Instr 2 Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg e r Instr 3 Mem Reg Mem Reg 1.6

Insert Stall Time (clock cycles) I n s t r. O r d e r Ld/St Instr 1 Instr 2 Stall Instr 3 Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Bubble Bubble Bubble Bubble Bubble Mem Reg Mem Reg 1.7

Split instruction and data memory Time (clock cycles) I n st r. O rd Ld/St Instr 1 Instr 2 IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg e r Instr 3 IM Reg DM Reg Split instruction and data memory / multiple memory port / instruction buffer means: fetch the instruction and data inference using different hardware resources. 1.8

Not fully pipelined function unit : may cause structural hazard Unpipelined Float Adder ADDD IF ID ADDD WB ADDD IF ID stall stall stall stall stall ADDD Not fully pipelined Adder ADDD IF ID A1 A2 A3 WB ADDD IF ID stall A1 A2 A3 Fully pipelined Adder ADDD IF ID A1 A2 A3 A4 A5 A6 WB ADDD IF ID A1 A2 A3 A4 A5 A6 WB Or multiple unpipelined Float Adder ADDD IF ID ADDD1 WB ADDD IF ID ADDD2 WB 1.9

Example Machine without structural hazards will always have a lower CPI Data reference constitute 40% of the mix Ideal CPI ignoring the structural hazard is 1 The processor with the structural hazard has a clock rate that is 1.05 times higher than that of a processor without structural hazard. Answer Average instruction time = CPI Clock cycle time =(1+0.4 1) CC ideal /1.05 = 1.3 Cc ideal Clearly, the processor without the structural hazard is faster. 1.10

Why allow machine with structural hazard? To reduce cost. i.e. adding split caches, requires twice the memory bandwidth. also fully pipelined floating point units costs lots of gates. It is not worth the cost if the hazard does not occur very often. To reduce latency of the unit. Making functional units pipelined adds delay (pipeline overhead -> registers.) An unpipelined version may require fewer clocks per operation. Reducing latency has other performance benefits, as we will see. 1.11

Example Example: impact of structural hazard to performance Many machines have unpipelined float-point multiplier. The function unit time of FP multiplier is 6 clock cycles FP multiply has a frequency of 14% in a SPECfp benchmark Will the structural hzard have a large performance impact on the SPECfp benchmark? 1.12

Answer to the example In the best case: FP multiplies are distributed uniformly. There is one multiply in every 7 clock. 1/14% Then there will be no structural hazard,then there is no performance penalty at all. In the worst case: the multiplies are all clustered with no intervening instructions. Then every multiply instruction have to stall 5 clock cycles to wait for the multiplier be released. The CPI will increase 70% to 1.7, if the ideal CPI is 1. Experiment result: This structural hazard increase execution time by less than 3%. 1.13

Summary of Structural hazard Taxonomy of Hazards Structural hazards These are conflicts over hardware resources. OK, maybe add extra hardware resources; or full pipelined the functional units; otherwise still have to stall Allow machine with Structural hazard, since it happens not so often Data hazards Instruction depends on result of prior computation which is not ready (computed or stored) yet Control hazards branch condition and the branch PC are not available in time to fetch an instruction on the next clock 1.14

Pipeline Hazards Taxonomy of Hazards Structural hazards These are conflicts over hardware resources. Data hazards Instruction depends on result of prior computation which is not ready (computed or stored) yet Control hazards branch condition and the branch PC are not available in time to fetch an instruction on the next clock 1.15

Data hazard Data hazards occur when the pipeline changes the order of read/write accesses to operands comparing with that in sequential executing. Let s see an Example DADD R1, R1, R3 DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 1.16

Coping with data hazards:example Time ( clock cycle) I n s t r. ADD R1,R2,R3 SUB R4, R1, R5 IM Reg IM R1, read DM R1 w DM Reg. O r d e r AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11 No Hazrd IM R1, read IM R1, read DM IM R1, read 1.17

Data hazard Basic structure An instruction in flight wants to use a data value that s not done yet Done means it s been computed and it s located where I would normally expect to go look in the pipe hardware to find it Basic cause You are used to assuming a purely sequential model of instruction execution Instruction N finishes before instruction N+k, for k >= 1 There are dependencies now between nearby instructions ( near in sequential order of fetch from memory) Consequence+ Data hazards -- instructions want data values that are not done yet, or not in the right place yet 1.18

Somecases Double Bump can do! Time ( clock cycle) I n s t r. ADD R1,R2,R3 SUB R4, R1, R5 IM Reg IM R1, read DM R1 w DM Reg. O r d e r AND R6,R1,R7 OR R8,R1,R9 double bump can do! XOR R10,R1,R11 No Hazard IM R1, read IM R1, read DM IM R1, read 1.19

Proposed solution Proposed solution Don t let them overlap like this? Mechanics Don t let the instruction flow through the pipe In particular, don t let it WRITE any bits anywhere in the pipe hardware that represents REAL CPU state (e.g., register file, memory) Let the instruction wait until the hazard resolved. Name for this operation: PIPELINE STALL 1.20

How do we stall? Insert nop by compiler Time ( clock cycle) I n s t r. ADD R1,R2,R3 NOP IM Reg DM R1 w Bubble Bubble Bubble Bubble Bubble. O r d e r NOP (ADD R0, R0, R0) SUB R4,R1,R5 double bump can do! AND R6, R1,R7 No Hazard Bubble Bubble Bubble Bubble IM R1, read IM R1, read 1.21

How do we stall? Add hardware Interlock! Add extra hardware to detect stall situations Watches the instruction field bits Looks for read versus write conflicts in particular pipe stages Basically, a bunch of careful case logic Add extra hardware to push bubbles thru pipe Actually, relatively easy Can just let the instruction you want to stall GO FORWARD through the pipe but, TURN OFF the bits that allow any results to get written into the machine state So, the instruction executes (it does the work), but doesn t save 1.22

Interlock: insert stalls I n s t r.. O r d e r ADD R1,R2,R3 DSUB, R4, R1,R5 IM Time ( clock cycle) Reg IM AND R6,R1,R7 No Hazard Bubble DM Bubble Empty slots in the pipe called bubbles; means no real instruction work getting saved here R1 w R1, read IM R1, read How the interlock is implementated? 1.23

Recall MIPS Instruction format add R8, R17, R18 is stored in binary format as 00000010 00110010 01000000 00100000 MIPS lays out instructions into fields op operation of the instruction rs first register source operand rt second register source operand rd register destination operand shamt shift amount funct function (select type of operation) 1.24

Detect: Data Hazard Logic Rs =? Rd Rt =? Rd between IF/ID and ID/EX, EX/MEM Stages Rs Rt Rd Rd Rd 1.25

Example DSUB R2, R1, R3 Rd = R2 Rs = R1 Rt = R3 AND R12, R2, R5 Rd = R12 Rs = R2 Rt = R5 OR R13, R6, R2 Rd = R13 Rs = R6 Rt = R2 DADDR14, R2, R2 Rd = R14 Rs = R2 Rt = R2 SW R15, 100(R2) Rd = R15 Rs = R2 Rt = XX SUB-AND Hazard ID/EX.RegRd(sub) == IF/ID. RegRs(and) == R2 SUB-OR Hazard EX/MEM.RegRd(sub) == IF/ID. RegRt(or) == R2 AND-OR: No Hazard ID/EX.RegRd(and)==R12 IF/ID.RegRt Or IF/ID.RegRs 1.26

How to delay the instruction? The Interlock can simulate the NOP: Once it is detected need to add a stall, then Clear the ID/EX.IR to be the instruction of NOP. disable the write signal: wreg, wmem. Reserve the IF/ID. IR unchanged for one more clock cycle. Disable the write signal: WritePC, WriteIR. 1.27

Hardware simulates NOP Time ( clock cycle) I n s t r. ADD R1,R2,R3 NOP IM Reg IM DM R1 w Bubble Bubble Bubble Bubble. O r d e r NOP (ADD R0, R0, R0) SUB R4,R1,R5 double bump can do! AND R6, R1,R7 No Hazard IM Bubble IM Bubble R1, read Bubble IM R1, read 1.28

Forwarding: reduce data hazard stalls If the result you need does not exist AT ALL yet, you are out of luck, sorry. But, what if the result exists, but is not stored back yet? Instead of stalling until the result is stored back in its natural home grab the result on the fly from inside the pipe, and send it to the other instruction (another pipe stage) that wants to use it 1.29

Forwarding Generic name: forwarding ( bypass, shortcircuiting) Instead of waiting to store the result, we forward it immediately (more or less) to the instruction that wants it Mechanically, we add buses to the datapath to move these values around, and these buses always point backwards in the datapath, from later stages to earlier stages 1.30

Forwarding:reduce data hazard stalls Data may be already computed - just not in the Register File Time ( clock cycle) I n s t r.. O r d e r ADD R1,R2,R3 SUB R4, R1, R5 AND R6,R1,R7 IM Reg IM R1, read IM R1 DM R1, read R1 R1 w DM Reg DM EX/MEM.output input port MEM/WB.output input port 1.31

Hardware Change for Forwarding NextPC Registers Immediate ID/EX mux mux When to use the red. Blue EX/MEM Data Memory purple forwording path? MEM/WR mux Source sink EX/Mem.output input MEM/WB.output input MEM/WB.LMD input 1.32

When to use the forwarding path? EX/Mem.output input The previous instruction in EX/MEM is The instruction in ID/EX has Rs or Rt source register EX/MEM.Rd == ID/EX.Rs or EX/MEM.Rd ==ID/EX.Rt MEM/WB.output input The previous instruction in MEM/WB is The instruction in ID/EX has Rs or Rt source register MEM/WB.Rd == ID/EX.Rs or MEM/WB.Rd == IF/ID.Rt MEM/WB.LMD input The previous instruction in MEM/WB is Load The instruction in ID/EX has Rs or Rt source register MEM/WB.Rd == ID/EX.Rs or MEM/WB.Rd == ID/EX.Rt 1.33

How to know if it s or Load? : Wreg == 1 && W2Reg == 0 Load: Rd Wreg ==1 && W2Reg == 1 If reg-reg, then Rd = Rd If Load or reg-imm, then Rd = Rt Rs or Rt If it s Not JMP then it has Rs If it s reg-reg or Store or Branch then it has Rt 1.34

How we implement forwarding in Lab? EX/MEM.output input port 3 Forwarding paths: MEM/WB.output input port (0) MEM/WB.LDMR input port(1) 1.35

Internal forwarding Logic If (Mem.WriteReg and(mem.registerrd 0) and(mem.registerrd ==Exe.RegisterRS)) ForwardA = 01 If (Mem.WriteReg and(mem.registerrd 0) and(mem.registerrd ==Exe.RegisterRt)) ForwardB = 01 If (WB.WriteReg and(wb.registerrd 0) and(mem.registerrd Exe.RegisterRS)) and(wb.registerrd == Exe.RegisterRs)) ForwardA = 10 If (WB.WriteReg and(wb.registerrd 0) and(mem.registerrd Exe.RegisterRt)) and(wb.registerrd == Exe.RegisterRt)) ForwardB = 10 1.36

How to select the forwarding path: the forwarding logic PA-36 in Edition 3th and Edition 4th 含源指令的锁存器源指令的操作含目标指令的锁存器目标指令的操作 ( 后继指令 ) 旁路电路的输入端比较检测 ( 相等则直接送结果 EX/MEM R-R ID/EX,LD,ST,Branch 上端 EX/MEM.IR16..20=ID/EX.IR.6..10 EX/MEM R-R ID/EX R-R 下端 EX/MEM.IR16..20=ID/EX.IR11.15 MEM/WB R-R ID/EX,LD,ST,Branch 上端 MEM/WB.IR16..20=ID/EX.IR6..10 MEM/WB R-R ID/EX R-R 下端 MEM/WB.IR16..20=ID/EX.IR11..15 EX/MEM R-I ID/EX,LD,ST,Branch 上端 EX/MEM.IR11..15=ID/EX.IR6..10 EX/MEM R-I ID/EX R-R 下端 EX/MEM/.IR11..15=ID/EX.IR11..15 EX/MEM.Rd == ID/EX.rs or rt What s the difference with MEM/WB R-I ID/EX,LD,ST,Branch 上端 MEM/WB.IR11..15=ID/EX.IR6..10 MEM/WB R-I ID/EX R-R 下端 MEM/WB.IR11..15=ID/EX.IR11..15 MEM/WB Load ID/EX,LD,ST,Branch 上端 MEM/WB.IR11..15=ID/EX.IR6..10 ID/EX.Rd == IF/ID.rs or rt? MEM/WB Load ID/EX R-R 下端 MEM/WB.IR11..15=ID/EX.IR11..15 1.37

Forwarding path to other input entry Could you give an example where this bypass will be used? store MEM/WB.LMD DM input load 1.38

Forwarding Doesn t Always Work 1.39

So we have to insert stall: Load stall Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Ifetch Reg Ifetch Reg Ifetch DMem Bubble Bubble Bubble Reg Reg Ifetch DMem Reg Reg DMem Reg DMem 1.40

How to implement Load Interlock Detect when should use Load Interlock PA34 situation No dependence Dependence requiring stall Dependence overcome by forwarding Dependence with accesses in order Example code sequence LD R1, 45(R2) DADD R5,R6,R7 DSUB R8,R6,R7 OR R9,R6,R7 LD R1, 45(R2) DADD R5,R1,R7 DSUB R8,R6,R7 OR R9,R6,R7 LD R1, 45(R2) DADD R5,R6,R7 DSUB R8,R1,R7 OR R9,R6,R7 LD R1, 45(R2) DADD R5,R6,R7 DSUB R8,R6,R7 OR R9,R1,R7 Action No hazard possible because of no dependence Comparators detect the use of R1 in the DADD and stall the DADD (and DSUB and OR ) before the DADD begins EX Comparators detect the use of R1 in DSUB and forward result of load to in time for DSUB to begin EX No action required because read of R1 by OR occurs in the second half of the ID phase, while the write of the loaded data occurred in the first half. 1.41

The logic to detect for Load interlock Opcode field of ID/EX Load Opcode Field of IF/ID Reg-Reg Matching operand fields ID/EX.IR[rt]==IF/ID.IR[rs] Load Reg-Reg ID/EX.IR[rt]==IF/ID.IR[rt] Load Load,store, immediate, branch ID/EX.IR[rt]==IF/ID.IR[rs] 1.42

Example of Forwarding and Load Delay Why forwarding? ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2) Why load delay? ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2) 1.43

Solution (without forwarding) 1.44

Solution (with forwarding) EX/MEM.output input port MEM/WB.output DM data write port 1.45

The performance influence of load stall Example Assume 30% of the instructions are loads. Half the time, instruction following a load instruction depends on the result of the load. If hazard causes a single cycle delay, how much faster is the ideal pipeline? Answer CPI = 1+30% 50% 1=1.15 The performance decrease about 15% due to load stall. 1.46

Fraction of load that cause a stall 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 24% 41% 12% 23% 24% 20% 20% 10% 10% 4% 1.47 Fraction of loads that cause a stall compress eqntott espresso gcc li doduc ear hydro2d mdijdp su2cor

Instruction reordering by compiler to avoid load stall Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf d,rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf d,rd 1.48

Another case: has stall even with forwarding path alu r1, r2, R3 Ifetch Reg DMem Reg BNE r4,r1,l1 Ifetch Bubble Reg DMem Reg Assume branch is resolved in ID stage. 1.49

What we can do? Move the Forwarding path to ID stage Move the forwarding control logic to ID stage 1.50

Summary of Data Hazard Taxonomy of Hazards Structural hazards These are conflicts over hardware resources. Data hazards Instruction depends on result of prior computation which is not ready (computed or stored) yet OK, we did these, Double Bump, Forwarding path, compiler scheduling, otherwise have to stall Control hazards branch condition and the branch PC are not available in time to fetch an instruction on the next clock 1.51

Related materials an overview of pipelining computer processors http://granite.sru.edu/~venkatra/pipelining1.html pipelining: a summery http://wwwee.eng.hawaii.edu/~tep/ee461/notes/pipeline/summ ary.html 1.52