Hiroaki Kobayashi 10/08/2002

Similar documents
Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

PROBLEMS. which was discussed in Section

Pipelining Review and Its Limitations

Week 1 out-of-class notes, discussions and sample problems

PROBLEMS #20,R0,R1 #$3A,R2,R4

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

LSN 2 Computer Processors

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

StrongARM** SA-110 Microprocessor Instruction Timing

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Computer Organization and Components

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.


COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

VLIW Processors. VLIW Processors

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

on an system with an infinite number of processors. Calculate the speedup of

A Lab Course on Computer Architecture

Computer Architecture TDTS10

WAR: Write After Read

Thread level parallelism

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

EC 362 Problem Set #2

Let s put together a Manual Processor

IA-64 Application Developer s Architecture Guide

The Microarchitecture of Superscalar Processors

Central Processing Unit (CPU)

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Chapter 5 Instructor's Manual

(Refer Slide Time: 00:01:16 min)

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

OC By Arsene Fansi T. POLIMI

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

Exception and Interrupt Handling in ARM

CPU Organization and Assembly Language

Instruction scheduling

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Performance evaluation

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CPU Performance Equation

Computer Organization and Architecture

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

TIMING DIAGRAM O 8085

CSEE W4824 Computer Architecture Fall 2012

路 論 Chapter 15 System-Level Physical Design

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

361 Computer Architecture Lecture 14: Cache Memory

Solutions. Solution The values of the signals are as follows:

Sequential Logic. (Materials taken from: Principles of Computer Hardware by Alan Clements )

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Annotation to the assignments and the solution sheet. Note the following points

EE361: Digital Computer Organization Course Syllabus

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

An Introduction to the ARM 7 Architecture

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Microprocessor & Assembly Language

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

ARM Microprocessor and ARM-Based Microcontrollers

Instruction Set Design

2

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Module: Software Instruction Scheduling Part I

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

CS352H: Computer Systems Architecture

Switch Fabric Implementation Using Shared Memory

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

CS521 CSE IITG 11/23/2012

CHAPTER 11: Flip Flops

The Timeboxing Process Model for Iterative Software Development

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Windows Server Performance Monitoring

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

GPUs for Scientific Computing

Central Processing Unit

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Intel Pentium 4 Processor on 90nm Technology

Transcription:

Hiroaki Kobayashi 10/08/2002 1 Multiple instructions are overlapped in execution Process of instruction execution is divided into two or more steps, called pipe stages or pipe segments, and Different stage are completing different parts of different instructions in parallel The stages are connected one to the next to form a pipe Instructions enter at one end, progress through the stages, and exit at the other end It takes advantage of parallelism that exists among the actions needed to execute an instruction in a sequential instruction stream. Unlike some speedup techniques, it is not visible to the programmer/compiler. 10/8/2002 Hiroaki Kobayashi 2

Key properties of RISC instruction set All operations on data apply to data in registers and typically change the entire register The only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register, respectively. The instruction formats are few in number with all instructions typically being one size Five steps to execute an instruction I. Instruction fetch cycle () II. Instruction decode/register fetch cycle () III. Execution/effective address cycle () IV. Memory access () V. Write-back cycle () 10/8/2002 Hiroaki Kobayashi 3 Clock Number Instruction number 1 2 3 4 5 6 7 8 9 Instruction i Instruction i+1 Instruction i+2 Instruction i+3 Instruction i+4 10/8/2002 Hiroaki Kobayashi 4

10/8/2002 Hiroaki Kobayashi 5 10/8/2002 Hiroaki Kobayashi 6

Throughput How often an instruction exits the pipeline Processor Cycle The time required between moving an instruction one step down the pipeline Because all stages proceed at the same time, the length of a processor cycle Is determined by the time required for the slowest pipe stage. The longest step would determine the time between advancing the line. Ideal time per instruction on the pipeline processor Time per instruction on unpipelined machine Number of pipe stages Ideally, n times faster on n-stage pipeline, but usually the stages will not be perfectly balanced! Pipeline overhead: Latch delay and skew 10/8/2002 Hiroaki Kobayashi 7 Assume that A processor has a 1ns clock cycle, and uses 4 cycles for ALU operations and branches and 5 cycles for memory operations, where the relative frequencies of these operations are 40% (ALUs), 20% (Branches), and 40% (memory) Suppose that due to clock skew and setup, pipelining the processor add 0.2 ns of overhead to the clock. Question: Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? 10/8/2002 Hiroaki Kobayashi 8

Structural hazards Arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution Data hazards Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline Control hazards Arise from the pipelining of branches and other instructions that change the PC Hazards in pipelines can make it necessary to stall the pipelines! 10/8/2002 Hiroaki Kobayashi 9 Speedup from pipelining = = = Average instruction time unpipelined Average instruction time pipelined CPI unpipelined Clock cycle unpipeline CPI pipelined Clock cycle pipeline CPI unpipelined Clock cycle unpipeline CPI pipelined Clock cycle pipeline Because the Ideal CPI on a pipelined processor is almost always 1, CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + Pipeline stall clock cycles per instruction 10/8/2002 Hiroaki Kobayashi 10

If there is no pipeline overhead, Speedup = CPI unpipelined 1 + Pipeline stall cycles per instruction In the simple case, the unpipelined CPI is equal to the depth of the pipeline Speedup = Pipeline depth 1 + Pipeline stall cycles per instruction 10/8/2002 Hiroaki Kobayashi 11 If pipelining improves the clock cycle time, Speedup = = CPI unpipelined Clock cycle unpipeline CPI pipelined Clock cycle pipeline 1 Clock cycle unpipelined 1 + Pipeline stall cycles per instruction Clock cycle pipelined In cases where the pipe stages are perfectly balanced and there is no overhead, Clock cycle pipelined = Clock cycle unpipelined Pipeline depth Pipeline depth = Clock cycle unpipeline Clock cycle pipeline 10/8/2002 Hiroaki Kobayashi 12

Speedup = 1 Clock cycle unpipelined 1 + Pipeline stall cycles per instruction Clock cycle pipelined = 1 + Pipeline stall cycles per instruction 1 Pipeline depth 10/8/2002 Hiroaki Kobayashi 13 Structural hazards occur when some combination of instructions cannot be accommodated because of resource conflicts. Some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute. Examples: Read access in and write access in to the register file A single-memory pipeline for data and instructions 10/8/2002 Hiroaki Kobayashi 14

10/8/2002 Hiroaki Kobayashi 15 Instruction number 1 2 3 4 5 6 7 8 9 10 Instruction i Instruction i+1 Instruction i+2 Instruction i+3 stall stall stall stall stall Instruction i+3 Instruction i+4 Solution: Separate (cache-)memory systems for instructions and data 10/8/2002 Hiroaki Kobayashi 16

Suppose that Data references constitute 40% of the mix The ideal CPI of the pipelined processor is 1 The processor with the structural hazard has a clock rate that is 1.05 times higher than the clock rate of the processor without the hazard. Question: Disregarding any other performance losses, is the pipeline with or without the structural hazard faster, and by how much? 10/8/2002 Hiroaki Kobayashi 17 A processor without structural hazards will always have a lower CPI. Why would a designer allow structural hazard? Tradeoff between cost and performance gains Since pipelining all the functional units, or duplicating them, may be too costly. Processors that support both an instruction and a data cache access every cycle require twice as much total memory bandwidth, and often have higher bandwidth at the pin. 10/8/2002 Hiroaki Kobayashi 18

Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instruction on a unpipelined processor. Example: DADD R1, R2, R3 DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 Operation Destination, Source1, Source2 10/8/2002 Hiroaki Kobayashi 19 Data hazard No hazard 10/8/2002 Hiroaki Kobayashi 20

If the result can be moved from the pipeline register where the DADD stores it to where the DSUB needs it, the need for a stall can be avoided. Data forwarding (bypassing) mechanism: The ALU result from both the / and / pipeline registers is always fed back to the ALU inputs If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. 10/8/2002 Hiroaki Kobayashi 21 Data forwarding path 10/8/2002 Hiroaki Kobayashi 22

10/8/2002 Hiroaki Kobayashi 23 Example LD R1, 0(R2) DSUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 10/8/2002 Hiroaki Kobayashi 24

LD R1,0(R2) DSUB R4, R1, R5 AND OR R6, R1, R7 R8, R1, R9 Pipeline interlock LD R1,0(R2) DSUB R4, R1, R5 stall AND R6, R1, R7 stall OR R8, R1, R9 stall CPI increases by the length of the stall! 10/8/2002 Hiroaki Kobayashi 25 Execution of branch instructions may or may not change the PC to something other than its current execution sequence. Branch Instruction Branch Successor Branch Successor + 1 stall Branch Successor + 2 Fetch is restarted once the branch target is know. 10/8/2002 Hiroaki Kobayashi 26

Delayed Branch Branch instruction Sequential successor (branch delay slot) Branch target if taken Taken branch inst Branch delay inst Branch target Branch target + 1 10/8/2002 Hiroaki Kobayashi 27 Execution of the instruction in the delay slot will be nullified when the branch is incorrectly predicted. 10/8/2002 Hiroaki Kobayashi 28