Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats



Similar documents
CS352H: Computer Systems Architecture

WAR: Write After Read

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Computer Organization and Components

Week 1 out-of-class notes, discussions and sample problems

Pipelining Review and Its Limitations

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

LSN 2 Computer Processors


PROBLEMS #20,R0,R1 #$3A,R2,R4

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Course on Advanced Computer Architectures

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

CPU Performance Equation

VLIW Processors. VLIW Processors

Central Processing Unit (CPU)

CS521 CSE IITG 11/23/2012

Performance evaluation

CPU Organization and Assembly Language

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Intel 8086 architecture

An Introduction to the ARM 7 Architecture

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Instruction Set Design

Computer organization

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Introduction to Cloud Computing

A SystemC Transaction Level Model for the MIPS R3000 Processor

Instruction Set Architecture

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Giving credit where credit is due

EC 362 Problem Set #2

SOC architecture and design

Review: MIPS Addressing Modes/Instruction Formats

Thread level parallelism

Instruction Set Architecture (ISA)

Reduced Instruction Set Computer (RISC)

Processor Architectures

In the Beginning The first ISA appears on the IBM System 360 In the good old days

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

PROBLEMS. which was discussed in Section

Introducción. Diseño de sistemas digitales.1

Computer Architecture TDTS10

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

CPU Organisation and Operation

(Refer Slide Time: 00:01:16 min)

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

EE361: Digital Computer Organization Course Syllabus

İSTANBUL AYDIN UNIVERSITY

CSE 141 Introduction to Computer Architecture Summer Session I, Lecture 1 Introduction. Pramod V. Argade June 27, 2005

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Computer Organization and Architecture

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

ARM Microprocessor and ARM-Based Microcontrollers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Microprocessor and Microcontroller Architecture

The Microarchitecture of Superscalar Processors

Solutions. Solution The values of the signals are as follows:

on an system with an infinite number of processors. Calculate the speedup of

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Where we are CS 4120 Introduction to Compilers Abstract Assembly Instruction selection mov e1 , e2 jmp e cmp e1 , e2 [jne je jgt ] l push e1 call e

IA-64 Application Developer s Architecture Guide

EEM 486: Computer Architecture. Lecture 4. Performance

CHAPTER 7: The CPU and Memory

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Chapter 5 Instructor's Manual

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Exception and Interrupt Handling in ARM

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

StrongARM** SA-110 Microprocessor Instruction Timing

2

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

The Design of the Inferno Virtual Machine. Introduction

Levels of Programming Languages. Gerald Penn CSC 324

Design of Digital Circuits (SS16)

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Administrative Issues

Property of ISA vs. Uarch?

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

Module: Software Instruction Scheduling Part I

Figure 1: Graphical example of a mergesort 1.

Transcription:

Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction Fetch» Get the next instruction from memory» Increment Program Counter value by 4 2. Instruction Decode» Figure out what the instruction says to do» Get values from the named registers» Simple instruction format means we know which registers we may need before the instruction is fully decoded R I J Simple MIPS Instruction Formats op code source 1 source 2 dest shamt function 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits op code base reg src/dest offset or immediate value 6 bits 5 bits 5 bits 16 bits op code word offset 6 bits 26 bits

EX, MEM, and WB stages 3. Execute» On a memory reference, add up base and offset» On an arithmetic instruction, do the math 4. Memory Access» If load or store, access memory» If branch, replace PC with destination address» Otherwise do nothing 5. Write back» Place the results in the appropriate register Example: add $s0, $s1, $s2 IF get instruction at PC from memory op code source 1 source 2 dest shamt function 000000 10001 10010 10000 00000 100000 ID determine what instruction is and read registers» 000000 with 100000 is the add instruction» get contents of $s1 and $s2 (eg: $s1=7, $s2=12) EX add 7 and 12 = 19 MEM do nothing for this instruction WB store 19 in register $s0 Example: lw $t2, 16($s0) IF get instruction at PC from memory op code base reg src/dest offset or immediate value 010111 10000 01000 0000000000010000 ID determine what 010111 is» 010111 is lw» get contents of $s0 and $t2 (we don t know that we don t care about $t2) $s0=0x200d1c00, $t2=77763 EX add 16 to 0x200D1C00 = 0x200D1C10 MEM load the word stored at 0x200D1C10 WB store loaded value in $t2 Latency & Throughput 1 2 3 4 5 6 7 8 9 10 inst 1 inst 2 Latency the time it takes for an individual instruction to execute What s the latency for this implementation? One instruction takes 5 clock cycles Cycles per Instruction (CPI) = 5 Throughput the number of instructions that execute per unit time What s the throughput of this implementation? One instruction is completed every 5 clock cycles Average CPI = 5

A case for pipelining If execution is non-overlapped, the functional units are underutilized because each unit is used only once every five cycles If Instruction Set Architecture is carefully designed, organization of the functional units can be arranged so that they execute in parallel Pipelining overlaps the stages of execution so every stage has something to do each cycle Pipelined Latency & Throughput 1 2 3 4 5 6 7 8 9 What s the latency of this implementation? What s the throughput of this implementation? inst 1 inst 2 inst 3 inst 4 inst 5 Pipelined Analysis Throughput is good! A pipeline with N stages could improve throughput by N times, but» each stage must take the same amount of time» each stage must always have work to do» there may be some overhead to implement Also, latency for each instruction may go up» Within some limits, we don t care increasing number of instructions overlapped sequential increasing time

MIPS ISA: Born to Pipeline Instructions all one length» simplifies Instruction Fetch stage Regular format» simplifies Instruction Decode Few memory operands, only registers» only lw and sw instructions access memory Aligned memory operands» only one memory access per operand Memory accesses Efficient pipeline requires each stage to take about the same amount of time CPU is much faster than memory hardware Cache is provided on chip» i-cache holds instructions» d-cache holds data» critical feature for successful RISC pipeline» more about caches next week The Hazards of Parallel Activity Any time you get several things going at once, you run the risk of interactions and dependencies» juggling doesn t take kindly to irregular events Unwinding activities after they have started can be very costly in terms of performance» drop everything on the floor and start over Design for Speed Most of what we talk about next relates to the CPU hardware itself» problems keeping a pipeline full» solutions that are used in the MIPS design Some programmer visible effects remain» many are hidden by the assembler or compiler» the code that you write tells what you want done, but the tools rearrange it for speed

Pipeline Hazards Structural hazards» Instructions in different stages need the same resource, eg, memory Data hazards» data not available to perform next operation Control hazards» data not available to make branch decision Structural Hazards Concurrent instructions want same resource» lw instruction in stage four (memory access)» add instruction in stage one (instruction fetch)» Both of these actions require access to memory; they would collide if not designed for Add more hardware to eliminate problem» separate instruction and data caches Or stall (cheaper & easier), not usually done Data Hazards When an instruction depends on the results of a previous instruction still in the pipeline This is a data dependency $s0 is written here Stall for register data dependency Stall the pipeline until the result is available» this would create a 3-cycle pipeline bubble add $s0, $s1, $s2 add $s4, $s3, $s0 add s0,s1,s2 stall $s0 is read here

Read & Write in same Cycle Write the register in the first part of the clock cycle Read it in the second part of the clock cycle A 2-cycle stall is still required write $s0 Solution: Forwarding The value of $s0 is known internally after cycle 3 (after the first instruction s EX stage) The value of $s0 isn t needed until cycle 4 (before the second instruction s EX stage) If we forward the result there isn t a stall add s0,s1,s2 read $s0 IF stall ID EX MEM WB add s0,s1,s2 Another data hazard What if the first instruction is lw? s0 isn t known until after the MEM stage» We can t forward back into the past Either stall or reorder instructions Stall for lw hazard We can stall for one cycle, but we hate to stall lw s0,0(s2) lw s0,0(s2) NO! IF ID stall EX MEM WB

Instruction Reorder for lw hazard Reordering Instructions Try to execute an unrelated instruction between the two instructions lw s0,0(s2) sub t4,t2,t3 sub t4,t2,t3 Reordering instructions is a common technique for avoiding pipeline stalls Static reordering» programmer, compiler and assembler do this Dynamic reordering» modern processors can see several instructions» they execute any that have no dependency» this is known as out-of-order execution and is complicated to implement, but effective