HW 5 Solutions. Manoj Mardithaya

Similar documents
Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Solutions. Solution The values of the signals are as follows:

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Computer Organization and Components

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

PROBLEMS #20,R0,R1 #$3A,R2,R4

CS352H: Computer Systems Architecture

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

EC 362 Problem Set #2

Course on Advanced Computer Architectures

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Chapter 5 Instructor's Manual

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

CS521 CSE IITG 11/23/2012

PROBLEMS. which was discussed in Section

Week 1 out-of-class notes, discussions and sample problems

Computer organization

Performance evaluation

Let s put together a Manual Processor

(Refer Slide Time: 00:01:16 min)

Pipelining Review and Its Limitations

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

VLIW Processors. VLIW Processors

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Thread level parallelism

Review: MIPS Addressing Modes/Instruction Formats


INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

TIMING DIAGRAM O 8085

Take-Home Exercise. z y x. Erik Jonsson School of Engineering and Computer Science. The University of Texas at Dallas

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

To design digital counter circuits using JK-Flip-Flop. To implement counter using 74LS193 IC.

In the Beginning The first ISA appears on the IBM System 360 In the good old days

A New Paradigm for Synchronous State Machine Design in Verilog

Giving credit where credit is due

Having read this workbook you should be able to: recognise the arrangement of NAND gates used to form an S-R flip-flop.

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

CPU Performance Equation

2

WAR: Write After Read

EE361: Digital Computer Organization Course Syllabus

Generating MIF files

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Introducción. Diseño de sistemas digitales.1

on an system with an infinite number of processors. Calculate the speedup of

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

CHAPTER 11: Flip Flops

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

CPU Organisation and Operation

WEEK 8.1 Registers and Counters. ECE124 Digital Circuits and Systems Page 1

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

StrongARM** SA-110 Microprocessor Instruction Timing

CSE 141L Computer Architecture Lab Fall Lecture 2

CPU Performance. Lecture 8 CAP

Computer Architecture TDTS10

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Instruction Set Design

The 104 Duke_ACC Machine

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

ECE410 Design Project Spring 2008 Design and Characterization of a CMOS 8-bit Microprocessor Data Path

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Intel 8086 architecture

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

AC : A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION

Lecture 7: Clocking of VLSI Systems

- Hubs vs. Switches vs. Routers -

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Central Processing Unit (CPU)

The Central Processing Unit:

Switch Fabric Implementation Using Shared Memory

Chapter 2 Logic Gates and Introduction to Computer Architecture

Instruction Level Parallelism I: Pipelining

Digital Controller for Pedestrian Crossing and Traffic Lights

MACHINE ARCHITECTURE & LANGUAGE

A SystemC Transaction Level Model for the MIPS R3000 Processor

Chapter 2. Why is some hardware better than others for different programs?

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

LSN 2 Computer Processors

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Network Planning and Analysis

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Chapter 9 Computer Design Basics!

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Instruction Set Architecture

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

MICROPROCESSOR AND MICROCOMPUTER BASICS

UMBC. ISA is the oldest of all these and today s computers still have a ISA bus interface. in form of an ISA slot (connection) on the main board.

Software Pipelining - Modulo Scheduling

Transcription:

HW 5 Solutions Manoj Mardithaya Question 1: Processor Performance The critical path latencies for the 7 major blocks in a simple processor are given below. IMem Add Mux ALU Regs DMem Control a 400ps 100ps 30ps 120ps 200ps 350ps 100ps b 500ps 150ps 100ps 180ps 220ps 1000ps 65ps For each part, answer the following questions: 1. What is the critical path for a MIPS ADD instruction? Explain your break-up. Using the diagram in the book, three of the many possible paths through the circuit are: Add + Add + Mux Imem + Control + Mux + ALU + DMem + Mux Imem + Mux + Reg + Mux + ALU + DMem + Mux The longest of these (and the other shorter combinations that I didn t list) is the critical path. Note that the instruction type doesn t affect the critical path the critical path through the circuit is the same because it determines how quickly you can do the next operation even if an instruction finishes meaningful work early, it still has to go through the rest of the circuit so your timing and signals stay in sync. For both a. and b., the longest is the third path: 400 + 30 + 200 + 30 + 120 + 350 + 30 = 1160 500 + 100 + 220 + 100 + 180 + 1000 + 100 = 2200 2. If the number of registers is doubled, this increases Regs by 100ps and Control by 20ps. This results in 10% fewer instructions due to fewer load/stores. What is the new critical path for a MIPS ADD instruction? Again, the 10% fewer instructions doesn t affect the critical path. (it would affect program runtime, because of fewer instructions.). The changes don t change the critical path, they only increase the length of it to 1260psand2300ps respectively.

Question 2: Pipelining The 5 stages of the processor have the following latencies: Fetch Decode Execute Memory Writeback a. 300ps 400ps 350ps 500ps 100ps b. 200ps 150ps 120ps 190ps 140ps Assume that when pipelining, each pipeline stage costs 20ps extra for the registers between pipeline stages. 1. Non-pipelined processor: what is the cycle time? What is the latency of an instruction? What is the throughput? Cycle-time: there is no pipelining, so the cycle-time has to allow an instruction to go through all the stages each cycle. Therefore: a. CT = 1650ps b. CT = 800ps The latency for an instruction is also the same, since each instruction takes 1 cycle to 1 go from beginning fetch to the end of writeback. The throughput similarly is cycle time instructions per second. 2. Pipelined processor: What is the cycle time? What is the latency of an instruction? What is the throughput? Pipelining to 5 stages reduces the cycle time to the length of the longest stage. Additionally, the cycle time needs to be slightly longer to accomodate the register at the end of the stage. a. CT = 520ps b. CT = 220ps The latency for both is 5 (cycle time), since an instruction needs to go through 5 pipeline stages, spending 1 cycle in each, before it commits. The throughput for both is still 1 instruction/cycle. Throughput increases because the cycle time reduces. 3. If you could split one of the pipeline stages into 2 equal halves, which one would you choose? What is the new cycle time? What is the new latency? What is the new throughput? Splitting the longest stage is the only way to reduce the cycle time. After splitting it, the new cycle time is based on the new longest stage. a. Old longest stage is Memory. New CT = 420ps b. Old longest stage is Fetch. New CT = 210ps The new latency is 6 (cycle stages now. time), since an instruction needs to go through 6 pipeline

The throughput for both is still 1 instruction/cycle. Throughput increases because the cycle time reduces. 4. Assume the distribution of instructions that run on the processor is: 50% ALU 25%: BEQ 15%: LW 10%: SW Assuming there are no stalls or hazards, what is the utilization of the data memory? What is the utilization of the register block s write port? (Utilization in percentage of clock cycles used) Data memory is utilized only by LW and SW instructions in the MIPS ISA. So the utilization is 25% of the clock cycles. The write port may be utilized by ALU and LW instructions. The utilization is 65% of the clock cycles. Calculating utilization in terms of % time though gives different results, because unless the corresponding stage (data memory or writeback) is the stage dictating cycle time, the circuit is idle for some period of the clock cycle in all cycles. Question 3: Stalling Sequence of instructions: 1. lw $s2, 0($s1) 2. lw $s1, 40($s6) 3. sub $s6, $s1, $s2 4. add $s6, $s2, $s2 5. or $s3, $s6, $zero 6. sw $s6, 50($s1) 1. Data dependencies: A data dependence is a dependence of one instruction B on another instruction A because the value produced by A is read by B. So when listing data dependencies, you need to mention A, B, and the location (in this case, the register that causes the dependence). It also doesn t matter how far apart dependencies are. Don t confuse data dependencies and hazards! Dependencies: 3 depends on 1 ($s2) 3 depends on 2 ($s1) 4 depends on 1 ($s2) 5 depends on 4 ($s6)

6 depends on 2 ($s1) 6 depends on 4 ($s6) 2. Assume the 5-stage MIPS pipeline with no forwarding, and each stage takes 1 cycle. Instead of inserting nops, you let the processor stall on hazards. How many times does the processor stall? How long is each stall (in cycles)? What is the execution time (in cycles) for the whole program? Ignoring the stalls for a moment, the program takes 10 cycles to execute not 6, because the first 4 cycles, it does not commit (finish) an instruction those 4 cycles, the pipeline is still filling up. It is also not 30, because when the first instruction commits, the 2nd instruction is nearly done, and will commit in the next cycle. Remember, pipelines allow multiple instructions to be executing at the same time. With the stalls, there are only two stalls after the 2nd load, and after the add both are because the next instruction needs the value being produced. Without forwarding, this means the next instruction is going to be stuck in the fetch stage until the previous instruction writes back. These are 2 cycle stalls (when in doubt, draw diagrams like the ones on the Implementing MIPS slides, slide 63). So to answer the question, 2 stalls, 2 cycles each, and the total is 10 + 2 2 = 14 cycles to execute the program. 3. Assume the 5-stage MIPS pipeline with full forwarding. Write the program with nops to eliminate the hazards. (Hint: time travel is not possible!) Again, draw a diagram in the second stall, the result of the add is available at the register at the end of the execute stage when the next instruction wants to move to the execute stage, so you can forward the value of $s6 as the input to the execution stage (as the argument for the or) this removes the second 2-cycle stall. However, the loaded value is not ready until the end of the memory stage, so you cannot using forwarding to remove both cycles you still need to wait 1 cycle. The solution is to place a NOP after the second load. (Note: this is also the reason why MIPS has load delay slots). Speaking of delay slots, the question was ambiguous in whether delay slots were used or not. If they are, all the answers are slightly different: Part 1: You have one fewer dependence - 3 does not depend on 2, because it is in the delay slot. Part 2: The first stall is only 1 cycle, so the program executes in 13 cycles. Part 3: You do not require any nops, because of the delay slot. Question 4: More Pipelines You are given a non-pipelined processor design which has a cycle time of 10ns and average CPI of 1.4. Calculate the latency speedup in the following questions. The solutions given assume the base CPI = 1.4 throughput. Since the question is ambiguous, you could assume pipelining changes the CPI to 1. The method for computing the answers still apply.

1. What is the best speedup you can get by pipelining it into 5 stages? Since IC and CPI don t change, and, in the best case, pipelining will reduce CT to 2ns: = 10ns CT new 2ns = 5x Speedup 2. If the 5 stages are 1ns, 1.5ns, 4ns, 3ns, and 0.5ns, what is the best speedup you can get compared to the original processor? The cycle time is limited by the slowest stage, so CT = 4 ns. = 10ns CT new 4ns = 2.5x Speedup 3. If each pipeline stage added also adds 20ps due to register setup delay, what is the best speedup you can get compared to the original processor? Adding register delay to the cycle time because of pipeline registers, you get CT = 4.02 ns. CT new = 10ns 4.02ns = 2.49x Speedup 4. The pipeline from Q4.3 stalls 20% of the time for 1 cycle and 5% of the time for 2 cycles (these occurences are disjoint). What is the new CPI? What is the speedup compared to the original processor? 20% of the time: CP I = 1.4 + 1 = 2.4 5% of the time: CP I = 1.4 + 2 = 3.4 75% of the time: CP I = 1.4 New average CPI: 2.4 0.2 + 3.4 0.05 + 1.4 0.75 = 1.7 CP I old CT new CP I new = 10 1.4 4.02 1.7 = 2.049x Speedup