Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:



Similar documents
CS352H: Computer Systems Architecture

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Computer Organization and Components


Solutions. Solution The values of the signals are as follows:

WAR: Write After Read

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

PROBLEMS #20,R0,R1 #$3A,R2,R4

Instruction Set Design

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

CPU Performance Equation

Course on Advanced Computer Architectures

An Introduction to the ARM 7 Architecture

EE361: Digital Computer Organization Course Syllabus

VLIW Processors. VLIW Processors

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Pipelining Review and Its Limitations

A Lab Course on Computer Architecture

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Week 1 out-of-class notes, discussions and sample problems

Computer organization

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

EC 362 Problem Set #2

Chapter 5 Instructor's Manual

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Instruction Set Architecture (ISA)

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

LSN 2 Computer Processors

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

İSTANBUL AYDIN UNIVERSITY

Instruction Set Architecture

(Refer Slide Time: 00:01:16 min)

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Introducción. Diseño de sistemas digitales.1

MICROPROCESSOR AND MICROCOMPUTER BASICS

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

on an system with an infinite number of processors. Calculate the speedup of

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Introduction to Cloud Computing

Computer Architecture TDTS10

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Exceptions in MIPS. know the exception mechanism in MIPS be able to write a simple exception handler for a MIPS machine

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

StrongARM** SA-110 Microprocessor Instruction Timing

CHAPTER 7: The CPU and Memory

IA-64 Application Developer s Architecture Guide

PROBLEMS. which was discussed in Section

Module: Software Instruction Scheduling Part I

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

POWER8 Performance Analysis

CHAPTER 4 MARIE: An Introduction to a Simple Computer

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

CSE 141L Computer Architecture Lab Fall Lecture 2

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

CPU Organization and Assembly Language

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Processor Architectures

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

Generating MIF files

Systems I: Computer Organization and Architecture

COS 318: Operating Systems

Digital Signal Controller Based Automatic Transfer Switch

picojava TM : A Hardware Implementation of the Java Virtual Machine

Software Pipelining - Modulo Scheduling

Thread level parallelism

Review: MIPS Addressing Modes/Instruction Formats

Parallelization: Binary Tree Traversal

Performance evaluation

Chapter 7D The Java Virtual Machine

Instruction Set Architecture (ISA) Design. Classification Categories

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Lecture 7: Clocking of VLSI Systems

CS521 CSE IITG 11/23/2012

Computer Organization and Architecture

Operating System Impact on SMT Architecture

Sequential Logic. (Materials taken from: Principles of Computer Hardware by Alan Clements )

The Microarchitecture of Superscalar Processors

The 104 Duke_ACC Machine

SOC architecture and design

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

Microprocessor & Assembly Language

Transcription:

Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove it cannot happen. You may assume any data forwarding paths that could prevent a bubble have been implemented. Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: label_k: LW $1, offset_i( $2 ) BReq $1, $ZER0, label_i Suppose this runs on M1, a 5-stage pipelined MIPS machine with full data forwarding and stalls and nullification as needed for data and control hazards. Ignoring pipeline fill and drain overhead, what is the average branch CPI if every branch is taken? What is the average LW CPI? What is the overall average CPI? Q. Compare the performance of M1 with M2, a simple 5-stage pipelined MIPS with no forwarding or stalls or nulls: all hazards are handled by the compiler inserting NOPS as needed. Also compare to M3, a singlecycle MIPS implementation with a 800ps clock cycle. The pipelined machines have a 200ps clock cycle.

Q. The previous question did not consider hierarchical refinement. That is, we might consider implementing parallelism in various ways in the functional units by, for instance, interleaving and pipelining. As a limiting case, using Amdahl's law and without any restrictions on the possible improvements except minimum flip-flop timing considerations, derive an absolute limit on speedup.

Q. The single-cycle MIPS implementation presented in the P&H textbook has a micro-controller ROM presented in Figure 4.22. Only one state is needed for execution of each instruction: There is one ROM word for each opcode and the instruction's opcode bits are fed directly to the ROM's address input. Figure 4.24 (below) shows the single-cycle MIPS, including elements that implement the J, or jump, instruction. Add a row to the ROM to implement control signals for J-format instructions. NB--Having 6-bit opcodes implies a 64-word ROM. We are only implementing five of those for the opcodes 000000 (R-format), 100011 (LW), 101011 (SW), 000100 (BR), and 000010 (J). We ignore the other ROM words. Figure 4.22 has been rearranged (below) to make the ROM look like a typical memory layout.

Q. Figure 4.29 (below) in the P&H text shows pipeline graphical notation for execution of an ADD instruction. Gray areas represent active hardware elements in each clock cycle. The register file is shown twice, once in ID and once in WB. Since writes into the register file occur on the falling clock edge, the register file is gray for the first half of the WB cycle as data is sampled. An instruction reads the register file during the second half of its ID cycle, so the right half of the register file is gray in ID. The register file hardware is shown twice for each instruction's execution. (Memory is not gray because ADD does not access memory.) By this convention, if the register file's registers were positive-edge triggered, the register file would be gray for the entire ID cycle as the read would initiate at the positive clock edge that moved the ADD instruction into the ID stage and would not complete until next positive clock edge clocked the data read into the EX stage. How would you handle write-back if the register file were positive-edge triggered?

An execution trace contains the following mix of MIPS instructions: 45% 3-register operate, 25% loads, 10% stores, 20% branches (1/2 are taken). We are considering our options for pipelining the 1-cycle MIPS implementation. We are considering a 5-stage implementation. The clock cycle time is 800 ps for the single-cycle implementation and 200 ps for the pipelined version. Q. Let's analyze an optimistic limiting case. Let's assume all operate instructions have forwarding to avoid data dependency stalls. Further, let's assume all loads and stores cause no bubbles because the compiler always moves independent instructions into any delay slots. Further, assume all branching avoids data dependencies in a similar manner. With branch hazard detection, the only bubbles will come from taken branches. What is the average CPI in this case (ignore pipe fill and drain overhead)? What is the speed-up of the pipelined version versus the single-cycle version? Q. A pessimistic assumption is just the opposite: all possible delays are incurred. Find the CPI and speed-up in this case. Q. Going back to the optimistic case, let's suppose we spend more to make the pipeline deeper: We split each stage into k sub-stages. The pipeline depth is then 5k at an added expense of O(5k) dollars. What is the limiting speed-up in this case? Q. Considering the costs, which alternative seems the most practical? Explain.

Suppose functional-unit delays in 5-stage pipelined implementation are (200, 100, 250, 200, 100) in ps (IMEM, DECODE, EXECUTE, DMEM, WRITE-BACK). We are considering two possible improvements: (1) add look-ahead-carry circuitry which typically speeds up an ALU by 25% or (2) replace the register file with a new one containing double-word registers and double the datapath to DMEM. So, data fetches are not slowed, and data fetching only needs half the number of LW executions (program data access pattern permiting). Q. Assuming both improvements cost about the same, which is most cost-effective? Is doing both a possibility worth considering? Assume the typical job mix is 50% ALU operations, 35% load-store, and 15% branches. Analyze using Amdahl's Law. Q. Now redo the analysis comparing execution times directly. That is, account for CR speedup and code size reduction.

Q. In a 5-stage pipelined MIPS, an exception can be caused by an instruction in any functional unit except WB. In fact, four instructions could cause exceptions simultaneously. On top of that, a hardware interrupt might also occur in the same cycle. The cause register can record all the causes. If the EPC is to hold the address of only one instruction, which instruction's address is the most logical? Does it matter to the functionality of the operating system? Q. Suppose an instruction stream looks like this (first executed is at top): LW ADD LW SUB JR ADD Suppose the 5-stage MIPS processor is executing in user mode and the JR (jump via register) instruction's argument register contains 80008000. Which instructions will continue execution and which will be flushed from the pipeline? In addition, suppose the SUB instruction causes an overflow exception. The cause register acts like a stack and can push another cause. What happens to the OS routine that started in response to the first exception? What address could be put in the EPC to handle this. Assuming we did as you suggest, does this processor have precise exceptions? Q. Suppose we add 1-bit branch prediction to our 5-stage MIPS, but we put the branch instruction's full 32- bit address into our branch prediction table instead of just the low address bits. Does this solve the problem of mis-prediction on entering a loop? How?

Q. Suppose we know that $2 always has the value 400 before entering the loop. Show how a compiler could use unrolling and register renaming to take advantage of a VLIW machine which has two parallel integer units. Loop: SUB $4, $3, $1 ADD $4, $6, $4 SW $4, base( $2 ) SUB $2, $2, 4 BRneq $2, $ZERO, Loop