Exercise 4 Due 26. November 2009, 12:15pm

Similar documents
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER


Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

VLIW Processors. VLIW Processors

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Instruction scheduling

CS521 CSE IITG 11/23/2012

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

WAR: Write After Read

Course on Advanced Computer Architectures

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Pipelining Review and Its Limitations

The Microarchitecture of Superscalar Processors

Solutions. Solution The values of the signals are as follows:

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Module: Software Instruction Scheduling Part I

PROBLEMS #20,R0,R1 #$3A,R2,R4

CS352H: Computer Systems Architecture

Computer organization

Computer Organization and Components

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

CPU Performance Equation

(Refer Slide Time: 00:01:16 min)

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Central Processing Unit (CPU)

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Week 1 out-of-class notes, discussions and sample problems

A Lab Course on Computer Architecture

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Let s put together a Manual Processor

Chapter 5 Instructor's Manual

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

PROBLEMS. which was discussed in Section

IA-64 Application Developer s Architecture Guide

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

CPU Organisation and Operation

LSN 2 Computer Processors

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Instruction Scheduling. Software Pipelining - 2

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Giving credit where credit is due

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

picojava TM : A Hardware Implementation of the Java Virtual Machine

Intel Pentium 4 Processor on 90nm Technology

Performance evaluation

2

Using Power to Improve C Programming Education

Computer Organization. and Instruction Execution. August 22

Instruction Set Design

Central Processing Unit Simulation Version v2.5 (July 2005) Charles André University Nice-Sophia Antipolis

Putting it all together: Intel Nehalem.

CSE 141L Computer Architecture Lab Fall Lecture 2

An Introduction to the ARM 7 Architecture

Computer Architecture TDTS10

TDT 4260 lecture 11 spring semester Interconnection network continued

The Little Man Computer

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

EC 362 Problem Set #2

Software Pipelining - Modulo Scheduling

A SystemC Transaction Level Model for the MIPS R3000 Processor

Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture

CPU Organization and Assembly Language

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Administrative Issues

Derek Chiou, UT and IAA Workshop

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

Precise and Accurate Processor Simulation

OAMulator. Online One Address Machine emulator and OAMPL compiler.

Boosting Beyond Static Scheduling in a Superscalar Processor

Instruction Set Architecture (ISA)

StrongARM** SA-110 Microprocessor Instruction Timing

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12

Introduction to Cloud Computing

Lab Work 2. MIPS assembly and introduction to PCSpim

Software implementation of Post-Quantum Cryptography

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

İSTANBUL AYDIN UNIVERSITY

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

Operating Systems. Virtual Memory

Multi-GPU Load Balancing for Simulation and Rendering

Transcription:

Computer Architecture Exercise 4 Due 26. November 2009, 12:15pm The following sources shall increase your understanding about Tomasulo s Algorithm. R. M. Tomasulo - An Efficient Algorithm For Exploiting Multiple Arithmetic Units http://www.csd.uoc.gr/~hy425/lectures/tomasulo.pdf UMASS - Simulation of the Tomasulo s algorithm used for Dynamic Scheduling http://www.ecs.umass.edu/ece/koren/architecture/tomasulo1/tomasulo.htm Part 1. Case Study - Register Renaming, Dynamic Scheduling, Multiple Issue Designs 1.1. Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute, memory, write back) and the code below. All ops are 1 cycle except LW and SW, which are 1 + 2 cycles, and branches, which are 1 + 1 cycles. There is no forwarding. Given is the following example code: Loop LW R1, 0(R2) I0 ADDI R1, R1, #1 I1 SW R1, 0(R2) I2 ADDI R2, R2, #4 I3 SUB R4, R3, R2 I 4 BNZ R4, Loop Note: An instruction does not enter the execution phase until all of its operands are ready. The decode and write-back can overlap. Branch overhead is the amount of cycles needed to decide, if a branch is taken or not (beginning with the fetch). a) Show the phases of each instruction per clock cycle for one iteration of the loop. b) How many clock cycles per loop iteration are lost to branch overhead? c) Assume a static branch predictor, capable of recognizing a backwards branch in the Decode stage (heuristic: P C branch target < P C actual > loop, take branch). Now how many clock cycles are wasted on branch overhead? d) Assume a dynamic branch predictor. How many cycles are lost on a correct prediction? Coordinator: Dipl.-Inf. Robert Hartmann 1

1.2. Let s consider what dynamic scheduling might achieve here. Assume a microarchitecture as shown in figure 1. Assume that the ALUs can do all arithmetic ops (MULTD, DIVD, ADDD, ADDI, SUB) and branches, and that the Reservation Station (RS) can dispatch at most one operation to each functional unit per cycle (one op to each ALU plus one memory op to the LD/ST unit). Figure 1: An out-of-order microarchitecture I0. MULTD F2, F0, F2 I1. DIVD F8, F2, F0 I2. LD F4, 0(Ry) I3. ADDD F4, F0, F4 I4. ADDD F10, F8, F2 I5. SD F4, 0(Ry) I6. ADDI Rx, Rx, #8 I7. ADDI Ry, Ry, #8 Table 1: Latencies. Instruction Latency beyond single cycle Memory LD +3 Memory SD +1 Integer ADD, SUB +0 Branches +1 ADDD +2 MULTD +4 DIVD +10 a) Suppose the whole code is resident in the RS in clock cycle N, with latencies as given in table 1. Show how the RS should dispatch these instructions out-of-order, Coordinator: Dipl.-Inf. Robert Hartmann 2

clock by clock, to obtain optimal performance on this code. (Consider the given RS restrictions. Also assume that results must be written into the RS before they re available for use; i.e., no bypassing.) How many clock cycles does the code sequence take? b) Part a) lets the RS try to optimally schedule these instructions. But in reality, the whole instruction sequence of interest is not usually present in the RS. Instead, various events clear the RS, and as a new code sequence streams in from the decoder, the RS must choose to dispatch what it has. Suppose that the RS is empty. In cycle 0 the first two register-renamed instructions of this sequence appear in the RS. Assume it takes 1 clock cycle to dispatch any op, and assume functional unit latencies are as in table 1. Further assume that the front end (decoder/register-renamer) will continue to supply two new instructions per clock cycle. Show the cycle-by-cycle order of dispatch of the RS. How many clock cycles does this code sequence require now? c) If you wanted to improve the results of part b), which would have helped most: (1) another ALU; (2) another LD/ST unit; (3) cutting the longest latency in half? What s the speedup? d) Now let s consider speculation, the act of fetching, decoding, and executing beyond one or more conditional branches. Our motivation to do this is twofold: the dispatch schedule we came up with in part b) had lots of nops, and we know computers spend most of their time executing loops (which implies the branch back to the top of the loop is pretty predictable.) Loops tell us where to find more work to do; our sparse dispatch schedule suggests we have opportunities to do some of that work earlier than before. In part c) you found the critical path through the loop. Imagine folding a second copy of that path onto the schedule you got in part a). How many more clock cycles would be required to do two loops worth of work (assuming all instructions are resident in the RS)? (Assume all functional units are fully pipelined.) Part 2. Home Work - Register Renaming, Dynamic Scheduling, Multiple Issue Designs [9 pt.] Let s consider what dynamic scheduling might achieve here. Assume a microarchitecture as shown in figure 1. Assume that the ALUs can do all arithmetic ops (MULTD, DIVD, ADDD, ADDI, SUB) and branches, and that the Reservation Station (RS) can dispatch at most one operation to each functional unit per cycle (one op to each ALU plus one memory op to the LD/ST unit). a) Suppose all of the instructions from the sequence are present in the RS, with no renaming having been done. Highlight any instructions in the code where register renaming would improve performance (that is, name-dependencies are resolved or instructions can be dispatched earlier). Hint: Look for RAW and WAW hazards. Assume the functional unit latencies as in table 2. [1 pt.] Given is the following example code: Coordinator: Dipl.-Inf. Robert Hartmann 3

Table 2: Latencies. Instruction Latency beyond single cycle Memory LD +4 Memory SD +1 Integer ADD, SUB +0 Branches +1 ADDD +1 MULTD +5 DIVD +12 I0. MULTD F2, F0, F3 I1. DIVD F2, F2, F0 I3. ADDD F2, F8, F3 I4. ADDD F4, F2, F4 I5. ADDI Ry, Ry, #8 I6. SD F4, 0(Ry) I7. ADDI Rx, Rx, #8 b) Suppose the register-renamed version of the code from a) (given below) is resident in the RS in clock cycle N, with latencies as given in table 2. Show how the RS should dispatch these instructions out-of-order, clock by clock, to obtain optimal performance on this code. (Assume the same RS restrictions as in a). Also assume that results must be written into the RS before they re available for use; i.e., no bypassing.) How many clock cycles does the code sequence take? [2 pt.] I0. DIVD F8, F2, F0 I1. MULTD F2, F6, F2 I2. LD F4, 0(Ry) I3. ADDD F4, F0, F4 I4. ADDD F10, F8, F2 I5. ADDI Rx, Rx, #8 I6. ADDI Ry, Ry, #8 I7. SD F4, 0(Ry) c) Part b) lets the RS try to optimally schedule these instructions. But in reality, the whole instruction sequence of interest is not usually present in the RS. Instead, various events clear the RS, and as a new code sequence streams in from the decoder, the RS must choose to dispatch what it has. Suppose that the RS is empty. In cycle 0 the first two register-renamed instructions of this sequence appear in the RS. Assume it takes 1 clock cycle to dispatch any op, and assume functional unit latencies are as in table 2. Further assume that the front end (decoder/register-renamer) will Coordinator: Dipl.-Inf. Robert Hartmann 4

continue to supply two new instructions per clock cycle. Show the cycle-by-cycle order of dispatch of the RS. How many clock cycles does this code sequence require now? [2 pt.] d) If you wanted to improve the results of part c), which would have helped most: (1) another ALU; (2) another LD/ST unit; (3) full bypassing of ALU results to subsequent operations; (4) cutting the longest latency in half? What s the speedup? [2 pt.] e) Now let s consider speculation, the act of fetching, decoding, and executing beyond one or more conditional branches. Our motivation to do this is twofold: the dispatch schedule we came up with in part c) had lots of nops, and we know computers spend most of their time executing loops (which implies the branch back to the top of the loop is pretty predictable.) Loops tell us where to find more work to do; our sparse dispatch schedule suggests we have opportunities to do some of that work earlier than before. In part d) you found the critical path through the loop. Imagine folding a second copy of that path onto the schedule you got in part b). How many more clock cycles would be required to do two loops worth of work (assuming all instructions are resident in the RS)? (Assume all functional units are fully pipelined.) [2 pt.] Part 3. Home Work - Scoreboarding and Tomasulo s Approach [5 pt.] Both Scoreboarding and Tomasulo s Approach allow out-of-order execution. Go through the slides again and answer the following questions. a) How does the Scoreboard Control solve WAW and WAR hazards? [1 pt.] b) How does Tomasulo s Approach solve WAW and WAR hazards? [1 pt.] c) What are the main differences between Scoreboarding and Tomasulo s Approach? [1 pt.] d) What is the Common Data Bus for and what is his benefit? [1 pt.] e) What problem occurs, if two functional units finish their computation and write their results? [1 pt.] Your solutions need to be traceable! Coordinator: Dipl.-Inf. Robert Hartmann 5