Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture

Similar documents

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

VLIW Processors. VLIW Processors

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

WAR: Write After Read

Computer Architecture TDTS10

CS352H: Computer Systems Architecture

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

PROBLEMS #20,R0,R1 #$3A,R2,R4

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

IA-64 Application Developer s Architecture Guide

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Computer Organization and Components

Midterm Exam #2 Solutions November 10, 1999 CS162 Operating Systems

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Introduction to Cloud Computing

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Solutions. Solution The values of the signals are as follows:

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Multithreading Lin Gao cs9244 report, 2006

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Instruction Set Design

StrongARM** SA-110 Microprocessor Instruction Timing

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Midterm II SOLUTIONS December 4 th, 2006 CS162: Operating Systems and Systems Programming

OC By Arsene Fansi T. POLIMI

Module: Software Instruction Scheduling Part I

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Next Generation GPU Architecture Code-named Fermi

CPU Performance Equation

Annotation to the assignments and the solution sheet. Note the following points

An Implementation Of Multiprocessor Linux

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

A Lab Course on Computer Architecture

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Binary search tree with SIMD bandwidth optimization using SSE

SOC architecture and design

Thread level parallelism

The Microarchitecture of Superscalar Processors

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Memory Systems. Static Random Access Memory (SRAM) Cell

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Putting it all together: Intel Nehalem.

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Computer organization

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

AMD Opteron Quad-Core

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

Architecture of Hitachi SR-8000

Pipelining Review and Its Limitations

Using Power to Improve C Programming Education

Operating System Impact on SMT Architecture

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

OPENSPARC T1 OVERVIEW

Computer Systems Structure Main Memory Organization

(Refer Slide Time: 00:01:16 min)

Energy-Efficient, High-Performance Heterogeneous Core Design

The Classical Architecture. Storage 1 / 36

FLIX: Fast Relief for Performance-Hungry Embedded Applications

CSC 2405: Computer Systems II

Spacecraft Computer Systems. Colonel John E. Keesee

Whitepaper: performance of SqlBulkCopy

EE361: Digital Computer Organization Course Syllabus

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

An Introduction To Simple Scheduling (Primarily targeted at Arduino Platform)

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Serial Communications

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Software Pipelining - Modulo Scheduling

Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview

Lecture 17: Virtual Memory II. Goals of virtual memory

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Operating Systems. Virtual Memory

Scalability and Classifications

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Low Power AMD Athlon 64 and AMD Opteron Processors

Interconnection Network Design

Transcription:

University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2012 John Kubiatowicz Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture Your Name: SID Number: Problem Possible Score 1 25 2 25 3 30 4 20 Total 100

[ This page left for ] 3.141592653589793238462643383279502884197169399375105820974944 2

Question #1: Short Answer [25pts] Problem 1a[3pts]: What hardware structure can be used to support branch prediction, data prediction, and precise exceptions in an out-of-order processor? Explain what this structure is (including what information it holds) and how it is used with implicit register renaming to recover from a bad prediction or exception. The Reorder Buffer is used to recover from branch miss-prediction, data miss-prediction, and to restore the processor to a precise state. The Reorder Buffer holds pending instructions in program order; in addition to the instruction itself, the reorder buffer holds the result of the instruction until the instruction is committed, as well as any exception results. With implicit register renaming (i.e. original Tomasulo), instructions are committed in program order by writing their results to the register file. As a result, we can recover from a bad prediction or exception by simply throwing out the contents of the Reorder Buffer and refetching the earliest instruction that didn t complete (which was at the head of the Reorder Buffer when we flushed the buffer). Problem 1b[3pts]: What is explicit register renaming? What is involved with implementing explicit register renaming in a 2-way superscalar processor (don t forget the free list!)? Explicit register renaming is the process of translating the programmer-visible (logical) register names to physical register names. Implementation of register renaming for a 2-way superscalar processor involves the ability to translate four source-registers and two destination registers for two instructions simultaneously. The translation table must have four read ports and two write ports. Two details: we must be able to see if the destination register of the first instruction is the same as either of the source registers for the second instruction (and do an appropriate replacement). Further, we must have a free list mechanism which can allocate two free registers at a time; we might use the reorder buffer to know which physical registers are no longer in use and can be placed on the free list (up to two at a time). Problem 1c[2pts]: Suppose that we start with a basic Tomasulo architecture that includes branch prediction. What changes are required to execute 4 instructions per cycle? We need to be able to issue 4 instructions at a time to the reservation stations. We need two have 4 result busses for up to four instruction completions per cycle. We need 8 read ports and 4 write ports on the register file. We need to have enough parallel execution resources to keep up to 4 things going at once. Problem 1d[2pts]: What could prevent the above architecture (in 1c) from sustaining 4 instructions per cycle? What could you do to improve utilization of the pipeline? There are many possible problems. Structural hazards can stall the pipeline (specifically insufficient queue slots in the reservation stations). Long-latency loads and stores could fill up buffers. Insufficient branch-prediction could prevent issuing 4 instructions per cycles. Insufficient instructional-level parallelism can cause problems. We could improve the situation with simultaneous multithreading. 3

Problem 1e[3pts]: Name three reasons that industry leaders (e.g. Intel) decided around 2002 to stop trying to improve individual processor performance and start producing multicore processors instead. There were a number of reasons. We took any reasonable ones. Some examples: (1) The amount of ILP that could be automatically extracted by hardware had run out of steam, (2) Power consumption had hit a high point and would have had to go higher to continue performance improvements. (3) Designers found it impossible to increase clock rates. Problem 1f[2pts]: Most branches in a program are highly biased, i.e. they can be predicted by a simple one-level predictor. What can the compiler do to improve the number of branches that are in this category? The compiler can perform a node-splitting operation in which branches that can be reached along multiple paths are replicated multiple times, once for each different path through the code. The resulting branches are typically more highly biased than they were before replication. Problem 1g[2pts]: When is it better to handle events via interrupts rather than polling? How about the reverse? Be specific. Interrupts work well as a notification mechanism when events are either unpredictable or infrequent. Interrupts can also be useful to guarantee processing of events (from a security standpoint), since interrupts often involve handing control to the kernel. Polling works well when events are regular or predictable or can be delayed for a long time without affecting functionality. Problem 1h[3pts]: Why is it advantageous to have a prime number of banks of DRAM in a vector processor? Can you say how to map a binary address to a prime number of banks without the expense of implementing a general modulus operation in hardware? A prime number of banks will provide high performance with a larger number of possible strides than a non-prime number of banks (i.e. all strides that do not have that particular prime number as a factor). If the prime number is of the form 2 m -1 (i.e. a Mersenne prime), then computing the bank ID can be extracted simply from a binary number by realizing that 2 x mod (2 m -1) 2 x mod m. This means that we can divide the address into m-bit chunks which we add together. We take the result and do the same thing again multiple times until there are no more than m bits left (special case if result 2 m -1, treat it like 0). 4

Problem 1i[3pts]: Explain how exception conditions could occur out-of-order (in time) with a 5-stage, in-order pipeline. A diagram is probably the easiest way to illustrate. How can such a pipeline produce a precise exception point? Exceptions can occur out-of-order in time because exceptions can occur in different stages. Example: Illegal instruction faults occur in the D stage, overflow errors occur in the E stage, and memory faults occur in the M stage. As a result, instructions that are later in the flow of the program can have exceptions that occur earlier in time, for instance: F 1 D 1 E 1 M 1 W 1 F 2 D 2 E 2 M 2 W 2 In the above example, D 2 occurs in time earlier than M 1, even though the first instruction is the precise exception point. The five stage pipeline reorders exceptions simply by waiting to handle an exception until a well defined point in the pipeline, such as the end of the memory stage. Thus, whenever an exception occurs, the corresponding stage simply sets an exception field in the pipeline, rather than stopping the pipeline. The memory stage will look at this field to decide which instruction should be the precise exception point. Problem 1j[2pts]: Suppose your virtual memory system has 4KB pages. Further, suppose you have a 64KB first-level cache. Explain how you could fully overlap the TLB lookup and cache access. Use a diagram to help with your explanation: The simple answer to this question is to make sure that bits examined by the TLB and by the Cache index are different bits. Since pages are 4K in size, this means that only the lower 12 bits should be examined by the cache. To handle 64K of first-level cache, we must have 64K/4K 16-way associativity. The diagram below shows relevant information. Note that we still have to do the TLB look up and way-choice in serial (thus, perhaps fully overlap is slightly misleading). Address: 20 page # 7 5 disp off TLB assoc lookup index 64K Cache 16-way assoc Tags: Hit/ Miss FN Output Mux 5 Data Hit/ Miss

[ This page intentionally left blank!] 6

Question #2: In-Order Superscalar Processor [25pts] Consider a dual-issue, in-order pipeline with one fetch stage, one decode stage, multiple execution stages (which include memory access) and a single write-back stage. Assume that the execution stages are organized into two parallel execution pipelines (call them even and odd) that support all possible simultaneous combinations of two instructions. Instructions wait in the decode stage until all of their dependencies have been satisfied. Further, since this is an in-order pipeline, new instructions will be forced to wait behind stalled instructions. On each cycle, the decode stage takes zero, one, or two ready instructions from the fetch stage, gathers operands from the register file or the forwarding network, then dispatch them to execution stages. If less than 2 instructions are dispatched on a particular cycle, then NOPs are sent to the execution stages. When two instructions are dispatched, the even pipeline receives the earlier instruction. When only one instruction is dispatched, it is placed in the even pipeline. Assume that each of the execution pipelines consist of a single linear sequence of stages in which later stages serve as no-ops for shorter operations (or: every instruction takes the same number of stages to execute, but results of shorter operations are available for forwarding sooner). All operations are fully pipelined and results are forwarded as soon as they are complete. Assume that the execution pipelines have the following execution latencies: addf (2 cycles), multf (3 cycles), divf (4 cycles), integer ops (1 cycle). Assume that memory instructions take 4 cycles of execution: one for address calculation done by the integer execution stage, two unbreakable cycles for the actual cache access, and one cycle to check the cache. Finally, assume that branch-conditions are computed by integer execution units. Problem 2a[2pts]: Suppose that this pipeline must be backward-compatible with an earlier, inorder, 5-stage pipeline that has a single branch-delay slot. Assuming that there is only a single fetch stage, will there be any bubbles in the pipeline from branches? Explain. Yes. Even if we can process branch instructions in the decode stage, we will end up with a twoinstruction bubble in the pipeline, at least part of the time: Suppose that a branch and its delay slot get fetched on one cycle. Then, when these two instructions are in the decode stage, there will be two other instructions being fetched; these next two instructions must be discarded if the branch is taken. If, on the other hand, the branch is fetched as an odd instruction, there will be a one instruction in the fetch stage that will have to be discarded if the branch is taken. Note that more complex branch conditions are likely not computed until after the first execution stage, leading to more bubbles. Problem 2b[2pts]: Suppose that the fetch stage takes 2 cycles. Will this change your answer from (2a)? Explain. If you added branch prediction, what would the pipeline have to do when the prediction was wrong? Yes, this will make the situation in 2a even worse, potentially forcing the discarding of 2 additional instructions under some circumstances (or more if branch computed in execute stage). Because the pipeline is simply in-order, all that we have to do on a branch misprediction is to (1) identify the instructions after the branch that are already in the pipeline (2 instructions in F 1, at least 1 in F 2 possibly 2 if compute branch condition in decode, 2 more if compute branch in E) and mark them for flushing (i.e. to inhibit writeback of their results) and (2) start fetching from the correct branch target. 7

Problem 2c[10pts]: Below is a start at a simple diagram for the pipelines of this processor. 1) Finish the diagram. Stages are boxes with letters inside: Use F 1 and F 2 for the fetch stages, D for a decode stage, EX 1 through EX 4 for the execution stages of each pipelines, and W for a writeback stage. Memory instructions take 1 cycle (EX 1 ) to compute address, two cycles to fetch (or write) data, and 1 cycle to check tags (TagC). Clearly label which is the even pipeline. Include arrows for forward information flow if this is not obvious. 2) Next, describe what is being performed in each of the 4 stages (including partial results). 3) Show all forwarding paths (as arrows). Your pipeline should never stall unless a value is not ready. Assume, for the moment, that there are never any cache misses. Label each bypass arrow with the types of instructions that will forward their results along that path (i.e. use M for multf, D for divf, A for addf, I for integer operations, and Ld for load results). [Hint: think carefully about inputs to store instructions!] I, A, M, Ld, D I, A, M, Ld I, A I EX 2 EX 3 EX 4 EX 1 MEM 1 MEM 2 TagC A M D EVEN F 1 F 2 D I W EX 2 EX 3 EX 4 EX 1 MEM 1 MEM 2 TagC I A M D ODD I, A I, A, M, Ld I, A, M, Ld, D STAGES: F 1 /F 2 First and second cycle of fetch D Decode stage, stall until ready to dispatch, fetch values from registers EX 1 Integer operations, Compute memory address, First stage of Addf, Multf, and Divf EX 2 First cycle of memory operation, Second stage of Addf, Multf, Divf EX 3 Second cycle of memory operation (load results ready), Third stage of Multf, Divf. EX 4 Tag check cycle of memory operation, Fourth stage of Divf W Writeback (write data to register file) Note that the arcs feeding into the end of the EX 1 stages are for optimizing stores. It is important to note that there are no such arcs for the result of load, since we cannot start store until we have checked tag (thus normal feed of Ld results to end of D stage would be used for stores). If you take the no cache misses constraint literally, then we would feed from MEM 2 to EX 1. 8

Problem 2d[2pts]: Could this particular pipeline benefit from explicit register renaming? Why or why not? Be very explicit in your answer. No. Explicit register renaming is not really necessary here, because the pipeline is (1) in order and commits in order at the W stage and (2) uses bypassing. Consequently, there are no WAR and WAW hazards to worry about. Problem 2e[2pts]: Note that we assume that a load is not completed until the end of EX 3 and that a store must have its value by the beginning of EX 2. Consider the following common sequence for a memory copy: loop: ld r1, 0(r2) st r1, 0(r3) add r2, r2, #4 subi r4, r4, #1 add r3, r3, #4 bne r4, r0, loop nop Why can t the load and store be dispatched in the same cycle? What is the minimum number of instructions that must be placed between them to avoid stalling? Explain. They cannot be dispatched in the same cycle, since data from the load is not available until the MEM 2 stage and is not available to start something that cannot be aborted until after the TagC stage. The store instruction needs its data by the beginning of the MEM 1 stage. Thus, we must feed from the end of the TagC stage to the beginning of the MEM 1 stage at the earliest there must be at least 2 cycles of instructions between the load and the store (which means at least 4 instructions between, could need to be 6 instructions if the load is in the even pipeline and the store is in the odd pipeline). Problem 2f[2pts]: Assume that the following multiply-accumulate operation is extremely common for some important applications: multf f2, f1, f2 addf f3, f3, f2 How could you modify the pipeline from (2c) so that the above two operations could always be dispatched together in the same cycle? Explain with a figure. Would there be any negative consequences to this organization? Assuming that multiplies take 3 cycles and adds take 2, you could arrange to process floating-point adds in cycles 4 and 5 of the Odd pipeline (as shown below). Possible consequences: more bypassing logic, more stalling ( results of add in Odd pipeline not available until EX 5 ) EX 2 EX 3 EX 4 EX 1 EX 5 MEM 1 MEM 2 TagC Forward Multf Start Addf EX 2 EX 3 EX 4 EX 1 EX 5 MEM 1 MEM 2 TagC EVEN ODD 9

Problem 2g[2pts]: Assume that the above pipeline can experience cache misses. What actions must happen if a cache miss is discovered when a load is traversing through the TagC stage? Are any instructions other than that particular load affected? Explain how to resolve any issues First, it is important to note that the actual load has the wrong value! Further, since we forward for maximum performance, we may have forwarded this wrong value from the MEM 2 stage to one or two instructions that are now in the EX 1 stage. The easiest thing to do when we discover a cache miss is to (1) start filling the cache, (2) flush all instructions in stages earlier than W (including the instructions in the EX 4 /TagC), then (3) restart fetching from those instructions (including the original load) after the cache miss has completed. This simple solution needs only replace the cache line in the cache; we do not have to figure out which instructions have gotten the wrong values, since we restart them all. We could selectively do better than this by doing the following on a cache miss: we could (1) stall all instructions in the pipeline, which means that we do not update any of the latches, then (2) when the data comes back, we not only update the cache, but we forward the particular part of the cache line we originally loaded to the EX 1 latches (if necessary), and the TagC latches (just to have the right data for the writeback stage). We then let the pipeline recompute the failed cycle. Note that, by stalling the pipeline, we effectively recompute anything that might have used a wrong value and continue as if nothing incorrect ever happened. It is important to note that values are written back in order, in the write stage! Those of you that mentioned a ROB or register renaming were not taking the in-order nature of this pipeline into account (and didn t get credit for your answer!). We flush an instruction by turning it into a NOP. Problem 2h[3pts]: Notice that the TagC stage is after the two memory data stages. Assuming that the above pipeline can experience cache misses on store instructions, how can you avoid overwriting data incorrectly during such a cache miss? Explain with a diagram any extra hardware that you may need to make stores work correctly. Be explicit. Do you need to change any of the arcs from (2c)? The trick here is to wait until after we check the tag (in TagC ) before writing to the cache. We can do this by splitting the tag lookup and data storage on a store operation. We will lookup the tag for the current instruction, but do the storeback for a previous store instruction that has already been checked. Notice that there can be up to three stores/pipeline in process. Also, we need to check during loads for matching, pending stores from each pipeline. Here is one-half of the hardware (need to match with other pipeline as well): Store Address Store Data Store Address Store Data Mux Store Address Store Data EX 1 MEM 1 MEM 2 TagC 10

EXTRA CREDIT: Problem 2i[5pts]: Briefly describe the logic that would be required in the decode stage of this pipeline. In five (5) sentences or less (and possibly a small figure), describe a mechanism that would permit the decode stage to decide which of two instructions presented to it could be dispatched. 11

Problem #3: Software Scheduling [30pts] For this problem, assume that we have fully pipelined, single-issue, in-order processor with the following number of execution cycles for: 1. Floating-point multiply: 4 cycles 2. Floating-point square root: 11 cycles 3. Floating-point adder: 2 cycles 4. Integer operations: 1 cycle Assume that there is one branch delay slot, that there is no delay between integer operations, and dependent branch instructions, and that memory loads and stores require 2 memory cycles (plus the address computation). All functional units are fully pipelined and bypassed. Problem 3a[3pts]: Compute the following latencies between instructions to avoid stalls (i.e. how many unrelated instructions must be inserted between two instructions of the following types to avoid stalls)? The first one is given: Between a ldf and addf: 2 Insts Between a addf and sqrtf: 1 Inst Between a addf and stf: 0 Insts Between an sqrtf and stf: 9 Insts Between a ldf and multf: 2 Insts Between a multf and addf: 3 Insts Between a sqrtf and addf: 10 Inst Between an integer and branch: 0 Insts The following code takes an array of 2D vectors, V [], (split into 1D arrays of coordinates, X[] and Y[]) and computes the sum of the norm, namely V. It also stores the individual norms into array D[]. Let r1 point at array X, r2 at array Y, and r3 at array D. Let r4 hold the length of the arrays. Assume that F9 0 before the start of execution. Stall 2 cycles Stall 2 cycles Stall 3 cycles Stall 1 cycle Stall 10 cycles cnorm: ldf F3,0(r1) ; Load x[i] multf F4,F3,F3 ; x[i]^2 ldf F5,0(r2) ; Load y[i] multf F6,F5,F5 ; y[i]^2 addf F7,F4,F6 ; x[i]^2 + y[i]^2 sqrtf F8,F7 ; sqrt(x[i]^2 + y[i]^2) addf F9,F9,F8 ; accumulate sums stf 0(r3),F8 ; d[i] sqrt(x[i]^2 + y[i]^2) addi r1,r1,#4 addi r2,r2,#4 addi r3,r3,#4 subi r4,r4,#1 bnez r4,cnorm nop Problem 3b[2pts]: How many cycles does this loop take per iteration? Indicate stalls in the above code by labeling each of them with a number of cycles of stall: Total cycles 14 instructions + 18 stalls 32 cycles / iteration 12

Problem 3c[4pts]: Reschedule this code to run with as few cycles per iteration as possible. Do not unroll it or software pipeline it. How many cycles do you get per iteration of the loop now? cnorm: ldf F3,0(r1) Stall Stall 2 1 cycles cycle ldf F5,0(r2) multf F4,F3,F3 multf F6,F5,F5 Stall 3 cycles Stall 1 cycle addi r1,r1,#4 addf sqrtf F7,F4,F6 F8,F7 addi r2,r2,#4 addi r3,r3,#4 subi r4,r4,#1 Stall 5 cycles stf -4(r3),F8 bnez r4,cnorm addf F9,F9,F8 Total cycles 13 instructions + 10 Stalls 23 cycles/iteration Problem 3d[6pts]: Unroll the loop once and schedule it to run with as few cycles as possible. Ignore startup code. What is the average number of cycles per iteration of the original loop? Hint: To make this easier, use the 10s digit in naming registers for the second iteration of the loop, i.e. F3 F13. cnorm: ldf F3,0(r1) ldf F5,0(r2) ldf F13,4(r1) multf F4,F3,F3 ldf F15,4(r2) multf F6,F5,F5 multf F14,F13,F13 Stall 1 cycle multf F16,F15,F15 addf F7,F4,F6 Stall 1 cycle addf F17,F14,F16 sqrtf F8,F7 sqrtf F18,F17 addi r1,r1,#8 addi r2,r2,#8 addi r3,r3,#8 Stall 4 cycles subi r4,r4,#2 stf -8(r3),F8 stf -4(r3),F18 addf F9,F9,F8 bnez r4,cnorm addf F9,F9,F18 Total cycles 21 instructions + 6 Stalls 27 cycles 13.5 cycles/iteration 13

Problem 3e[5pts]: Software pipeline this loop to avoid stalls. Use as few instructions as possible. Your code should have no more than one copy of the original instructions. What is the average number of cycles per iteration? Ignore startup and exit code. This code can be organized with a pipeline of 5 stages as shown in the following figure: ldf ldf multf multf addf sqrtf stf addf cnorm: stf 0(r3),F8 ; r1+16, r2+16, r3+16, r4+4 addf F9,F9,F8 ; sqrtf F8,F7 ; r1+12, r2+12, r3+12, r4+3 addf F7,F4,F6 ; r1+ 8, r2+ 8, r3+ 8, r4+2 multf F4,F3,F3 ; r1+ 4, r2+ 4, r3+ 4, r4+1 multf F6,F5,F5 ; ldf F3,-16(r1) ; r1+ 0, r2+ 0, r3+ 0, r4+0 ldf F5,-16(r2) ; addi r1,r1,#4 ; addi r2,r2,#4 ; subi r4,r4,#1 ; bnez r4,cnorm ; addi r3,r3,#4 ; Total cycles 13 instructions + 0 Stalls 13 cycle/iteration Problem 3f[3pts]: Assuming that we are allowed to perform loads off the end of the array at r1 and r2, show how we can add a small amount of startup code and a very simple type of exit code to make your software pipelined loop a complete loop for any number of iterations > 0. HINT: Assume that you clean up the first few iterations at the end, but be careful about F9! STARTUP CODE: cnorm_ent: movfi F9, 0f ; Init F9 0 movfi F8, 0f ; Init F8 0 movfi F7, 0f ; Init F7 0 movfi F4, 0f ; Init F4, F6 0 movfi F6, 0f ; movfi F3, 0f ; Init F3, F5 0 movfi F5, 0f ; mov r11, r1 ; Save r1, r2, r3, r4 mov r12, r2 mov r13, r3 mov r14, r4 EXIT CODE: filtersc: mov r1, r11 ; Restore address of x[] mov r2, r12 ; Restore address of y[] mov r3, r13 ; Restore address of z[] mov r4, r14 ; Restore iteration count slti r5,r4,#5 ; Less than 5 iterations? bne r5, finloop ; yes. Use iteration count nop addi r4,r0,#4 ; No, peg fixup count at 4 finloop: <Code from 3a or 3c here> ; Fix up to 4 iterations 14

Problem 3g[5pts]: Suppose that we have a vector pipeline in addition to everything else. Assuming that the hardware vector size is at least as big as the number of loop iterations (i.e. value in r4 at entrance), produce a vectorized version of the filter loop. If you are not sure about the name of a particular instruction, use a name that seems reasonable but make sure to put comments in your code to make it obvious what you mean. Assume that there are no native vector reduction operations. Your code should take the same four arguments (r1, r2, r3, and r4) and produce the same result (F9). cnorm: MOVL r4 ; Set vector length LVS V3,r1,4 ; Load x[] into V3 MULTV V4,V3,V3 ; Compute square of x[]>v4 LVS V5,r2,4 ; Load y[] into V5 MULTV V6,V5,V5 ; Compute square of y[]>v6 ADDV V7,V4,V6 ; sum of squares > V7 SQRTV V8,V7 ; square roots > V8 SV V8,r3,4 ; Result > d[] ; The following ignores non-associativity of rounding and assumes ; non-initialized values of V8 0 (may need to work for this) reduce: movi r5, MAXLEN ; Assume power of 2 > 1 MOVL r5 ; Set vector length red_loop: VEXTHALF V9,V8 ; Extract top half of V8 into V9 ADDV V8,V8,V9 ; Bottom half partial sum srl r5,r5,#1 ; Divide length by 2 slti r6,r5, #2 bne r6,red_loop ; Still > 1 MOVL r5 ; Set vector length SVEXT F9,V8,#0 ; Move element 0 of V8 into F9 Note that we did the reduction assuming that floating point addition is associative (which it isn t for routing). Note that, unless you are worried about the reproducibility of two different versions of this code, it is not clear that the different answer that you get from this code is any more correct than the version you would get from the original code. Always understand your problem statement! We also used two vector extract instructions, one which extracts the top-half of a vector register (relative to vector length) into the bottom-half of a different register. We also assumed that we could extract any individual vector entry into a scalar register. Note that we also accepted a serial loop that scanned through d[] to compute the result (since this would technically be correct if we cared about the ordering of the adds in the reduction). Problem 3h[2pts]: Describe generally (in a couple of sentences) what you would have to do to your code in 3h if the size of a hardware vector is < number of iterations of the loop: You would have to strip mine the code, namely divide it into multiple chunks < size of hardware vector. Code would roughly look like this: cnorm_sm: modi r5,r4, MAXLEN ; compute remainder cnorm_loop: MOVL r5 ; Set vector length ; <Compute one vector-length worth of work> ; Update F9 by adding partial-sum of items ; <Update array starting points (r1, r2, r3) sub r4,r4,r5 bne r4, cnorm_loop movi r5, MAXLEN 15

Problem 4: Paper Potpourri[20pts] Problem 4a[3pts]: One of the papers that we read showed that the delay of the bypass network scales quadratically with issue width. Can you give an intuition as to why this is true? Assuming that you want to double the number of instructions issued per cycle without slowing your clock cycle down by a factor of 4, what could you do? The intuition behind this result is that the length of the bypass network is going to vary at least linearly with number of units that need to bypass their values. Since the delay of wires varies quadratically with length (assuming that they are not repeated), this gives us a quadratic factor. In fact, you could imagine adding repeaters to give some relief to this scaling factor, but the width of the muxes also increases with issue width, leading to super-linear delay. You could mitigate this delay increase by clustering functional units into smaller groups. Problem 4b[2pts]: Higher dimensional networks (e.g. hypercubes) can route messages with fewer hops than lower-dimensional networks. None-the-less, the exploration paper that we read on k-ary n-cubes (Bill Dally) showed that high-dimensional networks were not necessarily the lowest-latency networks. Explain what assumptions lead to this conclusion and reflect when such assumptions are valid: The important assumption here was that wiring was limited by physical constraints, i.e. the crosssectional area of the wires across the bisection must be the same regardless of the degree of the network. Thus, when optimizing for overall latency, lower-latency networks might perform better because they supported wider (higher bandwidth) connections. Problem 4c[2pts]: The Future of Wires paper ultimately concluded that multicore was an important future architectural innovation (as opposed to more complex uniprocessor cores). Can you give two arguments that lead to that conclusion? One of the primary arguments was that reasonable assumptions about the constitution of wires and clock-rates (measured in units of F04) would lead one to assume that the number of clock cycles to cross a typical chip was increasing. Thus, putting the long-distance wires into a generalized network which connects small processors (i.e. multicore) is the easiest way to handle multiple cycles for communication, rather than trying to build a large multi-issue processor. A second argument had involved the observation that CAD tools generate errors in routing proportional to the number of transistors on the chip. Each of these errors requires painstaking correction by hand. Multicore is the best way to handle Moore s law growth in number of transistors/chip (i.e. exponential): simply produce and debug a processor macro, then replicate it across the chip (with a regular network). Problem 4d[3pts]: Sketch out the following branch predictors: gshare, PAp, Tournament Address GBHR Predictor Type 1 Chooser Predictor Type 2 Address GPHT GShare PABHR PAPHT PAp 16 Mux Tournament

Problem 4e[2pts]: What is a simple technique using virtual channels that will permit arbitrary adaptation (around faults and congestion) while still guaranteeing deadlock freedom? Divide virtual channels into two categories: adaptive and deadlock-free. The deadlock-free network can use any technique to be deadlock free, such as routing in dimension order (requiring only 1 virtual channel/physical channel). Then, to route a message, use virtual channels in the adaptive category any way you want (routing around faults or congestion), then, if you get stuck for a bit, transition to the deadlock-free network. Most of the time, you will never have to leave the adaptive network. The most important constraint here is that once a message starts routing on the deadlockfree network, if must never transition back to the adaptive network. Problem 4f[3pts]: What was the basic idea behind Trace Scheduling for VLIW and why is it necessary to get good performance from a VLIW? Explain why a VLIW might need to perform many simultaneous condition checks when implementing Trace Scheduling. Trace Scheduling looks at traces of program execution to identify common paths through the program (across multiple branches, i.e. basic blocks); once these paths have been identified, it compresses all of the instructions in a trace together. The resulting superblock of instructions is scheduled as if the branches always take the path identified in the trace. Checks are placed at the end of the block to catch violations of the predicted branch directions; further, special fixup code is generated to correct the state of the program if branch directions are violated. Trace scheduling is necessary for a VLIW since we need to generate a large block of instructions (without branches) in order find enough parallelism to fill up the slots in VLIW instructions. Without trace scheduling, the instructions in the program couldn t cross branch boundaries a serious constraint when branches might be every 5 instructions. As described above, we might need to check many branch conditions at the end of our scheduled block to see if we need to execute fixup code. Problem 4g[3pts]: What is Simultaneous Multithreading? What hardware enhancements would be required transform a superscalar, out-of-order processor, into one that performs simultaneous multithreading? Be explicit. Simultaneous Multithreading is a technique that allows instructions from multiple threads to exist in the pipeline at the same time. It is term that is applied to superscalar, out-of-order pipelines. Hardware enhancements for Simultaneous Multithreading include (1) multiple PCs and branchprediction hardware, (2) Fetch logic to choose instructions from multiple threads, (3) Additional renaming resources to support more than one thread, including more translation table space and additional physical registers, (4) multiple commit logic to handle commits from multiple threads, (5) possibly additional TLB space to allow each thread to operate in a different address space. Problem 4h[2pts]: What is coarse-grained multithreading (as implemented by the Sparcle processor)? Name at least 2 ways in which the Alewife multiprocessor utilized coarse-grained multithreading. Coarse-grained multithreading is a form of multithreading that switches from one thread to another infrequently say at events such as cache misses or synchronization misses rather than switching every instruction. Another definition of coarse-grained multithreading would be that instructions from different threads never coexist in the pipeline at the same time. With coarse-grained multithreading, the overhead of switching from one thread to another can be multiple cycles (in Alewife it was 14 cycles). Alewife used course-grained multithreading in a number of ways, including switching: (1) on cache misses to global memory, (2) during synchronization misses (such as with the fine-grained synchronization), (3) for handling the threads generated by incoming messages. 17

[ This page intentionally left blank!] 18

[ Random spare page for scratch ] 19

[ Random spare page for scratch ] 20