2013 Advanced Computer Architecrture Mid-term Exam 1. Amdahl s law When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a complicated fast floating point unit, that takes space, and something else might have to be moved farther away from the middle to accommodate it, addding an extra cycle in delay to reach that unit. The basic Amdahl s law equation does not take into account this trade-off. a) If the new fast floating point unit speeds up floating point operations by, on average, 2, and floating point operations take 20% of the original program s execution time, what is the overal speedup ( ignoring the penalty to any other instructions)? b) Now assume that speeding up the floating point unit slowed down data cache accesses which consume 10% of the execution time. What is the overal speedup now? a. 1/(0.8 + 0.20/2) = 1.11 b. 1/(0.7 + 0.20/2 + 0.10 3/2) = 1.05 2. Cache performance optimization The transpose of a matrix interchanges its rows and columes; this is illustrated below: Here is a simple C loop to show the transpose: for ( i = 0; i< 3; i++) { for ( j = 0; j<3; j++) { output [j][i] = input [i][j]; } }
Assume that both the input and output matrices are stored in the row major order (row major order means that the row index changes fastest). Assume that you are executing a 256 256 double-precision transpose on a processor with a 16KB fully associative ( don t worry about cache conflicts) least recently used(lru) replacement L1 data cache with 64 byte blocks. Assume that the L1 cache misses require 16 cycles and always hit in the L2 cache. For the simple implementation given above, this execution order would be non- ideal for the input matrix; however, applying a loop interchange optimization would create a non-ideal order for the output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead. a) What should be the minimum size of the cache to take advantage of blocked execution? Each element is 8B. Since a 64B cache line has 8 elements, and each column access will result in fetching a new line for the non-ideal matrix, we need a minimum of 8x8 (64 elements) for each matrix. Hence, the minimum cache size is 128 8B = 1KB. (128= 64 + 64 for two matrix ) b) How do the relative number of misses in the blocked and unblocked versions compare in the minimum sized cache above? The blocked version only has to fetch each input and output element once. The unblocked version will have one cache miss for every 64B/8B = 8 row elements. Each column requires 64Bx256 of storage, or 16KB. Thus, column elements will be replaced in the cache before they can be used again. Hence the unblocked version will have 9 misses (1 row and 8 columns) for every 2 in the blocked version. c) Write code to perform a transpose with a block size parameter B which uses B B blocks. for ( i=0; i<256; i=i+b ) for(j=0; j<256; j=j+b) for ( m=0; m<b; m++) for ( n=0; n<b; n++) output[j+n][i+m] = input [i+m][j+n] ; Three C s of Cache Misses (Short Answer)
Mark whether the following modifications will cause each of the categories to increase, decrease, or whether the modification will have no effect. You can assume the baseline cache is set associative. Explain your reasoning to receive credit. This table continues on the next page. Compulsory Misses Conflict Misses Capacity Misses No effect Decrease No effect Double the associativity (capacity and line size constant) (halves # of sets) If the data wasn t ever in the cache, increasing associativity with the constraints won t change that. No effect Typically higher associativity reduces conflict misses because there are more places to put the same element. Decrease Capacity was given as a constant. Decrease Adding a victim cache Adding prefetching Victim cache only holds lines previously held by CPU. Decrease Hopefully prefetched data is there when needed. The victim cache holds the victim of a conflict, so it can be used again later. No effect doesn t affect placement - or - Possibly increase prefetch data could possibly pollute cache Slightly larger cache capacity. No effect, since victim cache doesn t count towards capacity total (only -0.5 off) No effect doesn t affect change - or - Possibly increase prefetch data could possibly pollute cache
Ben is designing a 7-stage in-order pipeline and is concerned about the performance implications of branches. The baseline processor he is considering is similar to the classic 5-stage RISC pipeline, except instruction cache access and data cache access each take two stages, as shown in Figure 2-1. Initially, the pipeline lacks any branch prediction mechanism. All branches in Ben's ISA are simple enough that they can execute without an ALU. His ISA has no branch delay slots. Figure 2-1 Ben s in-order pipeline ----------------------------------------------------------------------------------------------------- Problem 2.A What is the earliest stage that branches can be resolved in Ben s pipeline? How many instructions are squashed on a taken branch? Assuming 1 in 6 instructions is a branch and 3/5 branches are taken, and assuming a base CPI of 1, what is the CPI of Ben s pipeline? Simple branches can be resolved in decode, so two fetches are wasted on a taken branch. The CPI is 5/6 (non-branches) + 1/6*2/5 (untaken branches) + 3*1/6*3/5 (taken branches), or 6/5. Decode Branch Resolution Stage 2 # Instructions Squashed 6/5 CPI
Problem 3.A Describe how precise exceptions are maintained in out-of-order processors. Exceptions are detected when an instruction executes out-of-order and saved in the ROB. Since instructions commit in-order from the ROB, exceptions can still be taken in program order by not actually taking an exception until the corresponding instruction is at the head of the ROB, about to commit. Problem 3.B Consider an out-of-order processor with register renaming using a unified physical register file. A new physical register is allocated for each instruction's destination register in the decode stage, but since physical registers are a finite resource, they must be deallocated at some point in time. Carefully explain when it is safe to deallocate a physical register. The physical register can be freed when the next writer of the same architectural register commits. At that point, it is guaranteed that no instructions remaining in the pipeline need to read the old physical register. Compute the Clocks Per Instruction (CPI) of a machine which has an average CPI for ALU operations of 1.1, a CPI for branches/jumps of 3.0, and a hit rate of 60% in the cache. A hit in the cache takes 1 cycle pipelined and a cache miss takes 120 cycles. Assume 22% of instructions are loads, 12% are stores, 20% are branches/jumps and the balance are ALU operations. CPI = P ALU * CPI ALU + P BR/JMP * CPI BR/JMP + P LD/ST * (P HIT * CPI HIT + P MISS * CPI MISS ) = (1 0.22 0.12 0.2) * 1.1 + 0.2 * 3.0 + (0.22 + 0.12) * (0.6 * 1 + (1 0.6) * 120) = 17.63
For the following code snippet, identify all of the RAW, WAW, and WAR hazards. Provide a list for each hazard. Hint, remember that you have to check more than neighbor instructions. I0: LD F4,0(Rx) I1: MULTD F2,F0,F2 I2: DIVD F8,F4,F2 I3: LD F4,0(Ry) I4: ADDD F6,F0,F4 I5: SUBD F8,F8,F6 I6: SD F8,0(Ry) RAW WAW WAR I0, I2 I0. I3 I2, I3 I0, I4 I2, I5 I5, I6 I1, I2 I5, I6 I2, I5 I2, I6 I3, I4 I4, I5 1. For the following snippets of code, select the single architectural feature that will most improve the performance of the code. Explain your choice, including description of why the other features will not improve performance as much and your assumptions about the machine design. ------------- ADD.D F0, F1, F8 ADD.D F2, F3, F8 ADD.D F4, F5, F8 ADD.D F6, F7, F8
C : -------------------------- A. Out-of-Order Issue with Renaming 带换名的乱序发射 B. Branch Prediction 转移预测 C. Superscalar 超标量 A : There is no WAR, WAW or RAW hazards in this code, so OoO issue with renaming has no help to improve the performance B : There is no branch in this code, so BPU can not improve the performance C : In this code, for instructions can be fetched,executed and wrote back in parallel, so by using superscalar we can get the most improvement of performance. Consider the execution of the following loop, which searches an array, on an processor which is in-order single-issue, out of order execution and writing back with dynamic scheduling and speculation: Loop: LD R2, 0(r1) ; R2= array element DADDI R2, R2, #1 ; increment R2 SD R2, 0(R1) DAADI R1, R1, #-4 ;decrement pointer BNEZ R2, LOOP ; branch if the element!=0 Assume that there are separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. And assume that only one instruction can write back per cycle. Complete the following table for the first three iterations of this loop for the machine.
Iteration instructions Issues at Executes Memroy Write Comment number access at CDB at 1 LD R2, 0(r1) 1 2 3 4 First issue 1 DADDI R2, R2, #1 2 5 6 1 SD R2, 0(R1) 3 7 8 1 DAADI R1, R1, #-4 4 8 9 1 BNEZ R2, LOOP 5 7 2 LD R2, 0(r1) 8 10 11 12 2 DADDI R2, R2, #1 9 13 14 2 SD R2, 0(R1) 10 15 16 2 DAADI R1, R1, #-4 11 16 17 2 BNEZ R2, LOOP 12 13 3 LD R2, 0(r1) 14 16 17 18 3 DADDI R2, R2, #1 15 19 20 3 SD R2, 0(R1) 16 21 22 3 DAADI R1, R1, #-4 17 22 23 3 BNEZ R2, LOOP 18 19 Notice that : 1: There are separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. It means that the excecute stages of LD/SD, DADDI and BNEZ can be overlapped. 2. No bypassing. 3. No branch prediction. 4. Memory access stage of different LD/SD instructions can not be overlapped.
Assume that you have the following pipeline. It can issue two instructions per cycle and can commit one instruction per cycle. Draw the pipeline diagram of the following code sequence executing. MUL R6, R7, R8 ADD R9, R10, R11 ADD R11, R12, R13 ADD R13, R14, R15 ADD R19, R13, R10 LW R2, R3 ADD R12, R16, R19 LW R5, R2 ADD R15, R20, R21
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 F D I Y0 Y1 Y2 Y3 W C F D I X0 W C F D I X0 W C F D I I X0 W C F D I I I X0 W C F D I L0 L1 W C F D I I I I X0 W C F D I I I I I L0 L1 W C F D I I X0 W C