ECE 562 Advanced Computer Architecture Chapter 1-2 Sampling Questions. Based on our study, explain how to reduce each type of stalls?

Transcription

1 ECE 562 Advanced Computer Architecture Chapter 1-2 Sampling Questions 1. Pipeline CPI =Ideal pipeline CPI+ Structural Stalls + RAW Stalls + WAR Stalls + WAW stalls + Control Stalls Based on our study, explain how to reduce each type of stalls? 2. Your boss is trying to decide between a single-processor system and a dual-processor system. The Table below gives the performance on two sets of benchmarks a memory benchmark and a processor benchmark. You know that your application will spend 30% of its time on memorycentric computations, and 70% of its time on processor-centric computations. How much speedup do you anticipate getting if you suggest your boss to move from using a Pentium to an Athlon 64 X on a CPU-intensive application suite? 3. Consider the following code segment. Identify data dependencies by marking with arrows and labeling with names (RAW, WAR, WAW). i: R4 R0 + R2 j: R8 R0 * R4 k: R4 R4 - R2 Simulate the execution of the code using the basic 5-stage pipeline (F, D, E, M, & W) with a single memory port and without forwarding nor cycle splitting. Add/subtract takes 2 cycles, and Multiply takes 3 cycles. The first instruction has done for you. Extend the table as needed.

2 Cycle #: i: R4 R0 + R2 F D E E M W j: R8 R0 * R4 k: R4 R4 - R2 Total number of cycles to complete the code:? 5. Imagine that your company is trying to decide between a single-processor system and a dual-processor system. Figure 1.26 gives the performance on two sets of benchmarks a memory benchmark and a processor benchmark. You know that your application will spend 30% of its time on memory-centric computations, and 70% of its time on processor-centric computations. You are using a dual-core Athlon processor, and you are choosing between two ways to implement the same algorithm. The first is to create a large lookup table to store 4K words of data. When you need the result, you look up the answer. The second method would be to calculate the result in a very tight loop. What are the advantages and disadvantages of each implementation? Thus, at what situation (when there are 88.89% memory operations and 11.1% processor operations), the performance of Pentium equals Pentium D 820 when there are 88.89% memory operations and 11.1% processor operations. 6. Think about what latency numbers really mean they indicate the number of cycles a given function requires to produce its output, nothing more. If the overall pipeline stalls for the latency cycles of each functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a producer followed by a consumer ) will execute correctly. But not all instruction pairs have a Producer/consumer relationship. Sometimes two adjacent instructions have nothing to do with each other. How many cycles would the loop body in the code sequence in Figure 2.35 require if the pipeline detected true data dependences and only stalled on those, rather than blindly stalling everything just because one functional unit is busy? Show the code with <stall> inserted where necessary to accommodate stated latencies. (Hint: An instruction with latency +2 needs 2 <stall> cycles to be inserted into the code sequence. Think of it this way: a 1-cycle instruction has latency 1 + 0, meaning zero extra wait states. So latency implies 1 stall cycle; latency 1 + N has N extra stall cycles.) Loop: LD F2, 0(Rx) Memory LD +3 I0: MULTD F2, F0, F2 Memory SD +1 I1: DIVD F8, F2, F0 Integer ADD, SUB +0 I2: LD F4, 0(Ry) Branches +1 I3: ADDD F4, F0, F4 ADDD +2 I4: ADDD F10, F8, F2 MULTD +4 I5: SD F4, 0(Ry) DIVD +10 I6: ADDI Rx, Rx, #8 I7: ADDI Ry, Ry, #8 Latencies beyond Single Cycle I8: SUB R20, R4, Rx I9: BNZ R20, Loop

3 Sample questions in the textbook: Chapter 1: Case Study 1: Chip Fabrication Cost Fabrication Cost Fabrication Yield Defect Tolerance through Redundancy There are many factors involved in the price of a computer chip. New, smaller technology gives a boost in performance and a drop in required chip area. In the smaller technology, one can either keep the small area or place more hardware on the chip in order to get more functionality. In this case study, we explore how different design decisions involving fabrication technology, area, and redundancy affect the cost of chips. 1.1 [10/10/Discussion] <1.5, 1.5> Figure 1.22 gives the relevant chip statistics that influence the cost of several current chips. In the next few exercises, you will be exploring the trade-offs involved between the AMD Opteron, a single-chip processor, and the Sun Niagara, an 8-core chip. a. [10] <1.5> What is the yield for the AMD Opteron? b. [10] <1.5> What is the yield for an 8-core Sun Niagara processor? c. [Discussion] <1.4, 1.6> Why does the Sun Niagara have a worse yield than the AMD Opteron, even though they have the same defect rate? 1.3 [20/20/10/10/20] <1.7> Your colleague at Sun suggests that, since the yield is so poor, it might make sense to sell two sets of chips, one with 8 working processors and one with 6 working processors. We will solve this exercise by viewing the yield as a probability of no defects occurring in a certain area given the defect rate. For the Niagara, calculate probabilities based on each Niagara core separately (this may not be entirely accurate, since the yield equation is based on empirical evidence rather than a mathematical calculation relating the probabilities of finding errors in different portions of the chip). a. [20] <1.7> Using the yield equation for the defect rate above, what is the probability that a defect will occur on a single Niagara core (assuming the chip is divided evenly between the cores) in an 8-core chip? b. [20] <1.7> What is the probability that a defect will occur on one or two cores (but not more than that)? c. [10] <1.7> What is the probability that a defect will occur on none of the cores? d. [10] <1.7> Given your answers to parts (b) and (c), what is the number of 6-core chips you will sell for every 8-core chip? e. [20] <1.7> If you sell your 8-core chips for $150 each, the 6-core chips for $100 each, the cost per die sold is $80, your research and development budget was $200 million, and testing itself costs $1.50 per chip, how many processors would you need to sell in order to recoup costs? Case Study 2: Power Consumption in Computer Systems Amdahl s Law Redundancy MTTF Power Consumption Power consumption in modern systems is dependent on a variety of factors, including the chip clock frequency, efficiency, the disk drive speed, disk drive utilization, and DRAM. The following exercises explore the impact on power that different design decisions and/or use scenarios have.

4 1.4 [20/10/20] <1.6> Figure 1.23 presents the power consumption of several computer system components. In this exercise, we will explore how the hard drive affects power consumption for the system. a. [20] <1.6> Assuming the maximum load for each component, and a power supply efficiency of 70%, what wattage must the server s power supply deliver to a system with a Sun Niagara 8-core chip, 2 GB 184-pin Kingston DRAM, and two 7200 rpm hard drives? b. [10] <1.6> How much power will the 7200 rpm disk drive consume if it is idle rougly 40% of the time? c. [20] <1.6> Assume that rpm is the only factor in how long a disk is not idle (which is an oversimplification of disk performance). In other words, assume that for the same set of requests, a 5400 rpm disk will require twice as much time to read data as a 10,800 rpm disk. What percentage of the time would the 5400 rpm disk drive be idle to perform the same transactions as in part (b)? 1.6 [10/10/Discussion] <1.2, 1.9> Figure 1.24 gives a comparison of power and performance for several benchmarks comparing two servers: Sun Fire T2000 (which uses Niagara) and IBM x346 (using Intel Xeon processors). a. [10] <1.9> Calculate the performance/power ratio for each processor on each benchmark. b. [10] <1.9> If power is your main concern, which would you choose? c. [Discussion] <1.2> For the database benchmarks, the cheaper the system, the lower cost per database operation the system is. This is counterintuitive: larger systems have more throughput, so one might think that buying a larger system would be a larger absolute cost, but lower per operation cost. Since this is true, why do any larger server farms buy expensive servers? (Hint: Look at exercise 1.4 for some reasons.) Case Study 3: The Cost of Reliability (and Failure) in Web Servers TPCC Reliability of Web Servers MTTF This set of exercises deals with the cost of not having reliable Web servers. The data is in two sets: one gives various statistics for Gap.com, which was down for maintenance for two weeks in 2005 [AP 2005]. The other is for Amazon.com, which was not down, but has better statistics on high-load sales days. The exercises combine the two data sets and require estimating the economic cost to the shutdown. 1.9 [10/10] <1.8> The main reliability measure is MTTF. We will now look at different systems and how design decisions affect their reliability. Refer to Figure 1.25 for company statistics. a. [10] <1.8> We have a single processor with an FIT of 100. What is the MTTF for this system? b. [10] <1.8> If it takes 1 day to get the system running again, what is the availability of the system?

5 1.10 [20] <1.8> Imagine that the government, to cut costs, is going to build a supercomputer out of the cheap processor system in Exercise 1.9 rather than a specialpurpose reliable system. What is the MTTF for a system with 1000 processors? Assume that if one fails, they all fail. Case Study 4: Performance Arithmetic Mean Geometric Mean Parallelism Amdahl s Law Weighted Averages In this set of exercises, you are to make sense of Figure 1.26, which presents the performance of selected processors and a fictional one (Processor X), as reported by For each system, two benchmarks were run. One benchmark exercised the memory hierarchy, giving an indication of the speed of the memory for that system. The other benchmark, Dhrystone, is a CPU-intensive benchmark that does not exercise the memory system. Both benchmarks are displayed in order to distill the effects that different design decisions have on memory and CPU performance [10/10/20] <1.9> Imagine that your company is trying to decide between a single-processor system and a dual-processor system. Figure 1.26 gives the performance on two sets of benchmarks a memory benchmark and a processor benchmark. You know that your application will spend 40% of its time on memory-centric computations, and 60% of its time on processor-centric computations. a. [10] <1.9> Calculate the weighted execution time of the benchmarks. b. [10] <1.9> How much speedup do you anticipate getting if you move from using a Pentium to an Athlon 64 X on a CPU-intensive application suite? c. [20] <1.9> At what ratio of memory to processor computation would the performance of the Pentium be equal to the Pentium D 820? 1.14 [10/10/20/20] <1.10> Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for this processor. You will run two applications on this dual Pentium, but the resource requirements are not equal. The first application needs 80% of the resources, and the other only 20% of the resources. a. [10] <1.10> Given that 40% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation? b. [10] <1.10> Given that 99% of the second application is parallelizable, how much speedup would this application observe if run in isolation? c. [20] <1.10> Given that 40% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it?

6 d. [20] <1.10> Given that 99% of the second application is parallelizable, how much overall system speedup would you get? Chapter 2: Case Study 1: Exploring the Impact of Microarchitectural Techniques Basic Instruction Scheduling, Reordering, Dispatch Multiple Issue and Hazards Register Renaming Out-of-Order and Speculative Execution Where to Spend Out-of-Order Resources You are tasked with designing a new processor microarchitecture, and you are trying to figure out how best to allocate your hardware resources. Which of the hardware and software techniques you learned in Chapter 2 should you apply? You have a list of latencies for the functional units and for memory, as well as some representative code. Your boss has been somewhat vague about the performance requirements of your new design, but you know from experience that, all else being equal, faster is usually better. Start with the basics. Figure 2.35 provides a sequence of instructions and list of latencies. 2.1 [10] <1.8, 2.1, 2.2> What would be the baseline performance (in cycles, per loop iteration) of the code sequence in Figure 2.35 if no new instruction execution could be initiated until the previous instruction execution had completed? Ignore front-end fetch and decode. Assume for now that execution does not stall for lack of the next instruction, but only one instruction/cycle can be issued. Assume the branch is taken, and that there is a 1 cycle branch delay slot. 2.2 [10] <1.8, 2.1, 2.2> Think about what latency numbers really mean they indicate the number of cycles a given function requires to produce its output, nothing more. If the overall pipeline stalls for the latency cycles of each functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a producer followed by a consumer ) will execute correctly. But not all instruction pairs have a producer/consumer relationship. Sometimes two adjacent instructions have nothing to do with each other. How many cycles would the loop body in the code sequence in Figure 2.35 require if the pipeline detected true data dependences and only stalled on those, rather than blindly stalling everything just because one functional unit is busy? Show the code with <stall> inserted where necessary to accommodate stated latencies. (Hint: An instruction with latency +2 needs 2 <stall> cycles to be inserted into the code sequence. Think of it this way: a 1-cycle instruction has latency 1 + 0, meaning zero extra wait states. So latency implies 1 stall cycle; latency 1 + N has N extra stall cycles.)

7 2.3 [15] <2.6, 2.7> Consider a multiple-issue design. Suppose you have two execution pipelines, each capable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assume results can be immediately forwarded from one execution unit to another, or to itself. Further assume that the only reason an execution pipeline would stall is to observe a true data dependence. Now how many cycles does the loop require? 2.4 [10] <2.6, 2.7> In the multiple-issue design of Exercise 2.3, you may have recognized some subtle issues. Even though the two pipelines have the exact same instruction repertoire, they are not identical nor interchangeable, because there is an implicit ordering between them that must reflect the ordering of the instructions in the original program. If instruction N + 1 begins execution in Execution Pipe 1 at the same time that instruction N begins in Pipe 0, and N + 1 happens to require a shorter execution latency than N, then N + 1 will complete before N (even though program ordering would have implied otherwise). Recite at least two reasons why that could be hazardous and will require special considerations in the microarchitecture. Give an example of two instructions from the code in Figure 2.35 that demonstrate this hazard. 2.5 [20] <2.7> Reorder the instructions to improve performance of the code in Figure Assume the two-pipe machine in Exercise 2.3, and that the out-of-order completion issues of Exercise 2.4 have been dealt with successfully. Just worry about observing true data dependences and functional unit latencies for now. How many cycles does your reordered code take? 2.6 [10/10] <2.1, 2.2> Every cycle that does not initiate a new operation in a pipe is a lost opportunity, in the sense that your hardware is not living up to its potential. a. [10] <2.1, 2.2> In your reordered code from Exercise 2.5, what fraction of all cycles, counting both pipes, were wasted (did not initiate a new op)? b. [10] <2.1, 2.2> Loop unrolling is one standard compiler technique for finding more parallelism in code, in order to minimize the lost opportunities for performance. c. Hand-unroll two iterations of the loop in your reordered code from Exercise 2.5. What speedup did you obtain? (For this exercise, just color the N + 1 iteration s instructions green to distinguish them from the Nth iteration s; if you were actually unrolling the loop you would have to reassign registers to prevent collisions between the iterations.) 2.8 [20] <2.4> Exercise 2.7 explored simple register renaming: when the hardware register renamer sees a source register, it substitutes the destination T register of the last instruction to have targeted that source register. When the rename table sees a destination register, it substitutes the next available T for it. But superscalar designs need to handle multiple instructions per clock cycle at every stage in the machine, including the register renaming. A simple scalar processor would therefore look up both src register mappings for each instruction, and allocate a new destination mapping per clock cycle. Superscalar processors must be able to do that as well, but they must also ensure that any dest-to-src relationships between the two concurrent instructions are handled correctly. Consider the sample code sequence in Figure Assume that we would like to simultaneously rename the first two instructions. Further assume that the next two available T registers to be used are known at the beginning of the clock cycle in which these two instructions are being renamed. Conceptually, what we want is for the first instruction to do its rename table lookups, and then update the table per its destination s T register. Then the second instruction would do exactly the same thing, and any interinstruction dependency would thereby be handled correctly. But there s not enough time to write that T register designation into the renaming table and then look it up again for the second instruction, all in the same clock cycle. That register substitution must instead be done live (in parallel with the register rename table update). Figure 2.39 shows a circuit diagram, using multiplexers and comparators, that will accomplish the necessary on-the-fly register renaming. Your task is to show the cycle-by-cycle state of the rename table for every instruction of the code. Assume the table starts out with every entry equal to its index (T0 = 0; T1 = 1,...) [10/10/10] <2.3> Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute, memory, write back) and the code in Figure All ops are 1 cycle except LW and SW, which are 1 + 2

8 cycles, and branches, which are cycles. There is no forwarding. Show the phases of each instruction per clock cycle for one iteration of the loop. a. [10] <2.3> How many clock cycles per loop iteration are lost to branch overhead? b. [10] <2.3> Assume a static branch predictor, capable of recognizing a backwards branch in the decode stage. Now how many clock cycles are wasted on branch overhead? c. [10] <2.3> Assume a dynamic branch predictor. How many cycles are lost on a correct prediction? 2.12 [20/20/20/10/20] <2.4, 2.7, 2.10> Let s consider what dynamic scheduling might achieve here. Assume a microarchitecture as shown in Figure Assume that the ALUs can do all arithmetic ops (MULTD, DIVD, ADDD, ADDI, SUB) and branches, and that the Reservation Station (RS) can dispatch at most one operation to each functional unit per cycle (one op to each ALU plus one memory op to the LD/ST unit). a. [15] <2.4> Suppose all of the instructions from the sequence in Figure 2.35 are present in the RS, with no renaming having been done. Highlight any instructions in the code where register renaming would improve performance. Hint: Look for RAW and WAW hazards. Assume the same functional unit latencies as in Figure b. [20] <2.4> Suppose the register-renamed version of the code from part (a) is resident in the RS in clock cycle N, with latencies as given in Figure Show how the RS should dispatch these instructions out-oforder, clock by clock, to obtain optimal performance on this code. (Assume the same RS restrictions as in part (a). Also assume that results must be written into the RS before they re available for use; i.e., no bypassing.) How many clock cycles does the code sequence take? Case Study 2: Modeling a Branch Predictor Concept illustrated by this case study Modeling a Branch Predictor

9 Besides studying microarchitecture techniques, to really understand computer architecture you must also program computers. Getting your hands dirty by directly modeling various microarchitectural ideas is better yet. Write a C or Java program to model a 2,1 branch predictor. Your program will read a series of lines from a file named history.txt (available on the companion CD see Figure Figure 2.43). Each line of that file has three data items, separated by tabs. The first datum on each line is the address of the branch instruction in hex. The second datum is the branch target address in hex. The third datum is a 1 or a 0; 1 indicates a taken branch, and 0 indicates not taken. The total number of branches your model will consider is, of course, equal to the number of lines in the file. Assume a directmapped BTB, and don t worry about instruction lengths or alignment (i.e., if your BTB has four entries, then branch instructions at 0x0, 0x1, 0x2, and 0x3 will reside in those four entries, but a branch instruction at 0x4 will overwrite BTB[0]). For each line in the input file, your model will read the pair of data values, adjust the various tables per the branch predictor being modeled, and collect key performance statistics. The final output of your program will look like that shown in Figure Make the number of BTB entries in your model a command-line option [20/10/10/10/10/10/10] <2.3> Write a model of a simple four-state branch target buffer with 64 entries. a. [20] <2.3> What is the overall hit rate in the BTB (the fraction of times a branch was looked up in the BTB and found present)? b. [10] <2.3> What is the overall branch misprediction rate on a cold start (the fraction of times a branch was correctly predicted taken or not taken, regardless of whether that prediction belonged to the branch being predicted)? c. [10] <2.3> Find the most common branch. What was its contribution to the overall number of correct predictions? (Hint: Count the number of times that branch occurs in the history.txt file, then track how each instance of that branch fares within the BTB model.) d. [10] <2.3> How many capacity misses did your branch predictor suffer? e. [10] <2.3> What is the effect of a cold start versus a warm start? To find out, run the same input data set once to initialize the history table, and then again to collect the new set of statistics. f. [10] <2.3> Cold-start the BTB 4 more times, with BTB sizes 16, 32, and 64. Graph the resulting five misprediction rates. Also graph the five hit rates. g. [10] Submit the well-written, commented source code for your branch target buffer model.