CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, 2012

Transcription

1 1 CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, [ 12 Points ] Datapath & Pipelining. (a) Consider a simple in-order five-stage pipeline with a two-cycle branch misprediction penalty and a single-cycle load-use delay penalty. For a specific program, 30% of the instructions are loads, 20% are branches, the remaining 50% of instructions are simple single-cycle ALU operations. Half of the load instructions are followed immediately by a dependent instruction, and 75% of branches are predicted correctly. What is the average CPI of this program on this processor? Answer: [3 Points] The CPI would be 1 plus an extra cycle for loads followed by dependent instruction (15% of instructions), an extra two cycles for mis-predicted branches (25% of 20%). Thus, the CPI is 1 + (1 0.15) + ( ) = (b) The scalar pipeline in the lab assignment had no structural hazards. What two aspects of the design were responsible for avoiding such structural hazards? Answer: [2 points] (1) Enough resources. Replicating the resource so that both instructions can proceed in parallel. The pipeline had separate memory ports for instruction fetch and data accesses. (2) Pipeline organization. The 5-stage pipeline from the lab assignment avoids structural hazards on the register file write port by delaying all register writes until the W stage (ALU instructions could write the register file in M, but this could cause a structural hazard with a preceeding load instruction trying to write the register file in W). (c) The superscalar pipeline from the lab did have a structural hazard. What was the cause of the hazard and how was it handled in the design? Answer: [2 points] There was only one load/store port, so if the pipeline attempted two memory operations in the same cycle, the second of the two was forced to stall to avoid the structural hazard. (d) The maximum speedup achievable by pipelining a single-cycle datapath into five stages is 5x. Give three distinct reasons why this ideal speedup is generally not achieved in practice: Answer: [3 points] Any three of: (1) data hazards (load-to-use data dependencies) will hurt CPI. (2) control hazards (branch mis-predictions) will hurt CPI. (3) The stages won t be divided exactly evenly, preventing a 5x clock improvement. (4) The pipeline register overhead will also prevent a 5x improvement in clock. (e) The maximum speedup achievable by converting a scalar single-cycle datapath into a two-issue superscalar single-cycle processor is 2x. Give two distinct reasons why this ideal speedup is generally not achieved in practice: Answer: [2 points] 1. Adding the additional control logic and register file ports could hurt the clock frequency. 2. Not every pair if instructions would be independent, so sometimes the pipeline would only execute one instruction (rather than the peak throughput of two instructions per cycle).

2 2 2. [ 8 Points ] Branch Prediction Conflicts and Tagged Predictors. You re a microprocessor designer, and your simulations of an important workload indicate that branch instructions at the following two 32-bit addresses (in binary) are executed frequently: Address A: Address B: Answer: Note: This question had by far the lowest score on the exam (average of less than 2 points out of 8). This questions was obviously much more difficult than I anticipated, perhaps because the predictor in the lab assignment was a BTB-style predictor versus the branch-direction predictor discussed in this question. (a) The above two instructions are likely to conflict (hash to the same entry) in a branch predictor. For a simple bimodal predictor of two-bit saturating counters, how many entries must the predictor have to prevent these two branches from interfering (conflicting) with each other? How many total bytes must the predictor be? Answer: These two addresses have the lower 15 bits in common. Thus, the index bits would need to at least 16 bits to map these two addresses to different entries in the branch predictor. That would require a 64k-entry predictor, which is 16KBs. (b) You have a brilliant idea: Why not create a set-associative branch direction predictor? Your simulations indicate that a predictor with just 2048 entries (512 bytes) would be sufficient if it wasn t for these two trouble branches. Consider a two-way set-associative predictor that uses a straightforward tagging strategy (one tag for each two-bit counter, instructions have 32-bit addresses). How large (in KBs) is a two-way set-associative 2048-entry tagged predictor? Answer: The 2048-entry predictor has 1024 sets in each of its two ways, so the lower 10 bits are used to index into the table. Thus, the tags are each = 22 bits. Adding the two-bit counter, each entry is 24 bits or 3 bytes. 2K entries at 3 bytes each is 6KBs. (Actually, unless random replacement was used, a LRU bit is also needed, which would increase the size of each entry to 25 bits, for 6.25KBs.) (c) You have another brilliant idea: Because this is just a predictor, it can be wrong, so it doesn t actually need the full tags, just enough of a tag to avoid this particular conflict. With this new insight, (1) how large should the tags be and (2) what is the total size in KBs of this predictor? Answer: To avoid the conflict, we only need to tag the lower 16 bits of the address. 10 bits are already covered by indexing into the 2k-entry predictor (1024 sets). Thus, only 6 tag bits are need. Six bits plus the 2-bit saturating counter is a total of 8 bits (1 byte) per entry. 2k entries at 1 byte is 2KB. (As above, an LRU bit would add a bit to each entry (9 bits) for a total of 2.25KBs.) (d) How might a predictor capture both the conflict-mitigating benefits of a tagged set-associative predictor and most of the area efficiency of a tag-less predictor? Answer: Some sort of hybrid predictor. Perhaps something similar in concept to how a small victim buffer can mitigate conflicts in direct-mapped caches: form a hybrid of a small setassociative predictor and a large tag-less predictor. Access the predictors in parallel, if the tag matches, use the prediction from the set-associative predictor. Otherwise, use the prediction from the large tag-less predictor. To make most efficient use of the small predictor, insert a new entry (new tag) into it only when a branch is mis-predicted (if the large table is getting a branch correct, there is no need to put that branch in the small predictor).

3 3. [ 10 Points ] Caching. Consider two different cache configurations for an 8-bit processor. Both caches have two 16-byte blocks (for a total capacity of 32 bytes), but one is direct-mapped and the other is two-way set associative and uses the least-recently used (LRU) replacement algorithm. All caches begin empty. (a) For the direct-mapped cache, how many bits are in the tag, index, and offset? Answer: [2 Points] tag: 3, index: 1, offset: 4 (b) For the two-way set associative cache, how many bits are in the tag, index, and offset? Answer: [2 Points] tag: 4, index: 0, offset: 4 (c) Give a short sequence (4 or fewer) of addresses in which the set-associative cache has a better hit rate than the direct-mapped cache. (Give the addresses as an 8-digit binary number.) Also give the miss rate for each. Sequence: Answer: [3 Points] , , Miss rate on direct-mapped: Answer: 100%. Miss rate on set-associative: Answer: 66%. (d) Give a short sequence (4 or fewer) of addresses in which the direct-mapped cache has a better hit rate than the set-associative cache. (Give the addresses as an 8-digit binary number.) Also give the miss rate for each. Sequence: Answer: [3 Points] , , , Miss rate on direct-mapped: Answer: 75% Miss rate on set-associative: Answer: 100% 3

4 4 4. [ 11 Points ] Memory Hierarchy Calculations. (a) Consider a simple core with a single-level of cache that achieves 1 CPI when every memory operation hits in the cache. The miss penalty to the main memory is 150 cycles. For a specific workload, the miss rate is 1 miss per every 50 instructions (or 0.02 misses per instruction). What is the CPI of this chip for this workload? Answer: [2 points] 1 + ( ) = 4 CPI. (b) Consider adding a second-level cache in between the first-level cache and memory. For this workload, a miss in the second-level cache occurs only once every 250 instructions (0.004 misses per instruction) and has the same latency as accessing memory as above (150 cycles). The firstlevel cache is unchanged. Calculate under what conditions adding this cache will at least double the performance (half the CPI). Answer: [3 points] As the previous CPI was 4, we need to target a CPI of 2. Thus, we can only spend 1 CPI in the memory system. The unknown parameter is the access latency of the L2 cache. (0.02 L) + ( ) = 1 Solving for L gives us L = 20. Thus, the access latency of the second-level cache must be 20 cycles or faster. Now let s revisit this design in the context of energy. Assume that when all memory operations hit in the cache, the core consumes 1 nano-joule per instruction (njpi). At 1 CPI at 1Ghz, that is equal to expending 1 Watt of power. That is, njpi * IPC * (clock frequency in Ghz) = Watts. Hint: apply what you have learned about calculating CPIs. (c) Of course, the memory hierarchy also consumes energy. Using the same miss rates (1 miss per every 50 instructions), if a main memory access consumes 100 nj, calculate the nano-joules per instruction of the single-level cache hierarchy. Answer: [1 point] The calculation is largely the same as doing the CPI calculation: 1 + ( ) = 3 njpi. (d) Perform a similar calculation (to part (c)) for the two-level hierarchy, assuming the energy to access the second-level cache is 30 nj and it incurs one miss every 250 instructions. Answer: [1 point] This calculation is also largely the same as doing a CPI calculation: 1 + ( ) + ( ) = 2 njpi. (e) Did adding the second level of cache increase or decrease the energy per instruction? Answer: [1 point] It decreased the energy per instruction. (f) Now consider energy per second, also known as power and measured in Watts. Assuming that adding the second level of cache does halve the CPI, does it increase or decrease the power dissipation of the chip? Why? Answer: [1 point] It increased the energy per second (power). (g) Based on these two metrics, would you expect battery life of a mobile device using this chip to increase or decrease? Explain.

5 Answer: [2 points] If the user is doing the same amount of computation, the battery life of the device is expected in improve (increase), because of the improve energy efficiency. However, because the chip got 2x faster but only 1.5x more energy efficient, the overall power actually went up. Thus, if the user increases the amount of computation they do (which is reasonable, as the computation is faster), then the battery life will actually get worse (decrease). 5

6 5. [ 11 Points ] Scheduling. When the code below is executed, assume the loop iterates thousands of times, and thus you should ignore any startup or initialization effects. 6 LOOP: MUL V <- X * X ADD Y <- Y + 1 STORE 1 -> [Y+V] BRANCH-IF (V == 0), EXIT LOAD X <- [X] BRANCH LOOP LOAD and MUL have a latency of more than one cycle; all other operations have a latency of one cycle. The pipeline is fully bypassed, has no structural hazards, and all execution units are fully pipelined. The hardware s branch direction and branch target prediction are both perfect. The pipeline is non-superscalar, so it can execute at most one instruction per cycle. Initially, consider an in-order pipeline. (a) Assuming the latency of MUL and LOAD are both two cycles (one cycle use penalty), approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why? Answer: [2 points] As there are six instructions per iteration, it should take at least six cycles plus any stalls. As there is an independent instruction after both the MUL and LOAD, all of the stall cycles are hidden. Thus, 6 x 1000 = 6000 cycles. (b) Assuming the latency of MUL and LOAD are both increased to four cycles (three cycle use penalty), approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why? Answer: [3 points] With a three-cycle use penalty, the pipeline must add two stall cycles after the MUL. In addition, there is a cross-loop dependency between the LOAD and the use of X, adding two more stall cycles. Thus, = 10 or 10,000 cycles for 1000 iterations. (c) Static scheduling does not improve the performance of this loop. Why not? Give two specific limitations that combine to prevent static scheduling from being effective in this example. Answer: [4 points] Here we were looking for two specific limitations. First, the LOAD can not be moved before the BRANCH-IF, as the LOAD might access invalid memory, because when V is zero, X would be zero, so the LOAD would be a NULL-reference, which would raise a page fault and cause the operating system to kill the program. This is the scheduling scope problem discussed in class. Second, the LOAD can not be moved before the STORE, as the store might write to the same location. This is unlikely, but the compiler needs to generate correct code for all cases. This is the memory aliasing problem discussed in class. The answer not enough registers is not a correct answer for this example, as moving up the LOAD does not increase the number of registers needed. (d) Now consider a dynamically scheduled (out-of-order) pipeline that supports a large number of in-flight instructions (100+). The pipeline is still non-superscalar, so it can execute a maximum of one instruction per cycle. Assuming the same four-cycle latency (three cycle use penalty) for MUL and LOAD instructions, approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why?

7 Answer: [2 points] 6 cycles per iteration. Why? Dynamic scheduling, can move the LOAD up past the BRANCH and STORE instructions, fully hiding its execution latency. This also puts another instruction between the MUL and its use. If only a single iteration was considered, this would take 7 cycles per iteration. However, dynamically scheduled processors can execute instructions across iterations. The first few iterations will incur this execution stall, but the fetch engine will continue fetching and renaming one instruction per cycle, increasing the number of instructions in the out-of-order scheduling window. Eventually, there will be enough instructions in the window to find independent instructions. Thus, the remaining stall will be filled in, resulting in 6 cycles per iteration (the maximum sustainable for a scalar pipeline). In fact, a superscalar pipeline could sustain one interation every four cycles, because the critical path of the computation is only the LOAD feeding the LOAD in the next iteration. 7

8 8 6. [ 10 Points ] Thread-Level Parallelism and Multicore. (a) What is the primary disadvantage of coarse-grained locking? Answer: Limits the parallelism because of lock contention. (b) What are two disadvantages of employing fine-grained locking? Answer: [2 points] (1) More difficult programming and (2) each lock acquire and release adds runtime overhead. (c) What new mechanism has recently been proposed to help ameliorate the locking granularity problem? Answer: Transactional memory. (d) What is the primary advantage of the MSI protocol over the simpler VI cache coherence protocol? Answer: MSI allows multiple processor to share read access to a block, reducing the number of coherence-induced cache misses. (e) What is the primary advantage of the MESI protocol over the simpler MSI cache coherence protocol? Answer: MESI gives a processor a clean read/write copy of a block if no other processors are sharing the block. This avoids an upgrade miss on a subsequent read. (f) What is false sharing. How can software help avoid false sharing? Answer: [2 points] False sharing is when two processors are accessing different parts of the same cache block (one at least one of the processors is writing the block). Software can help by placing shared data in different cache blocks. (g) What is a memory fence (also known as memory barrier)? Where are they typically used? Answer: [2 points] Memory fences are instructions used to ensure ordering among instructions in systems with relaxed memory consistency models. They are used when writing synchronization operations, such as after a lock acquire and before a lock release.

9 7. [ 16 Points ] Parallelism At All Levels. Throughout the semester we ve explored parallelism at many different levels (from very fine-grained parallelism within an ALU to multiple cores on a chip). Give an example use of parallelism at five different levels of granularity. These should be big picture major approaches for using parallelism to extract significant performance (for example, each could provide a 4x or more performance improvement). For each of the examples, give: (i) the name or term, (ii) the specific reasons for (or benefits of) employing parallelism, and (iii) the primary disadvantage or challenge of exploiting parallelism at that level of granularity. Answer: (a) Arithmetic (carry-select addition) - reduces the latency of arithmetic operations at the small cost of some additional logic. (b) Pipelining (multiple phases of instructions) - greatly increases the clock frequency (and thus throughput) of the datapath. The disadvantages are that (1) it increases the complexity of the design (for example, to handle stalls) and (2) it has diminishing benefits because of branch mispredictions and other pipeline stalls. (c) Superscalar (multiple independent instructions in the same cycle) - up to a point, can multiply the performance of a processor. The advantage is that can be done transparently to the programmer, but it does require sophisticated hardware implementation (and it can hurt clock frequency and energy). (d) Multiprocessors (multiple pipelines all executing concurrently) - adding more processor cores adds many more transistors, but has the potential to linearly increase performance. Its main disadvantages are (1) that it requires programmers to write parallel software and (2) few program have enough inherent parallelism to achieve linear speedup on large number of cores. (e) Vectors/SIMD (one instruction that performs the same arithmetic operation on multiple data elements in parallel) - Vectors can increase performance, but they require programmer-inserted annotations. As an aside, all of these forms of parallelism meet with diminishing returns beyond some point. That is why all of these forms of parallelism are employed, as exploiting any one of them isn t as good use using them all to a moderate degree. Give a specific concrete instance of a system that exploits all of these forms of parallelism: Answer: The XBox 360 9

10 10 Distribution 12" Score'Distribu2on' 10" Number'of'students' 8" 6" 4" 2" 0" Score' Mean: 46 points (59%) Median: 46 points (59%) High: 70 (90%)