1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? b) Assume for some benchmark, the fraction of use is 15% for each of enhancements 1 and 2 and 70% for enhancement 3. We want to maximize performance. If only one enhancement can be implemented, which should it be? If two enhancements can be implemented, which should be chosen? 2. Measuring processor s time After graduating, you are asked to become the lead computer designer at Hyper Computer, Inc. Your study of usage of high level language constructs suggests that procedure calls are one of the most expensive operations. You have invented a new architecture with an ISA that reduces the loads and stores normally associated with procedure calls and returns. The first thing you do is run some experiments with and without this optimization. Your experiments use the same state of the art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information: The clock cycle time of the optimized version is 5% lower than the unoptimized version Thirty percent of the instructions in the unoptimized version are loads or stores. The optimized version executes two thirds as many loads and stores as the unoptimized version. For all other instructions the dynamic execution counts are unchanged. Every instruction (including load and store) in the unoptimized version takes one clock cycle. Due to the optimization, the procedure call and return instructions take one extra cycle in the optimized version, and these instructions accounts for 5% of total instruction count in the optimized version. Which is faster? Justify your decision quantitatively. 3. Amdahl s Law A particular program P running on a single-processor system takes time T to complete. Let us assume that 40% of the program s code is associated with data management housekeeping (according to Amdahl) and, therefore, can only execute sequentially on a single processor. Let us further assume that the rest of the program (60%) is embarrassingly parallel in that it can easily be divided into smaller tasks executing concurrently across multiple processors (without any interdependencies or communications among the tasks). (a) Calculate T2, T4, T8, which are the times to execute program P on a two-, four-, eightprocessor system, respectively.
(b) Calculate on an system with an infinite number of processors. Calculate the speedup of the program on this system, where speedup is defined as. What does this correspond to? 4. Amdahl s Law II Amdahl compares and contrasts the performance of three different machines in his paper (Machines A, B, C). For this problem, consider Machines X, Y, Z configured as follows. _ Machine X A scalar processor with 1 arithmetic unit, running at frequency 2f. _ Machine Y An array processor with 4 arithmetic units, running at frequency f. _ Machine Z A VLIW processor with 4 arithmetic units, running at frequency f. Additionally, we define instruction-level parallelism as the degree in which an application s instructions are independent of each other and, therefore, can be executed concurrently. (a) If a large portion of an application s code has very low instruction-level parallelism, on which machine would it run the fastest, if any, and why? (b) Describe the characteristics of an application which would perform better on Machine Y than on Machine Z. (c) Describe the characteristics of an application which would perform better on Machine Z than on Machine Y. 5. Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for this processor. You will run two applications on this dual Pentium, but the resource requirements are not equal. The first application needs 80% of the resources, and the other only 20% of the resources. a. Given that 30% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation? b. Given that 90% of the second application is parallelizable, how much speedup would this application observe if run in isolation? c. Given that 30% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it? d. Given that 90% of the second application is parallelizable, how much overall system speedup would you get? 6.In the load-store architecture of MIPS, operands of arithmetic and logical instruction must be from registers. For a typical integer program, the instruction distribution and CPI of 4 groups are given in the following table. a. Calculate the average CPI of the integer program.
b. Now, assume that a set of new memory-register type of arithmetic and logical instructions are added into the ISA. Each memory-register ALU instruction combines one Load and one original ALU instruction together. It takes 4 cycles to execution this new type of instruction. Assume 60% of the load instructions can be combined for the program; calculate the new CPI of the integer program. c. Assume the modification makes the overall cycle time increased by 5%. Is this modification really worthwhile? 7. Your company s internal studies show that a single-core system is sufficient for the demand on your processing power. You are exploring, however, whether you could save power by using two cores. a. Assume that your application is 80% parallelizable. By how much could you decrease the frequency and get the same performance? b. Assume that the voltage may be decreased linearly with the frequency. Using the equation in Section 1.5, how much dynamic power would the dual-core system require as compared to the singlecore system? c. Now assume that the voltage may not decrease below 30% of the original voltage. This voltage is referred to as the voltage floor, and any voltage lower than that will lose the state. Using the equation in Section 1.5, how much dynamic power would the dual core system require from part (a) compared to the singlecore system when taking into account the voltage floor? 8. You find yourself in a game show presented with 2 machines. You are supposed to pick the fastest one to win an awesome prize! You are given the following information about the two machines A and B (running different compilers) Machine A has a clock rate of 2 GHz with the following measurements. Machine B has a clock rate of 2.5 GHz with the following measurements. To make sure you don t parrot the answers given from the audience, the host asks you the following questions. a. What is the average CPI of machine A and B? b. On which machine is the program faster with respect to i. Execution time
ii. MIPS rating 9. 30% of a benchmark program s execution time is from multiply operations. Uber cool hardware speeds up these operations 12 times! Suppose the program took 20 seconds to execute without the enhanced hardware, what will be the overall speedup achieved? During its enhanced operation, what is the new execution time, and what is the percentage of time multiply operations take? 10.When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a complicated fast floating-point unit, that takes space, and something might have to be moved farther away from the middle to accommodate it, adding an extra cycle in delay to reach that unit. The basic Amdahl s law equation does not take this trade-off into account. Let s assume for some benchmark program, 15% of the original execution time is taken up by floating point operations, 25% by data accesses and 30% by I/O operations. You have 3 teams of engineers who come up with cool hardware to enhance each of the operations! But unfortunately, they inadvertently end up affecting other operations as well. Your job is to choose the hardware with the highest overall speedup and reward that team with a bag of goodies! Team A comes up with an improvement on the floating point hardware. It speeds up floating point operations 12 times but slows down data accesses by 1.25 times and I/O operations by 1.1 times. Team B comes up with an improvement on the data access hardware. It speeds up data accesses by 2.5 times. It slows down the I/O operations by 1.5 times but speeds up floating point operations by 2 times! Team C comes up with an improvement for I/O operations which speeds it up by 6 times but slows down data accesses by 2.5 times and leaves the floating point operations unchanged. 11. Suppose that when Program A is run, the user CPU time is 3 seconds, the elapsed wallclock time is 4 seconds, and the system performance is 10 MFLOP/sec. Assume that there are no other processes taking any significant amount of time, and the computer is either doing calculations in the CPU, or doing I/O, but it can't do both at the same time. We now replace the processor with one that runs six times faster, but doesn't affect the I/O speed. What will the user CPU time, the wallclock time, and the MFLOP/sec performance be now? 12. You are on the design team for a new processor. The clock of the processor runs at 200 MHz. The following table gives instruction frequencies for Benchmark B, as well as how many cycles the instructions take, for the different classes of instructions. For this problem, we
assume that (unlike many of today's computers) the processor only executes one instruction at a time. Instruction Type Frequency Cycles Loads & Stores 30% 6 cycles Arithmetic Instructions 50% 4 cycles All Others 20% 3 cycles Calculate the CPI for Benchmark B. The CPU execution time on the benchmark is exactly 11 seconds. What is the ``native MIPS'' processor speed for the benchmark in millions of instructions per second? The hardware expert says that if you double the number of registers, the cycle time must be increased by 20%. What would the new clock speed be (in MHz)? The compiler expert says that if you double the number of registers, then the compiler will generate code that requires only half the number of Loads & Stores. What would the new CPI be on the benchmark? How many CPU seconds will the benchmark take if we double the number of registers (taking into account both changes described above)?