Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types of processors. Handheld/mobile devices: we need powerful processors that are energy efficient (due to battery restrictions), low heat producing (due to the lack of a fan), yet offer real-time performance and graphics processing. The ARM family of processors are the most common, based on the Acorn RISC Machine, first introduced in the early 80s. In the late 80s, Apple worked with Acorn to begin releasing new ARM processor cores. It is these that most current ARM processors are based on. The ARM family generally is denoted by the following features: Load/store instruction set 16 32-bit registers where some of the registers are only used by the OS Fixed-length 32-bit instructions Single clock-cycle execution for most instructions Conditional execution rather than branch prediction Condition codes used only if specified Indexed addressing modes The early ARM processors used a 3-stage pipeline but were later expanded to as many as 13 stages. Branch prediction was added to later versions to improve over conditional execution. We will talk about conditional execution later in the semester. Later versions also implemented the Thumb instruction set. This instruction set consists of 16-bit instructions. This allows two instructions to be fetched and possibly executed. Thumb instructions can be placed inside of ordinary code. The idea behind ARM is to have a scaled back ISA so that the processors can squeeze a good deal of parallelism out of code. Since most handheld devices are running only one or a few apps at a time, there is less need for large memories, the fastest clock speeds or the power found in larger computers. This keeps power consumption and heat production down, and keeps the cost down. Desktop/laptop computers: for these devices, we need to manage the tradeoff between price and performance. Obviously users want better performance but are only winning to spend between $300 and $2500 on a desktop/laptop unit. The largest requirements are to support modest multitasking (e.g., up to 10 processes at a time), graphics and other forms of multimedia, Internet communication and common forms of productivity software as well as the luxury of running a complex operating system that handles user duties with little interaction. Memory requirements are somewhat lofty because users will multitask and because the Windows and Mac operating systems are large. This not only requires 4G-8G of RAM but also as many as 3 levels of cache, organized in such a way that cache performance does not negatively impact the processor. Additionally, the modern processors give off a good amount of heat, so cooling fans must be available. The most common PC processors today are the latest generations of Intel Pentium, Xeon, Celeron and now Core processors. AMD is currently one of the few competitors in the PC market offering the FX, Phenom II and Athlon II processors. Servers: introduced in the 1980s to serve as file servers, servers are now more generically titled and can range in usage from simple file servers (often found in LANs, or used as web or database servers on the Internet) but can also be used to service distributed processing for ATM machines, airline reservations and on-line services (e.g., amazon web site, google search engine). In the latter case, the authors refer to this as a cluster or warehouse-scale computer. Cloud computing also fits in this category. Higher end servers reach supercomputer status. Costs for a server ranges from $5K through $10M and up to $200M for a cluster. The most important aspect of this model of computer is throughput the number of services handled per unit of time. Throughput is impacted as much by memory capacity and telecommunications as it is processor capability. Scalability is another important feature, primarily impacted by how easy it is

to add memory and hard disk space to the computer(s). The server/cluster end has largely replaced mainframe computers of old. There are a wide range of processors that are used by servers, but the more significant performance increase occurs with multiprocessing rather than improvements made to a single processor as we see in the PC market. Embedded computers: at the other extreme from clusters is the embedded computer, a processor embedded in another device (e.g., microwave oven, car engine). These devices are often 8-bit or 16-bit processors with minimum storage and modest power requirements. They often can cost less than $5 and seldom cost more than $100. What we want to cover in this class are the common set of processor improvements that we see standard in most processors, no matter which platform they are intended for. The primary tool to processor improvement is parallelism. There are many forms of parallelism that the authors divide between datalevel and task-level. We implement these using instruction-level parallelism through pipelining and speculative execution, vector-level using an SIMD-style architecture, thread-level and request-level (not covered in this course). The main two efforts to achieve parallelism are through the processor and through the cache. In the processor, we use pipelining and multiple functional units so that one or more instruction can be issued each clock cycle. Due to the complexity of modern processors, instructions may finish execution out of order, and therefore we need additional hardware to re-order the instructions upon completion. In cache, we want to ensure as few cache misses as possible so that the instruction issue stage of the pipeline does not stall, nor does an instruction waiting on memory. So, principle of locality of reference is applied. We will visit many other cache improvements later in the semester. Above all, we focus on the common case. As we saw in class, Amdahl s Law shows us that no matter what level of speedup we might achieve through some improvement, it is the common case that will win out. Consider, for instance, an improvement that can be used 80% of the time that increases performance by 50% versus an improvement that can be used 25% of the time that increases performance by a factor of 10 (1000%). Improvement 1: 1 / (1 -.8 +.8 / 1.5) = 1.36 (36% speedup) Improvement 2: 1 / (1 -.25 +.25 / 10) = 1.29 (29% speedup) Later in the semester, we will look at the initial x86 pipeline and see how the CISC features of x86 complicated the pipeline to the point of poor performance. We cover the MIPS instruction set because it is a model instruction set to aim for. That is, it was designed specifically to promote an efficient pipeline. MIPS was originally developed in the early 80s. Because of this, it lacks some features that we now want present to help support further parallelism. For instance, there is no vector processing instructions in MIPS. We will briefly visit this later in the semester. Neither are there graphics processing instructions (we will not examine this although it is in the textbook). As covered in class, the typical MIPS processor is a 5-stage fetch-execute cycle. Next week, in the out-of-class notes, you will compare it to the MIPS R4000, an 8-stage fetch-execute cycle. We wrap up the notes for the out-of-class portion by looking at several example problems. Also visit the discussion board. 1. It seems that a quad core processor should speed up a computer by a factor of 4 but it doesn t. Use Amdahl s Law to compute the percentage of program execution that should be distributed to achieve an overall speedup of 3. Of 2. Of 1.5. Of 1.25. Answer: We want to solve for x in y = (1 / (1 x + x / 4)) where y is 3, 2, 1.5 and 1.25. This involves a little algebra but we wind up with x = 4 / 3 * (1 1 / y). For y = 3, x =.889. For y = 2, x =.667. For y = 1.5, x =.444. For y = 1.25, x =.267. So to achieve a speedup of 1.25, all

four cores must be in use about 26.7% of the time but to achieve a 3 time speedup, all four cores must be in use 88.9% of the time. 2. Let s compare a CISC machine versus a RISC machine on a benchmark. Assume the following characteristics of the two machines. CISC: CPI of 4 for load/store, 3 for ALU/branch and 10 for call/return, CPU clock rate of 2.75 GHz RISC: CPI of 1.4 (the machine is pipelined, the ideal CPI is 1.0, but overhead and stalls make it 1.4) and a CPU clock rate of 2 GHz Since the CISC machine has more complex instructions, the IC for the CISC machine is 40% smaller than the IC for the RISC machine The benchmark has a breakdown of 38% loads, 10% stores, 35% ALU operations, 3% calls, 3% returns and 11% branches. Which machine will run the benchmark in less time and by how much? Answer: use CPU time = IC * CPI * Clock cycle time RISC: IC RISC * CPI RISC * Clock cycle time RISC = IC RISC * 1.4 * 1 / 2GHz = 0.7 * IC RISC CISC: IC CISC * CPI CISC * Clock cycle time CISC = IC RISC * 0.6 * (4 *.38 + 4 *.10 + 3 *.35 + 10 *.03 + 10 *.03 + 3 *.11) * 1 / 2.75 GHz = IC RISC * 0.6 * 3.9 / 2.75 GHz = 0.851 * IC RISC Since the CISC machine has a higher CPU time, it means that the RISC machine is faster by 0.851 / 0.7 = 1.216 or about 22%. 3. The MIPS instruction set passes parameters through memory, thus slowing down function calls. An alternate architecture, Berkeley RISC, uses register windows. The register windows places local variables of a function into a set of registers. Those being passed as parameters to another function are placed into another set of registers which overlap registers available for the function. Thus, the window is overlapping registers. See the figure below. Let s assume that using

register windows cause memory accesses to be replaced by register operations, so rather than accruing the CPI of a load or store for each parameter, each parameter accrues the CPI of an ALU operation. Assume we have the following CPI breakdown: Loads/stores: 4, ALU and unconditional branches: 2, conditional branches: 3, procedure calls and returns: 15. Architects are trying to decide whether to use additional registers in a CPU for register windows or just more registers in the register file. If we go with ordinary registers, it reduces the number of loads and stores by 40% and 30% respectively (because we can put more into registers). If we go with register windows, it reduces the procedure call/return CPI greatly. Let s assume the CPI or a procedure call reduces to 4.5 and the return reduces to 3. Which should we use for a benchmark of: 40% loads, 13% stores, 31% ALU, 8% conditional branch, 2% unconditional branch, 3% procedure calls, 3% returns? Answer: CPU Time = IC * CPI * Clock Cycle Time. The last value will not change between the two approaches. If we use register windows, CPI reduces and if we add more registers, IC reduces because of fewer loads & stores. CPI original =.40 * 4 +.13 * 4 +.31 * 2 +.08 * 3 +.02 * 2 +.03 * 15 +.03 * 15 = 3.92. CPI regwindows =.40 * 4 +.13 * 4 +.31 * 2 +.08 * 3 +.02 * 2 +.03 * 4.5 +.03 * 3 = 3.245. We have to figure out the new breakdown of instructions if we have fewer loads and stores:.40 *.40 =.16, so 16% fewer loads.13 *.30 =.039 so 3.9% fewer stores So there will be.16 +.039 =.199 fewer instructions, we now recomputed the breakdown of instructions given an IC of 1.00 -.199 =.801 Loads = (.40 -.16) /.801 =.300 Stores = (.13 -.039) /.801 =.114 ALU =.31 /.801 =.387 Conditional branches =.08 /.801 =.100 Unconditional branches =.02 /.801 =.025 Procedure calls =.03 /.801 =.037 Returns =.03 /.801 =.037 New CPI =.300 * 4 +.114 * 4 +.387 * 2 +.100 * 3 +.025 * 2 +.037 * 15 +.037 * 15 = 3.89 IC registers =.801 * ICoriginal CPU Time register windows = IC original * 3.245 * CPU clock cycle time = 3.245 * IC original * clock cycle time CPU Time new registers = IC original *.801 * 3.89 * CPU clock cycle time = 3.116 * IC original * clock cycle time The version using the registers as actual registers is faster, so the speedup of using the registers as actual registers instead of as register windows is 3.245 / 3.116 = 1.041 or a little over 4% faster. 4. In the 1980s and 1990s, architects debated whether the RISC or CISC approach was better. The list below denotes some of the differences in philosophy between the two forms of architecture. For each of the following, explain how it would improve CPU time in terms of which of the following in our CPU time formula would be decreased: IC, CPI, Clock Cycle Time, or some combination. NOTE: some of these may increase but you do not need to discuss what increases, only what decreases. a. In RISC, there are a great number of registers available, less so in a CISC machine

b. In CISC, there can be complex addressing modes such as indirect addressing to obtain the datum pointed to by a pointer c. In RISC, a pipeline is used to perform each part of the fetch-execute cycle as an independent stage d. In CISC, variable sized instruction lengths are common so that multiple memory operands can be accessed at the same time Answers: a. With more registers, there is less need for loads and stores, so IC decreases. However, since CISC machines often have memory-register operations (such as add x, y, z), the actual impact is most felt in CPI because the add instruction in a RISC machine will have a low CPI since operands must be stored in registers, whereas the CISC add instruction will have a much longer CPI if it involves accessing memory one or more times per instruction. b. The complex addressing modes allow memory accesses in single operations whereas in a RISC architecture without complex addressing modes, something like indirect addressing takes multiple operations, therefore this feature lowers IC. c. Since all operations are pipelined, their CPI is reduced to approximately 1, therefore the pipeline lowers CPI. d. The variable sized instruction length allows instructions to carry out multiple tasks, and therefore there needs to be fewer instructions, lowering IC. 5. Let s see what might happen if we add a register-memory ALU mode to MIPS. We could replace the two statements LW R1, 0(R2) DADDU R3, R3, R1 With DADDU R3, 0(R2) So that the new instruction has enough space in the 32-bit instruction length format, we restrict this to be a two operand instruction where the first operand is a source and a destination register. Assume that to accommodate the memory fetch as part of this instruction, we increase clock cycle time by 15%. Using the gcc benchmark (see figure A.27, p. A-41), what percentage of loads would have to be eliminated so that this new mode can execute gcc in the same amount of time? Answer: We want CPU time old = CPU time new where CPU time = IC * CPI * Clock Cycle Time. We will assume that CPI will not change and we know Clock Cycle Time new is 15% longer than Clock Cycle Time old. So, to balance out, IC new must be 15% less than IC old or we have to reduce IC to be 85% of the old. Since loads make up 25.1% of the total, we have to reduce loads to be 15% / 25.1% =.60%, or we have to eliminate 60% of the loads. 6. The autoincrement and autodecrement mode are common in CISC computers. This mode is used when accessing an array by automatically incrementing or decrementing the register storing the offset. The change occurs after the access for the increment, and before the access for the decrement. Let s see what happens in some standard array code with the new mode: for(i=0;i<1000;i++) a[i]=b[i]+c[i]; Assume that R1, R2, and R3 store the starting addresses arrays a, b, c respectively and that they are all int arrays. If we introduce an autoincrement statement like LWI Rx, 0(Ry) in place of the LW instruction of MIPS, how will it impact the performance? Below are the two sets of code, without and with the autoincrements. The CPI for our machine is as follows: 5 for loads/stores, 2

for ALU and 3 for branches. The autoincrement load/store also has a CPI of 5 but requires that we lengthen the clock cycle by 25%. Is the new mode worth pursuing? DADD R4, R0, R0 // R4 is the loop variable i DADDI R5, R0, #1000 // R5 = 1000 top: DSUB R6, R5, R4 BEQZ R6, out // exit for loop after 1000 iterations LW R7, 0(R2) // R7 = b[i] LW R8, 0(R3) // R8 = c[i] DADD R9, R7, R8 // R9 = b[i] + c[i] SW R9, 0(R1) DADDI R1, R1, #4 DADDI R2, R2, #4 DADDI R3, R3, #4 DADDI R4, R4, #1 J top out:... DADD R4, R0, R0 // R4 is the loop variable i DADDI R5, R0, #1000 // R5 = 1000 top: DSUB R6, R5, R4 BEQZ R6, out // exit for loop after 1000 iterations LWI R7, 0(R2) // R7 = b[i] LWI R8, 0(R3) // R8 = c[i] DADD R9, R7, R8 // R9 = b[i] + c[i] SWI R9, 0(R1) DADDI R4, R4, #1 J top out:... Answer: We compare the two CPU Times. CPU Time = IC * CPI * Clock Cycle Time. The original machine has a shorter Clock Cycle Time while the newer machine has a reduced IC * CPI because we can remove three of the DADDI instructions. CPU Time original = IC * CPI * Clock Cycle Time original CPU Time new = IC * CPI * Clock Cycle Time new We compute IC * CPI as follows: The original code has 2 ALU operations outside of the loop plus a loop of 6 ALU, 2 branch and 3 load/store. This gives us a total IC * CPI = 2 * 2 + 1000 * (6 * 2 + 2 * 3 + 3 * 5) = 33,004 clock cycles. The new code has 2 ALU operations outside of the loop plus a loop of 3 ALU, 2 branch and 3 load/store increment. This gives us a total of IC * CPI = 2 * 2 + 1000 * (3 * 2 + 2 * 3 + 3 * 5) = 27,004. Clock Cycle Time new = Clock Cycle Time old * 1.25 CPU Time old = 33,004 * Clock Cycle Time old

CPU Time new = 27,004 * Clock Cycle Time new = 27,004 * Clock Cycle Time old * 1.25 Speedup = CPU Time old / CPU Time new = 33,004 / (27,004 * 1.25) = 0.978, so we see a slowdown, not a speedup. 7. As an alternative to #6, let s assume that the clock speed does not change, but that the CPI for the LWI and SWI is 6. Is the change worth it? Answer: Here, Clock cycle time does not change so we only have to compare IC * CPI for both machines. The old machine s IC * CPI does not change. The new machine has the following IC * CPI = 2 * 2 + 1000 * (3 * 2 + 2 * 3 + 3 * 6) = 30,004. Since this is a reduction, the new mode would be worth it in this case. The speedup is 33,004 / 30,004 = 1.10.