CSEE W4824 Computer Architecture Fall 2012

Transcription

1 CSEE W4824 Computer Architecture Fall 2012 Lecture 2 Performance Metrics and Quantitative Principles of Computer Design Luca Carloni Department of Computer Science Columbia University in the City of New York Announcements: CS Distinguished Lecture Wed, Oct. 12 th 11:00 am - Davis Auditorium What Should a Well-informed Person Know about Computers? Brian Kernighan (Princeton Univ.) His book with Dennis Ritchie, the creator of the C programming language is considered the bible of C At Bell Labs contributed to the development of Unix working with the Unix creators K. Thompson and D. Ritchie He is also a coauthor of the widely used AWK and AMPL programming languages, and of the EQN and PIC typesetting languages In collaboration with Shen Lin he devised well-known heuristics for two important NP-complete optimization problems: graph partitioning travelling salesman problem CSEE 4824 Fall Lecture 2 Page 3 1

2 Computer Architects and Quantitative Approach Design ideas and trade-offs are tested by using tools in order to estimate the impact on performance, power and cost (an iterative process) analytical reasoning and fundamental design principles equations for basic metrics cost, performance, power simulations at various levels system level, ISA, micro-architecture, memory, RTL, gate, circuit level benchmark programs representing typical workloads CSEE 4824 Fall Lecture 2 Page 5 How to Define Performance? Airplane Passenger Capacity Cruising Range (miles) Cruising Speed (m.p.h.) Passenger Throughput (passenger x m.p.h) Boeing ,750 Boeing ,700 Concorde ,200 Douglas DC ,424 CSEE 4824 Fall Lecture 2 Page 6 2

3 Two Key Performance Metrics Time to run the task execution time, response time, elapsed time, latency Tasks per time unit execution rate, bandwidth, throughput Airplane DC to Paris Speed Passengers Throughput (passengers x mph) Boeing hours 610mph ,700 Concorde 3 hours 1350mph ,200 CSEE 4824 Fall Lecture 2 Page 7 Latency vs. Throughput Latency real time necessary to complete a task important when the focus is on a single task a computer user who is working with a single application a critical task of a real-time embedded system Throughput (aka Bandwidth) number of tasks completed per unit of time a metric independent from the exact number of executed tasks important when the focus is on running many tasks a manager of a large data-processing center is interested in the total amount of work done in a given time CSEE 4824 Fall Lecture 2 Page 8 3

4 Latency lags Bandwidth Bandwidth has outpaced latency across the main computer technologies There is an old network saying: Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed you can t bribe God. [Anonymous] CSEE 4824 Fall Lecture 2 Page 9 Latency and Throughput The Classic 5-Stage Pipeline Pipelining increases the instruction throughput number of instructions completed per unit of time but does not reduce (in fact, it usually slightly increases) the execution time of an individual instruction CSEE 4824 Fall Lecture 2 Page 10 4

5 Performance Metrics Machine X is n times faster than machine Y executiontime(y) n = = executiontime(x) performance(x) performance(y) Performance and execution time are reciprocal improve performance increase performance improve execution time decrease execution time Example executiontime(y) = 4.8, executiontime(x) = 3.6 n = 1.33, i.e. X is 33% faster than Y CSEE 4824 Fall Lecture 2 Page 11 Make the Common Case Fast the most important, pervasive, and simple principle of computer design in making a design trade-off favor the frequent case rather than infrequent case when determining how to allocate resources favor the frequent event rather than the rare event when optimizing the design of a module target the average functional behavior besides, the frequent case is often simpler 1. How to determine what the frequent case is? 2. How to determine the amount of the possible performance gain in making the frequent case faster? CSEE 4824 Fall Lecture 2 Page 12 5

6 Simulation and Simulation Levels ISA (functional) simulator execute program & get ISA-level statistics frequency of instructions Memory simulator ISA simulator is run together with a model of the memory systems get cache hit/miss rates, study memory hierarchy options Full performance simulator a detailed performance model to a functional simulator model all interactions, stalls, (mis)-speculations generate accurate statistics CSEE 4824 Fall Lecture 2 Page 13 Simulation Tradeoffs ISA simulator 10x slower than the real processor x faster than a detailed performance simulator Key points use the right level of simulation to answer a specific question e.g., ISA simulator to get instruction mix statistics use fast, idealized models for non-critical components e.g., assume a perfect main memory for applications that present an optimal cache hit ratio simulation is a powerful tool for architectural explorations, but analytical reasoning should always be applied before starting long simulations CSEE 4824 Fall Lecture 2 Page 14 6

7 Benchmark Suites Sets of programs to simulate typical workloads Several types real software applications (GCC, Word, ) most accurate but typically longer to process portability problems (OS/compiler dependencies), GUI kernels(livermore Loops, Linpack, ) small, key pieces taken from real programs limited picture, but good to isolate the performance of individual features of a machine synthetic benchmarks (Whetstone, Dhrystone, ) try to match the average frequency of operations on operands of a real program may easily mislead compiler and hardware designers CSEE 4824 Fall Lecture 2 Page 15 Amdahl s Law What is the overall speedup after improving a component x of a system? system x originalexecutiontime speedup = = newexectiontime newperformance originalperformance If component x is improved by Sx and component x affects a fraction Fx of the overall execution time then 1 speedup = original execution time of unimproved part Fx (1 Fx) + Sx new exec. time of improved part CSEE 4824 Fall Lecture 2 Page 16 7

8 Amdahl s Law - Example speedup = If we optimize the module for the floatingpoint instructions by a factor of 2, but the system will normally run programs with only 20% of floating point instructions then the speedup is only 1 speedup = (1 0.2) + 1 (1 Fx) Fx Sx = CSEE 4824 Fall Lecture 2 Page = Amdahl s Law - Example If Sx=100, what is the overall speedup as a function of Fx? S Speedup vs. Optimized Fraction S CSEE 4824 Fall Lecture 2 Page 18 8

9 Amdahl s Law and the Law of Diminishing Returns the closer to 1 is Fx, the closer to Sx is the overall speedup i.e. [make common case fast] however, as Sx, speedup 1 / (1- Fx) i.e., once Fx/Sx is small with respect to (1-Fx) the price/performance ratio falls rapidly as Sx is increased the incremental improvement in speedup gained by an additional improvement in the performance of just a portion of the computation diminishes as improvements are added CSEE 4824 Fall Lecture 2 Page 19 Amdahl s Law - Reference Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", AFIPS 67 Amdahl s Law - special case of parallelization if F is the fraction of a calculation that can be parallelized and (1-F) is the fraction that is sequential (i.e. cannot benefit from parallelization) then Amdahl s Law gives the maximum speedup that can be achieved by using N processors as 1 speedup = F (1 F) + Example N if F is only 90%, the calculation can be sped up by only a maximum of a factor of 10, no matter how many processors are used key to parallel computing is to augment F but there is also Gustafson s Law CSEE 4824 Fall Lecture 2 Page 20 9

10 Principle of Locality Temporal Locality a resource that is referenced at one point in time will be referenced again sometime in the near future Spatial Locality the likelihood of referencing a resource is higher if a resource near it was just referenced 90/10 Locality Rule of Thumb a program spends 90% of its execution time in only 10% of its code hence, it is possible to predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past this is a consequence of how we program and we store the data in the memory CSEE 4824 Fall Lecture 2 Page 21 Principle of Locality - Example Cache Memory directly exploits temporal locality providing faster access to a smaller subset of the main memory which contains copy of data recently used but, all data in the cache are not necessarily data that are spatially close in the main memory still, when a cache miss occurs a fixed-size block of contiguous memory cells is retrieved from the main memory based on the principle of spatial locality CSEE 4824 Fall Lecture 2 Page 22 10

11 CPU Time CPU Time user CPU Time spent in the user program system CPU Time spent in the OS performing tasks required by the program harder to measure and to compare across architectures CPU performance = user CPU time on an unloaded system CPU Time = (Clock Cycles for a Program) x (Clock Cycle Time) = = (Clock Cycles for a Program) / (Clock Frequency) most computers run with a single clock signal (strictly synchronous design) whose discrete time events are called cycles, periods, or ticks a P with a 1ns clock period runs at 1GHz of clock frequency CSEE 4824 Fall Lecture 2 Page 23 CPU Time Three Main Factors CPU Time = (Clock Cycles for a Program) x CCT IC = instruction count number of instructions executed for a program CPI = clock cycles per instruction = CCfP/IC average number of clock cycles per instruction of a program its reciprocal is IPC = instruction per clock cycles CPU Time = IC x CPI x CCT CPU Time equally depends on these three factors a 10% improvement in any of these leads to a 10% improvement in CPU time CSEE 4824 Fall Lecture 2 Page 24 11

12 CPU Time - Dependencies Program Compiler HW organization HW technology CPU Time = IC x CPI x CCT IC CPI CCT ISA organization some interdependencies, but many techniques improve a single factor CSEE 4824 Fall Lecture 2 Page 25 Improving Performance by Exploiting Parallelism at the system level use multiple processors, multiple disks scalability is key to adaptively distribute workload in server apps at the single microprocessor level exploit instruction level parallelism (ILP) e.g., pipelining overlaps the execution of instruction to reduce the overall program CPU Time reduces CPI by overlapping instructions in time possible because many subsequent instructions are independent e.g. parallel computation reduces CPI by overlapping instructions in space duplicate hardware modules such as ALUs at the circuit level carry-lookahead adders speed-up sums from linear to logarithmic CSEE 4824 Fall Lecture 2 Page 26 12

13 CPU Time broken down per instruction CPU Time = IC x CPI x CCT CPU Time = i ( ICi x CPIi) x CCT CPI = i ( ICi x CPIi) IC = i (IFi x CPIi) frequent instructions have larger contributions on CPI CPI should be measured to include pipeline/memory effects it is not sufficient to calculate it from the reference manual table NOTE: it is ok to compare two designs based only on CPI (or IPC) only if IC and CCT are the same! CSEE 4824 Fall Lecture 2 Page 27 Example: Average Instruction Execution Time Assuming a simple un-pipelined processor with CCT = 2ns Operation IFi CPIi IFi x CPIi (% Time) ALU Load Store Branch CPI = i (IFi x CPIi ) = 4.3 Average instruction execution time = CPI x CCT = 8.6ns CSEE 4824 Fall Lecture 2 Page 28 13

14 Example: Speedup From 5-stage Pipelining Assumption after pipelining the slowest stage forces an effective clock period equal to (CCT + clockoverhead) = ( )ns Question What is the speedup from pipelining? (Average Instruction Time )unpipelined 8.6 speedup = = = 3.9 (Average Instruction Time ) pipelined 2.2 CSEE 4824 Fall Lecture 2 Page 29 Another Key Metric: Power Dissipation Energy measured in Joules Power rate of energy consumption [Watts = Joules/sec] instantaneous power P = V * I voltage drop across a component times the current flowing through it Example system A higher peak power lower total energy system B lower peak power higher total energy I V [Source: K. Asanovic MIT ] CSEE 4824 Fall Lecture 2 Page 30 14

15 Power Consumption of CMOS Transistors Dynamic Power traditionally dominant component dissipated when transistor switches (i.e. data dependent) Static Power becoming more important with transistors scaling due to leakage current that flows even if there is no switching activity proportional to the number of transistors on the chip Challenges power is the key limitation to chip design distribute power on-chip remove heat prevent hot spots low power design (clock gating, DVFS) CSEE 4824 Fall Lecture 2 Page 31 Example: Dynamic Power Consumption Assume a 0.25µm CMOS chip with a voltage supply Vdd=2.5V clock frequency F=500Mhz, and average load capacitance of CL=15fF/gate (assuming a fan-out of 4) What is the power consumption per gate? Approximately, Pavg =50µW For a design with 1 million gates, assuming that a transition occurs at every clock edge, this would result in an average power consumption of ~50W! In reality, not all gates on the chip switch at the full rate of 500Mhz. The actual activity is substantially lower and it is estimated by the switching capacitance CSEE 4824 Fall Lecture 2 Page 32 15

16 Dynamic Voltage Frequency Scaling DVFS is a low-power design technique that is becoming pervasive in modern processors Example: If the voltage and frequency of a processing core are both reduced by 15% what would be the impact on dynamic power? Power Save = Pnew Pold = C x (V x 0.85) x (F x 0.85) 2 2 C x V x F 3 = 0.85 = 0.61 Pnew is 64% more power efficient than Pold CSEE 4824 Fall Lecture 2 Page 33 Assigned Readings Computer Architecture A Quantitative Approach by John Hennessy Stanford University Dave Patterson UC Berkeley Fifth Edition Morgan Kaufmann (Elsevier) Read Sections CSEE 4824 Fall Lecture 1 Page 34 16