Performace Measuremet ad Aalysis i Computer Quatitative Computer Measuremet Model Iovatio Proposed How to measure, aalyze, ad specify computer system performace or My computer is faster tha your computer! Implemetatio Aalysis What is Performace? How to measure Executio? Executio? Throughput? Of What? What is relative performace? How is it specified? % time program... program results... 160.7u 19.9s 4:15 71% % Wall-clock time? user CPU time? user + kerel CPU time? Aswer:
Relative Performace ca be cofusig A rus i 12 secods B rus i 20 secods A/B =.6, so A is 40% faster, or 1.4X faster, or B is 40% slower B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower eeds a precise defiitio Relative Performace, the Defiitio Speedup (of x over y) Relative Performace X Executio = = Y = Performace Performace Y Executio X We ca remove all ambiguity by always costraiig to be > 1 => machie x is times faster tha y. Examples your program rus i 5 miutes o a Itel Xeo, but 2 miutes o a Core i7 processor. How much faster is the i7 processor? aother program rus i 10 miutes with the stadard compiler, but whe recompiled with a ew compiler, the program rus i 9 miutes. How much faster is the ew compiled program (what is the speedup)? How to Specify Performace, i summary Performace oly has meaig i the cotext of a program or workload (MIPS, GFLOPS???). Whe talkig about the performace of a sigle machie, we talk about respose time or throughput. Whe talkig about relative performace, we will say machie y has a speedup of over machie x based o the ratio of their executio times for a workload. speedup of 1.6 1.6 times as fast 60% speedup [correct but more ofte misiterpreted]
But What Workload? Sythetic workloads whetstoe, dhrystoe,... toy bechmarks puzzle, quicksort, sieve,... kerels livermore loops, lipack real programs To maximize their efforts, architects will attempt to mirror the decisio process of the market. Whe the market uses poor measuremet methodology, we ca get poor architectures! SPEC: System Performace Evaluatio Cooperative First Roud 1989 10 programs yieldig a sigle umber Secod Roud 1992 SpecIt92 (6 iteger programs) ad SpecFP92 (14 floatig poit programs) Compiler Flags ulimited. Third Roud 1995 Sigle flag settig for all programs; ew set of programs Fourth Roud, 2000 More complex programs, larger data sets Fifth Roud, 2006 Loger ruig time, some larger data sets, more applicatio areas SPEC combies real programs with eforced measuremet stadards. SPEC First Roud Oe program: 99% of time i sigle lie of code New frot-ed compiler could improve dramatically SPEC First Roud Oe program: 99% of time i sigle lie of code New frot-ed compiler could improve dramatically 800 700 600 500 400 300 200 100 0 gcc epresso spice doduc asa7 li eqtott matrix300 fpppp tomcatv Bechmark
How to Summarize Performace Real workloads typically ivolve multiple programs, ad thus, multiple results. Popular bechmarks (e.g., SPEC, livermore loops,...) ivolve multiple programs. Everyoe wats to summarize results with a sigle umber. But the summarized result ca be dramatically skewed by the method used to combie them. How to Summarize Performace Computer A Computer B Computer C Program 1 1 10 20 Program 2 1000 100 20 Total time 1001 110 40 Which machie is fastest? How to Summarize Performace Arithmetic Mea 1 i i1 Weighted Arithmetic Mea i * Weight i i 1 where the sum of the weights is 1. Geometric Mea Harmoic Mea i 1 i 1 Executio i i 1 ExecutioRatio i Executio base 1 Rate i Summarizig Performace A B C W(1) W(2) W(3) Program 1 1 10 20.5.909.999 Program 2 1000 100 20.5.091.001 AM/W(1) 500.5 55 20 AM/W(2) 91.82 18.18 20 AM/W(3) 2 10.09 20 GM 31.6 31.6 20 which machie is fastest ow?
Summarizig Performace Eve the uweighted arithmetic mea implies a weightig Geometric mea does ot ecessarily predict executio time for ay mix of the programs ratios of geometric meas ever chage (regardless of which machie is used as the base), ad always give equal weight to all bechmarks To give uequal weight requires weighted arithmetic mea Aalyzig Performace That was all about measurig performace. What tools do we use to aalyze (predict) performace i the absece of somethig to measure? models, equatios, queueig theory, mea value aalysis, istructio-level simulatio, gate-level simulatio,... Measuremet Model Iovatio Proposed Implemetatio Aalysis Speedup (due to architectural chage) Speedup is just relative performace o the same machie with somethig chaged. From before, the: Amdahl s Law The impact of a performace improvemet is limited by the percet of executio time affected by the improvemet speedup = relative performace = ET for etire task without chage ET for etire task with chage Executio time after improvemet = Executio Affected Amout of Improvemet + Executio Uaffected Suppose the chage oly affects part of executio time Make the commo case fast!!
Amdahl s Law ad Massive Parallelism.9.1.45.225.1.1.1 Speedup 1.0 1/.55 = 1.82 1/.325 = 3.07 < 10 Examples program A rus for 30 secods, but 5 secods of that time is just waitig for memory. If we double the speed of the memory subsystem, what is the speedup? fp istructios accout for 10% of executio time of program B. Should we double the speed of the fp istructios, or speed up iteger by 20%? How much do we eed to speed up the memory to get a 20% improvemet i program A? What is? How may clock cycles? CPU Executio = CPU clock cycles * Clock cycle time = CPU clock cycles / Clock rate Every covetioal processor has a clock with a associated clock cycle time or clock rate. Every program rus i a itegral umber of clock cycles. GHz = billios of cycles/secod X GHz = 1/X aosecods cycle time Number of CPU cycles = Istructios executed * Average Clock Cycles per Istructio (CPI) CPI = CPU clock cycles / Istructio cout or
All Together Now Examples CPU Executio secods Istructio CPI = X X Cout Clock Cycle 4 GHz processor, program rus i 30 secods, executig 40 billio istructios: CPI =?? If we reduce CPI to 2.4, ET =?? ew compiler reduces IC to 32 billio, but icreases CPI to 2.6: good or bad? 2 GHz Core i7 has CPI of.9, 4 GHz Core i7 has a CPI of 1.1 (why?): What s the speedup for that workload? istructios cycles/istructio secods/cycle Who Affects Performace? What Affects Performace? CPU Executio Istructio CPI = X X Cout Clock Cycle CPU Executio Istructio CPI = X X Cout Clock Cycle programmer compiler istructio-set architect machie architect hardware desiger materials scietist/physicist/silico egieer pipeliig superpipeliig cache from CISC to RISC superscalar
Key Poits We eed to be precise about how to specify performace. Performace is oly meaigful i the cotext of a workload. Be careful how you summarize performace. Amdahl s law ET = IC * CPI * CT