Notes on Calculating Computer Performance

otes on Calculatng Computer Performance Bruce Jacob and Trevor Mudge Advanced Computer Archtecture Lab EECS Department, Unversty of Mchgan {blj,tnm}@umch.edu Abstract Ths report explans what t means to characterze the performance of a computer, and whch methods are approprate and napproprate for the task. The most wdely used metrc s the performance on the SPEC benchmark sute of programs; currently, the results of runnng the SPEC benchmark sute are compled nto a sngle number usng the geometrc mean. The prmary reason for usng the geometrc mean s that t preserves values across normalzaton, but unfortunately, t does not preserve total run tme, whch s probably the fgure of greatest nterest when performances are beng compared. Cycles per Instructon (CPI) s another wdely used metrc, but ths method s nvald, even f comparng machnes wth dentcal clock speeds. Comparng CPI values to judge performance falls prey to the same problems as averagng normalzed values. In general, normalzed values must not be averaged and nstead of the geometrc mean, ether the harmonc or the arthmetc mean s the approprate method for averagng a set runnng tmes. The arthmetc mean should be used to average tmes, and the harmonc mean should be used to average rates (1/tme). A number of publshed SPECmarks are recomputed usng these means to demonstrate the effect of choosng a favorable algorthm. 1.0 Performance and the Use of Means We want to summarze the performance of a computer; the easest way uses a sngle number that can be compared aganst the numbers of other machnes. Ths typcally nvolves runnng tests on the machne and takng some sort of mean; the mean of a set of numbers s the central value when the set represents fluctuatons about that value. There are a number of dfferent ways to defne a mean value; among them the arthmetc mean, the geometrc mean, and the harmonc mean. The arthmetc mean s defned as follows: The geometrc mean s defned as follows: ArthmetcMean ( a 1, a 2, a 3,, a ) = a The harmonc mean s defned as follows GeometrcMean ( a 1, a 2, a 3,, a ) = a HarmoncMean ( a 1, a 2, a 3,, a ) = 1 a

In the mathematcal sense, the geometrc mean of a set of n values s the length of one sde of an n-dmensonal cube havng the same volume as an n-dmensonal rectangle whose sdes are gven by the n values. As ths s nether ntutve nor nformatve, the wsdom of usng the geometrc mean for anythng s questonable 1. Its only apparent advantage s that t s unaffected by normalzaton: whether you normalze by a set of weghts frst or by the geometrc mean of the weghts afterward, the result s the same. Ths property has been used to suggest that the geometrc mean s superor, snce t produces the same results when comparng several computers rrespectve of whch computer s tmes are used as the normalzaton factor [Flemng86]. However, the argument was rebutted n [Smth88], where the meannglessness of the geometrc mean was frst llustrated. In ths report, we wll consder only the arthmetc and harmonc means. Snce the two are nverses of each other, and snce the arthmetc mean the average s more easly vsualzed than the harmonc mean, we wll stck to the average from now on, relatng t back to the harmonc mean when approprate. UITS AD MEAS: An Example ArthmetcMean ( a 1, a 2, a 3, ) We begn wth a smple llustratve example of what can go wrong when we try to summarze performance. Rather than demonstrate ncorrectness, the ntent s to confuse the ssue by hntng at the subtle nteractons of unts and means. A machne s tmed runnng two benchmark tests and receves the followng scores: test1: 3 sec (most machnes run t n 12 seconds) test2: 300 sec (most machnes run t n 600 seconds) How fast s the machne? Let us look at dfferent ways of calculatng performance: Method 1 one way of lookng at ths s by the ratos of the runnng tmes: test1: The machne s performance on test 1 s four tmes faster than an average machne, ts performance on test 2 s twce as fast as average, therefore our machne s (on average) three tmes as fast as most machnes. Method 2 another way of lookng at ths s by the ratos of the runnng tmes: test1: The machne s runnng tme on test 1 s 1/4 the tme t takes most machnes, ts runnng tme on test 2 s 1/2 the tme t takes most machnes, so our machne (on average) takes 3/8 the tme a typcal machne does to run a program, or, put another way, our machne s 8/3 (2.67) tmes as fast as the average machne. Method 3 yet another way of lookng at ths s by the ratos of the runnng tmes: test1: 3 12 3 12 3 12 1 = HarmoncMean 1 1 1 (,,, ) a 1 a 2 a 3 The machne ran the benchmarks n a total of 303 seconds, the average machne runs the benchmarks n 612 seconds, therefore our machne takes 0.495 the amount of tme to run the benchmarks as most machnes do, and so s roughly twce as fast as the typcal machne (on average). test2: test2: test2: 300 600 300 600 300 600 1. Compare ths to just one physcal nterpretaton of the arthmetc mean; fndng the center of gravty n a set of objects (possbly havng dfferent weghts) placed along a see-saw. There are countless other nterpretatons whch are just as ntutve and meanngful. 2

Method 4 and then you can always look at the ratos of the runnng tmes... How can these calculatons seem reasonable and yet produce completely dfferent results? The answer s that they seem reasonable because they are reasonable; they all gve perfectly accurate answers, just not to the same queston. Lke n many other areas, the answers are not hard to come by the dffcult part s n askng the rght questons. 2.0 The Semantcs of Means In general, there are a number of possbltes for fndng the performance, gven a set of expermental tmes and a set of reference tmes. One can take the average of the raw tmes, the raw rates (nverse of tme) 1, the ratos of the tmes (expermental tme over reference), or the ratos of the rates (reference tme over expermental). Each opton represents a dfferent queston and as such gves a dfferent answer; each has a dfferent meanng as well as a dfferent set of mplcatons. An average need not be meanngless, but t may be f the mplcatons are not true. If one understands the mplcatons of averagng rates, tmes, and ther ratos, then one s less apt to wnd up wth meanngless nformaton. THE SEMATICS OF TIME, RATE, AD RATIO Remember the correspondence between the arthmetc and harmonc means: ArthmetcMean ( tmes) HarmoncMean ( rates) ArthmetcMean ( rates) HarmoncMean ( tmes) The Semantcs of Tme A set of tmes s a collecton of numbers representng Tme Taken per Unt Somethngs Accomplshed. The nformaton contaned n ther arthmetc mean s therefore On Average, How Much Tme s Taken per Unt Somethngs Accomplshed; the average amount of tme t takes to accomplsh a prototypcal task. On Average n ths case s defned across Somethngs and not Tme. For example, a book s read n two hours, another n four; the average s 3 hours per book. If books smlar to these are read contnuously one after another and the reader s progress s sampled n tme (say once every mnute) then the value of 4 hrs/book wll come up twce as often as the value of 2 hrs/book, gvng an ncorrect average of 10/3 hours per book. However, f the readng tme s sampled per book (say once every book), the average wll come out correctly. Tme s what we are concerned wth n omparng the performance of computers. Though t s just as mportant a measure of performance, we are not concerned wth throughput snce jugglng both would confuse the pont. In ths paper we want to know how long t takes to perform a task, rather than how many tasks the machne can perform per unt tme. If the set of tmes s taken from representatve programs, then the average wll be an accurate predctor of how long a typcal program would take, and thus ndcate the machne s performance. The Semantcs of Rate A set of rates s n unts of Somethngs Accomplshed per Unt Tme, and the nformaton contaned n ther arthmetc mean s then On Average, How Many Somethngs You Can Expect to Accomplsh per Unt Tme. Here, the average s 1. We wll use the word rate to descrbe a unt where tme s n the denomnator despte what may be n the numerator (unless t s also tme, n whch case the unt s a pure number). Tme and rate are related n that the arthmetc mean of one s the nverse of the harmonc mean of the other. 3

defned across Tme and not Somethngs; f you ntend to take the arthmetc mean of a set of rates, the rates should represent nstantaneous measures taken n Tme and should OT represent measurements taken for every Somethng Accomplshed. Take the above book example; f we try to average 1/2 book per hour and 1/4 book per hour (the values obtaned f we sample over books), we obtan a measurement of 3/8 books per hour; what good s ths nformaton? It cannot be combned wth the number of books we read to produce how long t should have taken (t took 6 hours, not 16/3 hours). Ths confuson arses because of an ncorrect use of the arthmetc mean. If, however, we sample the readng rate at perodc ponts n tme, we fnd that there wll be twce as many values of 1/4 book per hour as 1/2 book per hour, whch wll gve us (1/4 + 1/4 + 1/2, dvded by 3); an average rate of 1/3 book per hour, correspondng ncely wth realty. When measurng computers, we are generally presented wth a set of values taken per task completed a set of benchmark results, each the tme taken to perform one of several tests not a set of nstantaneous measurements of progress, sampled every unt of tme. Therefore, n general, fndng the arthmetc mean of a set of rates s not a good dea, as t wll lead to erroneous and msleadng results. Use the harmonc mean nstead. The Semantcs of Ratos Computer performance s often represented by a rato of rates or tmes. It s a untless number, and when the reference tme s n the numerator (as n a rato of rates) the measurement means how much faster one thng s than another. When the reference tme s n the denomnator (as n a rato of tmes) the measurement means what fracton of tme the machne n queston takes to perform a task, relatve to the reference machne. What does t mean to average a set of untless ratos? The arthmetc mean of a set of ratos s a weghted average where the weghts happen to be the runnng tmes of the reference machne. What nformaton s contaned n ths value? If the reference tmes are thought of as the expected amount of tme for each benchmark, the weghtng mght ensure that no benchmark result counts more than any other, and the arthmetc mean would then represent what proporton of the expected tme the average benchmark takes. 3.0 Problems wth ormalzaton Problems arse f we take the average of a set of normalzed numbers. The followng examples demonstrate the errors that occur. The frst example compares the performance of two machnes, usng a thrd as a benchmark. The second example extends the frst to show the error n usng CPI values to compare performance. EXAMPLE I: Average ormalzed by Reference Tmes There are two machnes, A and B, and a reference machne. There are two tests, T1 and T2, and we obtan the followng scores for the machnes: Scenaro I Test T1 Test T2 Machne A: 10 sec 100 sec Machne B: 1 sec 1000 sec Reference: 1 sec 100 sec In scenaro I, the performance of machne A relatve to the reference machne s 0.1 on test T1 and 1 on test T2. The performance of machne B relatve to the reference machne s 1 on test T1 and 0.1 on test T2. Snce tme s n the denomnator (the reference s n the numerator), we are averagng rates, therefore we use the harmonc mean. The fact that the reference value s also n unts of tme s rrelevant; the tme measurement we are concerned wth s n the denomnator, thus we are averagng rates. 4

The performance results of Scenaro I: Scenaro I Harmonc Mean Machne A: HMean(0.1, 1) = 2/11 Machne B: HMean(1, 0.1) = 2/11 The two machnes perform equally well. Ths makes ntutve sense; on one test machne A was ten tmes faster, on the other test machne B was ten tmes faster. Therefore they should be of equal performance. As t turns out, ths lne of reasonng s erroneous. Let us consder scenaro II, where the only thng that has changed s the reference machne s tmes (from 100 seconds on test T2 to 10 seconds): Here, the performance numbers for A relatve to the reference machne are 1/10 and 1/10, the performance numbers for B are 1 and 1/100, and these are the results: Accordng to ths, machne A performs about 5 tmes better than machne B. And f we try yet another scenaro changng only the reference machne s performance on test T2, we obtan the result that machne A performs worse than machne B. The lesson: do not average test results whch have been normalzed. EXAMPLE II: Average ormalzed by umber of Operatons Scenaro II Test T1 Test T2 Machne A: 10 sec 100 sec Machne B: 1 sec 1000 sec Reference: 1 sec 10 sec Scenaro II Harmonc Mean Machne A: HMean(0.1, 0.1) = 1/10 Machne B: HMean(1, 0.01) = 2/101 Scenaro III Test T1 Test T2 Harmonc Mean Machne A: 10 sec 100 sec HMean(0.1, 10) = 20/101 Machne B: 1 sec 1000 sec HMean(1, 1) = 1 Reference: 1 sec 1000 sec The example extends even further; what f the numbers were not a set of normalzed runnng tmes but CPI measurements? Takng the average of a set of CPI values should not be susceptble to ths knd of error, because the numbers are not untless; they are not the rato of the runnng tmes of two arbtrary machnes. Let us test ths theory. Let us take the average of a set of CPI values, n three scenaros. The unts are cycles per nstructon, and snce the tme-related porton (cycles) s n the numerator, we wll be able to use the arthmetc mean. The followng are the three scenaros, where the only dfference between each scenaro s the number of nstructons performed n Test2. The runnng tmes for each machne on each test do not change, therefore we should expect the performance of each machne relatve to the other to reman the same. Scenaro I Test1 Test2 Arthmetc Mean Machne A: 10 cycles 100 cycles AMean(10, 10) = 10 CPI Machne B: 1 cycle 1000 cycles AMean(1, 100) = 50.5 CPI Instructons: 1 nstr 10 nstr Result: Machne A faster Scenaro II Test1 Test2 Arthmetc Mean Machne A: 10 cycles 100 cycles AMean(10, 1) = 5.5 CPI 5

However, we obtan the anomalous result that the machnes have dfferent relatve performances whch depend upon the number of nstructons that were executed. The theory s flawed. Average CPI values are not vald measures of computer performance. Takng the average of a set of CPI values s not nherently wrong, but the result cannot be used to compare performance. The erroneous behavor s due to normalzng the values before averagng them. If we average the runnng tmes before normalzaton, we get a value of 55 cycles for Machne A, and a value of 500.5 cycles for Machne B. Ths alone s the vald comparson. Agan, ths example s not meant to mply that average CPI values are meanngless, they are smply meanngless when used to compare the performance of machnes. O SPECMARKS Machne B: 1 cycle 1000 cycles AMean(1, 10) = 5.5 CPI Instructons: 1 nstr 100 nstr Result: Equal performance Scenaro III Test1 Test2 Arthmetc Mean Machne A: 10 cycles 100 cycles AMean(10, 0.1) = 5.05 CPI Machne B: 1 cycle 1000 cycles AMean(1, 1) = 1 CPI Instructons: 1 nstr 1000 nstr Result: Machne B faster We have demonstrated that the followng s an erroneous method to fnd a performance number for a machne, based upon a set of test results. The mplcaton s that ratos such as SPECmarks should never be averaged; they should frst be converted back nto the orgnal tme values or rate values, and then averaged. The followng demonstrates the relaton between ths and SPEC. AVG has been calculated usng erroneous methods, and so t s a meanngless number. However, ths meanngless number can be easly transformed nto the harmonc mean of SPECmarks, as the followng demonstrates: 1 AVG To repeat, performance ratos should not be averaged. ormalzed values should not be averaged. Indvdual SPECmarks, whch are the rato of the reference machne s runnng tme to the test machne s runnng tme, are normalzed values. Ther average s therefore meanngless. The only meanngful performance number s the rato of the arthmetc means of the reference and test machnes runnng tmes. 4.0 The Meanng of Performance AVG = OurTme RefTme = = = OurTme 1 RefTme SPECmark We have determned that the arthmetc mean s approprate for averagng tmes (whch mples that the harmonc mean s approprate for averagng rates), and that ormalzaton, f performed, should be carred out after the averagng. The queston arses: what does ths mean? When we say that the followng descrbes the performance of a machne based upon the runnng of a number of standardzed tests (whch s the rato of the arthmetc means, wth the constant terms cancellng out), HarmoncMean ( SPECmarks) 6

then we mplctly beleve that every test counts equally, n that on average t s used the same number of tmes as all other tests. Ths means that tests whch are much longer than others wll count more n the results. POIT OF VIEW: Performance s Tme Saved OurTme RefTme j We wsh to be able to say, ths machne s X tmes faster than that machne. Ambguty arses because we are often unclear on the concept of performance. What do we mean when we talk about the performance of a machne? Why do we wsh to be able to say ths machne s X tmes faster than that machne? The reason s that we have been usng that machne (machne A) for some tme and wsh to know how much tme we would save by usng ths machne (machne B) nstead. How can we measure ths? Frst, we fnd out what programs we tend to run on machne A. These programs (or ones smlar to them) wll be used as the benchmark sute to run on machne B. ext, we measure how often we tend to use the programs. These values wll be used as weghts n computng the average (programs used more should count more), but the problem s that t s not clear whether we should use values n unts of tme or number of occurrences; do we count each program the number of tmes per day t s used or the number of hours per day t s used? We have an dea about how often we use programs; for nstance, every tme we edt a source fle we mght recomple. So we would assgn equal weghts to the word processng benchmark and the compler benchmark. We mght run a dfferent set of 3 or 4 n-body smulatons every tme we recompled the smulator; we would then weght the smulator benchmark 3 or 4 tmes as heavly as the compler and text edtor. Of course, t s not qute as smple as ths, but you get the pont; we tend to know how often we use a program, ndependent of how slowly or quckly the machne we use performs t. What does ths buy us? Say for the moment that we consder all benchmarks n the sute equally mportant (we use each as often as the other); all we need to do s total up the tmes t took the new machne to perform the tests, total up the tmes t took the reference machne to perform the tests, and compare the two results. It does not matter f one test takes three mnutes and another takes three days f the reference machne performs the short test n less than a second (ndcatng that your new machne s extremely slow) and t performs the long test n three days and sx hours (ndcatng that your new machne s margnally faster than the old one), the tme saved s about sx hours. Even f you use the short program a hundred tmes as often as the long program, the tme saved s stll an hour over the old machne. The error s that we consdered performance to be a value whch can be averaged; the problem s our percepton that performance s a smple number. The reason for the problem s that we often forget the dfference between the followng statements: on average, the amount of tme saved by usng machne A over machne B s... on average, the relatve performance of machne A to machne B s... HOW WROG IS WROG: Performance Comparsons of 7 Hgh-Profle Computers j What effect does ths have upon performance calculatons, besdes beng wrong? How wrong s t? Presented n the followng fgures are comparsons of machne performances, wth the performance numbers calculated accordng to the geometrc mean, the harmonc mean, and the arthmetc mean. The number produced by the geometrc mean s the number publshed as the computer s SPEC ratng. It s found by takng the geometrc mean of the SPEC ratos. It s a meanngless number. The number produced by the harmonc mean s smply the harmonc mean of the SPEC ratos; t, too, s a meanngless number. The fnal number s produced the correct way, by dervng the orgnal tme measurements from the SPECmark and the publshed runnngs tmes for the 7

150.0 SPECnt92 (GMean) Harmonc Mean of Ratos Rato of Arthmetc Means SPECmarks 100.0 50.0 0.0 50 MHz 88110 66MHz PowerPC 66MHz Pentum 150MHz R4400 90MHz HP 7100 72MHz POWER2 200MHz Alpha Comparatve SPECmarks, Integer A number of publshed SPECmarks are shown, compared to the values recomputed usng the harmonc mean on the ndvdual SPECmarks and the arthmetc mean on the ndvdual runnng tmes. As nether of the means were computed wth any weght nformaton, all tests are weghted equally. Only the Rato of Arthmetc Means s correct. VAX 11/780. These numbers are averaged wth the arthmetc mean and compared to the arthmetc mean of the VAX. The numbers are shown n Fg. 1 and Fg. 2, showng the dfference between usng varous approprate and napproprate methods. The publshed SPECnt92 and SPECfp92 numbers are to the left, the value recomputed usng the harmonc mean s n the mddle, and the value recomputed wth the raw runnng tmes s on the rght. The dfferences are on the order of ten percent; not enormous, but certanly enough to reorder the lst f the examples chosen had been clustered together. As t s, the 72MHz POWER2 chp turns out to be faster than the 200MHz Alpha n both nteger and floatng pont when the averages are recomputed. The numbers are taken from [Corp93a] and [Corp93b]. 5.0 Rethnkng Performance We usually know what we need to do; we are nterested n how much of t we can get done wth ths computer versus that one. In ths context, the only thng that matters s how much tme s saved by usng one machne over another. The fallacy s n consderng performance a measure unto tself. Performance s n realty a specfc nstance of the followng: two machnes, a set of programs to be run on them, and an ndcaton of how mportant each of the programs s to us. Performance s therefore not a sngle number, but really a collecton of mplcatons. It s nothng more or less than the measure of how much tme we save runnng our tests on the machnes n queston. If someone else has smlar needs to ours, our performance numbers wll be useful to them. However, two people wth dfferent sets of crtera wll lkely walk away wth two completely dfferent performance numbers for the same machne. 8

300.0 SPECfp92 (GMean) Harmonc Mean of Ratos Rato of Arthmetc Means SPECmarks 200.0 100.0 0.0 50 MHz 88110 66MHz Pentum 66MHz PowerPC 150MHz R4400 90MHz HP 7100 200MHz Alpha 72MHz POWER2 Comparatve SPECmarks, Floatng Pont A number of publshed SPECmarks are shown, compared to the values recomputed usng the harmonc mean on the ndvdual SPECmarks and the arthmetc mean on the ndvdual runnng tmes. As nether of the means were computed wth any weght nformaton, all tests are weghted equally. Only the Rato of Arthmetc Means s correct. 6.0 Summary Interpretatons of the arthmetc, geometrc, and harmonc means have been gven, wth the geometrc mean wrtten off as a curosty. umerous examples have llustrated the reasons for usng dfferent means n dfferent crcumstances, wth an attempt to gve nsght nto the semantcs of the varous choces. The prmary results nclude the followng: RULES TO LIVE BY 1. When presented wth a number of tmes for a set of benchmarks, the approprate average s the arthmetc mean. 2. When presented wth a number of rate ratos for a set of benchmarks (reference tme over expermental tme, such as n SPECmarks), sum the ndvdual runnng tmes and use the rato of the sums (equvalent to the rato of the arthmetc means). 3. When presented wth a number of tme ratos for a set of benchmarks (expermental tme over reference tme), sum the ndvdual runnng tmes and use the rato of the sums (equvalent to the rato of the arthmetc means). 4. When presented wth a set of rates, frst determne f they are per benchmark or sampled perodcally n tme. If per benchmark (whch s more lkely), the harmonc mean s approprate; f sampled n tme, the arthmetc mean s approprate. References [Corp93a] Standard Performance Evaluaton Corp. SPEC ewsletter, September 1993. [Corp93b] Standard Performance Evaluaton Corp. SPEC ewsletter, December 1993. [Flemng86] Phlp J. Flemng and John J. Wallace. How not to le wth statstcs: the correct way to summarze benchmark results. CACM, 29(3):218 221, March 1986. [Smth88] James E. Smth. Characterzng Computer Performance wth a Sngle umber. CACM, 31(10):1202 1206, October 1988. 9

otes on Calculatng Computer Performance Bruce Jacob and Trevor Mudge Advanced Computer Archtecture Lab EECS Department, Unversty of Mchgan {blj,tnm}@umch.edu Abstract Ths report explans what t means to characterze the performance of a computer, and whch methods are approprate and napproprate for the task. The most wdely used metrc s the performance on the SPEC benchmark sute of programs; currently, the results of runnng the SPEC benchmark sute are compled nto a sngle number usng the geometrc mean. The prmary reason for usng the geometrc mean s that t preserves values across normalzaton, but unfortunately, t does not preserve total run tme, whch s probably the fgure of greatest nterest when performances are beng compared. Cycles per Instructon (CPI) s another wdely used metrc, but ths method s nvald, even f comparng machnes wth dentcal clock speeds. Comparng CPI values to judge performance falls prey to the same problems as averagng normalzed values. In general, normalzed values must not be averaged and nstead of the geometrc mean, ether the harmonc or the arthmetc mean s the approprate method for averagng a set runnng tmes. The arthmetc mean should be used to average tmes, and the harmonc mean should be used to average rates (1/tme). A number of publshed SPECmarks are recomputed usng these means to demonstrate the effect of choosng a favorable algorthm. Unversty of Mchgan Tech Report CSE-TR-231-95 March, 1995 10