Toward Accurate Performance Evaluation using Hardware Counters

Transcription

1 Toward Accurate Performance Evaluation using Hardware Counters Wiplove Mathur Jeanine Cook Klipsch School of Electrical and Computer Engineering New Mexico State University Box 3, Dept. 3-O Las Cruces, NM 883 {wmathur, ABSTRACT On-chip performance counters are gaining popularity as an analysis and validation tool. Various drivers and interfaces have been developed to access these counters. Most contemporary processors have between two and six physical counters that can monitor an equal number of unique events simultaneously at fixed sampling periods. Through multiplexing and estimation, an even greater number of unique events can be monitored using round-robin scheduling of event sets. When program execution is sampled in multiplexed mode, the counters are interfaced to a subset of events (limited by the number of physical counters) and are incremented appropriately. At this sampling slice, the remaining events in the set do not access the counters, but the respective counts of these events are estimated. Our work addresses the error associated with the estimation of event counts during multiplexed mode. We quantify this error and propose new estimation algorithms that result in much improved accuracy.. INTRODUCTION Performance counters or Performance Monitoring Counters (PMCs) are built-in counters that are fabricated in the CPU chip. They can be programmed (by specific event-select registers) to count a specified event from a pool of events such as L-data cache accesses, load misses, branches taken. These performance counters are the least-intrusive and an accurate technique of counting and monitoring performance [6]. Moreover, the statistics are collected in real-time and on the hardware platform that is under test, thus providing a high degree of confidence in the results. Most contemporary processors have between two and six physical counters that can monitor an equal number of unique events simultaneously. The ability of PMCs to monitor only a small number of events simultaneously limits their usage in performance analysis and validation. In contrast, simulators are widely used to gather performance data of all the desired events in a single run of the simulated architecture [8]. Behavior of different events can be correlated to different units of the microprocessor and relevant performance information can be obtained by relating different metrics to one another. With the data of various events available at the same cycle, an accurate view of the processor state can be studied. However, when simulating only a micro-architecture (without rest of the system), the system level details (such as interfaces to buses, interrupt controllers, disks, video memory) are not taken into consideration. The behavior of a workload is affected by certain external factors (such as operating system and TLB effects []), which suggest that the performance data generated through simulation may not be completely accurate unless the simulator does full-system simulation (eg. Simics [3], SimOS [4]). The ability to obtain simulation-like data greatly increases the usability of PMCs for performance measurement and analysis. The paper is organized as follows: Section 2 discusses the background of interfaces to PMCs. Section 3 mentions the experimental methodology. The methodology and algorithms used to develop the estimation techniques are described in Section 4 and Section 5 discusses the results obtained by implementing the estimation algorithms. The paper concludes along with suggestions for future work in Section BACKGROUND 2. PMC Interface Several interfaces are available to access the PMCs on different microprocessor families. Some of the processor specific interfaces include Intel s VTune software for Intel processors [5], IBM s Performance Monitor API [4], Compaq s Digital Continuous Profiling Infrastructure DCPI for Alpha processors [6, 7], perf-mon for UltraSPARC-I/II processors [2], and Rabbit for Intel and AMD processors [2]. Additionally, interfaces that provide portable instrumentation on multiple platforms include Performance Counter Library (PCL) [8] and Performance Application Programming Interface (PAPI) [9]. PCL and PAPI support performance counters on Alpha, MIPS, Pentium, PowerPC, and UltraSPARC processors. Moreover, PAPI explicitly supports multithreading and multiplexing. We use PAPI as our interface to access the PMCs on a Pentium-III microprocessor (refer Section 3). The normal operation of PAPI gives aggregate counts of the events that are set to be monitored by the PMCs. A Pentium-III processor has two physical counters available for monitoring desired events []. Therefore, a maximum of two events can be

2 Figure : Estimation of counts of a multiplexed event in an interval. monitored simultaneously. Multiplexing is used for simultaneously counting more than two events which is described in the next section. 2.2 Multiplexed Mode The multiplexed mode of counting is used to monitor more than one event during program execution. During each time slice, a different event is monitored as shown in Figure. The sequence in which the events are counted is set in an event-list. At the end of each time-slice, the current eventcount is read and stored in a file which is followed by the monitoring of the next event in the event-list(after resetting the counter). This sequence continues throughout event monitoring. For example, consider events A, B, C, and D being monitored by the counters in the multiplexed mode. Figure shows a possible sequence in which the events may be monitored. An event A is physically counted only once in the entire interval. The counts corresponding to A are not known when other events (B, C, and D) are monitored. Since the aggregate count of an event corresponding to the complete interval(including the time slices when other events are monitored) is desired, an estimate value is computed by the multiplexing software. The built-in feature of multiplexing events in PAPI is used for running the counters in multiplexed mode. Since Pentium- III does not support hardware multiplexing, PAPI implements it in software. PAPI has adopted the MPX software library which is developed and implemented by May [5]. The switching of events (as described above) is triggered by using the Unix interval timer. During the initialization of the multiplexing mode, setitimer is called for setting the ITIMER PROF interval timer to a specified interval( milliseconds by default) and sigaction is called for setting the SIGPROF signal as a trap. When the ITIMER PROF expires, SIGPROF is sent to the process which halts the counter, stores the current count, and starts counting the next event. The counts that are not physically counted in an interval is then estimated which is discussed in Section EXPERIMENTAL METHODOLOGY This section describes the components of our experimental methodology. Below we describe the software we use to interface to and monitor the performance counters, the events that we monitor with this software, and the benchmarks we use in performance analysis. 3. PAPI In our work, we use PAPI, version 2.3.2, to interface to the performance counters on a Pentium-III, GHz, dual processor machine. On this machine, we run the Red-Hat Linux 7.3 operating system, kernel version The Linux kernel is patched with perfctr (version 2.4.) which is a Linux/x86 Performance Monitoring Counters Driver [7]. PAPI uses this package to gain access to the counters on Linux/x86 platforms []. The PAPI code is compiled in the DEBUG mode, and the debug data is stored in a file at the end of every time slice. The debug data is comprised of the counter values and the event ID in addition to other information. Additionally, a timer and a signal handler (similar to the one discussed in Section 2.2) is set for reading the counters at regular time slices in the non-multiplexed mode. The non-multiplexed mode of counting involves monitoring one fixed event for the complete measurement phase. As shown in Figure 2, a particular event A is monitored during every time slice. In MPX, one of the two counters is always set to read the total number of cycles(cyc) executed by the code being instrumented, whereas the second counter is used to monitor another event of interest. Therefore, we set the event-list in non-multiplexed mode such that one of the events being counted is always total number of cycles. 3.2 Benchmarks and Event Sets We chose a subset of workloads from the SPEC CPU2 benchmark suite [9] to use in the performance analysis of the proposed estimation techniques. These workloads and a brief description of their respective functionality are shown in Table. We use the reference input size in all experiments. We use a subset of events to study the accuracy of estimation techniques used in multiplexing. The Pentium-III (P6 architecture) has two performance counters that can be configured to count more than events []. PAPI interfaces to a subset of these + events. A list of these is given in Table 2. This list contains most, but not all of the events that can be monitored through PAPI. The events that we monitor and use in this work are shown in bold in Table 2. All the benchmark source codes were hand-instrumented by PAPI calls. Pseudo code for collecting the event counts in multiplexed or non-multiplexed mode is shown below: main() { /* Benchmark variables defined*/ /* Define PAPI variables */ /* Set the timers for sampling in multiplexed / non-multiplexed mode */ /* Enable multiplex feature if counters are to be run in multiplexed mode */ /* Create the eventset */ /* Start the counters */ --- Benchmark code executes --- /* Stop the counters */ return(); }

3 Workload Category; Description Integer crafty Game playing; High-performance computer chess mcf Combinatorial optimization; Single-depot scheduling for mass transportation parser Word processing; Syntactic English parser twolf Computer aided design; Microchip placement and routing vpr Computer aided design; FPGA placement and routing vortex Database; Object-oriented database Floating-Point art Neural networks; Object recognition in thermal image ammp Computational chemistry; Molecular dynamics using ODEs equake Simulation; Seismic wave propagation Table : Benchmarks used in Performance Analysis Category L, L2, Insn and Data Caches Instruction Mix (Conditional) Branch Prediction Hits, misses, accesses, total insns executed, total branch insns taken/not taken, reads, writes, load/store insn issued, FP insns executed, branches mispredicted/predicted correctly misses, TLB misses total branch insns executed, FP mult insns, FP div insns Table 2: Counted by PAPI. in bold type were used in study. Workload Number of Coefficient of intervals Variation crafty mcf parser 534. twolf vortex vpr art ammp equake Table 3: Number of multiplexed intervals collected per benchmark and Coefficient of Variation across three different execution runs. After declaring the variables of the benchmark, the PAPI library is initialized which is followed by setting the timers for the required mode. The default counting of events is in the non-multiplexed mode and hence multiplex feature is enabled if required. The eventset is created and the counters are started after which the benchmark code executes in its normal sequence. The counters are read at regular time slices and are stopped just before the completion of the benchmark. 4. METHODOLOGY AND ALGORITHMS The workloads are instrumented with the multiplexed and non-multiplexed code (as discussed in Section 3.2) for monitoring six different events (Table 2). We execute an individual workload in each mode three times. We do this to reduce the error due to variability of data collection in different executions of the same workload. Our aim is to obtain minimum absolute error between the multiplexed and the nonmultiplexed counts in every interval. The non-multiplexed counts reflect the actual/accurate count of an event since it is monitored continuously throughout the execution of the workload which is not the case in multiplexed counts (refer Section 2.2). The steps followed to calculate the statistics are as follows:. Workloads are executed as mentioned above. 2. The non-multiplexed data vector consists of six counts for each equivalent multiplexed interval. The sum of these six counts is the non-multiplexed (count nm i ) event count. 3. The multiplexed event count (count m i) is estimated for every interval using an algorithm described in the following sections. 4. The estimation error, count m i count nm i, for every interval is calculated. Table 3 shows the number of multiplexed intervals that occur in the full execution of each workload, where an interval is defined to contain a time slice during which one of the multiplexed events is counted and the remaining event counts are estimated. Coefficient of Variation is calculated across the 3 execution runs of the benchmark in multiplexed mode (refer Section 4). Figure 2 shows the format of the data obtained by reading the counters(in non-multiplexed and multiplexed modes) at regular time slices. cyc i(k) is the total number of cycles elapsed, starting from the instant the code is being instrumented. i is the interval in which all the multiplexed events are being physically monitored once by the PMC and k is the time slice (finest granularity) for which a counter is accumulating event occurrences (after resetting the counter

4 slope i = rate mi rate m (i ) cyc i(n) cyc i (n) (4) where n = the number of events being multiplexed. In our study, n=6. k = the time slice during which the event-count is sampled rate m i and rate nm i(k) = rate of occurrence of an event in multiplexed and non-multiplexed mode respectively Figure 2: Data format of non-multiplexed and multiplexed counters at regular time slices. count m i(n) = the number of times an event has occurred in the i th interval and in time slice between cyc i(n ) and cyc i(n) count nm i(k) = the number of times an event has occurred in the k th time slice of i th interval (that is the period between cyc i(k ) and cyc i(k) ) slope i = slope of rate between i th and i th intervals We discuss the estimation algorithms in the following section. 4. Algorithm The estimation algorithm(henceforth called Algorithm) used in PAPI is developed and implemented by May [5]. It is used to estimate the counts of the multiplexed event in each interval. Figure 3: Conversion of event counts to respective rates in multiplexed mode. at the end of time slice k- ). Therefore, if n events are being multiplexed, then k can take values from to n. For simplicity, we assume cyc i(n) to be equivalent to cyc i+(). Figure 3 shows the rate plot of all the events measured in the multiplexed mode. Some important variables (at the i th interval, k th time-slice) that are used in this paper are listed below. rate nm i(k) = rate m i = count m i(n) cyc i(n) cyc i(n ) () count nm i(k) cyc i(k) cyc i(k ) ; k n (2) Figure 4: Event-count calculation of a multiplexed event using base algorithm Consider the case as shown in Figures 2 and 3. We discuss the base algorithm that is used to estimate the count of event A in the i th interval. Event A is monitored in the time slice k=4 (from total cycles cyc i(3) to cyc i(4) ). If count m i(4) is the number of occurrences of event A in this time slice, then the rate of event A can be calculated using Eqn. as: n count nm i = count nm i(k) (3) k= rate m i = count m i(4) cyc i(4) cyc i(3) (5)

5 Figure 4 shows the plot of Rate vs. Total Cycles for event A alone. The rate of event A, rate m i and rate m i, corresponding to intervals i and i- respectively, can be calculated using eqn. 5. In the base algorithm, rate m i is assumed to be constant for the entire i th interval and the count of event A is estimated by the following equation: count m i rate m i (cyc i(n) cyc i (n) ) (6) Recall that rate m i is calculated using the data corresponding to the period between i(n-) and i(n) (that is time slice), whereas the count between i-(n) and i(n) (that is interval) is being estimated. 4.2 Trapezoid-area Method () Figure 6: Event-count calculation of a multiplexed event using Divided-interval Rectangulararea method or divided or split into j equal parts (see Figure 6). The rate at the k th division is calculated by using linear interpolation as follows: Figure 5: Event-count calculation of a multiplexed event using Trapezoid-area method Figure 5 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i, corresponding to intervals i and i- respectively, is calculated using eqn. 5. In the trapezoid-area method, the rate of occurrence of event A is assumed to be linearly changing within an interval. Thus, the estimated count of the multiplexed event A in the i th interval is given by the area under the trapezoid PQRS (refer Figure 5). Mathematically, count m i.5 (rate m i + rate m i ) (cyc i(n) cyc i (n) ) (7) 4.3 Divided-interval Rectangular Area method () We describe a simple algorithm called Divided-interval Rectangular area method () for estimating the count of an event, A. Figure 6 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i, corresponding to intervals i and i- respectively, is calculated using eqn. 5. The algorithm is explained in the following steps:. The i th interval (where the count is being estimated) is rate m i(k) = slope i (cyc i(k) cyc i (n) ) + rate m i (8) where slope i is given by Eqn The area corresponding to k th division is calculated by assuming the rate to remain constant between cycles cyc i(k ) and cyc i(k). Thus, the area under rectangle PQRS in Figure 6 is given by the formula: count m i(k) rate m i(k) (cyc i(k) cyc i(k ) ) (9) where value of rate m i(k) is obtained from Eqn Repeat steps and 2 for k j 4. The estimated count of the multiplexed event A in the i th interval is given by: count m i = In our case, j =n(=6). j count m i(k) () k= 4.4 Positional Mean Error () The Positional Mean Error () algorithm is a 2-phase algorithm. Phase- involves calculating the rate corrections or the positional mean errors and Phase-2 consists of using the s to correct the multiplexed rates and estimate the event count. The following steps comprise Phase-:. The rate of event A in multiplexed mode at the k th position, (rate m i(k) ), is calculated by using linear interpolation:

6 4. Estimated count in the i th interval is then given by: n count m i = count m i(k) (6) k= 4.5 Multiple Linear Regression Model () The Multiple Linear Regression model () allows prediction of a response variable as a function of predictor variables using a linear model [3]. In vector notation, it is given by: y = Xb + e (7) where Figure 7: Event-count calculation of a multiplexed event using Positional Mean Error method rate m i(k) = slope i (cyc i(k) cyc i (n) ) + rate m i () where slope i is given by Eqn. 4. y = a column vector of the non-multiplexed counts when aggregated in respective multiplexed intervals. X = matrix where each column element is the dividedinterval trapezoidal area as shown in Figure 8 and the data in a specific row corresponds to a particular interval. b = Predictor parameters 2. The difference between the rate of event A in nonmultiplexed mode, rate nm i(k) and the rate that is calculated in Step above is given by: e k = rate nm i(k) rate m i(k) (2) This is shown in Figure 7 and rate nm i (k) is given by Eqn. 2. This difference is the positional error (for position k) in the i th interval and is calculated for k n and i (n=6 in our case). 3. The is then given by: pme k = e k (3) i total i where pme k is the Positional Mean Error for the k th position and i total is the total number of intervals. Phase- produces n s that are used in Phase 2 for estimating the event counts. Phase 2 includes the following steps:. Same as Step of Phase- 2. Calculate the corrected rate at k th position k. c rate m i(k) = rate m i(k) + pme k (4) 3. Assuming linear rate between corrected positional rates, the count in the k th slice-division is now estimated using the trapezoid area method discussed in Section 4.2 count m i(k).5 (c rate m i(k) + c rate m i(k ) ) (cyc i(k) cyc i(k ) ) (5) Figure 8: Event-count calculation of a multiplexed event using Hence, the multiplexed sub-interval areas are represented as a linear model of the actual count (the non-multiplexed count). The predictor parameter estimation is given by: b = (X T X) (X T y) (8) The estimated parameter is then used to scale the trapezoid area in an interval and the sum of the scaled area is estimated as the multiplexed count of that interval. Mathematically, scaled x k = b[k] x k (9)

7 j count m i = scaled x k (2) k= For our study, the sample size =.5 of the population size. Event L Data Cache Misses L Data Cache Access Instructions Committed L Load Misses L Store Misses Conditional Branches taken Acronym dcm dca ins ldm stm brtkn 5. RESULTS The error in the base algorithm is computed by comparing the estimated multiplexed counts to the non-multiplexed counts of the same event. We distributed the errors in 5 groups: less than 5%, 5-%, -5% and above 5%. Table 4 lists the acronyms used for the six events being multiplexed and figures 9 and show the histogram for integer and floating point workloads respectively. The histograms indicate a very high percentage of interval counts in multiplexed mode being inaccurately estimated. This behavior is observed for all the integer (with the exception of twolf ) and floating point workloads. For instance, while estimating the count for store misses (stm) in mcf (Figure 9(b)), as much as 5% of the intervals were estimated with above 5% error and 42% of the intervals were estimated with 5-5% error. Almost all the events in every workload has at least 5% of the intervals being estimated with more than 5% error. A similar behavior is observed in the floating point workloads with as much as 93% intervals being estimated with greater than 5% error for the store misses in quake (Figure (c)). This shows that the estimation accuracy of the base algorithm is very low. Table 4: Acronym table for the multiplexed events monitored algorithms and up to 3% for and. Utilizing any of these techniques will greatly reduce the estimation errors of the multiplexed counts. and require a pre-calculated library of correction parameters (corresponding to the event-workload pair) for their implementation whereas and are more generic and independent of any event or workload. Since the interval size is defined by time ( miiliseconds in our case), the event counts cannot be collected at a specified cycle. Therefore, it is difficult to collect cycle-synchronized performance metrics that can provide a complete snapshot of program behavior. We plan to address this in the future by implementing an algorithm that we have developed into the techniques discussed in this paper. We show the results of applying the algorithms described in section 4. through 4.5 in Figures and 2. They describe the accuracy of each algorithm for each event in terms of error. The error in the algorithm is computed by comparing the estimated multiplexed counts to the nonmultiplexed counts of the same event. The total absolute error ( k count m k count nm k ) is computed for all the algorithms and compared with the base. Each data point is normalized to the error computed for the base algorithm. Values less than indicates lesser error in the estimation of multiplexed counts. Lower the normalized values, better is the algorithm. For all the benchmarks, the proposed estimation algorithm result in decreased error compared to the base algorithm for each event. For the benchmark crafty, the reduction of error varies between -4% over the set of events as shown in Figure (a). For data cache misses (dcm) and load misses (ldm), the error reduction realized by and is around 4%. and in general resulted in smaller error reductions. For crafty, this is around % for both and. Similar improvements are observed for the floating point benchmarks shown in Figure 2. The error reduction varies between 7-4% for all the floating point workloads across the six events. proved to be the best algorithm for quake which produced an estimation error reduction of almost 4% for the store misses (stm). 6. CONCLUSIONS AND FUTURE WORK The algorithms discussed in this paper reduce the estimation error for all the multiplexed events for all the workloads. Improvement of up to 4% is achieved by and

8 (a) crafty 5 5% > 5% (b) mcf 5 5% > 5% (c) parser 5 5% > 5% (d) twolf 5 5% > 5% (e) vortex 5 5% > 5% % of multiplexed interval vpr 5 5% > 5% (f) vpr Figure 9: Histogram plot showing inaccuracy distribution of errors in estimating multiplexed event counts (Integer benchmarks)

9 (a) crafty (b) mcf (c) parser (d) twolf (e) vortex (f) vpr Figure : Total absolute error of the estimated multiplexed counts (normalized to the base algorithm): Integer benchmarks

10 (a) art 5 5% > 5% (a) art (b) ammp fquake (c) quake 5 5% > 5% 5 5% > 5% Figure : Histogram plot showing inaccuracy distribution of errors in estimating multiplexed event counts (Floating point benchmarks) (b) ammp.356 (c) quake Figure 2: Total absolute error of the estimated multiplexed counts (normalized to the base algorithm): Floating point benchmarks

11 7. REFERENCES [] Intel architecture software developer s manual, volume 3: System programming guide. Intel document number 24392, [2] Perf-monitor for UltraSPARC. mch/perf-monitor/. [3] SIMICS, [4] The SimOS complete system simulator, [5] Vtune profiling software, [6] DIGITAL continuous profiling infrastructure project. Oct 997. [6] Anand. K. Ojha. Techniques in least-intrusive computer system performance monitoring. In SoutheastCon 2. Proceedings. IEEE, pages 5 54, 2. [7] Mikael Pettersson. Linux x86 performance-monitoring counters driver. mikpe/linux/perfctr/. [8] Kevin Skadron, Margaret Martonosi, David I. August, Mark D. Hill, David J. Lilja, and Vijay S. Pai. Challenges in computer architecture evaluation. IEEE Computer, August 23. [9] Standard Performance Evaluation Corporation (SPEC). [7] J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, D. Sites, M. Vandevoorde, C. Waldspurger, and W. Weihl. Continuous profiling: Where have all the cycles gone, 997. [8] R Berrendorf, Heinz Ziegler, and Bernd Molar. PCL - the performance counter library: A common interface to access hardware performance counters on microprocessors. Research Centre Juelich GmbH, 2., February 22. [9] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 4(3):89 24, Fall 2. [] J. Dongarra, K. London, S. Moore, P. Mucci, and D. Terpstra. Using PAPI for hardware performance monitoring on linux systems. In Conference on Linux Clusters: The HPC Revolution, June [] Jeff Gibson, Robert Kunz, David Ofelt, and Mark Heinrich. FLASH vs. (simulated) FLASH: Closing the simulation loop. In Architectural Support for Programming Languages and Operating Systems, pages 49 58, 2. [2] Don Heller. Rabbit: A performance counters library for Intel/AMD processors and linux. [3] Raj Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation and Modeling. John Wiley & Sons, Inc., 99. [4] F.E. Levine and C.P. Roth, editors. A programmer s view of performance monitoring in the PowerPC microprocessor., number 4(3). IBM Journal of Research and Development, May 997. [5] John M. May. MPX: Software for multiplexing hardware performance counters in multithreaded programs. In Parallel and Distributed Processing Symposium., Proceedings 5th International, pages 8, April 2.