Toward Accurate Performance Evaluation using Hardware Counters

Size: px
Start display at page:

Download "Toward Accurate Performance Evaluation using Hardware Counters"

Transcription

1 Toward Accurate Performance Evaluation using Hardware Counters Wiplove Mathur Jeanine Cook Klipsch School of Electrical and Computer Engineering New Mexico State University Box 3, Dept. 3-O Las Cruces, NM 883 {wmathur, ABSTRACT On-chip performance counters are gaining popularity as an analysis and validation tool. Various drivers and interfaces have been developed to access these counters. Most contemporary processors have between two and six physical counters that can monitor an equal number of unique events simultaneously at fixed sampling periods. Through multiplexing and estimation, an even greater number of unique events can be monitored using round-robin scheduling of event sets. When program execution is sampled in multiplexed mode, the counters are interfaced to a subset of events (limited by the number of physical counters) and are incremented appropriately. At this sampling slice, the remaining events in the set do not access the counters, but the respective counts of these events are estimated. Our work addresses the error associated with the estimation of event counts during multiplexed mode. We quantify this error and propose new estimation algorithms that result in much improved accuracy.. INTRODUCTION Performance counters or Performance Monitoring Counters (PMCs) are built-in counters that are fabricated in the CPU chip. They can be programmed (by specific event-select registers) to count a specified event from a pool of events such as L-data cache accesses, load misses, branches taken. These performance counters are the least-intrusive and an accurate technique of counting and monitoring performance [6]. Moreover, the statistics are collected in real-time and on the hardware platform that is under test, thus providing a high degree of confidence in the results. Most contemporary processors have between two and six physical counters that can monitor an equal number of unique events simultaneously. The ability of PMCs to monitor only a small number of events simultaneously limits their usage in performance analysis and validation. In contrast, simulators are widely used to gather performance data of all the desired events in a single run of the simulated architecture [8]. Behavior of different events can be correlated to different units of the microprocessor and relevant performance information can be obtained by relating different metrics to one another. With the data of various events available at the same cycle, an accurate view of the processor state can be studied. However, when simulating only a micro-architecture (without rest of the system), the system level details (such as interfaces to buses, interrupt controllers, disks, video memory) are not taken into consideration. The behavior of a workload is affected by certain external factors (such as operating system and TLB effects []), which suggest that the performance data generated through simulation may not be completely accurate unless the simulator does full-system simulation (eg. Simics [3], SimOS [4]). The ability to obtain simulation-like data greatly increases the usability of PMCs for performance measurement and analysis. The paper is organized as follows: Section 2 discusses the background of interfaces to PMCs. Section 3 mentions the experimental methodology. The methodology and algorithms used to develop the estimation techniques are described in Section 4 and Section 5 discusses the results obtained by implementing the estimation algorithms. The paper concludes along with suggestions for future work in Section BACKGROUND 2. PMC Interface Several interfaces are available to access the PMCs on different microprocessor families. Some of the processor specific interfaces include Intel s VTune software for Intel processors [5], IBM s Performance Monitor API [4], Compaq s Digital Continuous Profiling Infrastructure DCPI for Alpha processors [6, 7], perf-mon for UltraSPARC-I/II processors [2], and Rabbit for Intel and AMD processors [2]. Additionally, interfaces that provide portable instrumentation on multiple platforms include Performance Counter Library (PCL) [8] and Performance Application Programming Interface (PAPI) [9]. PCL and PAPI support performance counters on Alpha, MIPS, Pentium, PowerPC, and UltraSPARC processors. Moreover, PAPI explicitly supports multithreading and multiplexing. We use PAPI as our interface to access the PMCs on a Pentium-III microprocessor (refer Section 3). The normal operation of PAPI gives aggregate counts of the events that are set to be monitored by the PMCs. A Pentium-III processor has two physical counters available for monitoring desired events []. Therefore, a maximum of two events can be

2 Figure : Estimation of counts of a multiplexed event in an interval. monitored simultaneously. Multiplexing is used for simultaneously counting more than two events which is described in the next section. 2.2 Multiplexed Mode The multiplexed mode of counting is used to monitor more than one event during program execution. During each time slice, a different event is monitored as shown in Figure. The sequence in which the events are counted is set in an event-list. At the end of each time-slice, the current eventcount is read and stored in a file which is followed by the monitoring of the next event in the event-list(after resetting the counter). This sequence continues throughout event monitoring. For example, consider events A, B, C, and D being monitored by the counters in the multiplexed mode. Figure shows a possible sequence in which the events may be monitored. An event A is physically counted only once in the entire interval. The counts corresponding to A are not known when other events (B, C, and D) are monitored. Since the aggregate count of an event corresponding to the complete interval(including the time slices when other events are monitored) is desired, an estimate value is computed by the multiplexing software. The built-in feature of multiplexing events in PAPI is used for running the counters in multiplexed mode. Since Pentium- III does not support hardware multiplexing, PAPI implements it in software. PAPI has adopted the MPX software library which is developed and implemented by May [5]. The switching of events (as described above) is triggered by using the Unix interval timer. During the initialization of the multiplexing mode, setitimer is called for setting the ITIMER PROF interval timer to a specified interval( milliseconds by default) and sigaction is called for setting the SIGPROF signal as a trap. When the ITIMER PROF expires, SIGPROF is sent to the process which halts the counter, stores the current count, and starts counting the next event. The counts that are not physically counted in an interval is then estimated which is discussed in Section EXPERIMENTAL METHODOLOGY This section describes the components of our experimental methodology. Below we describe the software we use to interface to and monitor the performance counters, the events that we monitor with this software, and the benchmarks we use in performance analysis. 3. PAPI In our work, we use PAPI, version 2.3.2, to interface to the performance counters on a Pentium-III, GHz, dual processor machine. On this machine, we run the Red-Hat Linux 7.3 operating system, kernel version The Linux kernel is patched with perfctr (version 2.4.) which is a Linux/x86 Performance Monitoring Counters Driver [7]. PAPI uses this package to gain access to the counters on Linux/x86 platforms []. The PAPI code is compiled in the DEBUG mode, and the debug data is stored in a file at the end of every time slice. The debug data is comprised of the counter values and the event ID in addition to other information. Additionally, a timer and a signal handler (similar to the one discussed in Section 2.2) is set for reading the counters at regular time slices in the non-multiplexed mode. The non-multiplexed mode of counting involves monitoring one fixed event for the complete measurement phase. As shown in Figure 2, a particular event A is monitored during every time slice. In MPX, one of the two counters is always set to read the total number of cycles(cyc) executed by the code being instrumented, whereas the second counter is used to monitor another event of interest. Therefore, we set the event-list in non-multiplexed mode such that one of the events being counted is always total number of cycles. 3.2 Benchmarks and Event Sets We chose a subset of workloads from the SPEC CPU2 benchmark suite [9] to use in the performance analysis of the proposed estimation techniques. These workloads and a brief description of their respective functionality are shown in Table. We use the reference input size in all experiments. We use a subset of events to study the accuracy of estimation techniques used in multiplexing. The Pentium-III (P6 architecture) has two performance counters that can be configured to count more than events []. PAPI interfaces to a subset of these + events. A list of these is given in Table 2. This list contains most, but not all of the events that can be monitored through PAPI. The events that we monitor and use in this work are shown in bold in Table 2. All the benchmark source codes were hand-instrumented by PAPI calls. Pseudo code for collecting the event counts in multiplexed or non-multiplexed mode is shown below: main() { /* Benchmark variables defined*/ /* Define PAPI variables */ /* Set the timers for sampling in multiplexed / non-multiplexed mode */ /* Enable multiplex feature if counters are to be run in multiplexed mode */ /* Create the eventset */ /* Start the counters */ --- Benchmark code executes --- /* Stop the counters */ return(); }

3 Workload Category; Description Integer crafty Game playing; High-performance computer chess mcf Combinatorial optimization; Single-depot scheduling for mass transportation parser Word processing; Syntactic English parser twolf Computer aided design; Microchip placement and routing vpr Computer aided design; FPGA placement and routing vortex Database; Object-oriented database Floating-Point art Neural networks; Object recognition in thermal image ammp Computational chemistry; Molecular dynamics using ODEs equake Simulation; Seismic wave propagation Table : Benchmarks used in Performance Analysis Category L, L2, Insn and Data Caches Instruction Mix (Conditional) Branch Prediction Hits, misses, accesses, total insns executed, total branch insns taken/not taken, reads, writes, load/store insn issued, FP insns executed, branches mispredicted/predicted correctly misses, TLB misses total branch insns executed, FP mult insns, FP div insns Table 2: Counted by PAPI. in bold type were used in study. Workload Number of Coefficient of intervals Variation crafty mcf parser 534. twolf vortex vpr art ammp equake Table 3: Number of multiplexed intervals collected per benchmark and Coefficient of Variation across three different execution runs. After declaring the variables of the benchmark, the PAPI library is initialized which is followed by setting the timers for the required mode. The default counting of events is in the non-multiplexed mode and hence multiplex feature is enabled if required. The eventset is created and the counters are started after which the benchmark code executes in its normal sequence. The counters are read at regular time slices and are stopped just before the completion of the benchmark. 4. METHODOLOGY AND ALGORITHMS The workloads are instrumented with the multiplexed and non-multiplexed code (as discussed in Section 3.2) for monitoring six different events (Table 2). We execute an individual workload in each mode three times. We do this to reduce the error due to variability of data collection in different executions of the same workload. Our aim is to obtain minimum absolute error between the multiplexed and the nonmultiplexed counts in every interval. The non-multiplexed counts reflect the actual/accurate count of an event since it is monitored continuously throughout the execution of the workload which is not the case in multiplexed counts (refer Section 2.2). The steps followed to calculate the statistics are as follows:. Workloads are executed as mentioned above. 2. The non-multiplexed data vector consists of six counts for each equivalent multiplexed interval. The sum of these six counts is the non-multiplexed (count nm i ) event count. 3. The multiplexed event count (count m i) is estimated for every interval using an algorithm described in the following sections. 4. The estimation error, count m i count nm i, for every interval is calculated. Table 3 shows the number of multiplexed intervals that occur in the full execution of each workload, where an interval is defined to contain a time slice during which one of the multiplexed events is counted and the remaining event counts are estimated. Coefficient of Variation is calculated across the 3 execution runs of the benchmark in multiplexed mode (refer Section 4). Figure 2 shows the format of the data obtained by reading the counters(in non-multiplexed and multiplexed modes) at regular time slices. cyc i(k) is the total number of cycles elapsed, starting from the instant the code is being instrumented. i is the interval in which all the multiplexed events are being physically monitored once by the PMC and k is the time slice (finest granularity) for which a counter is accumulating event occurrences (after resetting the counter

4 slope i = rate mi rate m (i ) cyc i(n) cyc i (n) (4) where n = the number of events being multiplexed. In our study, n=6. k = the time slice during which the event-count is sampled rate m i and rate nm i(k) = rate of occurrence of an event in multiplexed and non-multiplexed mode respectively Figure 2: Data format of non-multiplexed and multiplexed counters at regular time slices. count m i(n) = the number of times an event has occurred in the i th interval and in time slice between cyc i(n ) and cyc i(n) count nm i(k) = the number of times an event has occurred in the k th time slice of i th interval (that is the period between cyc i(k ) and cyc i(k) ) slope i = slope of rate between i th and i th intervals We discuss the estimation algorithms in the following section. 4. Algorithm The estimation algorithm(henceforth called Algorithm) used in PAPI is developed and implemented by May [5]. It is used to estimate the counts of the multiplexed event in each interval. Figure 3: Conversion of event counts to respective rates in multiplexed mode. at the end of time slice k- ). Therefore, if n events are being multiplexed, then k can take values from to n. For simplicity, we assume cyc i(n) to be equivalent to cyc i+(). Figure 3 shows the rate plot of all the events measured in the multiplexed mode. Some important variables (at the i th interval, k th time-slice) that are used in this paper are listed below. rate nm i(k) = rate m i = count m i(n) cyc i(n) cyc i(n ) () count nm i(k) cyc i(k) cyc i(k ) ; k n (2) Figure 4: Event-count calculation of a multiplexed event using base algorithm Consider the case as shown in Figures 2 and 3. We discuss the base algorithm that is used to estimate the count of event A in the i th interval. Event A is monitored in the time slice k=4 (from total cycles cyc i(3) to cyc i(4) ). If count m i(4) is the number of occurrences of event A in this time slice, then the rate of event A can be calculated using Eqn. as: n count nm i = count nm i(k) (3) k= rate m i = count m i(4) cyc i(4) cyc i(3) (5)

5 Figure 4 shows the plot of Rate vs. Total Cycles for event A alone. The rate of event A, rate m i and rate m i, corresponding to intervals i and i- respectively, can be calculated using eqn. 5. In the base algorithm, rate m i is assumed to be constant for the entire i th interval and the count of event A is estimated by the following equation: count m i rate m i (cyc i(n) cyc i (n) ) (6) Recall that rate m i is calculated using the data corresponding to the period between i(n-) and i(n) (that is time slice), whereas the count between i-(n) and i(n) (that is interval) is being estimated. 4.2 Trapezoid-area Method () Figure 6: Event-count calculation of a multiplexed event using Divided-interval Rectangulararea method or divided or split into j equal parts (see Figure 6). The rate at the k th division is calculated by using linear interpolation as follows: Figure 5: Event-count calculation of a multiplexed event using Trapezoid-area method Figure 5 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i, corresponding to intervals i and i- respectively, is calculated using eqn. 5. In the trapezoid-area method, the rate of occurrence of event A is assumed to be linearly changing within an interval. Thus, the estimated count of the multiplexed event A in the i th interval is given by the area under the trapezoid PQRS (refer Figure 5). Mathematically, count m i.5 (rate m i + rate m i ) (cyc i(n) cyc i (n) ) (7) 4.3 Divided-interval Rectangular Area method () We describe a simple algorithm called Divided-interval Rectangular area method () for estimating the count of an event, A. Figure 6 shows the plot of Rate vs. Total Cycles for an event A which is being multiplexed. The rate of event A, rate m i and rate m i, corresponding to intervals i and i- respectively, is calculated using eqn. 5. The algorithm is explained in the following steps:. The i th interval (where the count is being estimated) is rate m i(k) = slope i (cyc i(k) cyc i (n) ) + rate m i (8) where slope i is given by Eqn The area corresponding to k th division is calculated by assuming the rate to remain constant between cycles cyc i(k ) and cyc i(k). Thus, the area under rectangle PQRS in Figure 6 is given by the formula: count m i(k) rate m i(k) (cyc i(k) cyc i(k ) ) (9) where value of rate m i(k) is obtained from Eqn Repeat steps and 2 for k j 4. The estimated count of the multiplexed event A in the i th interval is given by: count m i = In our case, j =n(=6). j count m i(k) () k= 4.4 Positional Mean Error () The Positional Mean Error () algorithm is a 2-phase algorithm. Phase- involves calculating the rate corrections or the positional mean errors and Phase-2 consists of using the s to correct the multiplexed rates and estimate the event count. The following steps comprise Phase-:. The rate of event A in multiplexed mode at the k th position, (rate m i(k) ), is calculated by using linear interpolation:

6 4. Estimated count in the i th interval is then given by: n count m i = count m i(k) (6) k= 4.5 Multiple Linear Regression Model () The Multiple Linear Regression model () allows prediction of a response variable as a function of predictor variables using a linear model [3]. In vector notation, it is given by: y = Xb + e (7) where Figure 7: Event-count calculation of a multiplexed event using Positional Mean Error method rate m i(k) = slope i (cyc i(k) cyc i (n) ) + rate m i () where slope i is given by Eqn. 4. y = a column vector of the non-multiplexed counts when aggregated in respective multiplexed intervals. X = matrix where each column element is the dividedinterval trapezoidal area as shown in Figure 8 and the data in a specific row corresponds to a particular interval. b = Predictor parameters 2. The difference between the rate of event A in nonmultiplexed mode, rate nm i(k) and the rate that is calculated in Step above is given by: e k = rate nm i(k) rate m i(k) (2) This is shown in Figure 7 and rate nm i (k) is given by Eqn. 2. This difference is the positional error (for position k) in the i th interval and is calculated for k n and i (n=6 in our case). 3. The is then given by: pme k = e k (3) i total i where pme k is the Positional Mean Error for the k th position and i total is the total number of intervals. Phase- produces n s that are used in Phase 2 for estimating the event counts. Phase 2 includes the following steps:. Same as Step of Phase- 2. Calculate the corrected rate at k th position k. c rate m i(k) = rate m i(k) + pme k (4) 3. Assuming linear rate between corrected positional rates, the count in the k th slice-division is now estimated using the trapezoid area method discussed in Section 4.2 count m i(k).5 (c rate m i(k) + c rate m i(k ) ) (cyc i(k) cyc i(k ) ) (5) Figure 8: Event-count calculation of a multiplexed event using Hence, the multiplexed sub-interval areas are represented as a linear model of the actual count (the non-multiplexed count). The predictor parameter estimation is given by: b = (X T X) (X T y) (8) The estimated parameter is then used to scale the trapezoid area in an interval and the sum of the scaled area is estimated as the multiplexed count of that interval. Mathematically, scaled x k = b[k] x k (9)

7 j count m i = scaled x k (2) k= For our study, the sample size =.5 of the population size. Event L Data Cache Misses L Data Cache Access Instructions Committed L Load Misses L Store Misses Conditional Branches taken Acronym dcm dca ins ldm stm brtkn 5. RESULTS The error in the base algorithm is computed by comparing the estimated multiplexed counts to the non-multiplexed counts of the same event. We distributed the errors in 5 groups: less than 5%, 5-%, -5% and above 5%. Table 4 lists the acronyms used for the six events being multiplexed and figures 9 and show the histogram for integer and floating point workloads respectively. The histograms indicate a very high percentage of interval counts in multiplexed mode being inaccurately estimated. This behavior is observed for all the integer (with the exception of twolf ) and floating point workloads. For instance, while estimating the count for store misses (stm) in mcf (Figure 9(b)), as much as 5% of the intervals were estimated with above 5% error and 42% of the intervals were estimated with 5-5% error. Almost all the events in every workload has at least 5% of the intervals being estimated with more than 5% error. A similar behavior is observed in the floating point workloads with as much as 93% intervals being estimated with greater than 5% error for the store misses in quake (Figure (c)). This shows that the estimation accuracy of the base algorithm is very low. Table 4: Acronym table for the multiplexed events monitored algorithms and up to 3% for and. Utilizing any of these techniques will greatly reduce the estimation errors of the multiplexed counts. and require a pre-calculated library of correction parameters (corresponding to the event-workload pair) for their implementation whereas and are more generic and independent of any event or workload. Since the interval size is defined by time ( miiliseconds in our case), the event counts cannot be collected at a specified cycle. Therefore, it is difficult to collect cycle-synchronized performance metrics that can provide a complete snapshot of program behavior. We plan to address this in the future by implementing an algorithm that we have developed into the techniques discussed in this paper. We show the results of applying the algorithms described in section 4. through 4.5 in Figures and 2. They describe the accuracy of each algorithm for each event in terms of error. The error in the algorithm is computed by comparing the estimated multiplexed counts to the nonmultiplexed counts of the same event. The total absolute error ( k count m k count nm k ) is computed for all the algorithms and compared with the base. Each data point is normalized to the error computed for the base algorithm. Values less than indicates lesser error in the estimation of multiplexed counts. Lower the normalized values, better is the algorithm. For all the benchmarks, the proposed estimation algorithm result in decreased error compared to the base algorithm for each event. For the benchmark crafty, the reduction of error varies between -4% over the set of events as shown in Figure (a). For data cache misses (dcm) and load misses (ldm), the error reduction realized by and is around 4%. and in general resulted in smaller error reductions. For crafty, this is around % for both and. Similar improvements are observed for the floating point benchmarks shown in Figure 2. The error reduction varies between 7-4% for all the floating point workloads across the six events. proved to be the best algorithm for quake which produced an estimation error reduction of almost 4% for the store misses (stm). 6. CONCLUSIONS AND FUTURE WORK The algorithms discussed in this paper reduce the estimation error for all the multiplexed events for all the workloads. Improvement of up to 4% is achieved by and

8 (a) crafty 5 5% > 5% (b) mcf 5 5% > 5% (c) parser 5 5% > 5% (d) twolf 5 5% > 5% (e) vortex 5 5% > 5% % of multiplexed interval vpr 5 5% > 5% (f) vpr Figure 9: Histogram plot showing inaccuracy distribution of errors in estimating multiplexed event counts (Integer benchmarks)

9 (a) crafty (b) mcf (c) parser (d) twolf (e) vortex (f) vpr Figure : Total absolute error of the estimated multiplexed counts (normalized to the base algorithm): Integer benchmarks

10 (a) art 5 5% > 5% (a) art (b) ammp fquake (c) quake 5 5% > 5% 5 5% > 5% Figure : Histogram plot showing inaccuracy distribution of errors in estimating multiplexed event counts (Floating point benchmarks) (b) ammp.356 (c) quake Figure 2: Total absolute error of the estimated multiplexed counts (normalized to the base algorithm): Floating point benchmarks

11 7. REFERENCES [] Intel architecture software developer s manual, volume 3: System programming guide. Intel document number 24392, [2] Perf-monitor for UltraSPARC. mch/perf-monitor/. [3] SIMICS, [4] The SimOS complete system simulator, [5] Vtune profiling software, [6] DIGITAL continuous profiling infrastructure project. Oct 997. [6] Anand. K. Ojha. Techniques in least-intrusive computer system performance monitoring. In SoutheastCon 2. Proceedings. IEEE, pages 5 54, 2. [7] Mikael Pettersson. Linux x86 performance-monitoring counters driver. mikpe/linux/perfctr/. [8] Kevin Skadron, Margaret Martonosi, David I. August, Mark D. Hill, David J. Lilja, and Vijay S. Pai. Challenges in computer architecture evaluation. IEEE Computer, August 23. [9] Standard Performance Evaluation Corporation (SPEC). [7] J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, D. Sites, M. Vandevoorde, C. Waldspurger, and W. Weihl. Continuous profiling: Where have all the cycles gone, 997. [8] R Berrendorf, Heinz Ziegler, and Bernd Molar. PCL - the performance counter library: A common interface to access hardware performance counters on microprocessors. Research Centre Juelich GmbH, 2., February 22. [9] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 4(3):89 24, Fall 2. [] J. Dongarra, K. London, S. Moore, P. Mucci, and D. Terpstra. Using PAPI for hardware performance monitoring on linux systems. In Conference on Linux Clusters: The HPC Revolution, June [] Jeff Gibson, Robert Kunz, David Ofelt, and Mark Heinrich. FLASH vs. (simulated) FLASH: Closing the simulation loop. In Architectural Support for Programming Languages and Operating Systems, pages 49 58, 2. [2] Don Heller. Rabbit: A performance counters library for Intel/AMD processors and linux. [3] Raj Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation and Modeling. John Wiley & Sons, Inc., 99. [4] F.E. Levine and C.P. Roth, editors. A programmer s view of performance monitoring in the PowerPC microprocessor., number 4(3). IBM Journal of Research and Development, May 997. [5] John M. May. MPX: Software for multiplexing hardware performance counters in multithreaded programs. In Parallel and Distributed Processing Symposium., Proceedings 5th International, pages 8, April 2.

Using PAPI for hardware performance monitoring on Linux systems

Using PAPI for hardware performance monitoring on Linux systems Using PAPI for hardware performance monitoring on Linux systems Jack Dongarra, Kevin London, Shirley Moore, Phil Mucci, and Dan Terpstra Innovative Computing Laboratory, University of Tennessee, Knoxville,

More information

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure

More information

A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters

A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters S. Browne, J. Dongarra +, N. Garner, K. London, and P. Mucci Introduction For years collecting performance

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION 1 DIPAK PATIL, 2 PRASHANT KHARAT, 3 ANIL KUMAR GUPTA 1,2 Depatment of Information Technology, Walchand College of

More information

THE BASICS OF PERFORMANCE- MONITORING HARDWARE

THE BASICS OF PERFORMANCE- MONITORING HARDWARE THE BASICS OF PERFORMANCE- MONITORING HARDWARE PERFORMANCE-MONITORING FEATURES PROVIDE DATA THAT DESCRIBE HOW AN APPLICATION AND THE OPERATING SYSTEM ARE PERFORMING ON THE PROCESSOR. THIS INFORMATION CAN

More information

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging

More information

Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer

Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer Pedro Mindlin, José R. Brunheroto, Luiz DeRose, and José E. Moreira pamindli,brunhe,laderose,jmoreira @us.ibm.com IBM Thomas J. Watson

More information

End-user Tools for Application Performance Analysis Using Hardware Counters

End-user Tools for Application Performance Analysis Using Hardware Counters 1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in

More information

Code Coverage Testing Using Hardware Performance Monitoring Support

Code Coverage Testing Using Hardware Performance Monitoring Support Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye Matthew Iyer Vijay Janapa Reddi Daniel A. Connors Department of Electrical and Computer Engineering University of Colorado

More information

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt 1 PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,

More information

On the Importance of Thread Placement on Multicore Architectures

On the Importance of Thread Placement on Multicore Architectures On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...

More information

A High Resolution Performance Monitoring Software on the Pentium

A High Resolution Performance Monitoring Software on the Pentium A High Resolution Performance Monitoring Software on the Pentium Ong Cheng Soon*, Fadhli Wong Mohd Hasan Wong**, Lai Weng Kin* * Software Lab, MIMOS Berhad lai@mimos.my, csong@mimos.my ** Dept of Electrical

More information

Performance Application Programming Interface

Performance Application Programming Interface /************************************************************************************ ** Notes on Performance Application Programming Interface ** ** Intended audience: Those who would like to learn more

More information

Types of Workloads. Raj Jain. Washington University in St. Louis

Types of Workloads. Raj Jain. Washington University in St. Louis Types of Workloads Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/ 4-1 Overview!

More information

Evaluation of ESX Server Under CPU Intensive Workloads

Evaluation of ESX Server Under CPU Intensive Workloads Evaluation of ESX Server Under CPU Intensive Workloads Terry Wilcox Phil Windley, PhD {terryw, windley}@cs.byu.edu Computer Science Department, Brigham Young University Executive Summary Virtual machines

More information

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing 64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

More information

A Methodology for Developing Simple and Robust Power Models Using Performance Monitoring Events

A Methodology for Developing Simple and Robust Power Models Using Performance Monitoring Events A Methodology for Developing Simple and Robust Power Models Using Performance Monitoring Events Kishore Kumar Pusukuri UC Riverside kishore@cs.ucr.edu David Vengerov Sun Microsystems Laboratories david.vengerov@sun.com

More information

Hardware Performance Monitoring: Current and Future. Shirley Moore shirley@cs.utk.edu VI-HPS Inauguration 4July 2007

Hardware Performance Monitoring: Current and Future. Shirley Moore shirley@cs.utk.edu VI-HPS Inauguration 4July 2007 Hardware Performance Monitoring: Current and Future Shirley Moore shirley@cs.utk.edu VI-HPS Inauguration 4July 2007 History of PAPI http://icl.cs.utk.edu/papi/ Started as a Parallel Tools Consortium project

More information

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0 D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

Effective Use of Performance Monitoring Counters for Run-Time Prediction of Power

Effective Use of Performance Monitoring Counters for Run-Time Prediction of Power Effective Use of Performance Monitoring Counters for Run-Time Prediction of Power W. L. Bircher, J. Law, M. Valluri, and L. K. John Laboratory for Computer Architecture Department of Electrical and Computer

More information

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1 Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems

More information

MONITORING power consumption of a microprocessor

MONITORING power consumption of a microprocessor IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS-II, VOL. X, NO. Y, JANUARY XXXX 1 A Study on the use of Performance Counters to Estimate Power in Microprocessors Rance Rodrigues, Member, IEEE, Arunachalam Annamalai,

More information

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Journal of Instruction-Level Parallelism 7 (25) 1-28 Submitted 2/25; published 6/25 Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Joshua Kihm Alex Settle Andrew

More information

Full and Para Virtualization

Full and Para Virtualization Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels

More information

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder Computer Science and

More information

VxWorks Guest OS Programmer's Guide for Hypervisor 1.1, 6.8. VxWorks GUEST OS PROGRAMMER'S GUIDE FOR HYPERVISOR 1.1 6.8

VxWorks Guest OS Programmer's Guide for Hypervisor 1.1, 6.8. VxWorks GUEST OS PROGRAMMER'S GUIDE FOR HYPERVISOR 1.1 6.8 VxWorks Guest OS Programmer's Guide for Hypervisor 1.1, 6.8 VxWorks GUEST OS PROGRAMMER'S GUIDE FOR HYPERVISOR 1.1 6.8 Copyright 2009 Wind River Systems, Inc. All rights reserved. No part of this publication

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

A Performance Counter Architecture for Computing Accurate CPI Components

A Performance Counter Architecture for Computing Accurate CPI Components A Performance Counter Architecture for Computing Accurate CPI Components Stijn Eyerman Lieven Eeckhout Tejas Karkhanis James E. Smith ELIS, Ghent University, Belgium ECE, University of Wisconsin Madison

More information

HyperThreading Support in VMware ESX Server 2.1

HyperThreading Support in VMware ESX Server 2.1 HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect

More information

Part I Courses Syllabus

Part I Courses Syllabus Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

More information

COS 318: Operating Systems

COS 318: Operating Systems COS 318: Operating Systems OS Structures and System Calls Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Outline Protection mechanisms

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

Design Cycle for Microprocessors

Design Cycle for Microprocessors Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

More information

Operating Systems 4 th Class

Operating Systems 4 th Class Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters

Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters Jack Dongarra Kevin London Shirley Moore Philip Mucci Daniel Terpstra Haihang You Min Zhou University of Tennessee

More information

CSC 2405: Computer Systems II

CSC 2405: Computer Systems II CSC 2405: Computer Systems II Spring 2013 (TR 8:30-9:45 in G86) Mirela Damian http://www.csc.villanova.edu/~mdamian/csc2405/ Introductions Mirela Damian Room 167A in the Mendel Science Building mirela.damian@villanova.edu

More information

Data Structure Oriented Monitoring for OpenMP Programs

Data Structure Oriented Monitoring for OpenMP Programs A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,

More information

Hardware Performance Monitoring with PAPI

Hardware Performance Monitoring with PAPI Hardware Performance Monitoring with PAPI Dan Terpstra terpstra@cs.utk.edu Workshop July 2007 What s s PAPI? Middleware that provides a consistent programming interface for the performance counter hardware

More information

Gary Frost AMD Java Labs gary.frost@amd.com

Gary Frost AMD Java Labs gary.frost@amd.com Analyzing Java Performance Using Hardware Performance Counters Gary Frost AMD Java Labs gary.frost@amd.com 2008 by AMD; made available under the EPL v1.0 2 Agenda AMD Java Labs Hardware Performance Counters

More information

Testing task schedulers on Linux system

Testing task schedulers on Linux system Testing task schedulers on Linux system Leonardo Jelenković, Stjepan Groš, Domagoj Jakobović University of Zagreb, Croatia Faculty of Electrical Engineering and Computing Abstract Testing task schedulers

More information

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Operating Systems Concepts: Chapter 7: Scheduling Strategies Operating Systems Concepts: Chapter 7: Scheduling Strategies Olav Beckmann Huxley 449 http://www.doc.ic.ac.uk/~ob3 Acknowledgements: There are lots. See end of Chapter 1. Home Page for the course: http://www.doc.ic.ac.uk/~ob3/teaching/operatingsystemsconcepts/

More information

A Practical Method for Estimating Performance Degradation on Multicore Processors, and its Application to HPC Workloads.

A Practical Method for Estimating Performance Degradation on Multicore Processors, and its Application to HPC Workloads. A Practical Method for Estimating Performance Degradation on Multicore Processors, and its Application to HPC Workloads Tyler Dwyer, Alexandra Fedorova, Sergey Blagodurov, Mark Roth, Fabien Gaud, Jian

More information

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 2 Logic Gates and Introduction to Computer Architecture Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are

More information

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University Multitasking ARM-Applications with uvision and RTX 1. Objectives The purpose of this lab is to lab is to introduce

More information

Wiggins/Redstone: An On-line Program Specializer

Wiggins/Redstone: An On-line Program Specializer Wiggins/Redstone: An On-line Program Specializer Dean Deaver Rick Gorton Norm Rubin {dean.deaver,rick.gorton,norm.rubin}@compaq.com Hot Chips 11 Wiggins/Redstone 1 W/R is a Software System That: u Makes

More information

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs. This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

More information

Hardware performance monitoring. Zoltán Majó

Hardware performance monitoring. Zoltán Majó Hardware performance monitoring Zoltán Majó 1 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance

More information

CSE 265: System and Network Administration

CSE 265: System and Network Administration CSE 265: System and Network Administration MW 1:10-2:00pm Maginnes 105 http://www.cse.lehigh.edu/~brian/course/sysadmin/ Find syllabus, lecture notes, readings, etc. Instructor: Prof. Brian D. Davison

More information

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER Pierre A. von Kaenel Mathematics and Computer Science Department Skidmore College Saratoga Springs, NY 12866 (518) 580-5292 pvonk@skidmore.edu ABSTRACT This paper

More information

Testing Database Performance with HelperCore on Multi-Core Processors

Testing Database Performance with HelperCore on Multi-Core Processors Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem

More information

Analyzing PAPI Performance on Virtual Machines. John Nelson

Analyzing PAPI Performance on Virtual Machines. John Nelson Analyzing PAPI Performance on Virtual Machines John Nelson I. OVERVIEW Over the last ten years, virtualization techniques have become much more widely popular as a result of fast and cheap processors.

More information

CS2101a Foundations of Programming for High Performance Computing

CS2101a Foundations of Programming for High Performance Computing CS2101a Foundations of Programming for High Performance Computing Marc Moreno Maza & Ning Xie University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Course Overview 2 Hardware Acceleration

More information

A Study of Performance Monitoring Unit, perf and perf_events subsystem

A Study of Performance Monitoring Unit, perf and perf_events subsystem A Study of Performance Monitoring Unit, perf and perf_events subsystem Team Aman Singh Anup Buchke Mentor Dr. Yann-Hang Lee Summary Performance Monitoring Unit, or the PMU, is found in all high end processors

More information

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC

More information

ELEC 377. Operating Systems. Week 1 Class 3

ELEC 377. Operating Systems. Week 1 Class 3 Operating Systems Week 1 Class 3 Last Class! Computer System Structure, Controllers! Interrupts & Traps! I/O structure and device queues.! Storage Structure & Caching! Hardware Protection! Dual Mode Operation

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Performance Comparison of RTOS

Performance Comparison of RTOS Performance Comparison of RTOS Shahmil Merchant, Kalpen Dedhia Dept Of Computer Science. Columbia University Abstract: Embedded systems are becoming an integral part of commercial products today. Mobile

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Operating Systems. Lecture 03. February 11, 2013

Operating Systems. Lecture 03. February 11, 2013 Operating Systems Lecture 03 February 11, 2013 Goals for Today Interrupts, traps and signals Hardware Protection System Calls Interrupts, Traps, and Signals The occurrence of an event is usually signaled

More information

Multi-GPU Load Balancing for Simulation and Rendering

Multi-GPU Load Balancing for Simulation and Rendering Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks

More information

Software Performance and Scalability

Software Performance and Scalability Software Performance and Scalability A Quantitative Approach Henry H. Liu ^ IEEE )computer society WILEY A JOHN WILEY & SONS, INC., PUBLICATION Contents PREFACE ACKNOWLEDGMENTS xv xxi Introduction 1 Performance

More information

KVM: Kernel-based Virtualization Driver

KVM: Kernel-based Virtualization Driver KVM: Kernel-based Virtualization Driver White Paper Overview The current interest in virtualization has led to the creation of several different hypervisors. Most of these, however, predate hardware-assisted

More information

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE DSS Data & Diskpool and cloud storage benchmarks used in IT-DSS CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Geoffray ADDE DSS Outline I- A rational approach to storage systems evaluation

More information

Real Time Programming: Concepts

Real Time Programming: Concepts Real Time Programming: Concepts Radek Pelánek Plan at first we will study basic concepts related to real time programming then we will have a look at specific programming languages and study how they realize

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Perfmon2: A leap forward in Performance Monitoring

Perfmon2: A leap forward in Performance Monitoring Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN, Geneva, Switzerland Sverre.Jarp@cern.ch Abstract. This paper describes the software component, perfmon2,

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Chao He he.chao@wustl.edu (A paper written under the guidance of Prof.

Chao He he.chao@wustl.edu (A paper written under the guidance of Prof. 1 of 10 5/4/2011 4:47 PM Chao He he.chao@wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Cloud computing is recognized as a revolution in the computing area, meanwhile, it also

More information

Virtual Machines. www.viplavkambli.com

Virtual Machines. www.viplavkambli.com 1 Virtual Machines A virtual machine (VM) is a "completely isolated guest operating system installation within a normal host operating system". Modern virtual machines are implemented with either software

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Carsten Emde Open Source Automation Development Lab (OSADL) eg Aichhalder Str. 39, 78713 Schramberg, Germany C.Emde@osadl.org

More information

Runtime Hardware Reconfiguration using Machine Learning

Runtime Hardware Reconfiguration using Machine Learning Runtime Hardware Reconfiguration using Machine Learning Tanmay Gangwani University of Illinois, Urbana-Champaign gangwan2@illinois.edu Abstract Tailoring the machine hardware to varying needs of the software

More information

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, R. Iris Bahar Brown University, Division of Engineering Providence, RI 091 Richard Weiss

More information

18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right

More information

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Building an energy dashboard. Energy measurement and visualization in current HPC systems Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 thomas.geenen@surfsara.nl SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators

More information

The Central Processing Unit:

The Central Processing Unit: The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

Performance Monitoring of the Software Frameworks for LHC Experiments

Performance Monitoring of the Software Frameworks for LHC Experiments Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero

More information

Distributed Systems. Virtualization. Paul Krzyzanowski pxk@cs.rutgers.edu

Distributed Systems. Virtualization. Paul Krzyzanowski pxk@cs.rutgers.edu Distributed Systems Virtualization Paul Krzyzanowski pxk@cs.rutgers.edu Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Virtualization

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

Attaining EDF Task Scheduling with O(1) Time Complexity

Attaining EDF Task Scheduling with O(1) Time Complexity Attaining EDF Task Scheduling with O(1) Time Complexity Verber Domen University of Maribor, Faculty of Electrical Engineering and Computer Sciences, Maribor, Slovenia (e-mail: domen.verber@uni-mb.si) Abstract:

More information

Per-core Power Estimation and Power Aware Scheduling Strategies for CMPs

Per-core Power Estimation and Power Aware Scheduling Strategies for CMPs Per-core Power Estimation and Power Aware Scheduling Strategies for CMPs Master of Science Thesis in Integrated Electronic System Design BHAVISHYA GOEL Chalmers University of Technology University of Gothenburg

More information

Operating Systems: Basic Concepts and History

Operating Systems: Basic Concepts and History Introduction to Operating Systems Operating Systems: Basic Concepts and History An operating system is the interface between the user and the architecture. User Applications Operating System Hardware Virtual

More information

Going Linux on Massive Multicore

Going Linux on Massive Multicore Embedded Linux Conference Europe 2013 Going Linux on Massive Multicore Marta Rybczyńska 24th October, 2013 Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 2 Agenda Architecture

More information

Precise and Accurate Processor Simulation

Precise and Accurate Processor Simulation Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm Performance Modeling Analytical

More information

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 26 Real - Time POSIX. (Contd.) Ok Good morning, so let us get

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

EEM 486: Computer Architecture. Lecture 4. Performance

EEM 486: Computer Architecture. Lecture 4. Performance EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information