Measuring Energy and Power with PAPI
|
|
|
- Harry Greene
- 10 years ago
- Views:
Transcription
1 Measuring Energy and Power with PAPI Vincent M. Weaver, Matt Johnson, Kiran Kasichayanula, James Ralph, Piotr Luszczek, Dan Terpstra, and Shirley Moore Innovative Computing Laboratory University of Tennessee Abstract Energy and power consumption are becoming critical metrics in the design and usage of high performance systems. We have extended the Performance API (PAPI) analysis library to measure and report energy and power values. These values are reported using the existing PAPI API, allowing code previously instrumented for performance counters to also measure power and energy. Higher level tools that build on PAPI will automatically gain support for power and energy readings when used with the newest version of PAPI. We describe in detail the types of energy and power readings available through PAPI. We support external power meters, as well as values provided internally by recent CPUs and GPUs. Measurements are provided directly to the instrumented process, allowing immediate code analysis in real time. We provide examples showing results that can be obtained with our infrastructure. Index Terms energy measurement; power measurement; performance analysis I. INTRODUCTION The Performance API (PAPI) [1] framework has traditionally provided low-level cross-platform access to the hardware performance counters available on most modern CPUs. With the advent of component PAPI (PAPI-C) [2], PAPI has been extended to provide a wider variety of performance data from various sources. Recently a number of new components have been added that provide the ability to measure a system s energy and power usage. Energy and power have become increasingly important components of overall system behavior in high-performance computing (HPC). Power and energy concerns were once primarily of interest to embedded developers. Now that HPC machines have hundreds of thousands of cores [3], the ability to reduce consumption by just a few Watts per CPU quickly adds up to major power, cooling, and monetary savings. There has been a lot of HPC interest in this area recently, including the Green 5 [4] list of energy-efficient supercomputers. PAPI s ability to be extended by components allows adding support for energy and power measurements without any changes needed to the core infrastructure. Existing code that is already instrumented for measuring performance counters can be re-used; the new power and energy events will show up in event listings just like other performance events, and can be measured with the same existing PAPI API. This will allow current users of PAPI on HPC systems to analyze power and energy with little additional effort. There are many existing tools that provide access to power and energy measurements (often these come with the power measuring hardware). PAPI s advantage is that it allows measuring a diverse set of hardware with one common interface. Users only instrument their code once, and then can use it with minimal changes as their code is moved between different machines with different hardware. Without PAPI the instrumented code would have to be re-written depending on what power measurement hardware it is running on. Another benefit of PAPI is that in addition to measuring energy and power, it also provides access to other values, such as CPU performance counters, GPU counters, network, and I/O. All of these can be measured at the same time, providing for a richer analysis environment. Many of the other advanced PAPI features, such as sampling and profiling, can potentially be used in conjunction with these new power and energy events. Higher-level tools that build on top of PAPI (such as TAU [5], HPCToolkit [6], or Vampir [7]) automatically get support for these new measurements as soon as they are paired with an updated PAPI version. We will describe in detail the various types of power and energy measurements that will be available in the PAPI 5. release, as well as showing examples of the data that can be gathered. II. RELATED WORK There are various existing tools that provide access to power and energy values. In general these tools do not have a crossplatform API like PAPI, nor are they deployed as widely. PAPI has the benefit of allowing energy measurements at the same time as CPU and other performance counter measurements, allowing analysis of low-level energy behavior at the source code level. PAPI can also act as an abstraction library, so most of the tools listed below could be given PAPI component interfaces. The tool that provides the most similar functionality to PAPI is the Intel Energy Checker SDK [8]. It provides an API for instrumenting code and gathering energy information from a variety of external power meters and system counters. It provides support for various operating systems, but is limited to Intel architectures. PowerPack [9] provides an interface for measuring power from a variety of external power sources. The API provides routines for starting and stopping the gathering of data on the remote machine. Unlike PAPI, the measurements are gathered out-of-band (on a separate machine) and thus cannot be directly provided to the running process in real time.
2 IBM Power Executive [1] allows monitoring power and energy on IBM blade servers. As with PowerPack, the data is gathered and analyzed by a tool (in this case IBM Director) running on a separate machine. Shin et al. [11] construct a power board for an ARM system that estimates power and communicates with a front-end tool via PCI. Various tools are described that use the gathered information, but there is not a generic API for accessing it. The Linux Energy Attribution and Accounting Platform (LEA 2 P) [12] acquires data on a system with hardware custom-modified to provide power readings via a data acquisition board. These values are passed into the Linux kernel and made available via the /proc filesystem and can be read in-band. PowerScope [13] uses a digital multimeter to perform offline analysis using statistical sampling. It provides a kernellevel interface (via system calls) to start and stop measurements; this requires modifying the operating system. The benefit of this system is that power information is kept in the process table, allowing one to map energy usage in a detailed per-process way. The Energy Endoscope [14] is an embedded wireless sensor network that provides detailed real-time energy measurements via a custom-designed helper chip. The Linux kernel is modified to report energy in /proc/stat along with other processor stats. Isci and Martonosi [15] combine external power meter measurements with performance counter results to generate power readings with a modeled CPU. The readings are gathered on an external machine. Bellosa [16] proposes Joule Watcher, an infrastructure that uses hardware performance counters to estimate power and provide this information to the kernel for scheduling decisions. He proposes a generic API to provide this information to users. III. BACKGROUND PAPI users have recently become more concerned with energy and power measurements. Part of this is due to the addition of embedded system support (including ARM and MIPS processors) and part is from the current interest in energy-efficiency in PAPI s traditional HPC environment. With PAPI-C (component PAPI) it is straightforward to add extra PAPI components that report values outside of the usual hardware performance counters that were long the mainstay of PAPI. The PAPI API returns unsigned 64-bit integers; as long as a power or energy value can fit that constraint no changes at all need to be made to existing PAPI code. A. New PAPI Interfaces The existing PAPI interface is sufficient for providing power and energy values, but the recent PAPI 5. release adds many features that improve the collection of this information. The most important new feature is enhanced event information support. The user can query an event and obtain far richer details than were available previously. The new interface allows specifying units for a returned value, allowing a user to know if the values they are getting are in Watts, Joules or perhaps even nano-joules without having to look in the system documentation. Another new feature is the ability to return values other than unsigned integers, including floating point. This allow returning power values in human-friendly amounts such as Watts rather than 9645 milliwatts. Additional event information is provided that will help external tools analyze the results, especially when trying to correlate power results with other measurements. PAPI now provides the frequency with which the value is updated and whether the value returned is instantaneous (like an average power reading) or cumulative (total Energy). B. Limitations There are some limitations when measuring power and energy using PAPI. Typically these readings are system-wide: it is not possible to exactly map the results exactly to the user s code, especially on multi-core systems. Often a user is interested in knowing where the power usage comes from: power supply inefficiencies, the CPU, network card, memory, etc. Withexternalpowermetersitisnotpossibletobreakdownthe full-system power measurements into per-component values. Since power optimization for various hardware components require different strategies, having only total system power might not provide enough information to allow optimization. Ideally one could correlate power and energy with CPU and other PAPI measurements. This can be done; values can be measured at the same time (although in separate event sets). However due to the nature of the measurements it is hard to get an exact correlation. Another issue is that of measurement overhead. Since PAPI has to run on the system gathering the results, it contributes to the overall power budget of the system. Tools that measure power externally do not have this problem. IV. PAPI ENERGY AND POWER COMPONENTS The new PAPI 5. release adds support for various power and energy components. PAPI components measure power and energy in-band: a program is instrumented with PAPI calls and can read measurement data into the running process. The data can be stored to disk for later offline analysis, but by default it is available for immediate action. This contrasts with other tools that only support out-of-band measurements: they can only analyze code at a later time, and the program being profiled is not aware of its current power or energy status. We use linear algebra routines that perform one-sided factorization of dense matrices to compare various methods of measuring energy. In particular, we test Cholesky factorization from PLASMA [17] on the processor side and LU factorization on the GPU using MAGMA [18]. Both of these are computationally bound and thus show variable power draw by thecomputing device: either CPU or GPU. Our testsalsoshow memory effects by including memory bound operations such as filling the matrices with initial values.
3 Power (Watts) CPU Memory Motherboad Fan Fig. 1. PLASMA Cholesky power usage gathered by PowerPack (not PAPI). Results were gathered out-of-band; PAPI can gather similar data in-band. For comparison purposes, Figure 1 shows PLASMA Cholesky results gathered with PowerPack [9] (not PAPI) on a machine custom-wired for power measurement. Results are gathered on an unrelated machine (which has the advantage of not including the overhead of the measurement in the power readings). We show that PAPI can generate similar results from a variety of power measurement devices. A. External Measurement The most common type of power measurement infrastructure is one where an external power meter is used. For PAPI to access the data, the values have to be passed back to the machine being measured. This is usually done via a serial or USB connection. The easiest type of equipment to use in this case is one where a power pass-through is used; this device looks like a power strip, and allows measuring the power consumption of anything plugged into the device. More intrusive full-system instrumentation can be done, where wires are hooked into power supplies, disks, processor sockets, and DIMM sockets. This enables fine-grained power measurement but usually requires extensive installation costs. 1) Watt s Up Pro Power Meter: The Watt s Up Pro powermeter is an external measurement device that a system plugs into instead of a wall outlet; it provides various measurements via a USB serial connection. The metrics collected include average power, voltage, current, and various others. Energy can be derived based on the average power and time. The results are system-wide and low resolution, with updates only once a second. Writing a PAPI driver for this device is nontrivial, as the results become available every second whether requested or not. Any data can potentially be lost if the on-board logging memory is full and a read does not happen in the one-second time window. Since PAPI users cannot be expected to have their code interrupt itself once a second to measure data, the PAPI component forks a helper thread that reads the data on a regular basis, and then returns overall values when an instrumented program requests it. Some data gathered from a Watt s Up Pro device are shown in Figure 2. The results are coarse due to the one-second sampling frequency of the device. This can be good enough for doing validation and global investigations, but probably not detailed enough when tuning code for energy efficiency. However, the general trends in power consumption for the code in question (Cholesky factorization from PLASMA [17]) are similar to the much finer-grain graph in Figure 1. In Figure 2 the initial spike in power consumption to about 5 W (two seconds into the run) represents data generation (creation of a random matrix) and corresponds to a flat ledge at about 13 W in Figure 1. Four seconds into the run, both figures indicate a fluctuation around the maximum power level for the whole run. The fluctuations are much more accurately portrayed in Figure 1, indicating the need for granularity substantially lower than 1 second available for the Watt s Up Pro device. 2) PowerMon 2: The powermon2 [19] card sits between a system s power supply and its various components. It measures voltage and current on 8 different lines, monitoring most of the power going into the computer. Measurements happen at a frequency of up to 3kHz; this is multiplexed across a userselected subset of the 8 channels. We are working on a PAPI component for this device, but support is currently not available. We foresee using this device to provide energy results at a detail not available with other external power meters. B. Internal Measurement Recent computer hardware includes support for measuring energy and power consumption internally. This allows finegrained power analysis without having to custom-instrument the hardware.
4 Average Power (Watts) PLASMA Cholesky Factorization N=1, threads=2 Fig. 2. PLASMA Cholesky power gathered with a Watt s Up Pro device on an Intel Core2 laptop. Coarse results due to one-second sampling frequency. Access to the measurements usually requires direct lowlevel hardware reads, although sometimes the operating system or a library will do this for you. 1) Intel RAPL: Recent Intel SandyBridge chips include the Running Average Power Limit (RAPL) interface, which is described in the Intel Software Developer s Manual [2]. RAPL s overall design goal is to provide an infrastructure for keeping processors inside of a given user-specified power envelope. The internal circuitry can estimate current energy usage based on a model driven by hardware counters, temperature, and leakage models. The results of this model are available to the user via a model specific register (MSR), with an update frequency on the order of milliseconds. The power model has been validated by Intel [21] to closely follow actual energy being used. PAPI provides access to the values returned by the power model. Accessing MSRs requires ring- access to the hardware; typically only the operating system kernel can do this. This means accessing the RAPL values requires a kernel driver. Currently Linux does not provide such a driver; one has been proposed [22] but it is unlikely it will be merged into the main kernel tree any time soon. To get around this problem, we use the Linux MSR driver that exports MSR access to userspace via a special device driver. If the MSR driver is enabled and given proper read-only permissions then PAPI can access these registers directly without needing kernel support. There are some limitations to accessing RAPL this way. The results are system-wide values and cannot easily be attributed to individual threads. This is not worse than measurements of any shared resource; on modern Intel chips last level caches and the uncore events share this limitation. RAPL reports various energy readings. This includes the energy usage for the total processor package and the total combined energy used by all the cores (referred to as Power- Plane (PP)). PP also includes all of the processor caches. Some versions of SandyBridge chips also report power usage by the on-board GPU (Power-Plane 1 (PP1)). Sandybridge EP chips do not support the GPU measurement, but instead report energy readings for the DRAM interface. While the RAPL values can be measured in-band and consumed by the program, since RAPL is system-wide a separate process may be used to measure energy and power. In this way the running code does not need to be instrumented and some of the PAPI overhead can be avoided. We use this method to gather the results presented. We take measurements on a Sandybridge EP machine. It has 2 CPU packages, each with 8 cores, and each core with 2 threads. Figure 3 shows some average power measurements gathered while doing Cholesky factorization using the PLASMA library. Notice that the energy usage by each package varies, despite all of the cores doing similar work. Part of this is likely due to variations in the cores at the silicon level, as noticed by Rountree et al. [23]. Figure 4 shows the same measurements using the Intel MKL library [24]. Figure 5 shows some energy measurements comparing the same Cholesky factorization using both PLASMA and Intel MKL on the same hardware. The PAPI results show that for this case, PLASMA uses energy more quickly, but finishes faster and uses less total energy for the calculation. 2) AMD Application Power Management: Recent AMD Family 15h processors can report Current Power In Watts. [25] via the Processor Power in TDP MSR. We are investigating PAPI support for this and hope to deploy a component similar in nature and scope to the Intel RAPL component.
5 Average Power (Watts) Total Package Total Package 1 PP Package PP Package 1 DRAM Package DRAM Package PLASMA Cholesky Factorization N=3, threads=16 Fig. 3. PLASMA Cholesky power usage measured with RAPL on Sandybridge EP. Power Plane (PP) is total usage for all 8 cores in a package. Average Power (Watts) Total Package Total Package 1 PP Package PP Package 1 DRAM Package DRAM Package MKL Cholesky Factorization N=3, threads=16 Fig. 4. Intel MKL Cholesky power usage measured with RAPL on Sandybridge. Power Plane (PP) is total usage for all 8 cores in a package. Total Energy (Joules) PLASMA Package PLASMA Package 1 mkl Package mkl Package Cholesky Factorization N=3, threads=16 Fig. 5. Energy usage of two different implementations (PLASMA and MKL) of Cholesky on Sandybridge EP measured with RAPL.
6 Average Power (Watts) Fig. 6. MAGMA LU with size 1, power measurement on an Nvidia Fermi C275, gathered with NVML. 3) NVIDIA Management Library: Recent NVIDIA GPUs can report power usage via the NVIDIA Management Library (NVML) [26]. The nvmldevicegetpowerusage() routine exports the current power; on Fermi C275 GPUs it has milliwatt resolution within ±5W and is updated at roughly 6Hz. The power reported is that for the entire board, including GPU and memory. Gathering detailed performance information from a GPU is difficult: once you dispatch code to a GPU the running CPU has no control over it until the GPU returns upon completion. This means that it is not generally possible to attribute what GPU code corresponds to what power readings. Nvidia provides a high-level utility called nvidia-smi which can be used to measure power, but its sample rate is too long to obtain useful measurements. In order to provide better power measurements we have constructed an NVML component [27] for PAPI and have validated the results using a Kill-A-Watt power meter. Figure 6 shows data gathered on an Nvidia Fermi C275 card running a MAGMA [28] kernel using the LU algorithm [29] with a matrix size of 1k. The MAGMA LU factorization is a compute bound algorithm (expressed in terms of GEMMs); it uses a hybridization methodology to split the computation between the CPU host and GPU. The split aims to match LU s algorithmic requirements to the architectural strengths of the GPU and the CPU. In the case of LU, this translates into having all matrix-matrix (GEMM) multiplication done on the Gmy PU, and the panel factorizations on CPU. The design of the algorithm allows for big enough matrices to totally overlap the CPU work with the large matrix-matrix multiplications on the GPU. As a result, the performance of the MAGMA LU algorithm runs at the speed of performing GEMMs on the GPU. Our experiments have shown that the use of MAGMA GEMM operations on GPU completely utilize it, maximizing the power consumption. This explains why the hybrid LU factorization also maximizes the GPU power consumption, which reduces time taken so the overall energy consumption is minimized. C. Estimated Power Various researches have proposed using hardware performance counters to model energy and power consumption [15], [3], [31], [32], [33], [16], [34], [35], [36]. Goel et al. [36] have shown that power can be modeled to within 1% using just four hardware performance counters. Using the PAPI user-defined events infrastructure [37] an event can be created that derives an estimated power value from the hardware counters. This can be used to measure power on systems that do not have hardware power measurement available. V. CONCLUSION The PAPI library can now provide transparent access to power and energy measurements via existing interfaces. Existing programs that already have instrumentation for PAPI for CPU performance measurements can quickly be adapted to measure power, and existing tools will gain access to the new power events with a simple PAPI upgrade. With larger and larger clusters being built, energy consumption has become one of the defining constraints. PAPI has been continually extended to provide support for the most up-to-date performance measurements on modern systems. The addition of power and energy measurements allow PAPI users to stay
7 on top of this increasingly important area in the always rapidly changing HPC environment. ACKNOWLEDGMENT This material is based upon work supported by the National Science Foundation under Grant No and the U.S. Department of Energy Office of Science under contract DE- FC2-6ER REFERENCES [1] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, A portable programming interface for performance evaluation on modern processors, International Journal of High Performance Computing Applications, vol. 14, no. 3, pp , 2. [2] D. Terpstra, H. Jagode, H. You, and J. Dongarra, Collecting performance data with PAPI-C, in 3rd Parallel Tools Workshop, 29, pp [3] Top 5 supercomputing sites, [4] Top green 5 list:: Environmentally responsible supercomputing, [5] S. Shende and A. Malony, The Tau parallel performance system, International Journal of High Performance Computing Applications, vol. 2, no. 2, pp , 26. [6] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Crummey, and N. Tallent, HPCToolkit: Tools for performance analysis of optimized parallel programs, Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp , 21. [7] W. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach, VAMPIR: Visualization and analysis of MPI resources, Supercomputer, vol. 12, no. 1, pp. 69 8, [8] Intel, Intel Energy Checker: Software Developer Kit User Guide, 21. [9] R. Ge, X. Feng, S. Song, H.-C. Chang, D. Li, and K. Cameron, PowerPack: Energy profiling and analysis of high-performance systems and applications, IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 6, May 21. [1] P. Popa, Managing server energy consumption using IBM PowerExecutive, IBM Systems and Technology Group, Tech. Rep., 26. [11] D. Shin, H. Shim, Y. Joo, H.-S. Yun, J. Kim, and N. Chang, Energymonitoring tool for low-power embedded programs, IEEE Design & Test of Computers, vol. 19, no. 4, pp. 7 17, July/August 22. [12] S. Ryffel, LEA 2 P: The linux energy attribution and accounting platform, Master s thesis, Swiss Federal Institute of Technology, Jan. 29. [13] J. Flinn and M. Satyanarayanan, PowerScope: a tool for profiling the energy usage of mobile applications, in Proc. of the 2nd IEEE Workshop on Mobile Computing Systems and Applications, Feb. 1999, pp [14] T. Stathopoulos, D. McIntire, and W. Kaiser, The energy endoscope: Real-time detailed energy accounting for wireless sensor nodes, in Proc. of the International Conference on Information Processing in Sensor Networks, Apr. 28, pp [15] C. Isci and M. Martonosi, Runtime power monitoring in high-end processors: Methodology and empirical data, in Proc. IEEE/ACM 36th Annual International Symposium on Microarchitecture, Dec. 23. [16] F. Bellosa, The benefits of event: driven energy accounting in powersensitive systems, in Proceedings of the 9th workshop on ACM SIGOPS European workshop, 2. [17] PLASMA Users Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.3, University of Tennessee Knoxville, Nov. 21. [18] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, Dense linear algebra solvers for multicore with GPU accelerators, in Proc. 24th IEEE/ACM International Parallel and Distributed Processing Symposium, Apr. 21. [19] D. Bedard, R. Fowler, M. Linn, and A. Porterfield, PowerMon 2: Fine-grained, integrated power measurement, Renaissance Computing Institute, Tech. Rep. TR-9-4, 29. [2] Intel, Intel Architecture Software Developer s Manual, Volume 3: System Programming Guide, 29. [21] E. Rotem, A. Naveh, D. Rajwan, A. Anathakrishnan, and E. Weissmann, Power-management architecture of the Intel microarchitecture codenamed Sandy Bridge, IEEE Micro, vol. 32, no. 2, pp. 2 27, 212. [22] Z. Rui. (211, May) [patch 2/3] introduce intel rapl driver. linux-kernel mailing list. [Online]. Available: [23] B. Rountree, D. Ahn, B. de Supinski, D. Lowenthal, and M. Schulz, Beyond DVFS: A first look at performance under a hardware-enforced power bound, in Proc. of 8th Workshop on High-Performance, Power- Aware Computing, May 212. [24] Intel, Intel, Math Kernel Library (MKL), [25] AMD, AMD Family 15h Processor BIOS and Kernel Developer Guide, 211. [26] NVML Reference Manual, NVIDIA, 212. [27] K. Kasichayanula, Power aware computing on GPUs, Master s thesis, University of Tennessee, Knoxville, May 212. [28] E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov, Faster, cheaper, better - a hybridization methodology to develop linear algebra software for GPUs, LAPACK Working Note 23. [29] S. Yamazaki, S. Tomov, and J. Dongarra, One-sided dense matrix factorizations on a multicore with multiple GPU accelerators, in Proc. of the 212 International Conference on Computational Science, Jun [3] K. Singh, M. Bhadauria, and S. McKee, Real time power estimation of multi-cores via performance counters, Proc. Workshop on Design, Architecture and Simulation of Chip Multi-Processors, Nov. 28. [31] I. Kadayif, T. Chinoda, M. Kandemir, N. Vijaykirsnan, M. Irwin, and A. Sivasubramaniam, vec: virtual energy counters, in Proc. of the 21 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, Jun. 21. [32] V. Tiwari, S. Malik, and A. Wolfe, Power analysis of embedded software: a first step towards software power minimization, IEEE Transactions on VLSI, vol. 3, no. 4, pp , [33] J. Russell and M. Jacome, Software power estimation and optimization for high performance, 32-bit embedded processors, in Proc. IEEE International Conference on Computer Design, Oct. 1998, pp [34] R. Joseph and M. Martonosi, Run-time power estimation in highperformance microprocessors, in Proc. IEEE/ACM International Symposium on Low Power Electronics and Design, Aug. 21, pp [35] J. Haid, G. Kaefer, C. Steger, and R. Weiss, Run-time energy estimation in system-on-a-chip designs, in Proc. of the Asia and South Pacific Design Automation Conference, Jan. 23, pp [36] B. Goel, S. McKee, R. Gioiosa, K. Singh, M. Bhadauria, and M. Cesati, Portable, scalable, per-core power estimation for intelligent resource management. in First International Green Computing Conference, Aug. 21. [37] S. Moore and J. Ralph, User-defined events for hardware performance monitoring, in Proc. 11th Workshop on Tools for Program Development and Analysis in Computational Science, Jun. 211.
Building an energy dashboard. Energy measurement and visualization in current HPC systems
Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 [email protected] SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators
PAPI - PERFORMANCE API. ANDRÉ PEREIRA [email protected]
1 PAPI - PERFORMANCE API ANDRÉ PEREIRA [email protected] 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,
STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION
STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION 1 DIPAK PATIL, 2 PRASHANT KHARAT, 3 ANIL KUMAR GUPTA 1,2 Depatment of Information Technology, Walchand College of
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
PAPI-V: Performance Monitoring for Virtual Machines
PAPI-V: Performance Monitoring for Virtual Machines Matt Johnson, Heike McCraw, Shirley Moore, Phil Mucci, John Nelson, Dan Terpstra, Vince Weaver Electrical Engineering and Computer Science Dept. University
End-user Tools for Application Performance Analysis Using Hardware Counters
1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in
A Comparison of High-Level Full-System Power Models
A Comparison of High-Level Full-System Power Models Suzanne Rivoire Sonoma State University Parthasarathy Ranganathan Hewlett-Packard Labs Christos Kozyrakis Stanford University Abstract Dynamic power
How To Build An Ark Processor With An Nvidia Gpu And An African Processor
Project Denver Processor to Usher in a New Era of Computing Bill Dally January 5, 2011 http://blogs.nvidia.com/2011/01/project-denver-processor-to-usher-in-new-era-of-computing/ Project Denver Announced
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
Improving Time to Solution with Automated Performance Analysis
Improving Time to Solution with Automated Performance Analysis Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee {shirley,fwolf,dongarra}@cs.utk.edu Bernd
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Energy Constrained Resource Scheduling for Cloud Environment
Energy Constrained Resource Scheduling for Cloud Environment 1 R.Selvi, 2 S.Russia, 3 V.K.Anitha 1 2 nd Year M.E.(Software Engineering), 2 Assistant Professor Department of IT KSR Institute for Engineering
Pedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
Next Generation Operating Systems
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015 The end of CPU scaling Future computing challenges Power efficiency Performance == parallelism Cisco Confidential 2 Paradox of the
Data Structure Oriented Monitoring for OpenMP Programs
A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs Edmond Kereku, Tianchao Li, Michael Gerndt, and Josef Weidendorfer Institut für Informatik, Technische Universität München,
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation
The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation Mark Baker, Garry Smith and Ahmad Hasaan SSE, University of Reading Paravirtualization A full assessment of paravirtualization
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations
Power Capping Linux Agenda Context System Power Management Issues Power Capping Overview Power capping participants Recommendations Introduction of Linux Power Capping Framework 2 Power Hungry World Worldwide,
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Generating Real-Time Profiles of Runtime Energy Consumption for Java Applications
Generating Real-Time Profiles of Runtime Energy Consumption for Java Applications Muhammad Nassar, Julian Jarrett, Iman Saleh, M. Brian Blake Department of Computer Science University of Miami Coral Gables,
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Power efficiency and power management in HP ProLiant servers
Power efficiency and power management in HP ProLiant servers Technology brief Introduction... 2 Built-in power efficiencies in ProLiant servers... 2 Optimizing internal cooling and fan power with Sea of
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Full and Para Virtualization
Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels
Infrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville [email protected]
A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville [email protected] Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure
big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices
big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices Brian Jeff November, 2013 Abstract ARM big.little processing
www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING
www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING GPU COMPUTING VISUALISATION XENON Accelerating Exploration Mineral, oil and gas exploration is an expensive and challenging
How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx
STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM
STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM Albert M. K. Cheng, Shaohong Fang Department of Computer Science University of Houston Houston, TX, 77204, USA http://www.cs.uh.edu
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
BSC vision on Big Data and extreme scale computing
BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
Intel s SL Enhanced Intel486(TM) Microprocessor Family
Intel s SL Enhanced Intel486(TM) Microprocessor Family June 1993 Intel's SL Enhanced Intel486 Microprocessor Family Technical Backgrounder Intel's SL Enhanced Intel486 Microprocessor Family With the announcement
Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis
White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This document
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q
Innovativste XEON Prozessortechnik für Cisco UCS
Innovativste XEON Prozessortechnik für Cisco UCS Stefanie Döhler Wien, 17. November 2010 1 Tick-Tock Development Model Sustained Microprocessor Leadership Tick Tock Tick 65nm Tock Tick 45nm Tock Tick 32nm
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
International Journal of Computer & Organization Trends Volume20 Number1 May 2015
Performance Analysis of Various Guest Operating Systems on Ubuntu 14.04 Prof. (Dr.) Viabhakar Pathak 1, Pramod Kumar Ram 2 1 Computer Science and Engineering, Arya College of Engineering, Jaipur, India.
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Power Management in the Linux Kernel
Power Management in the Linux Kernel Tate Hornbeck, Peter Hokanson 7 April 2011 Intel Open Source Technology Center Venkatesh Pallipadi Senior Staff Software Engineer 2001 - Joined Intel Processor and
3 Red Hat Enterprise Linux 6 Consolidation
Whitepaper Consolidation EXECUTIVE SUMMARY At this time of massive and disruptive technological changes where applications must be nimbly deployed on physical, virtual, and cloud infrastructure, Red Hat
Intel Power Gadget 2.0 Monitoring Processor Energy Usage
Intel Power Gadget 2.0 Monitoring Processor Energy Usage Introduction Intel Power Gadget 2.0 is enabled for 2nd generation Intel Core Processor based platforms is a set of Microsoft Windows* gadget, driver,
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards [email protected] At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
Full-System Power Analysis and Modeling for Server Environments
Full-System Power Analysis and Modeling for Server Environments Dimitris Economou, Suzanne Rivoire, Christos Kozyrakis Department of Electrical Engineering Stanford University {dimeco,rivoire}@stanford.edu,
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Perfmon2: A leap forward in Performance Monitoring
Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN, Geneva, Switzerland [email protected] Abstract. This paper describes the software component, perfmon2,
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Resource Scheduling Best Practice in Hybrid Clusters
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti
VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
Software Distributed Shared Memory Scalability and New Applications
Software Distributed Shared Memory Scalability and New Applications Mats Brorsson Department of Information Technology, Lund University P.O. Box 118, S-221 00 LUND, Sweden email: [email protected]
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD [email protected]
:Introducing Star-P The Open Platform for Parallel Application Development Yoel Jacobsen E&M Computing LTD [email protected] The case for VHLLs Functional / applicative / very high-level languages allow
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Self-monitoring Overhead of the Linux perf event Performance Counter Interface
Paper Appears in ISPASS 215, IEEE Copyright Rules Apply Self-monitoring Overhead of the Linux perf event Performance Counter Interface Vincent M. Weaver Electrical and Computer Engineering University of
