PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

Similar documents
Hardware Performance Monitoring: Current and Future. Shirley Moore VI-HPS Inauguration 4July 2007

Hardware Performance Monitoring with PAPI

Performance Application Programming Interface

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Performance Analysis Tools and PAPI

End-user Tools for Application Performance Analysis Using Hardware Counters

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Multi-Threading Performance on Commodity Multi-Core Processors

Performance Counter Monitoring for the Blue Gene/Q Architecture

A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters

BG/Q Performance Tools. Sco$ Parker BG/Q Early Science Workshop: March 19-21, 2012 Argonne Leadership CompuGng Facility

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

Program Grid and HPC5+ workshop

Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Using PAPI for hardware performance monitoring on Linux systems

BG/Q Performance Tools. Sco$ Parker Leap to Petascale Workshop: May 22-25, 2012 Argonne Leadership CompuCng Facility

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Performance Measurement and monitoring in TSUBAME2.5 towards next generation supercomputers

Le langage OCaml et la programmation des GPU

COM 444 Cloud Computing

Full and Para Virtualization

Performance Tools for Parallel Java Environments

PAPI-V: Performance Monitoring for Virtual Machines

Data Structure Oriented Monitoring for OpenMP Programs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION

An Implementation Of Multiprocessor Linux

Measuring Energy and Power with PAPI

Performance analysis with Periscope

Parallel Algorithm Engineering

Design Cycle for Microprocessors

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Performance Analysis for GPU Accelerated Applications

GPU Performance Analysis and Optimisation

Parallel Programming Survey

PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Virtual Machines.

Introduction to Virtual Machines

Embedded Systems: map to FPGA, GPU, CPU?

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Perfmon2: A leap forward in Performance Monitoring

Sequential Performance Analysis with Callgrind and KCachegrind

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPU Profiling with AMD CodeXL

Multi-core Programming System Overview

Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models

BLM 413E - Parallel Programming Lecture 3

Hardware performance monitoring. Zoltán Majó

Toward Accurate Performance Evaluation using Hardware Counters

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

HPC Wales Skills Academy Course Catalogue 2015

MAQAO Performance Analysis and Optimization Tool

Intel DPDK Boosts Server Appliance Performance White Paper

Analyzing PAPI Performance on Virtual Machines. John Nelson

A quick tutorial on Intel's Xeon Phi Coprocessor

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

THE BASICS OF PERFORMANCE- MONITORING HARDWARE

Part I Courses Syllabus

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Linux Performance Optimizations for Big Data Environments

How To Write Portable Programs In C

Metrics for Success: Performance Analysis 101

An Approach to Application Performance Tuning

1 Bull, 2011 Bull Extreme Computing

Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q

GPU Computing - CUDA

Application Performance Analysis Tools and Techniques

Building an Inexpensive Parallel Computer

CS2101a Foundations of Programming for High Performance Computing

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

A Practical Approach to Compiling Multipleiversion and Embracing Failure

Improving System Scalability of OpenMP Applications Using Large Page Support

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

Transactional Memory

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Big Data Visualization on the MIC

Chapter 3 Operating-System Structures

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

An OS-oriented performance monitoring tool for multicore systems

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Sequential Performance Analysis with Callgrind and KCachegrind

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

ST810 Advanced Computing

PSE Molekulardynamik

Next Generation Operating Systems

Libmonitor: A Tool for First-Party Monitoring

Optimization tools. 1) Improving Overall I/O

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Turbomachinery CFD on many-core platforms experiences and strategies

Performance Evaluation of the XDEM framework on the OpenStack Cloud Computing Middleware

Transcription:

1 PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt

2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks, but Why is is it slow? How does the code behaves?

3 Motivation Efficient algorithms should take into account Cache behaviour Memory and resource contention Floating point efficiency Branch behaviour

4 HW Performance Counters Hardware designers added specialised registers o measure various aspects of a microprocessor Generally, they provide an insight into Timings Cache and branch behaviour Memory access patterns Pipeline behaviour FP performance IPC

5 What is PAPI? Interface to interact with performance counters With minimal overhead Portable across several platforms Provides utility tools, C, and Fortran API Platform and counters information

6 PAPI Organisation 3 rd Party Tools Low Level API High Level API PAPI Portable Layer PAPI Hardware Specific Layer Kernel Extension Operating System Perf Counter Hardware

7 Supported Platforms Mainstream platforms (Linux) x86, x86_64 Intel and AMD ARM, MIPS Intel Itanium II IBM PowerPC

8 Utilities papi_avail

8 Utilities papi_avail papi_native_avail

8 Utilities papi_avail papi_native_avail papi_event_chooser

9 PAPI Performance Counters Preset events Events implemented on all platforms PAPI_TOT_INS Native events Platform dependent events L3_CACHE_MISS Derived events Preset events that are derived from multiple native events PAPI_L1_TCM may be L1 data misses + L1 instruction misses

10 PAPI High-level Interface Calls the low-level API Easier to use Enough for coarse grain measurements You will not optimise code based on the amount of L2 TLB flushes per thread For preset events only!

11 The Basics PAPI_start_counters PAPI_stop_counters

12 The Basics #include "papi.h #define NUM_EVENTS 2 long long values[num_events]; unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring */ do_work(); /* Stop counters and store results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);

13 PAPI Low-level Interface Increased efficiency and functionality More information about the environment Concepts to check EventSet Multiplexing

14 The Basics create_eventset library_init PREFIX: PAPI_ Initialised create_eventset EventSet Setup add_event Counter Setup add_event Measurement Storage stop Measuring start start

The Basics #include "papi.h #define NUM_EVENTS 2 int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}; int EventSet; long long values[num_events]; /* Initialize the Library */ retval = PAPI_library_init(PAPI_VER_CURRENT); /* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset(&EventSet); /* Add Flops and total cycles to the eventset */ retval = PAPI_add_events(EventSet,Events,NUM_EVENTS); /* Start the counters */ retval = PAPI_start(EventSet); /* What we want to monitor*/ do_work(); /*Stop counters and store results in values */ retval = PAPI_stop(EventSet,values); André Pereira, UMinho, 2015/2016 15

16 PAPI CUDA Component PAPI is also available for CUDA GPUs Uses the CUPTI Which counters can be directly accessed Define a file with the counters and an environment variable Gives useful information about the GPU usage IPC Memory load/stores/throughput Branch divergences SM(X) occupancy

17 What to Measure? The whole application? PAPI usefulness is limited when used alone Combine it with other profilers Bottleneck identification + characterisation

18 A Practical Example for (int i = 0; i < SIZE; i++) for (int j = 0; j < SIZE; j++) for (int k = 0; k < SIZE; k++) c[i][j] += a[i][k] * b[k][j];

19 A Practical Example SGEMM int sum; for (int i = 0; i < SIZE; i++) for (int j = 0; j < SIZE; j++) { sum = 0; for (int k = 0; k < SIZE; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }

Execution Time André Pereira, UMinho, 2015/2016 @ 2x Intel Xeon E5-2695v2, 12C with 24t each, 2.4GHz 20

FLOP s André Pereira, UMinho, 2015/2016 @ 2x Intel Xeon E5-2695v2, 12C with 24t each, 2.4GHz 21

Cache Miss Rate André Pereira, UMinho, 2015/2016 @ 2x Intel Xeon E5-2695v2, 12C with 24t each, 2.4GHz 22

Arithmetic Intensity André Pereira, UMinho, 2015/2016 @ 2x Intel Xeon E5-2695v2, 12C with 24t each, 2.4GHz 23

24 Useful Counters Instruction mix PAPI_FP_INS PAPI_SR/LD_INS PAPI_BR_INS PAPI_SP/DP_VEC

24 Useful Counters Instruction mix PAPI_FP_INS PAPI_SR/LD_INS PAPI_BR_INS PAPI_SP/DP_VEC FLOPS and operational intensity PAPI_FP_OPS PAPI_SP/DP_OPS PAPI_TOT_INS Cache behaviour and bytes transferred PAPI_L1/2/3_TCM PAPI_L1_TCA

25 Useful Hints Be careful choosing a measurement heuristic Q: Why? Average? Median? Best measurement? Automatise the measurement process With scripting/c++ coding Using 3rd party tools that resort to PAPI PerfSuite HPCToolkit TAU Available for Java and on virtual machines

26 Compiling and Running the Code Use the same GCC/G++ version as The PAPI compilation on your home (preferably) The PAPI available at the cluster Code compilation g++ -L$PAPI_DIR/lib -I$PAPI_DIR/include c.cpp -lpapi Code execution export LD_LIBRARY_PATH=$PAPI_DIR/lib: $LD_LIBRARY_PATH (dynamic library dependencies are resolved at runtime; you can have it on your.bashrc) Run the code!

27 Hands-on Assess the available counters Perform the FLOPs and miss rate measurements https://bitbucket.org/ampereira/papi/downloads

28 References Dongarra, J., London, K., Moore, S., Mucci, P., Terpstra, D. "Using PAPI for Hardware Performance Monitoring on Linux Systems," Conference on Linux Clusters: The HPC Revolution, Linux Clusters Institute, Urbana, Illinois, June 25-27, 2001. Weaver, V., Johnson, M., Kasichayanula, K., Ralph, J., Luszczek, P., Terpstra, D., Moore, S. "Measuring Energy and Power with PAPI, International Workshop on Power-Aware Systems and Architectures, Pittsburgh, PA, September 10, 2012. Malony, A., Biersdorff, S., Shende, S., Jagode, H., Tomov, S., Juckeland, G., Dietrich, R., Duncan Poole, P., Lamb, C. "Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs," International Conference on Parallel Processing (ICPP'11), Taipei, Taiwan, September 13-16, 2011. Weaver, V., Dongarra, J. "Can Hardware Performance Counters Produce Expected, Deterministic Results?," 3rd Workshop on Functionality of Hardware Performance Monitoring, Atlanta, GA, December 4, 2010.