Application Performance Analysis Tools and Techniques



Similar documents
Application Performance Analysis Tools and Techniques

CRESTA DPI OpenMPI Performance Optimisation and Analysis

End-user Tools for Application Performance Analysis Using Hardware Counters

MAQAO Performance Analysis and Optimization Tool

Analysis report examination with CUBE

Performance Analysis for GPU Accelerated Applications

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

HPC Wales Skills Academy Course Catalogue 2015

Software and the Concurrency Revolution

How To Visualize Performance Data In A Computer Program

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Data Structure Oriented Monitoring for OpenMP Programs

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Performance Tools for Parallel Java Environments

Profiling and Tracing in Linux

A Pattern-Based Approach to. Automated Application Performance Analysis

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Improving Time to Solution with Automated Performance Analysis

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Performance Monitoring of Parallel Scientific Applications

A highly configurable and efficient simulator for job schedulers on supercomputers

Tools for Performance Debugging HPC Applications. David Skinner

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

An Implementation of the POMP Performance Monitoring Interface for OpenMP Based on Dynamic Probes

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to Cloud Computing

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Introduction to Embedded Systems. Software Update Problem

Overlapping Data Transfer With Application Execution on Clusters

Petascale Software Challenges. William Gropp

Parallel Computing. Parallel shared memory computing with OpenMP

HOPSA Project. Technical Report. A Workflow for Holistic Performance System Analysis

Full and Para Virtualization

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Real Time Programming: Concepts

A Case Study - Scaling Legacy Code on Next Generation Platforms

MPI and Hybrid Programming Models. William Gropp

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Multicore Parallel Computing with OpenMP

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Chapter 3 Operating-System Structures

gprof: a Call Graph Execution Profiler 1

Multi-Threading Performance on Commodity Multi-Core Processors

Sequential Performance Analysis with Callgrind and KCachegrind

Review of Performance Analysis Tools for MPI Parallel Programs

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Computer-System Architecture

Why Threads Are A Bad Idea (for most purposes)

Load Imbalance Analysis

Parallel I/O on JUQUEEN

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

NVIDIA Tools For Profiling And Monitoring. David Goodwin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Libmonitor: A Tool for First-Party Monitoring

Recent Advances in Periscope for Performance Analysis and Tuning

MOSIX: High performance Linux farm

Matlab on a Supercomputer

Understanding applications using the BSC performance tools

11.1 inspectit inspectit

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Zing Vision. Answering your toughest production Java performance questions

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Spring 2011 Prof. Hyesoon Kim

Optimization tools. 1) Improving Overall I/O

Performance Analysis and Optimization Tool

Sequential Performance Analysis with Callgrind and KCachegrind

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Scalability evaluation of barrier algorithms for OpenMP

Debugging with TotalView

CSE 403. Performance Profiling Marty Stepp

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

Operating System for the K computer

Practical Performance Understanding the Performance of Your Application

An Introduction to Parallel Computing/ Programming

OpenMP Tools API (OMPT) and HPCToolkit

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Transcription:

Mitglied der Helmholtz-Gemeinschaft Application Performance Analysis Tools and Techniques 2012-06-27 Christian Rössel Jülich Supercomputing Centre c.roessel@fz-juelich.de EU-US HPC Summer School Dublin

Performance analysis: an old problem The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible. Charles Babbage 1791 1871 Christian Rössel Slide 2

Today: the free lunch is over Moore's law is still in charge, but Clock rates do no longer increase Performance gains only through increased parallelism Christian Rössel Slide 3

Today: the free lunch is over Optimizations of applications more difficult Increasing application complexity Multi-physics Multi-scale Increasing machine complexity Hierarchical networks / memory More CPUs / multi-core Every doubling of scale reveals a new bottleneck! Christian Rössel Slide 4

Example: XNS CFD simulation of unsteady flows Developed by CATS / RWTH Aachen Exploits finite-element techniques, unstructured 3D meshes, iterative solution strategies MPI parallel version >40,000 lines of Fortran & C DeBakey blood-pump data set (3,714,611 elements) Partitioned finite-element mesh Hæmodynamic flow pressure distribution Christian Rössel Slide 5

XNS wait-state analysis on BG/L (2007) Christian Rössel Slide 6

Outline Motivation Performance analysis basics Terminology Techniques Performance tools overview Christian Rössel Slide 7

Performance factors of parallel applications Sequential factors Computation Choose right algorithm, use optimizing compiler Cache and memory Tough! Only limited tool support, hope compiler gets it right Input / output Often not given enough attention Parallel factors Communication (i.e., message passing) Threading Synchronization More or less understood, good tool support Christian Rössel Slide 8

Tuning basics Successful tuning is a combination of The right algorithms and libraries Compiler flags and directives Thinking, but Measurement is better than guessing To determine performance bottlenecks To validate tuning decisions and optimizations After each step! Courtesy of Bjarne Stroustrup Christian Rössel Slide 9

However... We should forget about small efficiencies, say 97% of the time: premature optimization is the root of all evil. Donald Knuth It's easier to optimize a slow correct program than to debug a fast incorrect one Nobody cares how fast you can compute the wrong answer... Christian Rössel Slide 10

Performance analysis workflow Preparation Prepare application, insert extra code (probes/hooks) Measurement Collection of data relevant to performance analysis Analysis Calculation of metrics, identification of performance metrics Presentation Visualization of results in an intuitive/understandable form Optimization Elimination of performance problems Christian Rössel Slide 11

The 80/20 rule Programs typically spend 80% of their time in 20% of the code Programmers typically spend 20% of their effort to get 80% of the total speedup possible for the application Know when to stop! Don't optimize what does not matter Make the common case fast! If you optimize everything, you will always be unhappy. Donald E. Knuth Christian Rössel Slide 12

Metrics of performance What can be measured? A count of how many times an event occurs E.g., the number of MPI point-to-point messages sent The duration of some time interval E.g., the time spent in these send calls The size of some parameter E.g., the number of bytes transmitted by these calls Derived metrics E.g., rates / throughput Needed for normalization Christian Rössel Slide 13

Example metrics Execution time Number of function calls CPI MFLOPS Clock ticks per instruction Millions of floating-point operations executed per second math Operations? HW Operations? HW Instructions? 32-/64-bit? Christian Rössel Slide 14

Execution time Wall-clock time CPU time Includes waiting time: I/O, memory, other system activities In time-sharing environments also the time consumed by other applications Time spent by the CPU to execute the application Does not include time the program was context-switched out Problem: Does not include inherent waiting time (e.g., I/O) Problem: Portability? What is user, what is system time? Problem: Execution time is non-deterministic Use mean or minimum of several runs Christian Rössel Slide 15

Inclusive vs. Exclusive values Inclusive Exclusive Information of all sub-elements aggregated into single value Information cannot be split up further Inclusive Exclusive int foo() { int a; a = a + 1; bar(); } a = a + 1; return a; Christian Rössel Slide 16

Classification of measurement techniques When are performance measurements triggered? Sampling Code instrumentation How is performance data recorded? Profiling / Runtime summarization Tracing Christian Rössel Slide 17

Sampling t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Time main foo(0) foo(1) foo(2) int main() { int i; } for (i=0; i < 3; i++) foo(i); return 0; void foo(int i) { } if (i > 0) foo(i 1); Running program is interrupted periodically Timer interrupt, OS signal, or HWC overflow Service routine examines return-address stack Addresses are mapped to routines using symbol table information Statistical inference of program behavior Not very detailed information on highly volatile metrics Requires long-running applications Works with unmodified executable, but symbol table information is generally recommended Christian Rössel Slide 18

Instrumentation t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 Time main foo(0) foo(1) foo(2) Measurement int main() { int i; Enter( main ); for (i=0; i < 3; i++) foo(i); Leave( main ); return 0; } void foo(int i) { Enter( foo ); if (i > 0) foo(i 1); Leave( foo ); } Measurement code is inserted such that every interesting event is directly captured Can be done in various ways Advantage: Much more detailed information Disadvantage: Processing of source-code / executable necessary Large relative overheads for small functions Christian Rössel Slide 19

Instrumentation techniques Static instrumentation Program is instrumented prior to execution Dynamic instrumentation Program is instrumented at runtime Code is inserted Manually Automatically By a preprocessor / source-to-source translation tool By a compiler By linking against a pre-instrumented library / runtime system By binary-rewrite / dynamic instrumentation tool Christian Rössel Slide 20

Critical issues Accuracy Perturbation Measurement alters program behavior E.g., memory access pattern Intrusion overhead Measurement itself needs time and thus lowers performance Accuracy of timers & counters Granularity How many measurements? How much information / work during each measurement? Tradeoff: Accuracy vs. Expressiveness of data Christian Rössel Slide 21

Classification of measurement techniques When are performance measurements triggered? Sampling Code instrumentation How is performance data recorded? Profiling / Runtime summarization Tracing Christian Rössel Slide 22

Profiling / Runtime summarization Recording of aggregated information Time Counts Function calls Hardware counters Over program and system entities Functions, call sites, basic blocks, loops, Processes, threads Profile = summation of events over time Christian Rössel Slide 23

Types of profiles Flat profile Shows distribution of metrics per function / instrumented region Calling context is not taken into account Call-path profile Shows distribution of metrics per call path Sometimes only distinguished by partial calling context (e.g., two levels) Special-purpose profiles Focus on specific aspects, e.g., MPI calls or OpenMP constructs Christian Rössel Slide 24

Tracing Recording information about significant points (events) during execution of the program Enter / leave of a region (function, loop, ) Send / receive a message, Save information in event record Timestamp, location, event type Plus event-specific information (e.g., communicator, sender / receiver, ) Abstract execution model on level of defined events Event trace = Chronologically ordered sequence of event records Christian Rössel Slide 25

Event tracing Process A void foo() { trc_enter("foo");... trc_send(b); send(b, tag, buf);... trc_exit("foo"); } MONITOR Local trace A... 58 ENTER 1 62 SEND B 64 EXIT 1... 1 foo... Global trace instrument Process B synchronize(d) Local trace B... 60 ENTER 1 merge void bar() { trc_enter("bar");... recv(a, tag, buf); trc_recv(a);... trc_exit("bar"); } MONITOR 68 RECV A 69 EXIT 1... 1 bar... unify

Example: Time-line visualization 1 foo 2 bar 3... main foo bar... 58 A ENTER 1 60 B ENTER 2 62 A SEND B 64 A EXIT 1 68 B RECV A 69 B EXIT 2 A B... 58 60 62 64 66 68 70 Christian Rössel Slide 27

Tracing vs. Profiling Tracing advantages Event traces preserve the temporal and spatial relationships among individual events ( context) Allows reconstruction of dynamic application behavior on any required level of abstraction Most general measurement technique Profile data can be reconstructed from event traces Disadvantages Traces can become very large Writing events to file at runtime causes perturbation Writing tracing software is complicated Event buffering, clock synchronization,... Christian Rössel Slide 28

No single solution is sufficient! A combination of different methods, tools and techniques is typically needed! Measurement Sampling / instrumentation, profiling / tracing,... Instrumentation Source code / binary, manual / automatic,... Analysis Statistics, visualization, automatic analysis, data mining,... Christian Rössel Slide 29

Outline Motivation Performance analysis basics Terminology Techniques Performance tools overview Christian Rössel Slide 31

Open-source tools (1) GNU gprof ompp mpip Easy to use profiling tool for gcc-based applications. Combination of sampling and instrumentation. Light-weight OpenMP profiling tool Web site: http://www.ompp-tool.com Scalable, light-weight MPI profiling library Web site: http://mpip.sourceforge.net Christian Rössel Slide 54

Open-source tools (2) Scalasca Scalable performance-analysis toolkit for parallel codes Supports profiling and tracing Web site: http://www.scalasca.org Vampir/VampirTrace TAU Interactive trace visualizer with powerful statistics (commercial) Tool set and runtime library to generate OTF traces Web site: http://www.vampir.eu, http://www.tu-dresden.de/zih/vampirtrace Very portable tool set for instrumentation, measurement and analysis of parallel applications Web site: http://tau.uoregon.edu Christian Rössel Slide 55

Open-source tools (3) Open SpeedShop Modular and extensible performance analysis tool set Web site: http://www.openspeedshop.org HPCToolkit Paraver Multi-platform statistical profiling package Web site: http://hpctoolkit.org Trace-based performance-analysis and visualization framework Web site: http://www.bsc.es/paraver PerfSuite, Periscope, IPM,... Christian Rössel Slide 56

Commercial / vendor-specific tools... Intel VTune (serial/multi-threaded) Intel Trace Analyzer and Collector (MPI) Cray Performance Toolkit: CrayPat / Apprentice2 IBM HPC Toolkit Oracle Performance Analyzer Rogue Wave SlowSpotter / ThreadSpotter Christian Rössel Slide 57

Acknowledgments Bernd Mohr, Brian Wylie, Markus Geimer (JSC) Andreas Knüpfer, Jens Doleschal (TU Dresden) Sameer Shende (University of Oregon) Luiz DeRose (Cray) Christian Rössel Slide 59