Application Performance Analysis Tools and Techniques

Similar documents
Application Performance Analysis Tools and Techniques

Unified Performance Data Collection with Score-P

Introduction to application performance analysis

End-user Tools for Application Performance Analysis Using Hardware Counters

BG/Q Performance Tools. Sco$ Parker BG/Q Early Science Workshop: March 19-21, 2012 Argonne Leadership CompuGng Facility

CRESTA DPI OpenMPI Performance Optimisation and Analysis

Scalable performance analysis of large-scale parallel applications

High Performance Computing in Aachen

A Pattern-Based Approach to Automated Application Performance Analysis

BG/Q Performance Tools. Sco$ Parker Leap to Petascale Workshop: May 22-25, 2012 Argonne Leadership CompuCng Facility

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

MAQAO Performance Analysis and Optimization Tool

A Scalable Approach to MPI Application Performance Analysis

HPC Wales Skills Academy Course Catalogue 2015

Performance Monitoring of Parallel Scientific Applications

Analysis report examination with CUBE

Load Imbalance Analysis

Prototype Visualization Tools for Multi- Experiment Multiprocessor Performance Analysis of DoD Applications

Introduction to the TAU Performance System

How To Visualize Performance Data In A Computer Program

Tools for Performance Debugging HPC Applications. David Skinner

Performance Analysis for GPU Accelerated Applications

Performance Data Management with TAU PerfExplorer

Profile Data Mining with PerfExplorer. Sameer Shende Performance Reseaerch Lab, University of Oregon

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

gprof: a Call Graph Execution Profiler 1

Profiling and Tracing in Linux

Performance analysis with Periscope

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Optimization tools. 1) Improving Overall I/O

Performance Analysis and Optimization Tool

HPC Software Debugger and Performance Tools

Understanding applications using the BSC performance tools

Integrating TAU With Eclipse: A Performance Analysis System in an Integrated Development Environment

A Pattern-Based Approach to. Automated Application Performance Analysis

Performance Analysis of Computer Systems

Profile Data Mining with PerfExplorer. Sameer Shende Performance Research Lab, University of Oregon

Score-P A Unified Performance Measurement System for Petascale Applications

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Performance Tools for Parallel Java Environments

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Data Structure Oriented Monitoring for OpenMP Programs

Chapter 3 Operating-System Structures

Sequential Performance Analysis with Callgrind and KCachegrind

Improving Time to Solution with Automated Performance Analysis

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

NVIDIA Tools For Profiling And Monitoring. David Goodwin

The KCachegrind Handbook. Original author of the documentation: Josef Weidendorfer Updates and corrections: Federico Zenith

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

How to Use Open SpeedShop BGP and Cray XT/XE

An Implementation of the POMP Performance Monitoring Interface for OpenMP Based on Dynamic Probes

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Petascale Software Challenges. William Gropp

Parallel Computing. Parallel shared memory computing with OpenMP

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

A highly configurable and efficient simulator for job schedulers on supercomputers

Practical Experiences with Modern Parallel Performance Analysis Tools: An Evaluation

High Performance Computing in Aachen

Real Time Programming: Concepts

A Case Study - Scaling Legacy Code on Next Generation Platforms

Chapter 2 System Structures

Software and the Concurrency Revolution

Recent Advances in Periscope for Performance Analysis and Tuning

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Sequential Performance Analysis with Callgrind and KCachegrind

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Libmonitor: A Tool for First-Party Monitoring

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Introduction to Embedded Systems. Software Update Problem

RA MPI Compilers Debuggers Profiling. March 25, 2009

11.1 inspectit inspectit

Analysis and Optimization of a Hybrid Linear Equation Solver using Task-Based Parallel Programming Models

Profiling, Analysis, and Performance. Adrian Jackson

Tools for Analysis of Performance Dynamics of Parallel Applications

Overview

Distributed communication-aware load balancing with TreeMatch in Charm++

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Review of Performance Analysis Tools for MPI Parallel Programs

Scalability evaluation of barrier algorithms for OpenMP

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

A Systematic Multi-step Methodology for Performance Analysis of Communication Traces of Distributed Applications based on Hierarchical Clustering

HOPSA Project. Technical Report. A Workflow for Holistic Performance System Analysis

Session 2: MUST. Correctness Checking

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Analyzing Overheads and Scalability Characteristics of OpenMP Applications

PTC System Monitor Solution Training

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Part I Courses Syllabus

Introduction to Cloud Computing

Parallel I/O on JUQUEEN

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Automatic Tuning of HPC Applications for Performance and Energy Efficiency. Michael Gerndt Technische Universität München

Performance Monitoring of the Software Frameworks for LHC Experiments

Online Performance Observation of Large-Scale Parallel Applications

Zing Vision. Answering your toughest production Java performance questions

Integration and Synthess for Automated Performance Tuning: the SYNAPT Project

Transcription:

Mitglied der Helmholtz-Gemeinschaft Application Performance Analysis Tools and Techniques 2011-07-15 Jülich Supercomputing Centre c.roessel@fz-juelich.de

Performance analysis: an old problem The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible. Charles Babbage 1791 1871 Slide 2

Today: the free lunch is over Moore's law is still in charge, but Clock rates do no longer increase Performance gains only through increased parallelism Optimizations of applications more difficult Increasing application complexity Multi-physics Multi-scale Increasing machine complexity Hierarchical networks / memory More s / multi-core Every doubling of scale reveals a new bottleneck! Slide 3

Example: XNS CFD simulation of unsteady flows Developed by CATS / RWTH Aachen Exploits finite-element techniques, unstructured 3D meshes, iterative solution strategies MPI parallel version >40,000 lines of Fortran & C DeBakey blood-pump data set (3,714,611 elements) Partitioned finite-element mesh Hæmodynamic flow pressure distribution Slide 4

XNS wait-state analysis on BG/L (2007) Slide 5

Outline Motivation Performance analysis basics Terminology Techniques Performance tools overview Outlook Slide 6

Performance factors of parallel applications Sequential factors Computation Choose right algorithm, use optimizing compiler Cache and memory Tough! Only limited tool support, hope compiler gets it right Input / output Often not given enough attention Parallel factors Communication (i.e., message passing) Threading Synchronization More or less understood, good tool support Slide 7

Tuning basics Successful tuning is a combination of The right algorithms and libraries Compiler flags and directives Thinking!!! Measurement is better than guessing To determine performance bottlenecks To validate tuning decisions and optimizations After each step! Slide 8

However... We should forget about small efficiencies, say 97% of the time: premature optimization is the root of all evil. Donald Knuth It's easier to optimize a slow correct program than to debug a fast incorrect one Nobody cares how fast you can compute the wrong answer... Slide 9

Performance analysis workflow Preparation Prepare application, insert extra code (probes/hooks) Measurement Collection of data relevant to performance analysis Analysis Calculation of metrics, identification of performance metrics Presentation Visualization of results in an intuitive/understandable form Optimization Elimination of performance problems Slide 10

The 80/20 rule Programs typically spend 80% of their time in 20% of the code Programmers typically spend 20% of their effort to get 80% of the total speedup possible for the application Know when to stop! Don't optimize what does not matter Make the common case fast! If you optimize everything, you will always be unhappy. Donald E. Knuth Slide 11

Metrics of performance What can be measured? A count of how many times an event occurs The duration of some time interval E.g., the time spent in these send calls The size of some parameter E.g., the number of MPI point-to-point messages sent E.g., the number of bytes transmitted by these calls Derived metrics E.g., rates / throughput Needed for normalization Slide 12

Example metrics Execution time Number of function calls CPI Clock ticks per instruction MFLOPS Millions of floating-point operations executed per second math Operations? HW Operations? HW Instructions? 32-/64-bit? Slide 13

Execution time Wall-clock time Includes waiting time: I/O, memory, other system activities In time-sharing environments also the time consumed by other applications time Time spent by the to execute the application Does not include time the program was context-switched out Problem: Does not include inherent waiting time (e.g., I/O) Problem: Portability? What is user, what is system time? Problem: Execution time is non-deterministic Use mean or minimum of several runs Slide 14

Inclusive vs. Exclusive values Inclusive Information of all sub-elements aggregated into single value Exclusive Information cannot be split up further int foo() { int a; a = a + 1; Inclusive bar(); Exclusive } a = a + 1; return a; Slide 15

Classification of measurement techniques When are performance measurements triggered? Sampling Code instrumentation How is performance data recorded? Profiling / Runtime summarization Tracing Slide 16

Sampling t1 t2 t3 t4 t5 t6 t7 t8 t9 Time main foo(0) foo(1) foo(2) int main() { int i; for (i=0; i < 3; i++) foo(i); return 0; } if (i > 0) foo(i 1); Timer interrupt, OS signal, or HWC overflow Service routine examines return-address stack Addresses are mapped to routines using symbol table information Statistical inference of program behavior void foo(int i) { } Running program is interrupted periodically Not very detailed information on highly volatile metrics Requires long-running applications Works with unmodified executable, but symbol table information is generally recommended Slide 17

Instrumentation t1 t2 t 3 t4 t5 t6 t7 t8 t9 t10 t11 t12t13 t14 Time main foo(0) foo(1) int main() { int i; Enter( main ); for (i=0; i < 3; i++) foo(i); Leave( main ); return 0; } void foo(int i) { Enter( foo ); if (i > 0) foo(i 1); Leave( foo ); } foo(2) Measurement Measurement code is inserted such that every interesting event is directly captured Advantage: Can be done in various ways Much more detailed information Disadvantage: Processing of source-code / executable necessary Large relative overheads for small functions Slide 18

Instrumentation techniques Static instrumentation Dynamic instrumentation Program is instrumented prior to execution Program is instrumented at runtime Code is inserted Manually Automatically By a preprocessor / source-to-source translation tool By a compiler By linking against a pre-instrumented library / runtime system By binary-rewrite / dynamic instrumentation tool Slide 19

Critical issues Accuracy Perturbation Measurement alters program behavior E.g., memory access pattern Intrusion overhead Measurement itself needs time and thus lowers performance Accuracy of timers & counters Granularity How many measurements? How much information / work during each measurement? Tradeoff: Accuracy vs. Expressiveness of data Slide 20

Classification of measurement techniques When are performance measurements triggered? Sampling Code instrumentation How is performance data recorded? Profiling / Runtime summarization Tracing Slide 21

Profiling / Runtime summarization Recording of aggregated information Time Counts Function calls Hardware counters Over program and system entities Functions, call sites, basic blocks, loops, Processes, threads Profile = summation of events over time Slide 22

Types of profiles Flat profile Shows distribution of metrics per function / instrumented region Calling context is not taken into account Call-path profile Shows distribution of metrics per call path Sometimes only distinguished by partial calling context (e.g., two levels) Special-purpose profiles Focus on specific aspects, e.g., MPI calls or OpenMP constructs Slide 23

Tracing Recording information about significant points (events) during execution of the program Enter / leave of a region (function, loop, ) Send / receive a message, Save information in event record Timestamp, location, event type Plus event-specific information (e.g., communicator, sender / receiver, ) Abstract execution model on level of defined events Event trace = Chronologically ordered sequence of event records Slide 24

Local trace A Event tracing... Process A void foo() { trc_enter("foo");... trc_send(b); send(b, tag, buf);... trc_exit("foo"); } MONITOR Global trace 58 ENTER 1 62 SEND B 64 EXIT 1... 1 foo instrument Process B void bar() {{ trc_enter("bar");... recv(a, tag, buf); trc_recv(a);... trc_exit("bar"); } synchronize(d)... Local trace B... 60 ENTER 1 68 RECV A 69 EXIT 1... MONITOR 1... bar merge unify

Example: Time-line visualization 1 foo 2 bar 3... main foo bar... 58 A ENTER 1 60 B ENTER 2 62 A SEND B 64 A EXIT 1 68 B RECV A 69 B EXIT 2 A B... 58 60 62 64 66 68 70 Slide 26

Tracing vs. Profiling Tracing advantages Event traces preserve the temporal and spatial relationships among individual events ( context) Allows reconstruction of dynamic application behavior on any required level of abstraction Most general measurement technique Profile data can be reconstructed from event traces Disadvantages Traces can become very large Writing events to file at runtime causes perturbation Writing tracing software is complicated Event buffering, clock synchronization,... Slide 27

No single solution is sufficient! A combination of different methods, tools and techniques is typically needed! Measurement Instrumentation Source code / binary, manual / automatic,... Analysis Sampling / instrumentation, profiling / tracing,... Statistics, visualization, automatic analysis, data mining,... Slide 28

Typical performance analysis procedure Do I have a performance problem at all? What is the main bottleneck (computation / communication)? Call-path profiling, detailed basic block profiling Why is it there? MPI / OpenMP / flat profiling Where is the main bottleneck? Time / speedup / scalability measurements Hardware counter analysis, trace selected parts to keep trace size manageable Does my code have scalability problems? Load imbalance analysis, compare profiles at various sizes function-by-function Slide 29

Outline Motivation Performance analysis basics Terminology Techniques Performance tools overview Outlook Slide 30

Fall-back: Home-grown performance tool Simple measurements can always be performed C/C++: times(), clock(), gettimeofday(), clock_gettime(), getrusage() Fortran: etime, cpu_time, system_clock However,... Use these functions rarely and only for coarse-grained timings Avoid do-it-yourself solutions for detailed measurements Use dedicated tools instead Typically more powerful Might use platform-specific timers with better resolution and/or lower overhead Slide 31

GNU gprof Shows where the program spends its time Indicates which functions are candidates for tuning Identifies which functions are being called Can be used on large and complex programs Method Compiler inserts code to count each function call (compile with -pg ) Shows up in profile as mcount Time information is gathered using sampling Current function and and its parents (two levels) are determined Program execution generates gmon.out file To dump human-readable profile to stdout, call gprof <executable> <gmon.out file> Slide 32

GNU gprof: flat profile Each sample counts as 0.01 % cumulative self time seconds seconds 82.90 31.55 31.55 7.54 34.42 2.87 5.96 36.69 2.27 2.13 37.50 0.81 0.87 37.83 0.33 0.34 37.96 0.13 0.21 38.04 0.08... seconds. calls 27648 1 28672 55296 54 27 self total ms/call ms/call 1.14 1.14 2870.00 37910.00 0.08 0.08 0.01 0.01 6.11 6.11 2.96 2.96 name pc_jac2d_blk3_ cg3_blk_ matxvec2d_ dot_prod2d_ add_exchange2d_ main cs_jac2d_blk3_ Avg. inclusive time Avg. exclusive time Number of calls Exclusive time (sorting criterion) Time sum Percentage of total time Slide 33

GNU gprof: (partial) call-path profile index %time [1] 99.3 self descendents 0.17 0.17 2.83 0.00 0.00 0.00 0.00 0.00 2.83 2.83 0.00 0.00 0.00 0.00 0.00 0.00 called/total parents called+self name called/total children 1/1 1 10/10 3/13 1/1 2/1523 1/1 10/20 index. start [2].main [1].relax [3].printf [15].saveOutput [27].malloc [5].readInput [40].update [68]... Children Percentage of total time (Called functions) Number of calls from this callsite (sorting criterion) Exclusive time Parent (Callsite) Total number of calls Time spent in children Slide 34

OpenMP profiling: ompp Light-weight OpenMP profiling tool Reports execution times and counts for various OpenMP constructs Presented in terms of the user model, not the system model Provides additional analyses Relies on source-code instrumentation Re-compilation necessary License: GPL-2 Web site: http://www.ompp-tool.com Slide 35

ompp: advanced analyses Overhead analysis Four well-defined overhead categories of parallel execution Analyze overheads for increasing thread counts Performance properties detection Analysis for individual parallel regions or whole program Scalability analysis Imbalance, Synchronization, Limited parallelism, Thread management Automatically detect common inefficiency situations Incremental profiling Profiling-over-time adds temporal dimension Slide 36

ompp: text output example Flat OpenMP construct profile R00002 main.c (34-37) (default) CRITICAL TID exect execc bodyt entert 0 3.00 1 1.00 2.00 1 1.00 1 1.00 0.00 2 2.00 1 1.00 1.00 3 4.00 1 1.00 3.00 SUM 10.01 4 4.00 6.00 exitt 0.00 0.00 0.00 0.00 0.00 PAPI_TOT_INS 159 6347 1595 1595 11132 Call-path profile Incl. time 32.22 (100.0%) 32.06 (99.50%) 10.02 (31.10%) 10.02 (31.10%) 16.03 (49.74%) 16.03 (49.74%) PARALLEL USERREG CRITICAL USERREG CRITICAL [APP 4 threads] +-R00004 main.c (42-46) -R00001 main.c (19-21) ('foo1') +-R00003 main.c (33-36) (unnamed) +-R00002 main.c (26-28) ('foo2') +-R00003 main.c (33-36) (unnamed) Slide 37

MPI profiling: mpip Scalable, light-weight MPI profiling library Generates detailed text summary of MPI behavior Time spent in each MPI function call site Bytes sent by each MPI function call site (if applicable) MPI I/O statistics Configurable traceback depth for function call sites Controllable from program using MPI_Pcontrol() Allows to profile just one code module or a single iteration Uses standard PMPI interface: only re-linking of application necessary License: new BSD Web site: http://mpip.sourceforge.net Slide 38

mpip: text output example @ mpip @ Version: 3.1.1 // 10 lines of mpip and experiment configuration options // 8192 lines of task assignment to BlueGene topology information @--- MPI Time (seconds) ------------------------------------------Task AppTime MPITime MPI% 0 37.7 25.2 66.89 //... 8191 37.6 26 69.21 * 3.09e+05 2.04e+05 65.88 @--- Callsites: 26 -----------------------------------------------ID Lev File/Address Line Parent_Funct MPI_Call 1 0 coarsen.c 542 hypre_structcoarsen Waitall // 25 similiar lines @--- Aggregate Time (top twenty, descending, milliseconds) -------Call Site Time App% MPI% COV Waitall 21 1.03e+08 33.27 50.49 0.11 Waitall 1 2.88e+07 9.34 14.17 0.26 // 18 similiar lines Slide 39

mpip: text output example (cont.) @--- Aggregate Sent Message Size (top twenty, descending, bytes) -Call Site Count Total Avrg Sent% Isend 11 845594460 7.71e+11 912 59.92 Allreduce 10 49152 3.93e+05 8 0.00 // 6 similiar lines @--- Callsite Time statistics (all, milliseconds): 212992 --------Name Site Rank Count Max Mean Min App% MPI% Waitall 21 0 111096 275 0.1 0.000707 29.61 44.27 //... Waitall 21 8191 65799 882 0.24 0.000707 41.98 60.66 Waitall 21 * 577806664 882 0.178 0.000703 33.27 50.49 // 213,042 similiar lines @--- Callsite Message Sent statistics (all, sent bytes) ----------Name Site Rank Count Max Mean Min Sum Isend 11 0 72917 2.621e+05 851.1 8 6.206e+07 //... Isend 11 8191 46651 2.621e+05 1029 8 4.801e+07 Isend 11 * 845594460 2.621e+05 911.5 8 7.708e+11 // 65,550 similiar lines Slide 40

Scalable performance-analysis toolkit for parallel codes Specifically targeting large-scale applications running on 10,000s to 100,000s of cores Integrated performance-analysis process Performance overview via call-path profiles In-depth study of application behavior via event tracing Automatic trace analysis identifying wait states Switching between both options without re-compilation or re-linking Supports MPI 2.2 and basic OpenMP License: new BSD Website: http://www.scalasca.org Slide 41

Summary report Measurement library HWC Local event traces Instr. target application Which problem? Instrumented executable Parallel waitstate search Where in the program? Wait-state report Report manipulation Scalasca architecture Which process? Instrumenter compiler / linker Source modules Slide 42

Blocking receive called earlier than corresponding send operation location Example pattern: Late sender Waiting time MPI_Send MPI_Recv Also applies to non-blocking communication location time MPI_Isend MPI_Irecv MPI_Wait MPI_Wait time Slide 43

Example: Sweep3D trace analysis on BG/P @ 288k s Slide 44

VampirTrace / Vampir VampirTrace Tool set and runtime library to generate OTF traces Supports MPI, OpenMP, POSIX threads, Java, CUDA,... Also distributed as part of Open MPI since v1.3 License: new BSD Web site: http://www.tu-dresden.de/zih/vampirtrace Vampir / VampirServer Interactive trace visualizer with powerful statistics License: commercial Web site: http://www.vampir.eu Slide 45

Vampir toolset architecture Multi-Core Multi-Core Program Program Many-Core Many-Core Program Program Vampir Vampir Trace Trace Trace File (OTF) Vampir Vampir Trace Trace Vampir 7 VampirServer Trace Bundle Slide 46

Vampir displays The main displays of Vampir are Master time-line Process and counter time-line Function summary Message summary Process summary Communication matrix Call tree All displays are coupled Always show the data for the current time interval Interactive zooming updates the displays automatically Slide 47

Vampir displays overview Slide 48

TAU Performance System Very portable tool set for instrumentation, measurement and analysis of parallel applications The swiss army knife of performance analysis Instrumentation API supports choice between profiling and tracing of metrics (e.g., time, HW counter,...) Supports C, C++, Fortran, HPF, HPC++, Java, Python MPI, OpenMP, POSIX threads, Java, Win32,... License: Open-source (historical permission notice and disclaimer) Web site: http://tau.uoregon.edu Slide 49

TAU components Program Analysis Performance Data Mining PDT TAU Architecture PerfExplorer PerfDMF Parallel Profile Analysis TAUoverSupermon ParaProf Performance Monitoring SC 10: Hands-on Practical Parallel Application Performance Engineering 50 Slide 50

TAU profile analysis: ParaProf Slide 51

TAU data mining: PerfExplorer Strong negative linear correlation between CALC_CUT_BLOCK_CONTRIBUTIONS and MPI_Barrier Slide 52

Other open-source tools worth mentioning Open SpeedShop Modular and extensible performance analysis tool set Web site: http://www.openspeedshop.org HPCToolkit Multi-platform statistical profiling package Web site: http://hpctoolkit.org Paraver Trace-based performance-analysis and visualization framework Web site: http://www.bsc.es/paraver PerfSuite, Periscope, IPM,... Slide 53

Commercial / vendor-specific tools Intel VTune (serial/multi-threaded) Intel Trace Analyzer and Collector (MPI) Cray Performance Toolkit: CrayPat / Apprentice2 IBM HPC Toolkit Oracle Performance Analyzer Rogue Wave SlowSpotter / ThreadSpotter... Slide 54

Upcoming workshops / tutorials 4th Workshop on Productivity and Performance (PROPER 2011) at EuroPar 2011 Conference, Bordeaux/France, August 30th, 2011 Tools for HPC Application Development See http://www.vi-hps.org/proper2011ws/ PRACE Summer School: Taking the most out of supercomputers CSC, Espoo, Finland, September 1st, 2011 Performance analysis tools for massively parallel applications 8th VI-HPS Tuning Workshop GRS, Aachen, Germany, September 5 9, 2011 Five-day bring-your-own-code workshop See http://www.vi-hps.org/vi-hps-tw8/ Slide 55

Acknowledgments Bernd Mohr, Brian Wylie, Markus Geimer (JSC) Andreas Knüpfer, Jens Doleschal (TU Dresden) Sameer Shende (University of Oregon) Luiz DeRose (Cray) Slide 56