Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010
Outline Motivation Periscope architecture Periscope performance analysis model Performance analysis strategies in Periscope Periscope GUI
Motivation Common performance analysis procedure on Power6 systems Use Tprof to pinpoint time-consuming subroutines Use Xprofiler to understand call graph; mpitrace for MPI comm Use hpmcount (libhpm) to measure HW Counters Problem Routine, error-prone and time-consuming Requires deep HW knowledge Mostly post-development process Hard to map bottlenecks to their source code location Solution Automate the performance analysis Integrate parallel application development and performance analysis within the same IDE
Periscope Iterative online analysis Measurements are configured, obtained and evaluated on the fly no need to store trace files Distributed architecture Reduced network overhead Analysis performed by multiple distributed hierarchical agents Automatic bottlenecks search Based on performance optimization experts' knowledge Single-node Performance on Intel Itanium6, IBM Power6, x86-s MPI Communication OpenMP Performance Enhanced Eclipse-based GUI Instrumentation: Fortran, C/C++; MPI / OpenMP / Hybrid
Distributed architecture Graphical User Interface Interactive frontend Eclipse-based GUI Analysis control Agents network Monitoring Request Interface Application
GUI Analysis Agents Instrumented Application Start Location Candidate Properties Monitoring Requests Refinement Precision Performance Measurements Final Properties Report Proven Properties Raw performance data Analysis
Automatic search for bottlenecks Automation based on formalized expert knowledge Efficient search algorithms Performance property Condition Confidence Severity strategies Performance analysis strategies Itanium2 Stall Cycle Analysis IBM POWER6 Single Core Performance Analysis MPI Communication Pattern Analysis Generic Memory Strategy OpenMP-based Performance Analysis Scalability Analysis OpenMP codes
POWER6 Single Core Performance Properties Hot spot of the application Cycles lost due to cache misses Average amount of cycles lost per L1 miss High L1 demand load miss rate High L2 demand load miss rate High L3 demand load miss rate Cycles lost due to address translation misses Cycles lost due to store instructions Cycles lost due to Floating Point instructions inefficiencies Cycles lost due to Integer multiplications and divisions Cycles lost due to no instruction to dispatch
Itanium2 Stall Cycle Properties IA64 Pipeline Stall Cycles Stalls due to pipeline flush Stalls due to branch misprediction flush Stalls due to exception flush Stalls due to floating point exceptions or L1D TLB misses Stalls due to Flush to zero or SIR stalls Stalls due to L1D TLB misses... Stalls due to waiting for data delivery to register Stalls due to waiting for integer register Stalls due to waiting for integer results Stalls due to waiting for FP register Stalls due to waiting for integer loads L3 misses dominate data access L2 misses L3 misses Stalls due to register stack engine
MPI Communication Patterns Analysis Automatic detection of wait patterns Measurement on the fly No tracing required! p1 MPI_Recv p2 MPI_Send
MPI Performance Properties Excessive MPI time in receive due to late sender Excessive MPI time due to late root in broadcast Excessive MPI time in root due to late process in reduce Excessive MPI time in... (1xN, Nx1, 1x1, NxN) Excessive MPI time due to many small messages Excessive MPI communication time
OpenMP-based Performance Properties Searches OpenMP-based perf. problems in a single step Properties are divided into four major domains Startup and Shutdown Overhead Load Imbalance in OpenMP regions: Parallel region Parallel loop Explicit barrier Parallel sections Not enough sections Uneven sections Seq. Computation in parallel regions: Master region Single region Ordered loop OpenMP Synchronization properties Critical section overhead property Frequent atomic property OpenMP Tasking analysis under development
Scalability Analysis OpenMP codes Identifies the OpenMP code regions that do not scale well Scalability Analysis is done by the frontend No need to manually configure the runs and find the speedup! Frontend initialization Frontend.run() i. Starts application ii.starts analysis agents iii.receives found properties n After 2 n runs Extracts information from the found properties Does Scalability Analysis Exports the Properties GUI-based Analysis
Scalability Analysis Properties Meta-Properties Properties occurring in all configurations Property with increasing severity across the configurations Speedup-based Prop. Linear Speedup Super linear Speedup Linear Speedup failed for the first time Speedup Decreasing Exp. specific properties Code region with the lowest speedup Low Speedup based on a threshold
Graphical User Interface Integrates with the Eclipse Development Platform Open-source, extensible and very popular IDE Supports different programming languages: C/C++, Fortran, etc. Uses the Eclipse Parallel Tools Platform (PTP) which provides a higher-level abstraction of the underlying parallel system Designed to combine: Performance measurement functionality of Periscope Advanced IDE functions like code indexing, refactoring, etc. Features Multi-functional table to display the detected bottlenecks Outline of the instrumented code regions Clustering techniques to get classes of similarly behaving processes Supports both local and remote projects Higher-level configuration and execution of performance experiments
Graphical User Interface Source code view Project view SIR outline view Properties view
Periscope GUI: Properties Table Simple and clean tree-based overview Multi-level grouping Complex data filtering Multiple criteria sorting algorithm Navigation from the properties to their source code location
Periscope GUI: Instrumentation Outline Resembles the code outline view of the Eclipse C/C++ Development Tooling Outlines the instrumented code regions and their nesting Shows the number of properties in each region Assists code navigation Filters the displayed properties
Periscope GUI: RDT and EFS Eclipse File System (EFS) Abstracts the underlying file system details Any supported file system can be used: Remote projects using SSH/FTP/DStore, Local, Zip, etc. Source files of the analyzed application reside only on the remote no need for synchronization Remote Development Tools (RDT) Part of Eclipse Parallel Tools Platform (PTP) Project Remote Compilation Remote Indexing Currently supports only C/C++ applications
Periscope GUI: Experiment Configuration External Tools Framework (ETFw) Part of Eclipse Parallel Tools Platform (PTP) Project More convenient environment using ETFw's Profile launch configuration no terminal access needed higher level configuration and automation possible
Clustering support Properties summarization Metaproperties Property 1 Cluster 2 Identify hidden behavior Weka workbench in the GUI: Waikato Environment for Knowledge Analysis Uses K-Means algorithm Groups properties based on CPU distribution and code region Results shown in a table view similar to the properties view Done also on the fly in the hierarchy Cluster 1 Cluster 3 Property 3 Cluster 1 CPUs: 7-10,16 Cluster 2 CPUs: 2-3,5,11,13-14 Cluster 3 CPUs:1,4,6,12,15 Property 2
Thank you for your attention! Current version 1.3 (New BSD License) Available under: http://www.lrr.in.tum.de/periscope/download Supported architectures SGI Altix 4700 Itanium2 IBM Power575 POWER6 x86-based architectures BlueGene/P under development Further information: Periscope web page: http://www.lrr.in.tum.de/periscope