Performance analysis with Periscope

Similar documents
Search Strategies for Automatic Performance Analysis Tools

Recent Advances in Periscope for Performance Analysis and Tuning

A Pattern-Based Approach to. Automated Application Performance Analysis

How To Visualize Performance Data In A Computer Program

End-user Tools for Application Performance Analysis Using Hardware Counters

Program Grid and HPC5+ workshop

Scalability evaluation of barrier algorithms for OpenMP

Spring 2011 Prof. Hyesoon Kim

Introduction to application performance analysis

Debugging with TotalView

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Developing Parallel Applications with the Eclipse Parallel Tools Platform

Prototype Visualization Tools for Multi- Experiment Multiprocessor Performance Analysis of DoD Applications

A Pattern-Based Approach to Automated Application Performance Analysis

Data Structure Oriented Monitoring for OpenMP Programs

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Introducing the IBM Software Development Kit for PowerLinux

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Multicore Parallel Computing with OpenMP

PERFORMANCE TOOLS DEVELOPMENTS

Full and Para Virtualization

UTS: An Unbalanced Tree Search Benchmark

HPC Wales Skills Academy Course Catalogue 2015

Simple Solution for a Location Service. Naming vs. Locating Entities. Forwarding Pointers (2) Forwarding Pointers (1)

The Design and Implementation of Scalable Parallel Haskell

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Unified Performance Data Collection with Score-P

Application Performance Analysis Tools and Techniques

BLM 413E - Parallel Programming Lecture 3

Operating System Impact on SMT Architecture

Architecture of Hitachi SR-8000

Next Generation GPU Architecture Code-named Fermi

Instruction Set Architecture (ISA)

Advanced Operating Systems CS428

Profiling and Testing with Test and Performance Tools Platform (TPTP)

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Naming vs. Locating Entities

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Tool - 1: Health Center

Multilevel Load Balancing in NUMA Computers

Analyzing Overheads and Scalability Characteristics of OpenMP Applications

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

2: Computer Performance

Perfmon2: A leap forward in Performance Monitoring

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Data-Flow Awareness in Parallel Data Processing

for High Performance Computing

POWER8 Performance Analysis

VLIW Processors. VLIW Processors

FPGA area allocation for parallel C applications

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Improving Time to Solution with Automated Performance Analysis

An Implementation Of Multiprocessor Linux

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Basic Concepts in Parallelization

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Optimization on Huygens

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

IA-64 Application Developer s Architecture Guide

MAQAO Performance Analysis and Optimization Tool

Improving System Scalability of OpenMP Applications Using Large Page Support


Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

OpenMP Programming on ScaleMP

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

A Scalable Approach to MPI Application Performance Analysis

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

RevoScaleR Speed and Scalability

GPU Tools Sandra Wienke

MPI and Hybrid Programming Models. William Gropp

Software and the Concurrency Revolution

ENEA BARE METAL PERFORMANCE TOOLS FOR NETLOGIC XLP AND CAVIUM OCTEON PLUS

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Integrating TAU With Eclipse: A Performance Analysis System in an Integrated Development Environment

Introduction to GPU Programming Languages

Tools for Performance Debugging HPC Applications. David Skinner

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Software of the Earth Simulator

Instrumentation Software Profiling

The Complete Performance Solution for Microsoft SQL Server

Memory Characterization to Analyze and Predict Multimedia Performance and Power in an Application Processor

Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer

Parallel Computing. Parallel shared memory computing with OpenMP

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

Performance Characteristics of Large SMP Machines

Pipelining Review and Its Limitations

Binary search tree with SIMD bandwidth optimization using SSE

Towards an Implementation of the OpenMP Collector API

Big Data Router for Real-Time Analytics

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

CS5460: Operating Systems. Lecture: Virtualization 2. Anton Burtsev March, 2013

Performance Application Programming Interface

NetVue Integrated Management System

Performance Monitoring of Parallel Scientific Applications

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Transcription:

Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010

Outline Motivation Periscope architecture Periscope performance analysis model Performance analysis strategies in Periscope Periscope GUI

Motivation Common performance analysis procedure on Power6 systems Use Tprof to pinpoint time-consuming subroutines Use Xprofiler to understand call graph; mpitrace for MPI comm Use hpmcount (libhpm) to measure HW Counters Problem Routine, error-prone and time-consuming Requires deep HW knowledge Mostly post-development process Hard to map bottlenecks to their source code location Solution Automate the performance analysis Integrate parallel application development and performance analysis within the same IDE

Periscope Iterative online analysis Measurements are configured, obtained and evaluated on the fly no need to store trace files Distributed architecture Reduced network overhead Analysis performed by multiple distributed hierarchical agents Automatic bottlenecks search Based on performance optimization experts' knowledge Single-node Performance on Intel Itanium6, IBM Power6, x86-s MPI Communication OpenMP Performance Enhanced Eclipse-based GUI Instrumentation: Fortran, C/C++; MPI / OpenMP / Hybrid

Distributed architecture Graphical User Interface Interactive frontend Eclipse-based GUI Analysis control Agents network Monitoring Request Interface Application

GUI Analysis Agents Instrumented Application Start Location Candidate Properties Monitoring Requests Refinement Precision Performance Measurements Final Properties Report Proven Properties Raw performance data Analysis

Automatic search for bottlenecks Automation based on formalized expert knowledge Efficient search algorithms Performance property Condition Confidence Severity strategies Performance analysis strategies Itanium2 Stall Cycle Analysis IBM POWER6 Single Core Performance Analysis MPI Communication Pattern Analysis Generic Memory Strategy OpenMP-based Performance Analysis Scalability Analysis OpenMP codes

POWER6 Single Core Performance Properties Hot spot of the application Cycles lost due to cache misses Average amount of cycles lost per L1 miss High L1 demand load miss rate High L2 demand load miss rate High L3 demand load miss rate Cycles lost due to address translation misses Cycles lost due to store instructions Cycles lost due to Floating Point instructions inefficiencies Cycles lost due to Integer multiplications and divisions Cycles lost due to no instruction to dispatch

Itanium2 Stall Cycle Properties IA64 Pipeline Stall Cycles Stalls due to pipeline flush Stalls due to branch misprediction flush Stalls due to exception flush Stalls due to floating point exceptions or L1D TLB misses Stalls due to Flush to zero or SIR stalls Stalls due to L1D TLB misses... Stalls due to waiting for data delivery to register Stalls due to waiting for integer register Stalls due to waiting for integer results Stalls due to waiting for FP register Stalls due to waiting for integer loads L3 misses dominate data access L2 misses L3 misses Stalls due to register stack engine

MPI Communication Patterns Analysis Automatic detection of wait patterns Measurement on the fly No tracing required! p1 MPI_Recv p2 MPI_Send

MPI Performance Properties Excessive MPI time in receive due to late sender Excessive MPI time due to late root in broadcast Excessive MPI time in root due to late process in reduce Excessive MPI time in... (1xN, Nx1, 1x1, NxN) Excessive MPI time due to many small messages Excessive MPI communication time

OpenMP-based Performance Properties Searches OpenMP-based perf. problems in a single step Properties are divided into four major domains Startup and Shutdown Overhead Load Imbalance in OpenMP regions: Parallel region Parallel loop Explicit barrier Parallel sections Not enough sections Uneven sections Seq. Computation in parallel regions: Master region Single region Ordered loop OpenMP Synchronization properties Critical section overhead property Frequent atomic property OpenMP Tasking analysis under development

Scalability Analysis OpenMP codes Identifies the OpenMP code regions that do not scale well Scalability Analysis is done by the frontend No need to manually configure the runs and find the speedup! Frontend initialization Frontend.run() i. Starts application ii.starts analysis agents iii.receives found properties n After 2 n runs Extracts information from the found properties Does Scalability Analysis Exports the Properties GUI-based Analysis

Scalability Analysis Properties Meta-Properties Properties occurring in all configurations Property with increasing severity across the configurations Speedup-based Prop. Linear Speedup Super linear Speedup Linear Speedup failed for the first time Speedup Decreasing Exp. specific properties Code region with the lowest speedup Low Speedup based on a threshold

Graphical User Interface Integrates with the Eclipse Development Platform Open-source, extensible and very popular IDE Supports different programming languages: C/C++, Fortran, etc. Uses the Eclipse Parallel Tools Platform (PTP) which provides a higher-level abstraction of the underlying parallel system Designed to combine: Performance measurement functionality of Periscope Advanced IDE functions like code indexing, refactoring, etc. Features Multi-functional table to display the detected bottlenecks Outline of the instrumented code regions Clustering techniques to get classes of similarly behaving processes Supports both local and remote projects Higher-level configuration and execution of performance experiments

Graphical User Interface Source code view Project view SIR outline view Properties view

Periscope GUI: Properties Table Simple and clean tree-based overview Multi-level grouping Complex data filtering Multiple criteria sorting algorithm Navigation from the properties to their source code location

Periscope GUI: Instrumentation Outline Resembles the code outline view of the Eclipse C/C++ Development Tooling Outlines the instrumented code regions and their nesting Shows the number of properties in each region Assists code navigation Filters the displayed properties

Periscope GUI: RDT and EFS Eclipse File System (EFS) Abstracts the underlying file system details Any supported file system can be used: Remote projects using SSH/FTP/DStore, Local, Zip, etc. Source files of the analyzed application reside only on the remote no need for synchronization Remote Development Tools (RDT) Part of Eclipse Parallel Tools Platform (PTP) Project Remote Compilation Remote Indexing Currently supports only C/C++ applications

Periscope GUI: Experiment Configuration External Tools Framework (ETFw) Part of Eclipse Parallel Tools Platform (PTP) Project More convenient environment using ETFw's Profile launch configuration no terminal access needed higher level configuration and automation possible

Clustering support Properties summarization Metaproperties Property 1 Cluster 2 Identify hidden behavior Weka workbench in the GUI: Waikato Environment for Knowledge Analysis Uses K-Means algorithm Groups properties based on CPU distribution and code region Results shown in a table view similar to the properties view Done also on the fly in the hierarchy Cluster 1 Cluster 3 Property 3 Cluster 1 CPUs: 7-10,16 Cluster 2 CPUs: 2-3,5,11,13-14 Cluster 3 CPUs:1,4,6,12,15 Property 2

Thank you for your attention! Current version 1.3 (New BSD License) Available under: http://www.lrr.in.tum.de/periscope/download Supported architectures SGI Altix 4700 Itanium2 IBM Power575 POWER6 x86-based architectures BlueGene/P under development Further information: Periscope web page: http://www.lrr.in.tum.de/periscope