OpenMP Tools API (OMPT) and HPCToolkit

Similar documents
OMPT: OpenMP Tools Application Programming Interfaces for Performance Analysis

OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging

Towards an Implementation of the OpenMP Collector API

A Performance Monitoring Interface for OpenMP

Open-Source Software Support for the OpenMP Runtime API for Profiling

Performance Tools for Parallel Java Environments

Application Performance Analysis Tools and Techniques

OpenACC 2.0 and the PGI Accelerator Compilers

GTask Developing asynchronous applications for multi-core efficiency

Performance Analysis for GPU Accelerated Applications

Data Structure Oriented Monitoring for OpenMP Programs

Multi-core Programming System Overview

Parallel Computing. Shared memory parallel programming with OpenMP

Parallel Computing. Parallel shared memory computing with OpenMP

Libmonitor: A Tool for First-Party Monitoring

Suitability of Performance Tools for OpenMP Task-parallel Programs

Improving Time to Solution with Automated Performance Analysis

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Equinox Framework: A Happier OSGi R6 Implementation

UTS: An Unbalanced Tree Search Benchmark

An Implementation of the POMP Performance Monitoring Interface for OpenMP Based on Dynamic Probes

IBM SDK, Java Technology Edition Version 1. IBM JVM messages IBM

OpenACC Basics Directive-based GPGPU Programming

Scalability evaluation of barrier algorithms for OpenMP

Kernel Synchronization and Interrupt Handling

Using the Intel Inspector XE

Java SE 7 Programming

ORACLE INSTANCE ARCHITECTURE

A Case Study - Scaling Legacy Code on Next Generation Platforms

Java SE 7 Programming

Java SE 7 Programming

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

Developing Parallel Applications with the Eclipse Parallel Tools Platform

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Java Mission Control

Debugging with TotalView

Integrating TAU With Eclipse: A Performance Analysis System in an Integrated Development Environment

Mutual Exclusion using Monitors

Case Study on Productivity and Performance of GPGPUs

C#5.0 IN A NUTSHELL. Joseph O'REILLY. Albahari and Ben Albahari. Fifth Edition. Tokyo. Sebastopol. Beijing. Cambridge. Koln.

Software and the Concurrency Revolution

Multi-Threading Performance on Commodity Multi-Core Processors

BLM 413E - Parallel Programming Lecture 3

Introducing Tetra: An Educational Parallel Programming System

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

Apache Traffic Server Extensible Host Resolution

UniFinger Engine SDK Manual (sample) Version 3.0.0

DAGViz: A DAG Visualization Tool for Analyzing Task Parallel Programs (DAGViz: DAG )

End-user Tools for Application Performance Analysis Using Hardware Counters

MPLAB Harmony System Service Libraries Help

How to Use Open SpeedShop BGP and Cray XT/XE

Between Mutual Trust and Mutual Distrust: Practical Fine-grained Privilege Separation in Multithreaded Applications

Resource Utilization of Middleware Components in Embedded Systems

Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

Real Time Programming: Concepts

USER GUIDE. AVR2050: BitCloud Developer Guide. Atmel MCU Wireless. Description. Features

An Implementation Of Multiprocessor Linux

System Monitor Guide and Reference

First-class User Level Threads

TivaWare USB Library USER S GUIDE SW-TM4C-USBL-UG Copyright Texas Instruments Incorporated

Infrastructure that supports (distributed) componentbased application development

Revealing the performance aspects in your code. Intel VTune Amplifier XE Generics. Rev.: Sep 1, 2013

Operating System Manual. Realtime Communication System for netx. Kernel API Function Reference.

q for Gods Whitepaper Series (Edition 7) Common Design Principles for kdb+ Gateways

CSCI E 98: Managed Environments for the Execution of Programs

I/O Device and Drivers

Frysk The Systems Monitoring and Debugging Tool. Andrew Cagney

Introduction to Cloud Computing

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

Shared Address Space Computing: Programming

An Introduction to Multi-threaded Debugging Techniques

Understanding Hardware Transactional Memory

Object-Oriented Databases db4o: Part 2

Introduction CORBA Distributed COM. Sections 9.1 & 9.2. Corba & DCOM. John P. Daigle. Department of Computer Science Georgia State University

Intel EP80579 Software for Security Applications on Intel QuickAssist Technology Cryptographic API Reference

ODMG-api Guide. Table of contents

Running applications on the Cray XC30 4/12/2015

Performance Data Management with TAU PerfExplorer

BSC vision on Big Data and extreme scale computing

Partitioning under the hood in MySQL 5.5

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I)

Transcription:

OpenMP Tools API (OMPT) and HPCToolkit John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu SC13 OpenMP Birds of a Feather Session, November 19, 2013

OpenMP Tools Subcommittee Executive lead Martin Schulz - LLNL Technical leads Alexandre Eichenberger - IBM John Mellor-Crummey - Rice Active subcommittee members Nawal Copty - Oracle Jim Cownie - Intel Robert Dietrich - TU Dresden Christian Iwainsky - TU Darmstadt Xu Liu - Rice Eugene Loh - Oracle Daniel Lorenz - Juelich Harald Servat - Barcelona Supercomputer Center

Motivation Large gap between OpenMP source and implementation Calling context for code in parallel regions and tasks executed by worker threads is not readily available... Tools must bridge this gap to explain program performance

Calling Context is Distributed across Threads Developer view Implementation view

Obstacles for Runtime-independent Tools No standard API for OpenMP tools Principal prior efforts POMP - Mohr, Malony, Shende, Wolf Collector API - Itzkowitz, Mazurov, Copty, Lin Differences in OpenMP implementations support for static linking Intel: extra monitor thread PGI: cactus stack IBM: neither

Outline OMPT - emerging performance tool API for OpenMP overview and goals state tracking event notification API Status and next steps

OMPT Design Objectives Enable tools to gather information and associate costs with application source and runtime system construct low-overhead tools based on asynchronous sampling identify application stack frames vs. runtime frames associate a thread s activity at any point with a descriptive state parallel work, idle, lock wait,... Negligible overhead if OMPT interface is not in use Define support for trace-based performance tools Don t impose an unreasonable development burden runtime implementers tool developers

OMPT Performance Tools API Overview and Goals Focus on minimal set of functionality provide essential support for sampling-based tools only support tools attached at link-time or program launch Minimize runtime cost reduce cost in runtime and tool where possible encourage integration into optimized runtimes make support for higher-overhead features optional callbacks for blame shifting callbacks for full-featured tracing tools

Major OMPT Functionality State tracking have runtime track keep track of thread states (e.g., idle, parallel) allow tools to query this state at any time (async signal safe) provide (limited) persistent storage for tool data in runtime system Call stack interpretation provide hooks that enable tools to reconstruct application-level call stacks from implementation-level information Event notification provide callbacks for predefined events few mandatory notifications many optional ones

Runtime State Tracking OpenMP runtime keeps track of its own state predefined states on next slide Query routine ompt_state_t ompt_get_state(ompt_wait_id_t *wait_id) async signal safe Wait IDs only returned for states that signify waiting identifies the cause for waiting (lock, critical region,...)

OMPT Runtime States Work serial, parallel reduction Idle Barrier implicit explicit either Task wait task wait task group wait Mutual Exclusion wait lock wait nest lock wait critical wait atomic wait ordered Overhead Miscellaneous first undefined

OMPT Event Notifications Mandatory events Blame-shifting events (optional) Trace events (optional)

Mandatory Events Essential support for any performance tool Threads Parallel regions Tasks Runtime shutdown User-level control API e.g., support tool start/stop create/exit event pairs singleton events

Blame-shifting Events (Optional) Idle Wait Support designed for sampling-based performance tools barrier taskwait taskgroup wait Release lock nest lock critical atomic ordered section begin/end event pairs singleton events

Directed Blame Shifting Example: threads waiting at a lock are the symptom the cause is the lock holder Approach: blame lock waiting on lock holder F o r k lockwait J o i n accumulate samples in a global hash table indexed by the lock address lock holder accepts these samples when it releases the lock acquire lock release lock

Example: Directed Blame Shifting for Blame a lock holder for delaying waiting threads Charge all samples that threads receive while awaiting a lock to the lock itself When releasing a lock, accept all of blame at the lock the waiting occurs here (symptom) almost all blame for the waiting is attributed here (cause)

Trace Events (Optional)

Parallel Region and Task IDs Unique id for each parallel region instance and task instance Ability to query parallel region and task IDs ompt_parallel_id_t ompt_get_parallel_id(int ancestor_level) ompt_task_id_t ompt_get_task_id(int ancestor_level) current task: ancestor_level = 0 query IDs of ancestor task/region using higher ancestor levels async signal safe

Call Stack Interpretation

Miscellaneous API Features Tool-facing API functions initialization int ompt_initialize(ompt_function_lookup_t ompt_fn_lookup, const char * version, int ompt_version) int ompt_set_callback(ompt_event_t e, ompt_callback_t cb) state enumeration int ompt_enumerate_state(int current_state, int *next_state, const char **next_state_name) Tool control void ompt_control(uint64_t command, uint64_t modifier)

Outline OMPT - emerging performance tool API for OpenMP overview and goals state tracking event notification API Status and next steps

OMPT Status and Next Steps Subcommittee approved the document a few weeks ago a few recent text changes based on late comments Available on the WWW for comment: http://bit.ly/ompt-tr Nov 7 ARB meeting tools group approved as subcommittee of language committee Submit it to OpenMP language committee for official comment turn it into an official OpenMP TR Runtime implementations IBM implementation of OMPT interface for BG/Q and Power Rice and Oregon prototype implementation OMPT in Intel runtime Tools Rice University s HPCToolkit (development branch) University of Oregon s Tau