Performance Analysis for GPU Accelerated Applications

Similar documents
NVIDIA Tools For Profiling And Monitoring. David Goodwin

Part I Courses Syllabus

CRESTA DPI OpenMPI Performance Optimisation and Analysis

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Application Performance Analysis Tools and Techniques

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to GPU Programming Languages

GPU Profiling with AMD CodeXL

Review of Performance Analysis Tools for MPI Parallel Programs

Experiences with Tools at NERSC

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

MAQAO Performance Analysis and Optimization Tool

BLM 413E - Parallel Programming Lecture 3

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Improving Time to Solution with Automated Performance Analysis

HPC Software Debugger and Performance Tools

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

HPC with Multicore and GPUs

GPU Performance Analysis and Optimisation

Case Study on Productivity and Performance of GPGPUs

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

A Pattern-Based Approach to. Automated Application Performance Analysis

COSCO 2015 Heterogeneous Computing Programming

TAU Install Guide. Updated Nov. 15, 2015, for use with version 2.25 or greater.

OpenACC Basics Directive-based GPGPU Programming

Performance Analysis and Optimization Tool

Tools for Performance Debugging HPC Applications. David Skinner

Data Structure Oriented Monitoring for OpenMP Programs

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

MPI and Hybrid Programming Models. William Gropp

Performance Tools for Parallel Java Environments

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPUs for Scientific Computing

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Parallel Programming Survey

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

HPC Wales Skills Academy Course Catalogue 2015

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

Introduction to GPU hardware and to CUDA

End-user Tools for Application Performance Analysis Using Hardware Counters

OpenACC 2.0 and the PGI Accelerator Compilers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

Optimizing Application Performance with CUDA Profiling Tools

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Load Imbalance Analysis

Application Performance Analysis Tools and Techniques

Profiler User's Guide

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Debugging with TotalView

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

How To Visualize Performance Data In A Computer Program

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Data Centric Systems (DCS)

A Case Study - Scaling Legacy Code on Next Generation Platforms

Running applications on the Cray XC30 4/12/2015

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Sourcery Overview & Virtual Machine Installation

A quick tutorial on Intel's Xeon Phi Coprocessor

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

- An Essential Building Block for Stable and Reliable Compute Clusters

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Analysis report examination with CUBE

Vampir 7 User Manual

Scalability evaluation of barrier algorithms for OpenMP

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Linux tools for debugging and profiling MPI codes

Spring 2011 Prof. Hyesoon Kim

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Multi-core Programming System Overview

Getting Started with CodeXL

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

HOPSA Project. Technical Report. A Workflow for Holistic Performance System Analysis

Next Generation GPU Architecture Code-named Fermi

Transcription:

Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871 Guido Juckeland (guido.juckeland@tu-dresden.de)

Outline Motivation Performance Analysis 101 Using Performance Tools for Accelerators Examples Summary & Outlook Guido Juckeland Slide 2

MOTIVATION Guido Juckeland Slide 3

Many High-Noon Situations I know, what my code does! Use my system efficiently! User System Provider Performance tools can provide an objective view Guido Juckeland Slide 4

Many High-Noon Situations (2) I need more information! Why do you care? Tool Developter Hardware Vendor Guido Juckeland Slide 5

Reaching Higher with Cooperation Guido Juckeland Slide 6

PERFORMANCE ANALYSIS 101 Guido Juckeland Slide 7

What do you want to know? Guido Juckeland Slide 8

How to get it? Data Presentation Profile Timeline Data Recording Summary Log Data Acquisition Sample Events Analysis Layer Analysis Technique Guido Juckeland Slide 9

Sampling vs. Tracing Foo: Total Time 0.0815 Bar: Total Time 0.4711 Sampling foo bar foo bar foo 2011/06/30 10:15:12.672865 Enter foo 2011/06/30 10:15:12.894341 Leave foo t Tracing Guido Juckeland Slide 10

Using Real Tools on Real Applications Guido Juckeland Slide 11

Score-P as a Collaboration Between Tool Providers Score-P Vampir Scalasca TAU Periscope Event traces (OTF2) TAU adaptor Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure Application (MPI, OpenMP, hybrid, serial) Online interface MPI wrapper Compiler TAU instrumentor OPARI 2 Instrumentation wrapper User Guido Juckeland Slide 12

Vampir 8 as an Example for Performance Data Visualization Toolbars Master Timeline Function Summary Process Timeline Counter Data Timeline Communication Matrix View Function Legend Process Summary Context View Guido Juckeland Slide 13

USING PERFORMANCE TOOLS FOR ACCELERATORS Guido Juckeland Slide 14

The Accelerator Challenge: Asynchronity main start main synchronize Host kernel Accelerator Accelerator Execution Queue kernel t Guido Juckeland Slide 15

Working Together with the Vendor: CUPTI CPU callback register start callback sync read back CPU test test test CUPTI reg. start reg. sync Accelerator Hardware Counter event kernel event Similar things possible for OpenCL Guido Juckeland Slide 16

What About Directive Based Approaches? Score-P Vampir Scalasca TAU Periscope Event traces (OTF2) TAU adaptor Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure ACCT callbacks Application (MPI, OpenMP, hybrid, serial) Online interface MPI wrapper OMPT callbacks Compiler TAU instrumentor OPARI 2 Instrumentation wrapper User Guido Juckeland Slide 17

Comparing Monitoring Tool Capabilities Vendor Tools VampirTrace / Score-P TAU HPCtoolkit IPM CEPBA MPItrace PAPI GPU Ocelot Monitoring Method Event + Sample Summary and Log Event + Sample Log Ereignis + Sample Aufzeichnung Event + Sample Summary and Log Event Summary Event Log Sample Summary Event Summary MPI Threads Accelerator Scalability Guido Juckeland Slide 18

Looking at Overhead: PIConGPU using 512 GPUs 2% 7% Simulation 14% Host Instrumentation CUDA Instrumentation MPI Instrumentation Guido Juckeland Slide 19

EXAMPLES Guido Juckeland Slide 20

Looking at Multi-hybrid Application Guido Juckeland Slide 21

Single GPU Implementation (5 years ago) Guido Juckeland Slide 22

Inter-GPU-Communication with synchronous MPI Guido Juckeland Slide 23

Impact of vectorized Kernels and asynchronous Communication Guido Juckeland Slide 24

Concurrent Kernel Execution and Communication Guido Juckeland Slide 25

Going to Large GPU Counts Guido Juckeland Slide 26

Filtering Guido Juckeland Slide 27

Only GPU activity Guido Juckeland Slide 28

Now What About Directives? Guido Juckeland Slide 29

SUMMARY & OUTLOOK Guido Juckeland Slide 30

Summary All levels of parallelism visible Inter-node (MPI, SHMEM) Intra-node (OpenMP, pthreads) Accelerators (CUDA, OpenCL) Multiple highly scalable analysis tools available Scalasca Vampir Experts available on-site You are running out of excuses ;-) Guido Juckeland Slide 31

Outlook Comittee work OpenACC (profiler interface) OpenMP (OMPT) Score-P group (finding usable solutions) Critical Path Analysis Blaming the right application parts Guido Juckeland Slide 32

Questions 2% 7% 14% Guido Juckeland Slide 33