Performance Analysis for GPU Accelerated Applications

Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871 Guido Juckeland (guido.juckeland@tu-dresden.de)

Outline Motivation Performance Analysis 101 Using Performance Tools for Accelerators Examples Summary & Outlook Guido Juckeland Slide 2

MOTIVATION Guido Juckeland Slide 3

Many High-Noon Situations I know, what my code does! Use my system efficiently! User System Provider Performance tools can provide an objective view Guido Juckeland Slide 4

Many High-Noon Situations (2) I need more information! Why do you care? Tool Developter Hardware Vendor Guido Juckeland Slide 5

Reaching Higher with Cooperation Guido Juckeland Slide 6

PERFORMANCE ANALYSIS 101 Guido Juckeland Slide 7

What do you want to know? Guido Juckeland Slide 8

How to get it? Data Presentation Profile Timeline Data Recording Summary Log Data Acquisition Sample Events Analysis Layer Analysis Technique Guido Juckeland Slide 9

Sampling vs. Tracing Foo: Total Time 0.0815 Bar: Total Time 0.4711 Sampling foo bar foo bar foo 2011/06/30 10:15:12.672865 Enter foo 2011/06/30 10:15:12.894341 Leave foo t Tracing Guido Juckeland Slide 10

Using Real Tools on Real Applications Guido Juckeland Slide 11

Score-P as a Collaboration Between Tool Providers Score-P Vampir Scalasca TAU Periscope Event traces (OTF2) TAU adaptor Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure Application (MPI, OpenMP, hybrid, serial) Online interface MPI wrapper Compiler TAU instrumentor OPARI 2 Instrumentation wrapper User Guido Juckeland Slide 12

Vampir 8 as an Example for Performance Data Visualization Toolbars Master Timeline Function Summary Process Timeline Counter Data Timeline Communication Matrix View Function Legend Process Summary Context View Guido Juckeland Slide 13

USING PERFORMANCE TOOLS FOR ACCELERATORS Guido Juckeland Slide 14

The Accelerator Challenge: Asynchronity main start main synchronize Host kernel Accelerator Accelerator Execution Queue kernel t Guido Juckeland Slide 15

Working Together with the Vendor: CUPTI CPU callback register start callback sync read back CPU test test test CUPTI reg. start reg. sync Accelerator Hardware Counter event kernel event Similar things possible for OpenCL Guido Juckeland Slide 16

What About Directive Based Approaches? Score-P Vampir Scalasca TAU Periscope Event traces (OTF2) TAU adaptor Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure ACCT callbacks Application (MPI, OpenMP, hybrid, serial) Online interface MPI wrapper OMPT callbacks Compiler TAU instrumentor OPARI 2 Instrumentation wrapper User Guido Juckeland Slide 17

Comparing Monitoring Tool Capabilities Vendor Tools VampirTrace / Score-P TAU HPCtoolkit IPM CEPBA MPItrace PAPI GPU Ocelot Monitoring Method Event + Sample Summary and Log Event + Sample Log Ereignis + Sample Aufzeichnung Event + Sample Summary and Log Event Summary Event Log Sample Summary Event Summary MPI Threads Accelerator Scalability Guido Juckeland Slide 18

Looking at Overhead: PIConGPU using 512 GPUs 2% 7% Simulation 14% Host Instrumentation CUDA Instrumentation MPI Instrumentation Guido Juckeland Slide 19

EXAMPLES Guido Juckeland Slide 20

Looking at Multi-hybrid Application Guido Juckeland Slide 21

Single GPU Implementation (5 years ago) Guido Juckeland Slide 22

Inter-GPU-Communication with synchronous MPI Guido Juckeland Slide 23

Impact of vectorized Kernels and asynchronous Communication Guido Juckeland Slide 24

Concurrent Kernel Execution and Communication Guido Juckeland Slide 25

Going to Large GPU Counts Guido Juckeland Slide 26

Filtering Guido Juckeland Slide 27

Only GPU activity Guido Juckeland Slide 28

Now What About Directives? Guido Juckeland Slide 29

SUMMARY & OUTLOOK Guido Juckeland Slide 30

Summary All levels of parallelism visible Inter-node (MPI, SHMEM) Intra-node (OpenMP, pthreads) Accelerators (CUDA, OpenCL) Multiple highly scalable analysis tools available Scalasca Vampir Experts available on-site You are running out of excuses ;-) Guido Juckeland Slide 31

Outlook Comittee work OpenACC (profiler interface) OpenMP (OMPT) Score-P group (finding usable solutions) Critical Path Analysis Blaming the right application parts Guido Juckeland Slide 32

Questions 2% 7% 14% Guido Juckeland Slide 33