Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel. +49 351-463 - 39871 Guido Juckeland (guido.juckeland@tu-dresden.de)
Outline Motivation Performance Analysis 101 Using Performance Tools for Accelerators Examples Summary & Outlook Guido Juckeland Slide 2
MOTIVATION Guido Juckeland Slide 3
Many High-Noon Situations I know, what my code does! Use my system efficiently! User System Provider Performance tools can provide an objective view Guido Juckeland Slide 4
Many High-Noon Situations (2) I need more information! Why do you care? Tool Developter Hardware Vendor Guido Juckeland Slide 5
Reaching Higher with Cooperation Guido Juckeland Slide 6
PERFORMANCE ANALYSIS 101 Guido Juckeland Slide 7
What do you want to know? Guido Juckeland Slide 8
How to get it? Data Presentation Profile Timeline Data Recording Summary Log Data Acquisition Sample Events Analysis Layer Analysis Technique Guido Juckeland Slide 9
Sampling vs. Tracing Foo: Total Time 0.0815 Bar: Total Time 0.4711 Sampling foo bar foo bar foo 2011/06/30 10:15:12.672865 Enter foo 2011/06/30 10:15:12.894341 Leave foo t Tracing Guido Juckeland Slide 10
Using Real Tools on Real Applications Guido Juckeland Slide 11
Score-P as a Collaboration Between Tool Providers Score-P Vampir Scalasca TAU Periscope Event traces (OTF2) TAU adaptor Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure Application (MPI, OpenMP, hybrid, serial) Online interface MPI wrapper Compiler TAU instrumentor OPARI 2 Instrumentation wrapper User Guido Juckeland Slide 12
Vampir 8 as an Example for Performance Data Visualization Toolbars Master Timeline Function Summary Process Timeline Counter Data Timeline Communication Matrix View Function Legend Process Summary Context View Guido Juckeland Slide 13
USING PERFORMANCE TOOLS FOR ACCELERATORS Guido Juckeland Slide 14
The Accelerator Challenge: Asynchronity main start main synchronize Host kernel Accelerator Accelerator Execution Queue kernel t Guido Juckeland Slide 15
Working Together with the Vendor: CUPTI CPU callback register start callback sync read back CPU test test test CUPTI reg. start reg. sync Accelerator Hardware Counter event kernel event Similar things possible for OpenCL Guido Juckeland Slide 16
What About Directive Based Approaches? Score-P Vampir Scalasca TAU Periscope Event traces (OTF2) TAU adaptor Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure ACCT callbacks Application (MPI, OpenMP, hybrid, serial) Online interface MPI wrapper OMPT callbacks Compiler TAU instrumentor OPARI 2 Instrumentation wrapper User Guido Juckeland Slide 17
Comparing Monitoring Tool Capabilities Vendor Tools VampirTrace / Score-P TAU HPCtoolkit IPM CEPBA MPItrace PAPI GPU Ocelot Monitoring Method Event + Sample Summary and Log Event + Sample Log Ereignis + Sample Aufzeichnung Event + Sample Summary and Log Event Summary Event Log Sample Summary Event Summary MPI Threads Accelerator Scalability Guido Juckeland Slide 18
Looking at Overhead: PIConGPU using 512 GPUs 2% 7% Simulation 14% Host Instrumentation CUDA Instrumentation MPI Instrumentation Guido Juckeland Slide 19
EXAMPLES Guido Juckeland Slide 20
Looking at Multi-hybrid Application Guido Juckeland Slide 21
Single GPU Implementation (5 years ago) Guido Juckeland Slide 22
Inter-GPU-Communication with synchronous MPI Guido Juckeland Slide 23
Impact of vectorized Kernels and asynchronous Communication Guido Juckeland Slide 24
Concurrent Kernel Execution and Communication Guido Juckeland Slide 25
Going to Large GPU Counts Guido Juckeland Slide 26
Filtering Guido Juckeland Slide 27
Only GPU activity Guido Juckeland Slide 28
Now What About Directives? Guido Juckeland Slide 29
SUMMARY & OUTLOOK Guido Juckeland Slide 30
Summary All levels of parallelism visible Inter-node (MPI, SHMEM) Intra-node (OpenMP, pthreads) Accelerators (CUDA, OpenCL) Multiple highly scalable analysis tools available Scalasca Vampir Experts available on-site You are running out of excuses ;-) Guido Juckeland Slide 31
Outlook Comittee work OpenACC (profiler interface) OpenMP (OMPT) Score-P group (finding usable solutions) Critical Path Analysis Blaming the right application parts Guido Juckeland Slide 32
Questions 2% 7% 14% Guido Juckeland Slide 33