Performance Analysis for GPU Accelerated Applications

Size: px

Start display at page:

Download "Performance Analysis for GPU Accelerated Applications"

Priscilla Price
7 years ago
Views:

1 Center for Information Services and High Performance Computing (ZIH) Performance Analysis for GPU Accelerated Applications Working Together for more Insight Willersbau, Room A218 Tel Guido Juckeland (guido.juckeland@tu-dresden.de)

2 Outline Motivation Performance Analysis 101 Using Performance Tools for Accelerators Examples Summary & Outlook Guido Juckeland Slide 2

3 MOTIVATION Guido Juckeland Slide 3

4 Many High-Noon Situations I know, what my code does! Use my system efficiently! User System Provider Performance tools can provide an objective view Guido Juckeland Slide 4

5 Many High-Noon Situations (2) I need more information! Why do you care? Tool Developter Hardware Vendor Guido Juckeland Slide 5

6 Reaching Higher with Cooperation Guido Juckeland Slide 6

7 PERFORMANCE ANALYSIS 101 Guido Juckeland Slide 7

8 What do you want to know? Guido Juckeland Slide 8

9 How to get it? Data Presentation Profile Timeline Data Recording Summary Log Data Acquisition Sample Events Analysis Layer Analysis Technique Guido Juckeland Slide 9

10 Sampling vs. Tracing Foo: Total Time Bar: Total Time Sampling foo bar foo bar foo 2011/06/30 10:15: Enter foo 2011/06/30 10:15: Leave foo t Tracing Guido Juckeland Slide 10

11 Using Real Tools on Real Applications Guido Juckeland Slide 11

12 Score-P as a Collaboration Between Tool Providers Score-P Vampir Scalasca TAU Periscope Event traces (OTF2) TAU adaptor Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure Application (MPI, OpenMP, hybrid, serial) Online interface MPI wrapper Compiler TAU instrumentor OPARI 2 Instrumentation wrapper User Guido Juckeland Slide 12

13 Vampir 8 as an Example for Performance Data Visualization Toolbars Master Timeline Function Summary Process Timeline Counter Data Timeline Communication Matrix View Function Legend Process Summary Context View Guido Juckeland Slide 13

14 USING PERFORMANCE TOOLS FOR ACCELERATORS Guido Juckeland Slide 14

15 The Accelerator Challenge: Asynchronity main start main synchronize Host kernel Accelerator Accelerator Execution Queue kernel t Guido Juckeland Slide 15

16 Working Together with the Vendor: CUPTI CPU callback register start callback sync read back CPU test test test CUPTI reg. start reg. sync Accelerator Hardware Counter event kernel event Similar things possible for OpenCL Guido Juckeland Slide 16

17 What About Directive Based Approaches? Score-P Vampir Scalasca TAU Periscope Event traces (OTF2) TAU adaptor Call-path profiles (CUBE4, TAU) Hardware counter (PAPI, rusage) Score-P measurement infrastructure ACCT callbacks Application (MPI, OpenMP, hybrid, serial) Online interface MPI wrapper OMPT callbacks Compiler TAU instrumentor OPARI 2 Instrumentation wrapper User Guido Juckeland Slide 17

18 Comparing Monitoring Tool Capabilities Vendor Tools VampirTrace / Score-P TAU HPCtoolkit IPM CEPBA MPItrace PAPI GPU Ocelot Monitoring Method Event + Sample Summary and Log Event + Sample Log Ereignis + Sample Aufzeichnung Event + Sample Summary and Log Event Summary Event Log Sample Summary Event Summary MPI Threads Accelerator Scalability Guido Juckeland Slide 18

19 Looking at Overhead: PIConGPU using 512 GPUs 2% 7% Simulation 14% Host Instrumentation CUDA Instrumentation MPI Instrumentation Guido Juckeland Slide 19

20 EXAMPLES Guido Juckeland Slide 20

21 Looking at Multi-hybrid Application Guido Juckeland Slide 21

22 Single GPU Implementation (5 years ago) Guido Juckeland Slide 22

23 Inter-GPU-Communication with synchronous MPI Guido Juckeland Slide 23

24 Impact of vectorized Kernels and asynchronous Communication Guido Juckeland Slide 24

25 Concurrent Kernel Execution and Communication Guido Juckeland Slide 25

26 Going to Large GPU Counts Guido Juckeland Slide 26

27 Filtering Guido Juckeland Slide 27

28 Only GPU activity Guido Juckeland Slide 28

29 Now What About Directives? Guido Juckeland Slide 29

30 SUMMARY & OUTLOOK Guido Juckeland Slide 30

31 Summary All levels of parallelism visible Inter-node (MPI, SHMEM) Intra-node (OpenMP, pthreads) Accelerators (CUDA, OpenCL) Multiple highly scalable analysis tools available Scalasca Vampir Experts available on-site You are running out of excuses ;-) Guido Juckeland Slide 31

32 Outlook Comittee work OpenACC (profiler interface) OpenMP (OMPT) Score-P group (finding usable solutions) Critical Path Analysis Blaming the right application parts Guido Juckeland Slide 32

33 Questions 2% 7% 14% Guido Juckeland Slide 33

Unified Performance Data Collection with Score-P

Unified Performance Data Collection with Score-P Bert Wesarg 1) With contributions from Andreas Knüpfer 1), Christian Rössel 2), and Felix Wolf 3) 1) ZIH TU Dresden, 2) FZ Jülich, 3) GRS-SIM Aachen Fragmentation