Performance Debugging: Methods and Tools. David Skinner deskinner@lbl.gov

Similar documents
Tools for Performance Debugging HPC Applications. David Skinner

RA MPI Compilers Debuggers Profiling. March 25, 2009

Performance Monitoring of Parallel Scientific Applications

Optimization tools. 1) Improving Overall I/O

Load Imbalance Analysis

Experiences with Tools at NERSC

MAQAO Performance Analysis and Optimization Tool

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Application Performance Analysis Tools and Techniques

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Performance Analysis and Optimization Tool

Multicore Parallel Computing with OpenMP

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

End-user Tools for Application Performance Analysis Using Hardware Counters

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Multi-Threading Performance on Commodity Multi-Core Processors

Petascale Software Challenges. William Gropp

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Parallel Computing. Benson Muite. benson.

HPC Wales Skills Academy Course Catalogue 2015

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Capacity Planning for Microsoft SharePoint Technologies

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Chapter 3 Operating-System Structures

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Running applications on the Cray XC30 4/12/2015

Guideline for stresstest Page 1 of 6. Stress test

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Part I Courses Syllabus

Parallel and Distributed Computing Programming Assignment 1

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Getting Things Done: Practical Web/e-Commerce Application Stress Testing

Keys to node-level performance analysis and threading in HPC applications

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Parallel Programming Survey

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Chapter 2 System Structures

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Software Development around a Millisecond

Mark Bennett. Search and the Virtual Machine

A Case Study - Scaling Legacy Code on Next Generation Platforms

Overlapping Data Transfer With Application Execution on Clusters

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Performance Analysis for GPU Accelerated Applications

Deployment Planning Guide

Working with HPC and HTC Apps. Abhinav Thota Research Technologies Indiana University

Performance tuning Xen

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Automating Big Data Benchmarking for Different Architectures with ALOJA

Sequential Performance Analysis with Callgrind and KCachegrind

PERFORMANCE MODELS FOR APACHE ACCUMULO:

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Parallelization: Binary Tree Traversal

Grid Scheduling Dictionary of Terms and Keywords

The Complete Performance Solution for Microsoft SQL Server

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Performance Management for Cloudbased STC 2012

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

<Insert Picture Here> Adventures in Middleware Database Abuse

Parallel Algorithm Engineering

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

NVIDIA Tools For Profiling And Monitoring. David Goodwin

High Performance or Cycle Accuracy?

ELEC 377. Operating Systems. Week 1 Class 3

Software and Performance Engineering of the Community Atmosphere Model

Performance Engineering of the Community Atmosphere Model

Performance Tuning and Optimizing SQL Databases 2016

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Sequential Performance Analysis with Callgrind and KCachegrind

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

Architecture of Hitachi SR-8000

Optimizing Shared Resource Contention in HPC Clusters

High Performance Computing. Course Notes HPC Fundamentals

Transcription:

Performance Debugging: Methods and Tools David Skinner deskinner@lbl.gov

Performance Debugging: Methods and Tools Principles Topics in performance scalability Examples of areas where tools can help Practice Where to find tools Specifics to NERSC and Hopper Scope & Audience Budding simulation scientist app dev Compiler/middleware dev, YMMV 2

One Slide about NERSC Serving all of DOE Office of Science domain breadth range of scales Science driven sustained performance Lots of users ~4K active ~500 logged in ~300 projects Architecture aware procurements driven by workload needs

Big Picture of Performance and Scalability 4

What is performance? no output July 2012 big fast computers incorrect performance one misconfig -$400M in 30min

App Performance: Dimensions Code Input deck Computer Concurrency Workload Person 6

Performance is Relative To your goals Time to solution, T q +T wall Your research agenda Efficient use of allocation To the application code input deck machine type/state Suggestion: Focus on specific use cases as opposed to making everything perform well. Bottlenecks can shift.

Performance, more than a single number Plan where to put effort Formulate Research Problem Queue Wait Data? Optimization in one area can de-optimize another Coding jobs jobs jobs jobs UQ VV Timings come from timers and also from your calendar, time spent coding Debug Perf Debug Understand & Publish! Sometimes a slower algorithm is simpler to verify correctness 8

Different Facets of Performance Serial Leverage ILP on the processor Feed the pipelines Exploit data locality Reuse data in cache Parallel Expose concurrency Minimizing latency effects Maximizing work vs. communication 9

Performance is Hierarchical Registers instructions & operands Caches lines Local Memory pages Remote Memory messages Disk / Filesystem blocks, files 10

on to specifics about HPC tools Mostly at NERSC but fairly general 11

Tools are Hierarchical Registers Caches PAPI Local Memory Remote Memory valgrind PMPI Craypat IPM Tau Disk / Filesystem SAR/LMT 12

HPC Perf Tool Mechanisms Sampling Regularly interrupt the program and record where it is Build up a statistical profile Tracing / Instrumenting Insert hooks into program to record and time events, document everything Use Hardware Event Counters Special registers count events on processor E.g. floating point instructions Many possible events Only a few (~4 counters) 13

Typical Tool Use Requirements (Sometimes) Modify your code with macros, API calls, timers Compile your code Transform your binary for profiling/ tracing with a tool Run the transformed binary A data file is produced Interpret the results with a tool 14

Performance Tools @ NERSC Vendor Tools: CrayPat Community Tools : TAU (U. Oregon via ACTS) PAPI (Performance Application Programming Interface) gprof IPM: Integrated Performance Monitoring 15

What HPC tools can tell us? CPU and memory usage FLOP rate Memory high water mark OpenMP OMP overhead OMP scalability (finding right # threads) MPI % wall time in communication Detecting load imbalance Analyzing message sizes 16

Using the right tool Tools can add overhead to code execution What level can you tolerate? Tools can add overhead to scientists What level can you tolerate? Scenarios: Debugging a code that is slow Detailed performance debugging Performance monitoring in production 17

Introduction to CrayPat Suite of tools to provide a wide range of performance-related information Can be used for both sampling and tracing user codes with or without hardware or network performance counters Built on PAPI Supports Fortran, C, C++, UPC, MPI, Coarray Fortran, OpenMP, Pthreads, SHMEM Man pages intro_craypat(1), intro_app2(1), intro_papi(1) 18

Using CrayPat @ Hopper 1. Access the tools module load perftools! 2. Build your application; keep.o files make clean! make! 3. Instrument application pat_build... a.out! Result is a new file, a.out+pat! 4. Run instrumented application to get top time consuming routines aprun... a.out+pat! Result is a new file XXXXX.xf (or a directory containing.xf files) 5. Run pat_report on that new file; view results pat_report XXXXX.xf > my_profile! vi my_profile! Result is also a new file: XXXXX.ap2 19

Guidelines for Optimization Derived metric Optimization needed when* * Suggested by Cray PAT_RT_HWP C Computational intensity < 0.5 ops/ref 0, 1 L1 cache hit ratio < 90% 0, 1, 2 L1 cache utilization (misses) < 1 avg hit 0, 1, 2 L1+L2 cache hit ratio < 92% 2 L1+L2 cache utilization (misses) < 1 avg hit 2 TLB utilization < 0.9 avg use 1 (FP Multiply / FP Ops) or (FP Add / FP Ops) < 25% 5 Vectorization < 1.5 for dp; 3 for sp 12 (13, 14) 20

Perf Debug and Production Tools Integrated Performance Monitoring MPI profiling, hardware counter metrics, POSIX IO profiling IPM requires no code modification & no instrumented binary Only a module load ipm before running your program on systems that support dynamic libraries Else link with the IPM library IPM uses hooks already in the MPI library to intercept your MPI calls and wrap them with timers and counters 21

IPM: Let s See 1) Do module load ipm, link with $IPM, then run normally 2) Upon completion you get ##IPM2v0.xx################################################## # # command :./fish -n 10000 # start : Tue Feb 10 11:05:21 2012 host : nid06027 # stop : Tue Feb 10 11:08:19 2012 wallclock : 177.71 # mpi_tasks : 25 on 2 nodes %comm : 1.62 # mem [GB] : 0.24 gflop/sec : 5.06 Maybe that s enough. If so you re done. Have a nice day!

IPM : IPM_PROFILE=full! # host : s05601/006035314c00_aix mpi_tasks : 32 on 2 nodes! # start : 11/30/04/14:35:34 wallclock : 29.975184 sec! # stop : 11/30/04/14:36:00 %comm : 27.72! # gbytes : 6.65863e-01 total gflop/sec : 2.33478e+00 total! # [total] <avg> min max! # wallclock 953.272 29.7897 29.6092 29.9752! # user 837.25 26.1641 25.71 26.92! # system 60.6 1.89375 1.52 2.59! # mpi 264.267 8.25834 7.73025 8.70985! # %comm 27.7234 25.8873 29.3705! # gflop/sec 2.33478 0.0729619 0.072204 0.0745817! # gbytes 0.665863 0.0208082 0.0195503 0.0237541! # PM_FPU0_CMPL 2.28827e+10 7.15084e+08 7.07373e+08 7.30171e+08! # PM_FPU1_CMPL 1.70657e+10 5.33304e+08 5.28487e+08 5.42882e+08! # PM_FPU_FMA 3.00371e+10 9.3866e+08 9.27762e+08 9.62547e+08! # PM_INST_CMPL 2.78819e+11 8.71309e+09 8.20981e+09 9.21761e+09! # PM_LD_CMPL 1.25478e+11 3.92118e+09 3.74541e+09 4.11658e+09! # PM_ST_CMPL 7.45961e+10 2.33113e+09 2.21164e+09 2.46327e+09! # PM_TLB_MISS 2.45894e+08 7.68418e+06 6.98733e+06 2.05724e+07! # PM_CYC 3.0575e+11 9.55467e+09 9.36585e+09 9.62227e+09! # [time] [calls] <%mpi> <%wall>! # MPI_Send 188.386 639616 71.29 19.76! # MPI_Wait 69.5032 639616 26.30 7.29! # MPI_Irecv 6.34936 639616 2.40 0.67! # MPI_Barrier 0.0177442 32 0.01 0.00! # MPI_Reduce 0.00540609 32 0.00 0.00! # MPI_Comm_rank 0.00465156 32 0.00 0.00! # MPI_Comm_size 0.000145341 32 0.00 0.00! 23

Analyzing IPM Data Communication time per type of MPI call CDF of time per MPI call over message sizes Pairwise communication volume (comm. topology)

Tracing: Hard but Thorough MPI Rank " Flops Exchange Sync Time "

Advice: Develop (some) portable approaches to app optimization There is a tradeoff between vendorspecific and vendor neutral tools Each have their roles, vendor tools can often dive deeper Portable approaches allow apples-toapples comparisons Events, counters, metrics may be incomparable across vendors You can find printf most places Put a few timers in your code? 26

So you run your code with a perf tool and get some numbers what do they mean? 27

Performance: Definitions How do we measure performance? Speed up = T s /T p (n) Efficiency = T s /(n*t p (n)) Be aware there are multiple definitions for these terms n=cores, s=serial, p=parallel Isoefficiency: contours of constant efficiency amongst all problem sizes and concurrencies 28

Parallel speedups for x#x, how much? # CPUs (threads via OpenMP)

Systematic Perf Measurement Scaling studies involve changing the degree of parallelism. Will we be changing the problem also? Strong scaling : Fixed problem size Weak scaling: Problem size grows with additional resources Optimization: Is the concurrency chosing the problem size or vice versa?

Sharks and Fish: Cartoon Data: n_fish is global my_fish is local fish i = {x, y, } Dynamics: 1 V ij r F = ma ij H = K + V q& = H / p p& = H / q MPI_Allgatherv(myfish_buf, len[rank],.. for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j a i += g * mass j * ( fish i fish j ) / r ij } } Move fish See a glimpse here: http://www.leinweb.com/snackbar/wator/

Sharks and Fish Results Running on a NERSC machine 100 fish can move 1000 steps in 1 task " 5.459s x 1.98 speedup 32 tasks " 2.756s 1000 fish can move 1000 steps in 1 task " 511.14s x 24.6 speedup 32 tasks " 20.815s What s the best way to run? How many fish do we really have? How large a computer do we have? How much computer time i.e. allocation do we have? How quickly, in real wall time, do we need the answer?

Good 1 st Step: Do runtimes make sense? Running fish_sim for 100-1000 fish on 1-32 CPUs we see 1 Task 32 Tasks

Isoeffciencies

Too much communication 35

Scaling studies are not always so simple 36

How many perf measurements? With a particular goal in mind, we systematically vary concurrency and/or problem size Example: $ $ How large a 3D (n^3) FFT can I efficiently run on 1024 cpus? Looks good? $ $ $ 37

The scalability landscape " Whoa! Why so bumpy? Algorithm complexity or switching Communication protocol switching Inter-job contention ~bugs in vendor software

Not always so tricky Main loop in jacobi_omp.f90; ngrid=6144 and maxiter=20 39

Scaling Studies at Scale Time in MPI calls 10,000 1,000 % Gyrokinetic Toroidal Code (fusion simulation) OpenMP enabled 4/6 threads Scaling up to 49152 cores 3 machines Time [sec] 256384 Wallclock scaling 512768 1024 1536 2048 4096 1024 2048 3072 6144 4096 Kraken Carver Ranger Ideal Scaling 12288 24576 49152 Number of Cores 100 100 1000 10000 100000 10,000 10,000 1,000 100 Time [sec] Time [sec] Time in OpenMP Parallel Regions 256384 5127681024 10241536 2048 4096 20483072 6144 4096 Kraken Carver Ranger Ideal Scaling 4096 384 768 1024 1,000 256 512 2048 1024 1536 3072 2048 6144 4096 12288 24576 49152 100 100 1000 10000 100000 Number of Cores Kraken Carver Ranger Ideal Scaling 12288 24576 49152 100 1000 10000 100000 Number of Cores

Let s look at scaling performance in more depth. A key impediment is often load imbalance 41

Load Imbalance : Pitfall 101 Communication Time: 64 tasks show 200s, 960 tasks show 230s MPI ranks sorted by total communication time

Load imbalance is pernicious 43 keep checking it

Load Balance : cartoon Unbalanced: Universal App Balanced: Time saved by load balance

Load Balance: Summary Imbalance often a byproduct of 1) data decomposition or 2) multi-core concurrency quirks Must be addressed before further MPI tuning can happen For regular grids consider padding or contracting Good software exists for graph partitioning / remeshing Dynamical load balance may be required for adaptive codes

Other performance scenarios

Simple Stuff: What s wrong here? IPM Profile This is why we need perf tools that are easy to use

Not so simple: Comm. topology MILC MAESTRO GTC PARATEC IMPACT-T CAM 48

Application Topology

Performance in Batch Queue Space 50

A few notes on queue optimization Consider your schedule Consider the queue constraints Charge factor regular vs. low Scavenger queues Xfer queues Downshift concurrency Run limit Queue limit Wall limit Soft (can you checkpoint?) Jobs can submit other jobs 51

Marshalling your own workflow Lots of choices in general Hadoop, CondorG, MySGE On hopper it s easy #PBS -l mppwidth=4096 aprun n 512./cmd & aprun n 512./cmd & aprun n 512./cmd & wait #PBS -l mppwidth=4096 while(work_left) { if(nodes_avail) { aprun n X next_job & } wait } 52

Thanks! Formulate Research Problem Queue Wait Data? Coding jobs jobs jobs jobs UQ VV Contacts: help@nersc.gov deskinner@lbl.gov Debug Perf Debug Understand & Publish! 53