Pekka Manninen CSC, the Finnish IT center for science PRACE Petascale Summer School Stockholm, Sweden Aug

Transcription

1 Performance Measurement Tools and Techniques Pekka Manninen CSC, the Finnish IT center for science PRACE Petascale Summer School Stockholm, Sweden Aug

2 Part I: Introduction Motivation Traditional and Petascales optimization Performance data and optimization

3 Motivation It is all about software It is of no use for building petascale machines without suitable software Single core efficiency is not the key issue anymore but the scalability Large scale parallel application development is still work on the frontier with novel challenges How to utilize thousands of cores How to deal with I/O Good performance analysis tools are mandatory for parallel program development

4 Traditional optimization process overview Parallel optimization stage -Apply parallelization -Perform parallel optimization Serial optimization stage -Perform serial optimization Application development stage -Choose algorithms -Choose data structures -Develop or port application

5 Petascales: Optimization flowchart No Choose algorithms, data structures and parallelization strategy Develop code Apply parallelization Unoptimal, correct parallel code Assess scalability Sufficient? No Reduce overhead from communication and load inbalance Yes Apply compiler optimization Tune for the processor Optimize I/O Identify performance bottlenecks No Sufficient? Yes Measure singlecore performance Yes Sufficient? Measure parallel performance No Optimized code Link optimized libraries Assess scalability Yes Converged?

6 Optimization considerations 1. Load balance 2. Minimal dedicated time for communication Minimize communication Overlap computation and communication 3. CPU utilization Optimal memory access (cache utilization) Pipeline performance (branch prediction, prefetching) SIMD operations Efficient I/O

7 Examples of relevant measurements Execution time across CPUs In order to an application to scale all tasks should be kept equally loaded MPI trace How and when the communication is carried out, hinting how to optimize communication Communication bandwidth Function call-tree and execution time profile Pinpoint the execution hotspots, i.e. where to spend the most of the effort in serial optimization Hardware counters (e.g. Cache utilization ratio, Instruction usage, Computational intensity, Flop rate) Provide insight on the potential inefficiencies of a given routine I/O statistics

8 Performance data collection Two dimensions When collection is triggered Asynchronous (sampling) or synchronous (code instrumentation) How data is recorded Profile or a trace file Acquisition Sampling Instrumentation Presentation Profile Timeline

9 Things to be kept in mind The objective of performance analysis is to understand the behavior of the whole system and apply it to improve the performance Instrumentation causes always overhead Artifacts in all measurements A performance analyst should Have an understanding of the different levels of the system architecture Be able to communicate with users as well as developers Be patient enough to explore a broad range of hypotheses and double-check them Be open-minded as to where the performance bottleneck could be

10 Part II: Cray performance analysis tools as an example Overview Usage

11 Cray performance analysis infrastructure CrayPat pat_build - an utility for application instrumentation without a need for source code modification Transparent run-time library for measurements pat_report - performance reports and visualization file pat_help - interactive help utility Cray Apprentice2 An advanced graphical performance analysis and visualization tool

12 Instrumentation with pat_build No source or makefile modification needed, link-time instrumentation Requires object files Instruments compiler-optimized code Generates stand-alone instrumented executable and preserves the original binary Automatic instrumentation at group level Supports both asynchronous and synchronous instrumentation Basic usage % module load xt-craypat % make clean; make % pat_build -g <trace groups> a.out

13 Instrumentation with pat_build Trace groups biolibs Cray bioinformatics library routines blas BLAS subroutines heap Dynamic heap io stdio + sysio tracegroups lapack LAPACK subroutines math ANSI math mpi MPI statistics omp OpenMP API omp-rtl OpenMP runtime library pthreads Posix threads shmem SHMEM API stdio All functions that accept or return the file* construct sysio I/O system calls system system calls

14 Instrumentation with pat_build Automatic profiling analysis (APA) % module load xt-craypat/4.2 % make clean; make % pat_build -O apa a.out Execution of the instrumented executable will produce a report for pat_report and an.apa file that allows for fine-tuning the analysis

15 Fine-grained instrumentation Fortran include pat_apif.h... call PAT_region_begin(id, label,ierr) <code segment> call PAT_region_end(id,ierr) C include <pat_api.h>... ierr = PAT_region_begin(id, label ); <code segment> ierr = PAT_region_end(id);

16 Collecting data Run the instrumented application (a.out+pat) as usual, then measurement data is created (.xf file) Must run on Lustre file system Optional runtime variables Optional timeline view of the program by setting PAT_RT_SUMMARY=0 Request hardware performance counter information by setting PAT_RT_HWPC=<HWPC group id> Number of files to store raw data can be customized with PAT_RT_EXPFILE_MAX

17 Collecting data Hardware performance counter groups 1 Summary with translation lookaside buffer metrics 2 L1 and L2 cache metrics 3 Bandwidth information 4 Hypertransport information 5 Floating point instruction (including SSE) information 6 Cycles stalled and resources empty 7 Cycles stalled and resources full 8 Instructions and branches 9 Instruction cache values

18 Analysis with pat_report Combines information from binary with the raw performance measurement data Generates a text report of performance results and/or formats data to be visualized with Apprentice2 Basic usage pat_report -O <keywords> data_file.xf Useful keywords profile callers calltree heap mpi Subroutine level data Function callers Calltree Heap information, instrument with -g heap MPI statistics, instrument with -g mpi load_balance Load balance information help Available options

19 Analysis with pat_report ct defaults heap io lb load_balance mpi callers callers+src Numbers calltree calltree+src heap_hiwater heap_leaks heap_program hwpc load_balance_function load_balance_group load_balance_program -O calltree Tables that would appear by default. -O heap_program,heap_hiwater,heap_leaks -O read_stats,write_stats -O load_balance -O lb_program,lb_group,lb_function -O mpi_callers Profile by Function and Callers Profile by Function and Callers, with Line Function Calltree View Calltree View with Callsite Line Numbers Heap Stats during Main Program Heap Leaks during Main Program Heap Usage at Start and End of Main Program HW Performance Counter Data Load Balance across PE's by Function Load Balance across PE's by FunctionGroup Load Balance across PE's

20 Analysis with pat_report load_balance_sm loops mpi_callers mpi_dest_bytes mpi_dest_counts mpi_rank_order mpi_sm_rank_order pgo_details profile profile+src profile_pe.th profile_pe_th profile_th_pe rogram_time read_stats samp_profile samp_profile+src thread_times write_stats Load Balance with MPI Sent Message Stats Loop Stats from -hprofile_generate MPI Sent Message Stats by Caller MPI Sent Message Stats by Destination PE MPI Sent Message Stats by Destination PE Suggested MPI Rank Order Sent Message Stats and Suggested MPI Rank Order Loop Stats detail from -hprofile_generate Profile by Function Group and Function Profile by Group, Function, and Line Profile by Function Group and Function Profile by Function Group and Function Profile by Function Group and Function Program Wall Clock Time File Input Stats by Filename Profile by Function Profile by Group, Function, and Line Program Wall Clock Time File Output Stats by Filename

21 Part III: Case study Workplan Demonstration of tools Analysis

22 Case study: CP2K performance analysis A code for ab-initio molecular dynamics simulations written in Fortran 95, uses MPI for parallelization Test job: A few time-step DFT simulation of 512 water molecules We wish to know the following Function profile - where are the hotspots Single-core efficiency Load balance and communication analysis - what are the scalability bottlenecks? Peak performance around 19 TFlop/s - still quite far away from petascales Runtimes (s) on Cray XT4 #CPUs FFTSG FFTW Not enough to calculate and so much communication overhead that the execution is slower than with 1024 cores!

23 Case study: CP2K performance analysis Workplan Do this on a longer route to reduce instrumentation overhead Instrument the code (that is build with Craypat module loaded) pat_build -O apa cp2k.pat ( cp2k.pat+pat) Execute cp2k.pat+pat with 512 cores ( an.xf file ) Obtain a sampling profile pat_report -O samp_profile+src xf-file.xf > samp_profile This will produce the profile as well as an.apa file Run a more exhaustive analysis by editing apa file and rebuilding the executable pat_build -O apa-file.apa Executing this will produce another.xf file - visualize the analysis with Apprentice2

24 Sampling profile output Table 1: Profile by Group, Function, and Line Samp % Samp Imb. Imb. Group Samp Samp % Function Source Line PE='HIDE' 100.0% Total % MPI % % MPI_Bcast 14.8% % MPI_Recv 8.6% % mpi_allreduce_ 5.9% % mpi_alltoallv_ 5.2% % MPI_Reduce 4.5% % mpi_waitall_ 1.4% % MPI_Send 1.1% % mpi_alltoall_ =============================================== 23.4% ETC % % dgemm_kernel 3.5% % dgemm_otcopy 1.5% % dgemm_oncopy =============================================== 13.5% USER % UPDATE_COST_CPU_ROW. in.distribution_optimize % % line % % line.358 ================================================

25 Sampling profile The corresponding fractions with other processor counts #cores MPI ETC USER

26 APA file # You can edit this file, if desired, and use it # to reinstrument the program for tracing like this: # # pat_build -O cp2k.pat+pat sdt.apa # # These suggested trace options are based on data from: # # /wrk/pmannin/prace/cp2k/cp2k/tests/scala/512/sg/cp2k.pat+pat sdt.ap2, /wrk/pmannin/prace/cp2k/cp2k/tests/scala/512/sg/cp 2k.pat+pat sdt.xf # # HWPC group to collect by default. # -Drtenv=PAT_RT_HWPC=0 # Summary with instructions metrics. -Drtenv=PAT_RT_HWPC=2 # # Libraries to trace. -g mpi Insert the desired tracegroups here # Select the hardware counter group here or give the PAPI calls explicitly, get group 2

27 APA file continued # User-defined functions to trace, sorted by % of samples. # Limited to top 200. A function is commented out if it has < 1% # of samples, or if a cumulative threshold of 90% has been reached, # or if it has size < 200 bytes. -w # Enable tracing of user-defined functions. # Note: -u should NOT be specified as an additional option. # 3.21% 756 bytes -T UPDATE_COST_CPU_ROW.in.DISTRIBUTION_OPTIMIZE # 0.74% 5757 bytes -T PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS # 0.66% bytes -T RS_PW_TRANSFER_DISTRIBUTED.in.REALSPACE_GRID_TYPES # 0.64% 9426 bytes -T RS_PW_TRANSFER_REPLICATED.in.REALSPACE_GRID_TYPES # 0.56% 2172 bytes -T PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS # 0.49% bytes # -T CP_SM_FM_MULTIPLY_2D.in.CP_SM_FM_INTERACTIONS We may now select the functions which we want to trace and reduce the instrumentation overhead by ignoring functions with little significance on the execution time, uncomment a few

28 Performance analysis Let us get the most out of the analysis data pat_report -O defaults,mpi,io,lb,heap,mpi_dest_bytes,hwpc,mpi_sm_rank_order xffile.xf

29 Table 1: Profile by Function Group and Function Experiment=1 / Group / Function / PE='HIDE' ======================================================================== Totals for program Note how the instrumentation affects the performance Time% 100.0% Time Imb.Time -- Imb.Time% -- Calls REQUESTS_TO_L2:DATA M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.449M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.139M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 86.6% LD & ST per D1 miss refs/miss D1 cache hit ratio 99.2% LD & ST per D2 miss refs/miss D2 cache hit ratio 72.9% D1+D2 cache hit ratio 99.8% Effective D1+D2 Reuse 7.57 refs/byte System to D1 refill 3.139M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes

30 MPI_SYNC This is all load imbalance! Time% 51.9% Time Imb.Time -- Imb.Time% -- Calls REQUESTS_TO_L2:DATA 0.474M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 0.371M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 0.025M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 99.4% LD & ST per D1 miss refs/miss D1 cache hit ratio 100.0% LD & ST per D2 miss refs/miss D2 cache hit ratio 93.8% D1+D2 cache hit ratio 100.0% Effective D1+D2 Reuse refs/byte System to D1 refill 0.025M/sec lines System to D1 bandwidth 1.497MB/sec bytes L2 to Dcache bandwidth MB/sec bytes

31 USER Time% 31.1% Time Imb.Time -- Imb.Time% -- Calls REQUESTS_TO_L2:DATA M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 65.0% LD & ST per D1 miss D1 cache hit ratio 97.7% LD & ST per D2 miss Cache statistics. There s room to improve both L1 and L2 utilization. D2 cache hit ratio 74.7% D1+D2 cache hit ratio 99.4% refs/miss refs/miss Effective D1+D2 Reuse 2.74 refs/byte System to D1 refill M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes

32 Table 2: Load Balance with MPI Sent Message Stats Time % Time Sent Sent Msg Avg Sent Experiment=1 Msg Total Bytes Msg Size Group Count PE[mmm] 100.0% Total % MPI_SYNC % pe % pe % pe.14 =================================================================== 31.1% USER % pe % pe.5 0.0% pe.405 =================================================================== 15.3% MPI % pe % pe % pe.305 =================================================================== MPI data transfer without load imbalance

33 Table 3: MPI Sent Message Stats by Caller Sent Msg Sent MsgSz 16B<= 256B<= 4KB<= 64KB<= Experiment=1 Total Bytes Msg <16B MsgSz MsgSz MsgSz MsgSz Function Count Count <256B <4KB <64KB <1MB Caller Count Count Count Count PE[mmm] Message passing profile Total mpi_isend_ MP_ISENDRECV_RM2.in.MESSAGE_PASSING CP_SM_FM_MULTIPLY_2D.in.CP_SM_FM_INTERACTIONS 4 CP_SM_FM_MULTIPLY.in.CP_SM_FM_INTERACTIONS QS_KS_BUILD_KOHN_SHAM_MATRIX.in.QS_KS_METHODS 6 QS_KS_UPDATE_QS_ENV.in.QS_KS_METHODS SCF_ENV_DO_SCF.in.QS_SCF 8 SCF.in.QS_SCF 9 QS_ENERGIES.in.QS_ENERGY 10 QS_FORCES.in.QS_FORCE 11 FORCE_ENV_CALC_ENERGY_FORCE.in.FORCE_ENV_METHODS pe pe pe.374 ============================================================ We could go into a routine and see if we could aggregate small messages

34 Table 5: Heap Stats during Main Program Tracked MBytes Total Allocs Total Tracked Tracked Experiment=1 Heap Not Allocs Not Frees Objects MBytes PE[mmm] HiWater Tracked Tracked Not Not MBytes Freed Freed Total pe pe pe.379 ===================================================================================

35 Table 7: File Input Stats by Filename Read Read MB Read Rate Reads Read Experiment=1 Time MB/sec B/Call File Name PE[mmm] File Desc Total NA pe fd pe fd pe.0 3 fd.49 ============================================================== <NA> pe.0 3 fd pe pe.242 ===============================================================

36 Table 8: File Output Stats by Filename Write Write MB Write Rate Writes Write B/Call Experiment=1 Time MB/sec File Name PE[mmm] File Desc Total out-w512-restart.wfn pe.0 3 fd pe pe.242 ======================================================================= <NA> pe.0 3 fd pe pe.242 ======================================================================= stdout pe.0 3 fd pe pe.242 ======================================================================== Rather modest I/ O, check the I/O bandwidth!

37 Table 13: Load Balance across PE's... ======================================================================== pe.491 Compare the amount of calls and time spent Time% 0.2% Time Calls REQUESTS_TO_L2:DATA 8.170M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 5.761M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 2.079M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 96.9% LD & ST per D1 miss refs/miss D1 cache hit ratio 99.5% LD & ST per D2 miss refs/miss D2 cache hit ratio 73.5% D1+D2 cache hit ratio 99.9% Effective D1+D2 Reuse refs/byte System to D1 refill 2.079M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes ========================================================================

38 Table 13: Load Balance across PE's... ======================================================================== pe.2 Compare the amount of calls and time spent Time% 0.2% Time Calls REQUESTS_TO_L2:DATA M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 5.022M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 72.0% LD & ST per D1 miss refs/miss D1 cache hit ratio 98.8% LD & ST per D2 miss refs/miss D2 cache hit ratio 74.1% D1+D2 cache hit ratio 99.7% Effective D1+D2 Reuse 5.03 refs/byte System to D1 refill 5.022M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes ========================================================================

39 Table 15: Load Balance across PE's by Function... ======================================================================== USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe Time% 0.0% Time Calls 576 REQUESTS_TO_L2:DATA DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED DATA_CACHE_REFILLS_FROM_SYSTEM:ALL PAPI_L1_DCA User time (approx) User time (approx) ========================================================================... ======================================================================== USER / PW_NN_COMPOSE_R_WORK.in.PW_SPLINE_UTILS / pe Time% 0.0% Time Calls 576 REQUESTS_TO_L2:DATA DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED DATA_CACHE_REFILLS_FROM_SYSTEM:ALL PAPI_L1_DCA User time (approx) User time (approx) req fills fills refs cycles cycles req fills 1474 fills refs 0 cycles 0 cycles Why does not this routine employ all the processes?

40 Table 15: Load Balance across PE's by Function... ======================================================================== USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe Time% 0.0% Time Calls REQUESTS_TO_L2:DATA req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL fills PAPI_L1_DCA refs User time (approx) cycles User time (approx) cycles ======================================================================== USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe Time ======================================================================== USER / PW_COMPOSE_STRIPE.in.PW_SPLINE_UTILS / pe Time ======================================================================== Or this?

41 Table 20: HW Performance Counter Data Experiment=1 / PE='HIDE' ======================================================================== Totals for program REQUESTS_TO_L2:DATA M/sec req DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED 8.451M/sec fills DATA_CACHE_REFILLS_FROM_SYSTEM:ALL 3.140M/sec fills PAPI_L1_DCA M/sec refs User time (approx) secs cycles Cycles secs cycles User time (approx) secs cycles Utilization rate 86.6% LD & ST per D1 miss refs/miss D1 cache hit ratio 99.2% LD & ST per D2 miss refs/miss D2 cache hit ratio 72.9% D1+D2 cache hit ratio 99.8% Effective D1+D2 Reuse 7.57 refs/byte System to D1 refill 3.140M/sec lines System to D1 bandwidth MB/sec bytes L2 to Dcache bandwidth MB/sec bytes ========================================================================

42 Table 21: Sent Message Stats and Suggested MPI Rank Order Sent Msg Total Bytes per MPI rank Max Avg Min Max Min Total Bytes Total Bytes Total Bytes Rank Rank Dual core: Sent Msg Total Bytes per node Rank Max Avg Min Max Node Min Node Order Total Bytes Total Bytes Total Bytes Ranks Ranks d , ,269 u , , ,8 409, , , ,9 502, Quad core: Sent Msg Total Bytes per node Rank Max Avg Min Max Node Min Node Order Total Bytes Total Bytes Total Bytes Ranks Ranks d ,511,178, ,209,68,473 u ,511,178, ,209,68, ,9,503,8 374,137,375, ,264,9, ,502,247, ,9,10,11 500,501,502,503 However this custom placement of ranks did not do much in practice According to CrayPat, the default SMP-like placement of ranks is the worst choice

43 Visualization with Apprentice2 The previous statistics can be viewed graphically with Apprentice2 pat_report version 4.2 produces the.ap2 file, with older versions you get it with -f ap2 switch to pat_report Just launch it by app2 command

44 You will get more info by holding the mouse cursor over a slice. Clicking will show the load balance of the function

45 Smallest, average and largest individual time

46 Same profile as a list, a click on a function will provide the HW counter data of it

47 HW counter overview. Would be useful if cache miss or cycle stall counters have been recorded.

48 Routine call flow window (who calls who) and how the execution is divided

49 The largest execution time on the left, smallest on the right

50 Final remarks on CP2K analysis Load balance is a major concern and largest efforts on optimization should be put there By performance analysis, we know in which routines the most severe imbalances happen and start to look the issue from there Messages are usually not too small, so no need for tedious aggregating I/O seems to be efficiently written Single core efficiency should be investigated in the most intense routines L1 and L2 accessing could be improved We would need also other HW counter data for a true optimization work, e.g. SSE instruction utilization and memory stalls