IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13

Size: px

Start display at page:

Download "IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13"

Sarah Dickerson
8 years ago
Views:

1 Center for Information Services and High Performance Computing (ZIH) IUmd Performance Analysis of a Molecular Dynamics Code Thomas William Dresden, 5/27/13

2 Overview IUmd Introduction First Look with Vampir Comparison of Code-versions Hardware Counters Source Code Analysis Source Code Optimization Porting IUmd to GPGPU IUmd Folie Nr. 2 von 47

3 Initial Motivation Code had been in development for many years Lots of different code versions doing similar science Developers wanted to compare these versions Search for Performance optimization possibilities They (IU) provide domain specific expertise We (TUD) provide performance optimization expertise IUmd Folie Nr. 3 von 47

4 Introduction: IUmd Code Classical molecular dynamics simulation of dense nuclear matter consisting of either fully ionized atoms or free neutrons and protons Main targets are studies of the dense matter in white dwarf and neutron stars. Interaction potentials between particles treated as classical two-particle central potentials No complicated particle geometry or orientation Electrons can be treated as a uniform background charge (not explicitly modeled) IUmd Folie Nr. 4 von 47

5 Visualization of Sample Data (VTK) IUmd Folie Nr. 5 von 47

6 Oh my god it s FORTRAN DIGGING INTO THE CODE IUmd Folie Nr. 6 von 47

7 IUmd Code Details Particle-particle-interactions have been implemented in multiple ways (nucleon, pure-ion, ion-mixture) Located in different files PP01, PP02 and PP03 The code blocks are selectable using preprocessor macros PP01 is the original implementation with no division into the Ax, Bx or NBS blocks PP02 implements the versions in use by the physicists today 3 different implementations for the Ax block 3 implementations for the Bx block No manual blocking (NBS) PP03 includes new routines 3 Ax blocks 8 Bx blocks Can be blocked using the NBS value IUmd Folie Nr. 7 von 47

8 MD Code Details IUmd Code Details Two sections of code each have two or more variations One section is labelled A and the other B IUmd Variations can be are compiled numbered as a serial, OpenMP, MPI, or MPI+OpenMP MD can be compiled (Hybrid) as a serial, OpenMP, MPI or MPI+OpenMP make MDEF=XRay md_mpi ALG=PP02 BLKA=A0 BLKB=B2 NBS= "NBSX=0 " IUmd Folie Nr. 8 von 47

9 IUmd Workflow The structure of the program is simple: First reads in a parameter file runmd.in And an initial particle configuration file md.in This allows checkpointing Program then calculates the initial set of accelerations Enters a time-stepping loop, a triply nested do loop IUmd Folie Nr. 9 von 47

10 Runtime Parameters Runtime Parameter!Parameters: sim_type = ion mixture,!simulation type tstart = 0.00,!start time dt = 25.00,!time step (fm/c)!warmup: nwgroup = 2,!groups nwsteps = 50,!steps per group!measurement: ngroup = 2,!groups ntot = 2,!per group nind = 25,!steps between tnormalize = 50,!temp normal. ncom = 50,!center of mass!motion cancel. Figure: Runtime parameter (excerpt from runmd.in) Figure : Runtime parameters, snippet from runmd.in IUmd Folie Nr. 10 von 47

11 The Main Loop The Main Loop do 100 ig =1,ngroup initialize group ig statistics do 40 j =1, ntot do i=1,nind!computes forces!updates x and v call newton enddo call vtot enddo compute group ig s t a t i s t i c s 40 continue 100 continue Figure: Simplified version of the main loop Figure : Simplified version of the main loop IUmd Folie Nr. 11 von 47

12 MD Implementation Details - newton module Forces are calculated in the newton module in a pair of nested do-loops Outer loop goes over target particles Inner loop over source particles Targets are assigned to MPI processes in a round-robin fashion Within each MPI process, the work is shared among OpenMP threads IUmd Folie Nr. 12 von 47

13 The cobbler should stick to his last FIRST LOOK WITH VAMPIR IUmd Folie Nr. 13 von 47

14 Pure MPI IUmd Folie Nr. 14 von 47

15 HUGE OpenMP Overhead in Vampir IUmd Folie Nr. 15 von 47

16 burnin CPU hours for good COMPARISON OF CODE- VERSIONS IUmd Folie Nr. 16 von 47

17 Cray XT5m TM XT5m is a 2D mesh of nodes Each node has two sockets each having four cores AMD Opteron 23 Shanghai (45mm) Running at 2.4 GHz 84 compute nodes a total of 672 cores pgi/9.0.4 Using the XT PE driver xtpe-shanghai IUmd Folie Nr. 17 von 47

18 Time Constraints First runs showed that: 5k particles measurement takes 27k particles measurement takes 55k particles measurement takes 5 minutes 1 hour 10 hours Scientifically useful data with 55k particles or more IUmd Folie Nr. 18 von 47

19 Overview PP01, PP02, and PP03 Figure: Overview of the runtimes for all (143) codeblock combinations with an ion-mix input dataset of 55k particles. The naming scheme is source-code-file A-block B-block blockingfactor IUmd Folie Nr. 19 von 47

20 PP02 Code Blocks IUmd Folie Nr. 20 von 47

21 Livin in a hardware meritocracy HARDWARE COUNTERS IUmd Folie Nr. 21 von 47

22 Floating Point Instructions IUmd Folie Nr. 22 von 47

23 The end of the world is nigh! SOURCE CODE ANALYSIS IUmd Folie Nr. 23 von 47

24 Block A IUmd Folie Nr. 24 von 47

25 Block B cut-off sphere reduces interactions to 19% IUmd Folie Nr. 25 von 47

26 Recap: PP02 Code Blocks IUmd Folie Nr. 26 von 47

27 Why is -O2 as fast as -O3 and why is fast only marginally better? IUmd Folie Nr. 27 von 47

28 Finally, Progress... SOURCE CODE OPTIMIZATION IUmd Folie Nr. 28 von 47

29 Recap: OpenMP Overhead in Vampir IUmd Folie Nr. 29 von 47

30 Code Changes Code Changes Omp omp parallel do for dothe forj-loop the j-loop is is removed Omp omp parallel region around the i-loop J-loop j-loop is is parallelized parallelized explicitly explicitly Entire ij-loop section in a 3rd loop over threads Entire ij-loop section in a 3rd loop over threads Each iteration of this outer loop is assigned to an Each OpenMP iteration thread of this outer loop is assigned to an OpenMP thread allocate (fj(3,0:n 1)) do 100 i =myrank, n 2,nprocs fi (:)=0.0d0!$omp parallel do private (r2,k,xx,r, fc ),\ reduction (+: fi ), do 90 j = i +1,n 1! A Block schedule(runtime)!$omp parallel private (nthrd, nj,nr, j0, j1,xx, r2, r, fc, f i ) nthrd = omp_get_num_threads() allocate (fi(3,0:n 1))!$omp do schedule( static,1) do 110 i t h r d =0, nthrd 1 do 100 i =myrank, n 2,nprocs fi (:, i)=0.0d0 nj=(n i 1)/nthrd nr=mod(n i 1,nthrd ) j0=( i+1)+ithrd nj+min( ithrd,nr) j1=( i +1)+( ithrd+1) nj+min( ithrd+1,nr) 1 do 90 j =j0, j1 IUmd Folie Nr. 30 von 47

31 New MPI/OpenMP Version in Vampir IUmd Folie Nr. 31 von 47

32 Parallel Efficiency of the OpenMP Application IUmd Folie Nr. 32 von 47

33 AMD Interlagos OpenMP Efficiency 97.6% Parallel Efficiency IUmd Folie Nr. 33 von 47

34 The next level PORTING IUMD TO GPGPU IUmd Folie Nr. 34 von 47

35 3 Different Approaches HMPP Only a proof of concept so far OpenACC CUDA PGI CUDA Fortran Both CUDA and OpenACC can be combined with MPI IUmd Folie Nr. 35 von 47

36 Code Changes Goal: Combine MPI Version with CUDA Needs new parameters to define The dimensions of the MPI process grid Number of source chunks Thread block dimensions for the CUDA kernel [1] must exactly divide the number of particles N [1] must be a multiple of the warp size (32) [2]*[1]/4) must exactly divide N/nschunk!MPI and CUDA parameters procdim = 1, 1,!MPI process grid dimensions nschunk = 4,!number of source chunks thrdblkdim = 128, 1!CUDA thread block and block grid dim IUmd Folie Nr. 36 von 47

37 Changes to MPI Split COM_WORLD into Cartesian Grid Example: Calculate kinetic energy (ttot) Master column adds up kinetic energies (Allreduce) Broadcast new values along each row IUmd Folie Nr. 37 von 47

38 Example: int_ion_mix - Original IUmd Folie Nr. 38 von 47

39 OpenACC IUmd Folie Nr. 39 von 47

40 CUDA Step A: Declare arrays in shared memory (xss, ffs) Define grid, block and thread parameters Copy target data from global memory to registers Initialize force Copy a chunk of source data from global to shared memory IUmd Folie Nr. 40 von 47

41 CUDA Step B: Thread (i,j) applies its sources (xss) to its target Step C: Sum the forces (ffs) from all threads of my block Step D: Copy shared memory array (ffs) to global memory Each block copies to its own array section Thread columns share the work IUmd Folie Nr. 41 von 47

42 CUDA Step E: Last block in row that copies its results then sums the contribution from each block Threads share the work, accumulating results into registers Threads take turns adding their contributions to shared memory Thread columns write the final results to global memory IUmd Folie Nr. 42 von 47

43 Fermi vs Kepler (1 GPU, pgfortran CUDA) Baseline: 8 OpenMP Threads (icc) Pure_ion simulation Weak scaling (increasing the number of ions) 8 Speedup FERMI KEPLER Number of Ions IUmd Folie Nr. 43 von 47

44 Comparison OpenACC vs CUDA OpenMP version as baseline: 2 Intel(R) Xeon(R) CPU (E5620, 2.4GHz) Intel 12.1 Compiler 8 OpenMP threads 1 Keppler K20 Cuda-version: 5.0 CAPS for ACC PGI 13.1 for ACC/CUDA IUmd parameter: nschunk: 4 blkdim: 128 grddim: 216 IUmd Folie Nr. 44 von 47

45 Comparison OpenACC vs CUDA Simulation type: ion-mixture Input data set changed! OpenACC CAPS OpenACC PGI CUDA 2 0 Speedup IUmd Folie Nr. 45 von 47

46 Combined Benchmark first results Run2me in seconds Benchmark (IUmd 6.3.0) x16 2x32 4x4 4x8 4x16 4x32 1 2x1 4x1 6x1 OpenMP MPI+OpenMP CUDA IUmd Folie Nr. 46 von 47

47 Note of Thanks to Contributors (Listing Names in Alphabetical Order) Robert Dietrich Jonas Stolle Frank Winkler Donald Berry Robert Henschel Joseph Schuchart IUmd Folie Nr. 47 von 47

48 General approach to optimization IUmd Folie Nr. 48 von 47

49 FPU idle times Figure: FPU idle times in percent of PAPI measured total cycles IUmd Folie Nr. 49 von 47

50 Branch Miss Predictions IUmd Folie Nr. 50 von 47

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia