Accelerating classical MD for multi-core CPUs and GPUs Dr. Axel Kohlmeyer Associate Dean for Scientific Computing College of Science and Technology Temple University, Philadelphia http://sites.google.com/site/akohlmey/ a.kohlmeyer@temple.edu LAMMPS Users and Developers Workshop and Symposium, March 24th-28th 2014
Standard LAMMPS Parallelization MPI based (MPI emulator for serial execution) Uses domain decomposition with 1 domain per MPI task (= processor). Each MPI task looks after the atoms in its domain Atoms move from MPI task to MPI task as they move through the system Assumes same amount of work (force computations) in each domain. 2
Why Bother Adding OpenMP? 1.Why not do it? a) LAMMPS is already very parallel b) Even more run-time settings to optimize c) OpenMP is often less effective than MPI (for MD) 2. Why do it anyway? a) On multi-core machines (Cray XT5) LAMMPS can run faster with MPI when some CPU cores are idle b) Parallelization over particles, not domains c) PPPM has scaling limitations. At high node counts it would be better to run it only on a subset of tasks 3
OpenMP Parallelization OpenMP is directive based => well written code works with or without OpenMP can be added incrementally OpenMP only works in shared memory => multi-core processors are now ubiquitous OpenMP hides the calls to a threads library => less flexible, more overhead, but less effort Caution: need to worry about race conditions, memory corruption, false sharing, Amdahl's law 4
How to add OpenMP to LAMMPS LAMMPS is very modular, just add new classes derived from non-threaded implementation Pairwise interactions (consume most time) i,j nested loop over neighbors can be parallelized each thread processes different i atoms Neighbor list build (binning still serial) i,j nested loop over atoms and neighboring bins Dihedrals and other bonded interactions Replace selected function(s) in derived class 5
Threading Class Relations PairLJ - serial implementation - all non-threaded code ThrOMP - thread-safe utility functions - reduction of per-thread force PairLJOMP - derived from PairLJ and ThrOMP - replaces ::compute() with threaded version - gets access to ThrData instance from FixOMP ThrData - per-thread accumulators - one instance per thread FixOMP - regularly called during MD loop - determines when to reduce forces - manages ThrData instances - toggles thread-related features 6
Naive OpenMP LJ Kernel #if defined(_openmp) #pragma omp parallel for default(shared) \ private(i) reduction(+:epot) Each thread will #endif for(i=0; i < (sys >natoms) 1; ++i) { work on different double rx1=sys >rx[i]; values of i double ry1=sys >ry[i]; double rz1=sys >rz[i]; [...] The critical directive will let only { #if defined(_openmp) #pragma omp critical sys >fx[i] += rx*ffac; one thread execute this block at a time Race condition: #endif sys >fy[i] += ry*ffac; i will be unique for { sys >fz[i] += rz*ffac; Timings (108 atoms): sys >fx[i] += rx*ffac; each thread, but not j sys >fx[j] = rx*ffac; serial: 4.0s sys >fy[i] += ry*ffac; Or some j may be an sys >fy[j] = ry*ffac; 1 thread: 4.2s sys >fz[i] += rz*ffac; sys >fz[j] = rz*ffac; i of another thread sys >fx[j] = rx*ffac; } 2 threads: 7.1s sys >fy[j] = ry*ffac; => multiple threads 4 threads: 7.7s sys >fz[j] = rz*ffac; update the same location 8 threads: 8.6s } 7
Alternatives to omp critical Use omp atomic to protect each force addition => requires hardware support (modern x86) 1Thr: 6.3s, 2Thr: 5.0s, 4Thr: 4.4s, 8Thr: 4.2s => faster than omp critical for multiple threads but it is slower than the serial code (4.0s) rd Don't use Newton's 3 Law => no race condition 1Thr: 6.5s, 2Thr: 3.7s, 4Thr: 2.3s, 8Thr: 2.1s => better scaling, but 2 threads ~= serial speed => this is what is done on GPU (many threads) 8
MPI-like Approach with OpenMP #if defined(_openmp) #pragma omp parallel reduction(+:epot) #endif { double *fx, *fy, *fz; Thread number is like MPI rank #if defined(_openmp) int tid=omp_get_thread_num(); #else sys->fx holds storage for one full fx array for int tid=0; each thread => race condition is avoided. #endif fx=sys >fx + (tid*sys >natoms); azzero(fx,sys >natoms); fy=sys >fy + (tid*sys >natoms); azzero(fy,sys >natoms); fz=sys >fz + (tid*sys >natoms); azzero(fz,sys >natoms); for(int i=0; i < (sys >natoms 1); i += sys >nthreads) { int ii = i + tid; if (ii >= (sys >natoms 1)) break; rx1=sys >rx[ii]; ry1=sys >ry[ii]; rz1=sys >rz[ii]; 9
MPI-like Approach with OpenMP (2) We need to write our own reduction: #if defined (_OPENMP) Need to make certain, all threads #pragma omp barrier are done with computing forces #endif i = 1 + (sys >natoms / sys >nthreads); fromidx = tid * i; toidx = fromidx + i; if (toidx > sys >natoms) toidx = sys >natoms; for (i=1; i < sys >nthreads; ++i) { int offs = i*sys >natoms; for (int j=fromidx; j < toidx; ++j) { sys >fx[j] += sys >fx[offs+j]; sys >fy[j] += sys >fy[offs+j]; sys >fz[j] += sys >fz[offs+j]; } } Use threads to parallelize the reductions 10
OpenMP Timings Comparison omp critical timings 1Thr: 4.2s, 2Thr: 7.1s, 4Thr: 7.7s, 8Thr: 8.6s omp atomic timings 1Thr: 6.3s, 2Thr: 5.0s, 4Thr: 4.4s, 8Thr: 4.2s omp parallel region (MPI-like) timings 1Thr: 4.0s, 2Thr: 2.5s, 4Thr: 2.2s, 8Thr: 2.5s No Newton's 3rd law timings 1Thr: 6.5s, 2Thr: 3.7s, 4Thr: 2.3s, 8Thr: 2.1s => the omp parallel variant is best for few threads, no Newton's 3rd variant better for more threads => cost for force reduction larger for more threads 11
2x Intel Xeon 2.66Ghz (Harpertown) w/ DDR Infiniband CHARMM (lj/charmm/coul/long + pppm), 32000 Atoms Time for 1000 MD steps /s 20 15 10 5 0 1 3 4 6 8 12 16 20 24 32 Number of Nodes 8 MPI / Node 4 MPI + 2 OpenMP / Node 4 MPI / Node 2 MPI + 4 OpenMP / Node 80% 70% 60% Parallel Efficiency 2 50% 40% 30% 20% 10% 1 MPI + 8 OpenMP / Node 0% 1 6 11 16 Number of Nodes 21 26 31 12
13
Running Big Vesicle fusion study: impact of lipid ratio in binary mixture cg/cmm/coul/long Experimental size => 4M CG-beads for 1 vesicle and solvent 30,000,000 CPU hour INCITE project 14
Strong Scaling (Cray XT5) 1 Vesicle CG System / 3,862,854 CG-Beads 12 MPI / 1 OpenMP 6 MPI / 2 OpenMP 4 MPI / 3 OpenMP 2 MPI / 6 OpenMP Time per MD step (sec) 0.36 0.17 0.08 0.04 27 63 148 345 805 1878 # Nodes 15
Strong Scaling (2) (Cray XT5) 8 Vesicles CG-System / 30,902,832 CG-Beads 0.61 12 MPI / 1 OpenMP 6 MPI / 2 OpenMP 4 MPI / 3 OpenMP 2 MPI / 6 OpenMP Time per MD Step (sec) 0.39 0.24 0.15 0.1 256 569 1263 2805 6231 # Nodes 16
The Curse of the k-space (1) 256 384 768 Other Neighbor Comm Kspace Bond Pair 1 MPI/Node + OpenMP 10 5 2 MPI + 6 OpenMP/Node 15 12 MPI/Node Time in seconds 20 6 MPI/Node 25 4 MPI/Node Rhodopsin Benchmark, 860k Atoms, 64 Nodes, Cray XT5 0 128 128 256 384 768 768 # PE 17
The Curse of the k-space (2) Rhodops in Benchm ark, 860k Atom s, 128 Nodes, Cray XT5 25 Neighbor 2 MPI + 6 OpenMP/Node Time in seconds 15 12 MPI/Node 20 6 MPI/Node 4 MPI/Node Other Comm Kspace 1 MPI/Node + OpenMP Bond Pair 10 5 0 256 512 768 1536 256 512 768 1536 1536 # PE 18
The Curse of the k-space (3) Time in sec onds 15 Other N eighbor C om m Kspace Bond Pair 1 MPI/Node + OpenMP 10 5 2 MPI + 6 OpenMP/Node 20 12 MPI/Node 25 6 MPI/Node 4 MPI/Node Rhodopsin Benchmark, 860k Atoms, 512 Nodes, Cray XT5 0 1024 2048 3072 6144 1024 2048 3072 6144 6144 # PE 19
Additional Improvements OpenMP threading added to charge density accumulation and force application in PPPM Force reduction only done on last /omp style Integration style verlet/split contributed by Voth group which run k-space on separate partition (compatible with OpenMP version of PPPM) Added threading to selected fixes like charge equilibration for COMB many-body potential Added threading to fix nve/sphere integrator 20
Current GPU Support in LAMMPS Multiple developments from different groups Converged to two efforts with two philosophies GPU package (minimalistic) pair styles, neighbor lists and k-space (optional): Download coordinates, retrieve forces Run asynchronously to bonded (and k-space) USER-CUDA package (see next talk) Replace all classes that touch atom data Data transfer between host and GPU as needed 21
Special Features of GPU Package Can be compiled for CUDA or OpenCL due to using Geryon preprocessor macros Can attach multiple MPI tasks to one GPU for improved GPU utilization (up to 4x oversubscription on Fermi, up to 15x on Kepler ) Uses a fix to manage GPUs and compute kernel dispatch, styles dispatch kernels asynchronously, fix then retrieves the forces after all other force computations are completed Tuned for good scaling with fewer atoms/gpu 22
1x GPU Performance in LAMMPS 140 Bulk Water, LJ+long-range electrostatics 5,376 Water 5000 Step 21,504 Water 1000 Step PPPM on GPU 120 Time in seconds 100 FirePro 80 60 Tesla GeForce 40 20 0 8 Core, 2.8GHz GTX 480 sp GTX 480 mp GTX 480 dp C2050 sp C2050 mp C2050 dp ATI V8800 sp ATI V8800 mp ATI V8800 dp 23
Multiple GPUs per Node Bulk Water, LJ + long-range electrostatics 100 PPPM on GPU 90 80 70 Time in seconds FirePro 5,376 water, 5000 steps 21,504 water, 1000 steps 60 50 Tesla 40 30 20 10 0 8 Cores, 2.8GHz 1x C2050 mp 2x C2050 mp 4x C2050 mp 1x V8800 mp 2x V8800 mp 4x V8800 mp 24
Comments on GPU Acceleration Mixed precision (force computation in single, force accumulation in double precision) good compromise: little overhead, good accuracy on forces, stress/pressure less so GPU acceleration larger for models that require more computation in force kernel Acceleration drops with lower number of atoms per GPU => limited strong scaling on big iron Acceleration amount dependent on host & GPU 25
Installation of USER-OMP and GPU USER-OMP package: make yes-user-omp to install sources Add -fopenmp (GNU) or -openmp (Intel) to CC and LINK definitions in your makefile to enable OpenMP Compilation without OpenMP => similar to OPT GPU package: Compile library in lib/gpu for CUDA or OpenCL make yes-gpu to install style sources which are wrappers for GPU library Tweak lib/gpu/makefile.lammps.??? as needed 26
Using Accelerated Code All accelerated styles are optional and need to be activated in the input or from command line Naming convention lj/cut -> lj/cut/omp lj/cut/gpu From command line -sf omp or -sf gpu Inside script: suffix omp or suffix gpu and suffix on or suffix off Use package omp/gpu command to adjust settings for acceleration and selection of GPUs -sf command line flag implies default settings 27
Conclusions and Outlook: OpenMP OpenMP+MPI is almost always a win, especially with large node counts (=> capability computing) USER-OMP also contains serial optimizations and thus useful without OpenMP compiled in Minimal changes to LAMMPS core code USER-OMP only a transitional implementation since efficient only on a small number of threads Longer-term solution also needs to consider vectorization and thus be more GPU-like and benefits from different data layout (see next talk) 28