Accelerating a Particle-in-Cell Code for Space Plasma Simulations with OpenACC

Accelerating a Particle-in-Cell Code for Space Plasma Simulations with OpenACC Ivy Bo Peng 1, Stefano Markidis 1, Andris Vaivads 2, Juris Vencels 1, Jan Deca 3, Giovanni Lapenta 3, Alistair Hart 4 and Erwin Laure 1 1 HPCViz Department, KTH Royal Institute of Technology 2 Swedish Institute of Space Physics, Uppsala, Sweden 3 Department of Mathematics, Centre for Mathematical Plasma Astrophysics (CmPA), KU Leuven 4 Cray Exascale Research Initiative Europe, UK

Top Supercomputers are Accelerated 15% of supercomputers in Top500 list are accelerated. Accelerated supercomputers provide 35% of the total Top500 performance. NVIDIA GPUs are dominating the accelerator market. Source: www-top500.org Nov 2014 list

Exascale Simulations need to be Accelerated Formation of a Magnetosphere Grid 384 x 384 x 384 Particles initially 3.01x10 9 Time Steps 15000 No. Of MPI Processes 2048 FLOPs in Mover 10 15 Simulation time 24 hours Simulation was run on Lindgren Supercomputer at KTH (Cray XE6 system, AMD Opteron processors and Cray Gemini interconnect)

Porting Code to GPU with OpenACC OpenACC is an accelerator programming API standard supported by multiple vendors. We used Cray and PGI compilers in our work. Aim for incremental porting of C, C++ and Fortran programs to multi-gpu systems using Compiler Directives: Offload computational intensive works to GPU Implicit data movement between CPU (host) and GPU (device) memory spaces Very similar to OpenMP and allows for porting applications not initially designed for GPU, like ipic3d.

Particle-in-Cell Code ipic3d ipic3d is a Particle-in-Cell code for space physics community to study the interactions between solar wind and Earth magnetosphere ipic3d is a parallel code implemented in C++ using the hybrid of MPI and OpenMP, 20,000 LOC, with 80% parallel efficiency on 16,000 cores. In this work, we study the porting of ipic3d to multigpu systems with OpenACC

Particle-in-Cell (PIC) Method Solves the Vlasov (transport equation without collision term) and Maxwell s equations with computational particles (order billion particles in our simulations). Computational cycle is formed by 3 basic steps: particle mover, interpolation and field solver. In our work: we use OpenACC in mover and Interpolation stages. MOVER INTERPOLATION FIELD SOLVER

Computational Time Spent in ipic3d We focus on the most timeconsuming parts (other than communication) Mover: 41.1% Interpolation: 13.2% CrayPAT Profiling Results are based on a typical magnetic reconnection simulation running on 256 processes on Beskow Cray XC40 supercomputer at KTH

Challenges in porting ipic3d We identified two challenges: Deep Copy Issue: OpenACC is good with 1D array but requires more hacking for multidimensional array, structures, classes, template classes Both compiler support and programmer input are required Atomic capture issue.

The Deep-Copy Issue OpenACC supports flat object model C++ pointer indirection requires non-continuous transfer and pointer translation We solve it by hack: Cast particle information to 1D array and use 1D arrays to move data back and forth from/to GPU Source: http://on-demand.gputechconf.com/gtc/2013/presentations/s3084-openacc-openmp-directives-cce.pdf

Atomic Capture Issue Is a OpenACC atomic directive that guarantees a variable is accessed and updated atomically Is essential for correctly accelerating the interpolation stage: multiple particles could map to the same grid point Not working in PGI compilers version < 15.1.

GPU Porting Results ipic3d GPU porting tested against the 2D magnetic reconnection problem, important for Earth magnetosphere Out-of-plane magnetic field component evolution

GPU Performance Results TEST ENVIRONMENT: Cray XC30 System SWAN, K20X GPU, Aries interconnect and Intel CPUs

Conclusions Successfully ported the ipic3d code to multi-gpu systems with OpenACC. C++ not the best match for OpenACC: Lack of compiler support for deep-copy in C++. Cray and PGI compiler teams are working on this but programmer s input will still be needed. Partial Control of memory management and thread could facilitate more aggressive optimization that can be achieved with CUDA. Preliminary porting with OpenACC results 42%, 15%, 16%, 18% faster on 1, 2, 4 and 6 nodes compared to original version GPU direct will simplify communication in mover.

Thanks! This work was funded by the Swedish VR grant D621-2013-4309 and by the European Commission through the EPiGRAM project (grant agreement no. 610598. epigram-project.eu). The work used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE- AC05-00OR22725. The work also used resource provided by the Cray Marketing Partner Network

Future Optimization Particle communication only copy exiting particles to CPU for MPI communication (use atomic update to flag out exiting particles) GPU-direct communication Cache-coherent Particle sorting - sort particles within subdomains to minimize cache misses during particle to field interpolation. Multiple MPI processes on one node

The Good and Bad Things about GPU GPU are good in computation-intensive parts of the application Connection between host and device memory spaces is the bottleneck: Minimize data movement between host and GPU Use pinned memory GPU CPU 250 GB/s (GDDR5) PCIe 2.0 x 16 = 8GB/s PCIe 3.0 x 16 = 15.75GB/s 32 GB/s (DDR3 x 4 channels) Device Memory Host Memory Source: https://www.olcf.ornl.gov/support/system-user-guides/accelerated-computing-guide/

Porting ipic3d to GPU ü 100X FLOP to update the velocity and location of each particle in each computation cycle ü Different data structure of particles and conversion are supported in the code x Advanced ü Use C++ compiler template directives requires deep to copy -> linearize to array structure x Conversion between two data structure is time-consuming x Field data is required for updating particles x Possible race-condition when interpolate from particle to field -> Need automic update (compiler issue, high collision is inefficient)

Porting to GPU with OpenACC Ø Add in field components in particle class and linearize the particle structures Ø Copy in field data in global memory Ø Create data region on GPU upon the initialization of particle species in class constructor and free data region in class destructor Ø Asynchronous data movement for different particle species Ø Two compilers: Cray and PGI on two GPUaccelerated supercomputer: Titan and Swan