Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code F. Rossi, S. Sinigardi, P. Londrillo & G. Turchetti University of Bologna & INFN GPU2014, Rome, Sept 17th 2014
Laser-driven, plasma based acceleration is a breakthrough particle acceleration technology. In LWFA electron acceleration, a laser pulse propagating in a underdense plasma generates a wakefield that can be used as an accelerating structure. Wakefield = Micron-scale accelerating cavity. Wakefield accelerating fields can reach amplitudes of >100 GV/m. BELLA laser at Berkeley Lab accelerates ultra short electrons bunches up to a few GeVs in very short distances! In Bologna we perform modeling work for experimental groups at INFN Frascati (FLAME) and CNR INO.
Modeling the complex Physics of laser plasma accelerators requires advanced numerical models. Electron acceleration (bubble regime) Ion acceleration (TNSA) + post-acceleration Nanostructured targets Ion acceleration (RPA) 15 10 10 100 ne 10 100 5 z [λ] z [λ] 5 ni 0 10-10 -10-15 -15-4 0 4 x [λ] 8 12 10 0 1-5 0.1 0.1 [nc ] 0 5 x [λ] 10 15 20 10-10 -10-5 0 y [λ] 5 10 np 10 1 0.1 1-5 -5 100 5 0-5 ni z [λ] 15 1 0.01 0.1 0.001 [nc ]
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Co Modeling the complex Physics of laser plasma accelerators requires advanced numerical models. Particle in Cell simulations 101 Physical model that describes laser plasma interaction: Maxwell equations for e.m. fields Relativistic Lorentz force for plasma Vlasov equation: (@ t + ẋ@ x + ṗ@ p )f j (x, p, t) =0 Fluid model integrating Vlasov equation over first two momenta n j (x) = f i (x, p, t)dp n j u j (x) = vf i (x, p, t)dp One momentum per position: cannot describe wave-breaking Particle in cell (PIC) Sampling phase space as a sum of discrete, finite-sized, macro-particles X f i (x, p, t) =f 0 g(x x i ) (p p i ) More accurate Describes wave-breaking i Numerical noise
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations C Particle in cell method, algorithmically Particle-grid interactions: Force on particles is interpoladed (averaging) from fields grids Particle current/density is deposited to grid scatter operation: a particle adds its density value the cells that it overlaps
Modeling the complex Physics of laser plasma accelerators requires advanced numerical codes. AlaDyn code suite: http://www.physycom.unibo.it/aladyn_pic/ CPU code ALADYN High order schemes For memory intensive ion acceleration simulations. Typical run: 3D, fully electromagnetic TNSA simulation (n/nc = 40 λsd=20nm) 22x30x30μm @ 100x25x25 points/μm ~1TB RAM (4000 CPU) Interaction time: 30μm GPU code JASMINE Particle tracking (radiation simulation) Tunneling ionization simulation Typical run: 3D, full-em bubble regime simulation (n = 10 18 ~ 10 19 e-/cm 3 ) 70x70x70μm @ 40x10x10 points/μm ~100GB of GPU RAM Interaction time: >4000μm
Current/Density Deposition on GPUs
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Co PIC Deposition algorithm on massively parallel architectures 1 Naive, (1 particle! 1 thread) parallelization! Race conditions on same memory cells: wrong results Atomic operations or other synchronization methods are required 2 N particles per cell, particle shape function of total order K (4~27) Density grid data is accessed K N ppc times It s worth caching in GPU multiprocessor s shared memory
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Co Caching using shared memory Many accesses to same memory location!worth caching Density grid data is accessed 2 K N ppc times Particles organized by grid block and then grid cell One block processes particles having center inside the grid block Grid block cells + halo (~ half the order of shape function) Cache electrical current grid block in multiprocessor s shared memory Saves ~N ppc K global memory traffic Use segmented reduction for sums in shared memory Rather than shared memory atomic operations Not natively available for double precision Can be slower than reduction
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Performance LASER: a=7.7, waist=9.0 mum, fwhm=24.0 fs PLASMA: density=1.00e+19 1/cm^3. Simulation run for ct = 60µm, double precision. Note: 3D test with Esirkepov method runs stretched grid optimization while 2D Esirkepov runs without it. Francesco Rossi P. Londrillo, S. Sinigardi, A. Sgattoni and G. Turchetti Robust algorithms for current deposition and efficient memory usage in a GPU
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Performance gain Jasmine vs ALaDyn (our CPU code), same exact simulation. Performance of a single NVIDIA Fermi GPU equates ~200x BlueGene cores or ~45x IBM SP6 cores. Plus, since subdomains are much larger, load balancing and scalability are much easier In the simulation setups shown above (fair resolution), jasmine can simulate ~4mm of plasma per day on a ~24 GPUs cluster + Kepler k40 boards @ CNAF gave us an extra ~1.5x in performance per GPU
PIC parallelization on multiple GPUs
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Scaling to multiple GPUs: Hiding network transfer Exchange: 1 Fields halos 2 Particles leaving subdomain Cluster nodes communicate using standard MPI Transfer particles concurrently with current deposition. Communication can be hidden almost completely. Scalability test: warm plasma simulation on INFN APE cluster @ ULa Sapienza and PLX machine @ CINECA
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Scaling
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Simple algorithm for load balancing The particle motion easily leads to inhomogeneous distribution of the load Shrink the volume of heavy-loaded nodes: Each few timesteps select a subdomain (and its row) and the direction where to shrink Subdomains topology remains intact (vertices conservation) Choice is done trying to minimize the cost function: k 1 Max(Load)/Average(Load)+k 2 Variance(Load)/Average(Load)
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Load balancer test case Setup (coarse test) LASER: a=5.8, waist=13.2 mum, fwhm=24.0 fs PLASMA: density=3.80e+18 1/cm^3 GRID: n=[729, 96, 96], dx=[ 6.25e-02, 5.00e-01, 5.00e-01 ] mum Francesco Rossi P. Londrillo, S. Sinigardi, A. Sgattoni androbust G. Turchetti algorithms for current deposition and efficient memory usage in a GPU
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Test case: scaling with load balancing With 72 subdomains:
Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Memory intensive simulations Total memory availability represents a constraint for many simulations (for example ion acceleration ones). In a node, GPU memory is often much less than the total host memory available Using host memory to store simulation data make larger simulations possible on a cluster of fixed size Asynchronous stream of particle chunks stored in main CPU memory overlapped with computation using CUDA streams Slower but no longer memory bound to the GPU device memory Currently in testing stage
Meta-programming improves the code flexibility and GPU scan primitives are used in a wide range of auxiliary functions. Meta-programming via Python-based code template engine: Ionization levels from ADK model Flexibility/extensibility Port GPU kernels in other frameworks Optimization tweaks Auxiliary PIC functionalities require parallel scan primitives: Radiation emission spectrum using trajectories For tracking particles, stream compaction is used for particle selection. Scan is used for dynamically spawn a variable number of electrons from atoms in field tunneling ionization (ADK) model
Conclusions GPUs allow simulating few millimeter-long, fully 3D, laser wakefield electron acceleration stages on small clusters (~10 k40). For 10 GeV, >10 cm acceleration stages, reduced numerical methods (envelope, cylindrical symmetry) are usually much more efficient: our objective is to have them efficiently implemented on GPUs.