Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Similar documents

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Parallel Simplification of Large Meshes on PC Clusters

Stream Processing on GPUs Using Distributed Multimedia Middleware

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Texture Cache Approximation on GPUs

Computer Graphics Hardware An Overview

11th International Computational Accelerator Physics Conference (ICAP) August 19 24, 2012, Rostock-Warnemünde (Germany)

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Parallel Programming Survey

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

HPC Deployment of OpenFOAM in an Industrial Setting

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

High Performance Computing in CST STUDIO SUITE

MOSIX: High performance Linux farm

Cellular Computing on a Linux Cluster

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Introduction to GPU hardware and to CUDA

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

walberla: A software framework for CFD applications on Compute Cores

HP ProLiant SL270s Gen8 Server. Evaluation Report

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Pedraforca: ARM + GPU prototype

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Part I Courses Syllabus

Next Generation GPU Architecture Code-named Fermi

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

HPC enabling of OpenFOAM R for CFD applications

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Les Accélérateurs Laser Plasma

ultra fast SOM using CUDA

Capacity Planning Process Estimating the load Initial configuration

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Evaluation of CUDA Fortran for the CFD code Strukti

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Introduction to Cloud Computing

Network Traffic Monitoring & Analysis with GPUs

Multi-GPU Load Balancing for Simulation and Rendering

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Trends in High-Performance Computing for Power Grid Applications

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Accelerating CFD using OpenFOAM with GPUs

Introduction to GPU Computing

Does Quantum Mechanics Make Sense? Size

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

GPUs for Scientific Computing

Jean-Pierre Panziera Teratec 2011

GPGPU Computing. Yong Cao

Numerical Model for the Study of the Velocity Dependence Of the Ionisation Growth in Gas Discharge Plasma

Real-time Visual Tracker by Stream Processing

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Turbomachinery CFD on many-core platforms experiences and strategies

GPU Point List Generation through Histogram Pyramids

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Packet-based Network Traffic Monitoring and Analysis with GPUs

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU Programming in Computer Vision

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Parallel Firewalls on General-Purpose Graphics Processing Units

The HipGISAXS Software Suite: Recovering Nanostructures at Scale

GPU-Based Network Traffic Monitoring & Analysis Tools

Network Traffic Monitoring and Analysis with GPUs

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Performance and scalability of a large OLTP workload

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

Overview of HPC Resources at Vanderbilt

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Virtuoso and Database Scalability

Retargeting PLAPACK to Clusters with Hardware Accelerators

~ Greetings from WSU CAPPLab ~

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Basin simulation for complex geological settings

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction to GPU Programming Languages

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Transcription:

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code F. Rossi, S. Sinigardi, P. Londrillo & G. Turchetti University of Bologna & INFN GPU2014, Rome, Sept 17th 2014

Laser-driven, plasma based acceleration is a breakthrough particle acceleration technology. In LWFA electron acceleration, a laser pulse propagating in a underdense plasma generates a wakefield that can be used as an accelerating structure. Wakefield = Micron-scale accelerating cavity. Wakefield accelerating fields can reach amplitudes of >100 GV/m. BELLA laser at Berkeley Lab accelerates ultra short electrons bunches up to a few GeVs in very short distances! In Bologna we perform modeling work for experimental groups at INFN Frascati (FLAME) and CNR INO.

Modeling the complex Physics of laser plasma accelerators requires advanced numerical models. Electron acceleration (bubble regime) Ion acceleration (TNSA) + post-acceleration Nanostructured targets Ion acceleration (RPA) 15 10 10 100 ne 10 100 5 z [λ] z [λ] 5 ni 0 10-10 -10-15 -15-4 0 4 x [λ] 8 12 10 0 1-5 0.1 0.1 [nc ] 0 5 x [λ] 10 15 20 10-10 -10-5 0 y [λ] 5 10 np 10 1 0.1 1-5 -5 100 5 0-5 ni z [λ] 15 1 0.01 0.1 0.001 [nc ]

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Co Modeling the complex Physics of laser plasma accelerators requires advanced numerical models. Particle in Cell simulations 101 Physical model that describes laser plasma interaction: Maxwell equations for e.m. fields Relativistic Lorentz force for plasma Vlasov equation: (@ t + ẋ@ x + ṗ@ p )f j (x, p, t) =0 Fluid model integrating Vlasov equation over first two momenta n j (x) = f i (x, p, t)dp n j u j (x) = vf i (x, p, t)dp One momentum per position: cannot describe wave-breaking Particle in cell (PIC) Sampling phase space as a sum of discrete, finite-sized, macro-particles X f i (x, p, t) =f 0 g(x x i ) (p p i ) More accurate Describes wave-breaking i Numerical noise

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations C Particle in cell method, algorithmically Particle-grid interactions: Force on particles is interpoladed (averaging) from fields grids Particle current/density is deposited to grid scatter operation: a particle adds its density value the cells that it overlaps

Modeling the complex Physics of laser plasma accelerators requires advanced numerical codes. AlaDyn code suite: http://www.physycom.unibo.it/aladyn_pic/ CPU code ALADYN High order schemes For memory intensive ion acceleration simulations. Typical run: 3D, fully electromagnetic TNSA simulation (n/nc = 40 λsd=20nm) 22x30x30μm @ 100x25x25 points/μm ~1TB RAM (4000 CPU) Interaction time: 30μm GPU code JASMINE Particle tracking (radiation simulation) Tunneling ionization simulation Typical run: 3D, full-em bubble regime simulation (n = 10 18 ~ 10 19 e-/cm 3 ) 70x70x70μm @ 40x10x10 points/μm ~100GB of GPU RAM Interaction time: >4000μm

Current/Density Deposition on GPUs

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Co PIC Deposition algorithm on massively parallel architectures 1 Naive, (1 particle! 1 thread) parallelization! Race conditions on same memory cells: wrong results Atomic operations or other synchronization methods are required 2 N particles per cell, particle shape function of total order K (4~27) Density grid data is accessed K N ppc times It s worth caching in GPU multiprocessor s shared memory

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Co Caching using shared memory Many accesses to same memory location!worth caching Density grid data is accessed 2 K N ppc times Particles organized by grid block and then grid cell One block processes particles having center inside the grid block Grid block cells + halo (~ half the order of shape function) Cache electrical current grid block in multiprocessor s shared memory Saves ~N ppc K global memory traffic Use segmented reduction for sums in shared memory Rather than shared memory atomic operations Not natively available for double precision Can be slower than reduction

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Deposition algorithm without atomics Algorithm - Sort by cell block and cell - Assign a CUDA block to a cell block - Perform a per-block, shared memory, segmented scan to compute density sum for each cell - Sum cached copy to global grid Scans:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Performance LASER: a=7.7, waist=9.0 mum, fwhm=24.0 fs PLASMA: density=1.00e+19 1/cm^3. Simulation run for ct = 60µm, double precision. Note: 3D test with Esirkepov method runs stretched grid optimization while 2D Esirkepov runs without it. Francesco Rossi P. Londrillo, S. Sinigardi, A. Sgattoni and G. Turchetti Robust algorithms for current deposition and efficient memory usage in a GPU

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Performance gain Jasmine vs ALaDyn (our CPU code), same exact simulation. Performance of a single NVIDIA Fermi GPU equates ~200x BlueGene cores or ~45x IBM SP6 cores. Plus, since subdomains are much larger, load balancing and scalability are much easier In the simulation setups shown above (fair resolution), jasmine can simulate ~4mm of plasma per day on a ~24 GPUs cluster + Kepler k40 boards @ CNAF gave us an extra ~1.5x in performance per GPU

PIC parallelization on multiple GPUs

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Scaling to multiple GPUs: Hiding network transfer Exchange: 1 Fields halos 2 Particles leaving subdomain Cluster nodes communicate using standard MPI Transfer particles concurrently with current deposition. Communication can be hidden almost completely. Scalability test: warm plasma simulation on INFN APE cluster @ ULa Sapienza and PLX machine @ CINECA

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Scaling

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Simple algorithm for load balancing The particle motion easily leads to inhomogeneous distribution of the load Shrink the volume of heavy-loaded nodes: Each few timesteps select a subdomain (and its row) and the direction where to shrink Subdomains topology remains intact (vertices conservation) Choice is done trying to minimize the cost function: k 1 Max(Load)/Average(Load)+k 2 Variance(Load)/Average(Load)

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Load balancer test case Setup (coarse test) LASER: a=5.8, waist=13.2 mum, fwhm=24.0 fs PLASMA: density=3.80e+18 1/cm^3 GRID: n=[729, 96, 96], dx=[ 6.25e-02, 5.00e-01, 5.00e-01 ] mum Francesco Rossi P. Londrillo, S. Sinigardi, A. Sgattoni androbust G. Turchetti algorithms for current deposition and efficient memory usage in a GPU

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Concl Test case: scaling with load balancing With 72 subdomains:

Introduction to GPUs GPU PIC current deposition algorithm Multi GPU parallelization Simulation of radiation emission Benchmarks and simulations Conc Memory intensive simulations Total memory availability represents a constraint for many simulations (for example ion acceleration ones). In a node, GPU memory is often much less than the total host memory available Using host memory to store simulation data make larger simulations possible on a cluster of fixed size Asynchronous stream of particle chunks stored in main CPU memory overlapped with computation using CUDA streams Slower but no longer memory bound to the GPU device memory Currently in testing stage

Meta-programming improves the code flexibility and GPU scan primitives are used in a wide range of auxiliary functions. Meta-programming via Python-based code template engine: Ionization levels from ADK model Flexibility/extensibility Port GPU kernels in other frameworks Optimization tweaks Auxiliary PIC functionalities require parallel scan primitives: Radiation emission spectrum using trajectories For tracking particles, stream compaction is used for particle selection. Scan is used for dynamically spawn a variable number of electrons from atoms in field tunneling ionization (ADK) model

Conclusions GPUs allow simulating few millimeter-long, fully 3D, laser wakefield electron acceleration stages on small clusters (~10 k40). For 10 GeV, >10 cm acceleration stages, reduced numerical methods (envelope, cylindrical symmetry) are usually much more efficient: our objective is to have them efficiently implemented on GPUs.