Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2, Michael Bussmann1 Helmholtz-Zentrum Dresden Rossendorf 2 Technische Universität Dresden 1 Prof. Peter Mustermann I Institut xxxxx I www.hzdr.de
PICon GPU Electron Acceleration with Lasers Ion Acceleration with Lasers Plasma Instabilities Compact X-Ray sources Tumor Therapy Astrophysics 2
Domain Decomposition Field and Particle Domain 3
Domain Decomposition Field and Particle Domain 4 Moving Particles create Fields Particles change Cells Fields act back on Particles
Creating Vectorized Data Structures for Particles and Fields Field Domain 1 3 5 Particle Domain 2 4
Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 1 3 6 2 4 Cell 1 Cell 2 Cell 4
Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 1 3 2 Cell 2 Cell 4 4 Cell 1 chunked in supercells line wise aligned 7
Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 1 3 8 2 Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays
Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 1 3 9 2 Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays
Algorithm Driven Cache Strategy Cell 1 1 10 2 Cell 2 3 Cell 1 Cell 4 4
Algorithm Driven Cache Strategy Global Memory Cell 1 1 11 2 Cell 2 3 Cell 1 Cell 4 4
Algorithm Driven Cache Strategy Global Memory Shared Memory Cell 1 1 12 2 Cell 2 3 Cell 1 Cell 4 4
Algorithm Driven Cache Strategy Shared Memory Global Memory Cell 1 1 13 2 Cell 2 3 Cell 1 Cell 4 4
High Utilization of Threads Shared Memory Global Memory Cell 1 1 THREAD BLOCK Cell 4 4 THREAD 1 14 2 Cell 2 3 Cell 1 THREAD 2 THREAD 3 THREAD 4
Task-Parallel Execution of Kernels Asynchronous Communication 15
PIConGPU Scales up to 16,384 GPUs strong scaling 10000 speedup 1000 100 ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384 10 1 16 1 10 100 1000 number of GPUs 10000
PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency 105 10000 100 efficiency [%] speedup 1000 100 ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384 10 1 17 1 10 100 1000 number of GPUs 95 90 85 80 10000 ideal PIConGPU 1 10 100 1000 number of GPUs 10000
PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency 105 10000 100 efficiency [%] speedup 1000 100 ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384 10 1 18 1 10 100 1000 number of GPUs 95 Efficiency >95% 90 85 80 10000 ideal PIConGPU 1 10 100 1000 number of GPUs 10000
PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency 105 10000 100 efficiency [%] speedup 1000 100 ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384 10 1 19 1 10 100 1000 number of GPUs 95 Efficiency >95% 6.9 PFlop/s (SP) 90 85 80 10000 ideal PIConGPU 1 10 100 1000 number of GPUs 10000
More Physics, More Computations, More Power! s1 s2 20
More Physics, More Computations, More Power! Old atom state s1 s2 s1,1 s1,2 s1,3... sn,m 21
More Physics, More Computations, More Power! s1 s2 Atom-physical effects Old atom state t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s1,1 s1,2 s1,3... sn,m 22
More Physics, More Computations, More Power! Atom-physical effects s1 s2 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n Old atom New atom state state s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m 23
More Physics, More Computations, More Power! Atom-physical effects s1 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s2 s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m Really Big Data Task 24 Old atom New atom state state Random access on big amounts of data > 100 GB Good job for powerful CPUs Efficient CPU/GPU cooperation
Small Open Source Communities need Maintainable Codes Heterogeneity Testability Sustainability Validate once, Write once, Porting implies get correct results minimal code execute everywhere everywhere changes Optimizability Openness Tune for good Open source performance and at minimum open standards coding effort Single Source 25
Alpaka 26
Good News: there are Alpakas on the Compute Meadow C Threads Fibers Next parallelism model Single zero overhead interface to existing parallelism models Single source C11 kernels Data-agnostic memory model 27
Abstract Hierarchical Redundant Parallelism Model Synchronize Parallel Grid Sequential 28
Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Parallel Block Sequential 29
Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Parallel Thread Sequential 30
Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Thread Parallel Element Element level is an explicit sequential layer Sequential 31
Data Structure Agnostic Memory Model Explicit deep copy Grid Host Memory 32 Device Global Memory
Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 33 Device Global Memory Shared Memory
Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 34 Device Global Memory Register Memory Shared Memory Thread Register Memory
Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 35 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware
Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 36 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware
Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 37 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R AVX AVX AVX AVX Global Memory Shared Memory Register Memory Explicit mapping of parallelization levels to hardware
Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 38 Grid Block Package Package L3 L3 Thread Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware
Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 39 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware
Map the Abstraction Model to your Desired Acceleration Back-End Specific unsupported levels of the model can be ignored Abstract interface allows to extend the set of mappings 40
Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { } }; 41
Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { using alp = alpaka; auto globalidx = alp::idx::getidx<alp::grid, alp::threads>(acc)[0u]; auto elemsperthread = alp::workdiv::getworkdiv<alp::thread, alp::elems>(acc)[0u]; } }; 42
Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { using alp = alpaka; auto globalidx = alp::idx::getidx<alp::grid, alp::threads>(acc)[0u]; auto elemsperthread = alp::workdiv::getworkdiv<alp::thread, alp::elems>(acc)[0u]; auto begin = globalidx * elemsperthread; auto end = min(begin elemsperthread, numelements); for(tsize i = begin; i < end; i){ Y[i] = X[i] Y[i]; } } }; 43
Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; 44
Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); 45
Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); // Specify work division auto elementsperthread ( alpaka::vec<dim, Size>::ones() ); auto threadsperblock ( alpaka::vec<dim, Size>::all(2u) ); auto blockspergrid ( alpaka::vec<dim, Size>(4u, 8u, 16u) ); WorkDiv workdiv(alpaka::workdiv::workdivmembers<dim, Size>(blocksPerGrid, threadsperblock, elementsperthread)); 46
Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); 47
Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); 48
Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); 49
Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); 50
Cupla for the rescue : very fast porting! Alpaka 51
Cupla for the rescue : very fast porting! Fast enough for live hack session! Alpaka 52
Live Cupla Hack Session Live port of CUDA to Cupla application Starting point: matrix multiplication algorithm of CUDA samples Aim: single source code executed on GPU and CPU hardware Tiling matrix-matrix multiplication algorithm 53
Single Source Alpaka DGEMM Kernel on Various Architectures 480 GFLOPS 560 GFLOPS 540 GFLOPS 150 GFLOPS 1450 GFLOPS Theoretical Peak Performance DGEMM: C αab βc 74 54 117 121 35 283
What happend so far... 55
What happend so far... 56
What happend so far... 57
What happend so far... 58
What happend so far... 59
PIConGPU Runtime on Various Architectures 60
PIConGPU Efficiency on Various Architectures 61
Clone us from GitHub https://github.com/computationalradiationphysics git clone https://github.com/computationalradiationphysics/picongpu git clone https://github.com/computationalradiationphysics/alpaka git clone https://github.com/computationalradiationphysics/cupla Alpaka paper pre-print: http://arxiv.org/abs/1602.08477 62
63