Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Transcription

1 Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2, Michael Bussmann1 Helmholtz-Zentrum Dresden Rossendorf 2 Technische Universität Dresden 1 Prof. Peter Mustermann I Institut xxxxx I

2 PICon GPU Electron Acceleration with Lasers Ion Acceleration with Lasers Plasma Instabilities Compact X-Ray sources Tumor Therapy Astrophysics 2

3 Domain Decomposition Field and Particle Domain 3

4 Domain Decomposition Field and Particle Domain 4 Moving Particles create Fields Particles change Cells Fields act back on Particles

5 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain 2 4

6 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell Cell 1 Cell 2 Cell 4

7 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell Cell 2 Cell 4 4 Cell 1 chunked in supercells line wise aligned 7

8 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays

9 Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays

10 Algorithm Driven Cache Strategy Cell Cell 2 3 Cell 1 Cell 4 4

11 Algorithm Driven Cache Strategy Global Memory Cell Cell 2 3 Cell 1 Cell 4 4

12 Algorithm Driven Cache Strategy Global Memory Shared Memory Cell Cell 2 3 Cell 1 Cell 4 4

13 Algorithm Driven Cache Strategy Shared Memory Global Memory Cell Cell 2 3 Cell 1 Cell 4 4

14 High Utilization of Threads Shared Memory Global Memory Cell 1 1 THREAD BLOCK Cell 4 4 THREAD Cell 2 3 Cell 1 THREAD 2 THREAD 3 THREAD 4

15 Task-Parallel Execution of Kernels Asynchronous Communication 15

16 PIConGPU Scales up to 16,384 GPUs strong scaling speedup ideal 1 to 32 8 to to to to number of GPUs 10000

17 PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency eﬃciency [%] speedup ideal 1 to 32 8 to to to to number of GPUs ideal PIConGPU number of GPUs 10000

18 PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency eﬃciency [%] speedup ideal 1 to 32 8 to to to to number of GPUs 95 Efficiency >95% ideal PIConGPU number of GPUs 10000

19 PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency eﬃciency [%] speedup ideal 1 to 32 8 to to to to number of GPUs 95 Efficiency >95% 6.9 PFlop/s (SP) ideal PIConGPU number of GPUs 10000

20 More Physics, More Computations, More Power! s1 s2 20

21 More Physics, More Computations, More Power! Old atom state s1 s2 s1,1 s1,2 s1,3... sn,m 21

22 More Physics, More Computations, More Power! s1 s2 Atom-physical effects Old atom state t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s1,1 s1,2 s1,3... sn,m 22

23 More Physics, More Computations, More Power! Atom-physical effects s1 s2 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n Old atom New atom state state s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m 23

24 More Physics, More Computations, More Power! Atom-physical effects s1 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s2 s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m Really Big Data Task 24 Old atom New atom state state Random access on big amounts of data > 100 GB Good job for powerful CPUs Efficient CPU/GPU cooperation

25 Small Open Source Communities need Maintainable Codes Heterogeneity Testability Sustainability Validate once, Write once, Porting implies get correct results minimal code execute everywhere everywhere changes Optimizability Openness Tune for good Open source performance and at minimum open standards coding effort Single Source 25

26 Alpaka 26

27 Good News: there are Alpakas on the Compute Meadow C Threads Fibers Next parallelism model Single zero overhead interface to existing parallelism models Single source C11 kernels Data-agnostic memory model 27

28 Abstract Hierarchical Redundant Parallelism Model Synchronize Parallel Grid Sequential 28

29 Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Parallel Block Sequential 29

30 Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Parallel Thread Sequential 30

31 Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Thread Parallel Element Element level is an explicit sequential layer Sequential 31

32 Data Structure Agnostic Memory Model Explicit deep copy Grid Host Memory 32 Device Global Memory

33 Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 33 Device Global Memory Shared Memory

34 Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 34 Device Global Memory Register Memory Shared Memory Thread Register Memory

35 Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 35 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

37 Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 37 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R AVX AVX AVX AVX Global Memory Shared Memory Register Memory Explicit mapping of parallelization levels to hardware

38 Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 38 Grid Block Package Package L3 L3 Thread Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

40 Map the Abstraction Model to your Desired Acceleration Back-End Specific unsupported levels of the model can be ignored Abstract interface allows to extend the set of mappings 40

41 Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { } }; 41

42 Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { using alp = alpaka; auto globalidx = alp::idx::getidx<alp::grid, alp::threads>(acc)[0u]; auto elemsperthread = alp::workdiv::getworkdiv<alp::thread, alp::elems>(acc)[0u]; } }; 42

43 Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { using alp = alpaka; auto globalidx = alp::idx::getidx<alp::grid, alp::threads>(acc)[0u]; auto elemsperthread = alp::workdiv::getworkdiv<alp::thread, alp::elems>(acc)[0u]; auto begin = globalidx * elemsperthread; auto end = min(begin elemsperthread, numelements); for(tsize i = begin; i < end; i){ Y[i] = X[i] Y[i]; } } }; 43

44 Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; 44

45 Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); 45

46 Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); // Specify work division auto elementsperthread ( alpaka::vec<dim, Size>::ones() ); auto threadsperblock ( alpaka::vec<dim, Size>::all(2u) ); auto blockspergrid ( alpaka::vec<dim, Size>(4u, 8u, 16u) ); WorkDiv workdiv(alpaka::workdiv::workdivmembers<dim, Size>(blocksPerGrid, threadsperblock, elementsperthread)); 46

47 Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); 47

48 Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); 48

49 Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); 49

50 Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); 50

51 Cupla for the rescue : very fast porting! Alpaka 51

52 Cupla for the rescue : very fast porting! Fast enough for live hack session! Alpaka 52

53 Live Cupla Hack Session Live port of CUDA to Cupla application Starting point: matrix multiplication algorithm of CUDA samples Aim: single source code executed on GPU and CPU hardware Tiling matrix-matrix multiplication algorithm 53

54 Single Source Alpaka DGEMM Kernel on Various Architectures 480 GFLOPS 560 GFLOPS 540 GFLOPS 150 GFLOPS 1450 GFLOPS Theoretical Peak Performance DGEMM: C αab βc

55 What happend so far... 55