Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Transcription

1 Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics and Astrophysics September 17, 2014 Rome, Italy E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

2 Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

4 GPUs and MICs performances are growing Courtesy of Dr. Karl Rupp, Technische Universität Wien E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

5 Accelerators use in HPC is growing Accelerator architectures in the Top500 Supercomputers E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

7 OpenCL (Open Computing Language): The same code can be run on CPUs, GPUs, MICs, etc. Functions to be offloaded on the accelerator have to be explicitly programmed (as in CUDA) Data movements between host and accelerator has to be explicitly programmed (as in CUDA) NVIDIA do not support it anymore OpenACC (for Open Accelerators): The same code (will probably) run on CPUs, GPUs, MICs, etc. Functions to be offloaded are annotated with #pragma directives Data movements between host and accelerator could be managed automatically or manually Support is still limited, but seems to be quickly growing E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

9 The D2Q37 Lattice Boltzmann Model Lattice Boltzmann method (LBM) is a class of computational fluid dynamics (CFD) methods simulation of synthetic dynamics described by the discrete Boltzmann equation, instead of the Navier-Stokes equations a set of virtual particles called populations arranged at edges of a discrete and regular grid interacting by propagation and collision reproduce after appropriate averaging the dynamics of fluids D2Q37 is a D2 model with 37 components of velocity (populations) suitable to study behaviour of compressible gas and fluids optionally in presence of combustion 1 effects correct treatment of Navier-Stokes, heat transport and perfect-gas (P = ρt ) equations 1 chemical reactions turning cold-mixture of reactants into hot-mixture of burnt product. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

10 Computational Scheme of LBM foreach time step foreach lattice point propagate ( ) ; endfor foreach lattice point collide ( ) ; endfor endfor Embarassing parallelism All sites can be processed in parallel applying in sequence propagate and collide. Challenge Design an efficient implementation able exploit a large fraction of available peak performance. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

12 D2Q37: propagation scheme perform accesses to neighbour-cells at distance 1,2, and 3 generate memory-accesses with sparse addressing patterns E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

14 D2Q37: boundary-conditions After propagation, boundary conditions are enforced at top and bottom edges of the lattice. 2D lattice with period-boundaries along X-direction at the top and the bottom boundary conditions are enforced: to adjust some values at sites y = and y = N y 3... N y 1 e.g. set vertical velocity to zero At left and and right edges we apply periodic boundary conditions. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

16 D2Q37 collision collision is computed at each lattice-cell after computation of boundary conditions computational intensive: for the D2Q37 model requires 7500 DP floating-point operations completely local: arithmetic operations require only the populations associate to the site computation of propagate and collide kernels are kept separate after propagate but before collide we may need to perform collective operations (e.g. divergence of of the velocity field) if we include computations conbustion effects. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

18 Grid and Memory Layout Uni-dimensional array of NTHREADS, each thread processing one lattice site. L y = α N wi, α N; (L y L x )/N wi = N wg Data stored as Structures-of-Arrays (SoA) E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

19 Grid and Memory Layout Uni-dimensional array of NTHREADS, each thread processing one lattice site. L y = α N wi, α N; (L y L x )/N wi = N wg Data stored as Structures-of-Arrays (SoA) E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

20 OpenCL Propagate device function kernel void prop ( global const data_t prv, global data_t nxt ) { int ix, / / Work item index along the X dimension. iy, / / Work item index along the Y dimension. site_i ; / / Index of c u r r e n t s i t e. / / Sets the work item i n d i c e s (Y i s used as the f a s t e s t dimension ). ix = ( int ) get_global_id ( 1 ) ; iy = ( int ) get_global_id ( 0 ) ; site_i = ( HX+3+ix) NY + ( HY+iy ) ; nxt [ site_i ] = prv [ site_i 3 NY + 1 ] ; nxt [ NX NY + site_i ] = prv [ NX NY + site_i 3 NY ] ; nxt [ 2 NX NY + site_i ] = prv [ 2 NX NY + site_i 3 NY 1 ] ; nxt [ 3 NX NY + site_i ] = prv [ 3 NX NY + site_i 2 NY + 2 ] ; nxt [ 4 NX NY + site_i ] = prv [ 4 NX NY + site_i 2 NY + 1 ] ; nxt [ 5 NX NY + site_i ] = prv [ 5 NX NY + site_i 2 NY ] ; nxt [ 6 NX NY + site_i ] = prv [ 6 NX NY + site_i 2 NY 1 ] ;... E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

21 OpenACC Propagate function inline void propagate ( const restrict data_t const prv, restrict data_t const nxt ) { int ix, iy, site_i ; #pragma acc kernels present ( prv ) present ( nxt ) #pragma acc loop independent gang for ( ix = HX ; ix < ( HX+SIZEX ) ; ix++) { #pragma acc loop independent vector ( BLKSIZE ) for ( iy = HY ; iy < ( HY+SIZEY ) ; iy++) {... site_i = ( ix NY ) + iy ; nxt [ site_i ] = prv [ site_i 3 NY + 1 ] ; nxt [ NX NY + site_i ] = prv [ NX NY + site_i 3 NY ] ; nxt [ 2 NX NY + site_i ] = prv [ 2 NX NY + site_i 3 NY 1 ] ; nxt [ 3 NX NY + site_i ] = prv [ 3 NX NY + site_i 2 NY + 2 ] ; nxt [ 4 NX NY + site_i ] = prv [ 4 NX NY + site_i 2 NY + 1 ] ; nxt [ 5 NX NY + site_i ] = prv [ 5 NX NY + site_i 2 NY ] ; nxt [ 6 NX NY + site_i ] = prv [ 6 NX NY + site_i 2 NY 1 ] ; E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

23 Hardware used: Eurora prototype Eurora (Eurotech and Cineca) Hot water cooling system Deliver 3,209 MFLOPs per Watt of sustained performance 1 st in the Green500 of June 2013 Computing Nodes: 64 Processor Type: Intel Xeon 2.10GHz Intel Xeon 3.10GHz Accelerator Type: MIC - Intel Xeon-Phi 5120D GPU - NVIDIA Tesla K20x E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

24 OpenCL WG size selection for Propagate (Xeon-Phi) Performance of propagate as function of the number of work-items N wi per work-group, and the number of work-groups N wg. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

25 OpenCL WG size selection for Collide (Xeon-Phi) Performance of collide as function of the number of work-items N wi per work-group, and the number of work-groups N wg. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

26 2 x NVIDIA K20s GPU CUDA OpenCL OpenACC Run time on 2 x GPU (NVIDIA K20s) [msec] per iteration Propagate BC Collide E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

27 2 x Intel Xeon Phi MIC C OpenCL Run time on 2 x MIC (Intel Xeon Phi) [msec] per iteration Propagate BC Collide E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

28 Propagate [msec] per iteration C C Opt. CUDA OpenCL OpenACC Run time (Propagate x2048 lattice) MIC GPU CPU2 CPU3 E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

29 Collide [msec] per iteration C C Opt. CUDA OpenCL OpenACC Run time (Collide x2048 lattice) MIC GPU CPU2 CPU3 E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

30 Scalability on Eurora Nodes (OpenCL code) Weak regime lattice size: No_devices. Strong regime lattice size: E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

31 Simulation of the Rayleigh-Taylor (RT) Instability Instability at the interface of two fluids of different densities triggered by gravity. A cold-dense fluid over a less dense and warmer fluid triggers an instability that mixes the two fluid-regions (till equilibrium is reached). E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

32 Conclusions 1 we have presented an OpenCL and an OpenACC implementations of a fluid-dynamic simulation based on Lattice Boltzmann methods 2 code portability: they have been succesfully ported and run on several computing architectures, including CPU, GPU and MIC systems 3 performance portability: results are of a similar level of codes written using more native programming frameworks, such as CUDA or C 4 OpenCL easily portable across several architecture preserving performances; but not all vendors are today commited to support this standard; 5 OpenACC easily utilizable with few coding efforts; but compilers are not available for all architectures yet. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

33 Acknowledgments Luca Biferale, Mauro Sbragaglia, Patrizio Ripesi University of Tor Vergata and INFN Roma, Italy Andrea Scagliarini University of Barcelona, Spain Filippo Mantovani BSC institute, Spain Enrico Calore, Sebastiano Fabio Schifano, Raffaele Tripiccione, University and INFN of Ferrara, Italy Federico Toschi Eindhoven University of Technology The Netherlands, and CNR-IAC, Roma Italy This work has been performed in the framework of the INFN COKA and SUMA projects. We would like to thank CINECA (ITALY) and JSC (GERMANY) institutes for access to their systems. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33

34 Thanks for Your attention E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, / 33