Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Similar documents

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Pedraforca: ARM + GPU prototype

Accelerating CFD using OpenFOAM with GPUs

Resource Scheduling Best Practice in Hybrid Clusters

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Case Study on Productivity and Performance of GPGPUs

walberla: A software framework for CFD applications on Compute Cores

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Turbomachinery CFD on many-core platforms experiences and strategies

OpenACC 2.0 and the PGI Accelerator Compilers

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HP ProLiant SL270s Gen8 Server. Evaluation Report

High Performance Computing in CST STUDIO SUITE

OpenACC Basics Directive-based GPGPU Programming

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Evaluation of CUDA Fortran for the CFD code Strukti

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

w w w. e u r o t e c h. c o m w e b. i n f n. i t / a u r o r a s c i e n c e

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

How Cineca supports IT

OpenACC Programming and Best Practices Guide

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

A quick tutorial on Intel's Xeon Phi Coprocessor

walberla: A software framework for CFD applications

Big Data Visualization on the MIC

GPU Acceleration of the SENSEI CFD Code Suite

College of William & Mary Department of Computer Science

OpenCL for programming shared memory multicore CPUs

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Multicore Parallel Computing with OpenMP

Collaborative and Interactive CFD Simulation using High Performance Computers

Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster

OpenCL Programming for the CUDA Architecture. Version 2.3

Parallel Programming Survey

Experiences With Mobile Processors for Energy Efficient HPC

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Part I Courses Syllabus

Overview of HPC Resources at Vanderbilt

GPU Parallel Computing Architecture and CUDA Programming Model

High Performance GPGPU Computer for Embedded Systems

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Writing Applications for the GPU Using the RapidMind Development Platform

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Recent Advances in HPC for Structural Mechanics Simulations

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPGPU accelerated Computational Fluid Dynamics

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

A CP Scheduler for High-Performance Computers

Parallel Computing with MATLAB

Trends in High-Performance Computing for Power Grid Applications

Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster

HPC Programming Framework Research Team

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Stream Processing on GPUs Using Distributed Multimedia Middleware

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Retargeting PLAPACK to Clusters with Hardware Accelerators

HPC enabling of OpenFOAM R for CFD applications

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Next Generation GPU Architecture Code-named Fermi

Optimizing Code for Accelerators: The Long Road to High Performance

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based Architectures

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale

COSCO 2015 Heterogeneous Computing Programming

Performance Improvement of Application on the K computer

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Accelerator Beam Dynamics on Multicore, GPU and MIC Systems. James Amundson, Qiming Lu, and Panagiotis Spentzouris Fermilab

OpenACC Programming on GPUs

Lecture 3. Optimising OpenCL performance

Parallel programming with Session Java

Supercomputing Resources in BSC, RES and PRACE

A Case Study - Scaling Legacy Code on Next Generation Platforms

Overview of HPC systems and software available within

Real-time Visual Tracker by Stream Processing

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Large-Scale Reservoir Simulation and Big Data Visualization

Parallel Computing. Introduction

Transcription:

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics and Astrophysics September 17, 2014 Rome, Italy E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 1 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 2 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 3 / 33

GPUs and MICs performances are growing Courtesy of Dr. Karl Rupp, Technische Universität Wien E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 4 / 33

Accelerators use in HPC is growing Accelerator architectures in the Top500 Supercomputers E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 5 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 6 / 33

OpenCL (Open Computing Language): The same code can be run on CPUs, GPUs, MICs, etc. Functions to be offloaded on the accelerator have to be explicitly programmed (as in CUDA) Data movements between host and accelerator has to be explicitly programmed (as in CUDA) NVIDIA do not support it anymore OpenACC (for Open Accelerators): The same code (will probably) run on CPUs, GPUs, MICs, etc. Functions to be offloaded are annotated with #pragma directives Data movements between host and accelerator could be managed automatically or manually Support is still limited, but seems to be quickly growing E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 7 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 8 / 33

The D2Q37 Lattice Boltzmann Model Lattice Boltzmann method (LBM) is a class of computational fluid dynamics (CFD) methods simulation of synthetic dynamics described by the discrete Boltzmann equation, instead of the Navier-Stokes equations a set of virtual particles called populations arranged at edges of a discrete and regular grid interacting by propagation and collision reproduce after appropriate averaging the dynamics of fluids D2Q37 is a D2 model with 37 components of velocity (populations) suitable to study behaviour of compressible gas and fluids optionally in presence of combustion 1 effects correct treatment of Navier-Stokes, heat transport and perfect-gas (P = ρt ) equations 1 chemical reactions turning cold-mixture of reactants into hot-mixture of burnt product. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 9 / 33

Computational Scheme of LBM foreach time step foreach lattice point propagate ( ) ; endfor foreach lattice point collide ( ) ; endfor endfor Embarassing parallelism All sites can be processed in parallel applying in sequence propagate and collide. Challenge Design an efficient implementation able exploit a large fraction of available peak performance. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 10 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 11 / 33

D2Q37: propagation scheme perform accesses to neighbour-cells at distance 1,2, and 3 generate memory-accesses with sparse addressing patterns E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 12 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 13 / 33

D2Q37: boundary-conditions After propagation, boundary conditions are enforced at top and bottom edges of the lattice. 2D lattice with period-boundaries along X-direction at the top and the bottom boundary conditions are enforced: to adjust some values at sites y = 0... 2 and y = N y 3... N y 1 e.g. set vertical velocity to zero At left and and right edges we apply periodic boundary conditions. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 14 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 15 / 33

D2Q37 collision collision is computed at each lattice-cell after computation of boundary conditions computational intensive: for the D2Q37 model requires 7500 DP floating-point operations completely local: arithmetic operations require only the populations associate to the site computation of propagate and collide kernels are kept separate after propagate but before collide we may need to perform collective operations (e.g. divergence of of the velocity field) if we include computations conbustion effects. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 16 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 17 / 33

Grid and Memory Layout Uni-dimensional array of NTHREADS, each thread processing one lattice site. L y = α N wi, α N; (L y L x )/N wi = N wg Data stored as Structures-of-Arrays (SoA) E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 18 / 33

Grid and Memory Layout Uni-dimensional array of NTHREADS, each thread processing one lattice site. L y = α N wi, α N; (L y L x )/N wi = N wg Data stored as Structures-of-Arrays (SoA) E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 18 / 33

OpenCL Propagate device function kernel void prop ( global const data_t prv, global data_t nxt ) { int ix, / / Work item index along the X dimension. iy, / / Work item index along the Y dimension. site_i ; / / Index of c u r r e n t s i t e. / / Sets the work item i n d i c e s (Y i s used as the f a s t e s t dimension ). ix = ( int ) get_global_id ( 1 ) ; iy = ( int ) get_global_id ( 0 ) ; site_i = ( HX+3+ix) NY + ( HY+iy ) ; nxt [ site_i ] = prv [ site_i 3 NY + 1 ] ; nxt [ NX NY + site_i ] = prv [ NX NY + site_i 3 NY ] ; nxt [ 2 NX NY + site_i ] = prv [ 2 NX NY + site_i 3 NY 1 ] ; nxt [ 3 NX NY + site_i ] = prv [ 3 NX NY + site_i 2 NY + 2 ] ; nxt [ 4 NX NY + site_i ] = prv [ 4 NX NY + site_i 2 NY + 1 ] ; nxt [ 5 NX NY + site_i ] = prv [ 5 NX NY + site_i 2 NY ] ; nxt [ 6 NX NY + site_i ] = prv [ 6 NX NY + site_i 2 NY 1 ] ;... E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 19 / 33

OpenACC Propagate function inline void propagate ( const restrict data_t const prv, restrict data_t const nxt ) { int ix, iy, site_i ; #pragma acc kernels present ( prv ) present ( nxt ) #pragma acc loop independent gang for ( ix = HX ; ix < ( HX+SIZEX ) ; ix++) { #pragma acc loop independent vector ( BLKSIZE ) for ( iy = HY ; iy < ( HY+SIZEY ) ; iy++) {... site_i = ( ix NY ) + iy ; nxt [ site_i ] = prv [ site_i 3 NY + 1 ] ; nxt [ NX NY + site_i ] = prv [ NX NY + site_i 3 NY ] ; nxt [ 2 NX NY + site_i ] = prv [ 2 NX NY + site_i 3 NY 1 ] ; nxt [ 3 NX NY + site_i ] = prv [ 3 NX NY + site_i 2 NY + 2 ] ; nxt [ 4 NX NY + site_i ] = prv [ 4 NX NY + site_i 2 NY + 1 ] ; nxt [ 5 NX NY + site_i ] = prv [ 5 NX NY + site_i 2 NY ] ; nxt [ 6 NX NY + site_i ] = prv [ 6 NX NY + site_i 2 NY 1 ] ; E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 20 / 33

Outline 1 Introduction Hardware trends Software tools 2 The Lattice Boltzmann Method at a glance The D2Q37 model Propagate Boundary Conditions Collide 3 Implementations details 4 Results and Conclusion E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 21 / 33

Hardware used: Eurora prototype Eurora (Eurotech and Cineca) Hot water cooling system Deliver 3,209 MFLOPs per Watt of sustained performance 1 st in the Green500 of June 2013 Computing Nodes: 64 Processor Type: Intel Xeon E5-2658 @ 2.10GHz Intel Xeon E5-2687W @ 3.10GHz Accelerator Type: MIC - Intel Xeon-Phi 5120D GPU - NVIDIA Tesla K20x E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 22 / 33

OpenCL WG size selection for Propagate (Xeon-Phi) Performance of propagate as function of the number of work-items N wi per work-group, and the number of work-groups N wg. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 23 / 33

OpenCL WG size selection for Collide (Xeon-Phi) Performance of collide as function of the number of work-items N wi per work-group, and the number of work-groups N wg. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 24 / 33

2 x NVIDIA K20s GPU 80 70 60 CUDA OpenCL OpenACC Run time on 2 x GPU (NVIDIA K20s) [msec] per iteration 50 40 30 20 10 0 Propagate BC Collide E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 25 / 33

2 x Intel Xeon Phi MIC 80 70 60 C OpenCL Run time on 2 x MIC (Intel Xeon Phi) [msec] per iteration 50 40 30 20 10 0 Propagate BC Collide E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 26 / 33

Propagate [msec] per iteration 600 500 400 300 200 C C Opt. CUDA OpenCL OpenACC Run time (Propagate - 1920x2048 lattice) 100 0 MIC GPU CPU2 CPU3 E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 27 / 33

Collide [msec] per iteration 600 500 400 300 200 C C Opt. CUDA OpenCL OpenACC Run time (Collide - 1920x2048 lattice) 100 0 MIC GPU CPU2 CPU3 E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 28 / 33

Scalability on Eurora Nodes (OpenCL code) Weak regime lattice size: 256 8192 No_devices. Strong regime lattice size: 1024 8192. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 29 / 33

Simulation of the Rayleigh-Taylor (RT) Instability Instability at the interface of two fluids of different densities triggered by gravity. A cold-dense fluid over a less dense and warmer fluid triggers an instability that mixes the two fluid-regions (till equilibrium is reached). E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 30 / 33

Conclusions 1 we have presented an OpenCL and an OpenACC implementations of a fluid-dynamic simulation based on Lattice Boltzmann methods 2 code portability: they have been succesfully ported and run on several computing architectures, including CPU, GPU and MIC systems 3 performance portability: results are of a similar level of codes written using more native programming frameworks, such as CUDA or C 4 OpenCL easily portable across several architecture preserving performances; but not all vendors are today commited to support this standard; 5 OpenACC easily utilizable with few coding efforts; but compilers are not available for all architectures yet. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 31 / 33

Acknowledgments Luca Biferale, Mauro Sbragaglia, Patrizio Ripesi University of Tor Vergata and INFN Roma, Italy Andrea Scagliarini University of Barcelona, Spain Filippo Mantovani BSC institute, Spain Enrico Calore, Sebastiano Fabio Schifano, Raffaele Tripiccione, University and INFN of Ferrara, Italy Federico Toschi Eindhoven University of Technology The Netherlands, and CNR-IAC, Roma Italy This work has been performed in the framework of the INFN COKA and SUMA projects. We would like to thank CINECA (ITALY) and JSC (GERMANY) institutes for access to their systems. E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 32 / 33

Thanks for Your attention E. Calore (INFN of Ferrara) Portable LBM for Heterogeneous HPC GPU Comp., Sep. 17, 2014 33 / 33