Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Similar documents

HPC with Multicore and GPUs

OpenACC 2.0 and the PGI Accelerator Compilers

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Next Generation GPU Architecture Code-named Fermi

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

GPU Parallel Computing Architecture and CUDA Programming Model

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Computer Graphics Hardware An Overview

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Performance Analysis for GPU Accelerated Applications

High Performance Computing in CST STUDIO SUITE

Le langage OCaml et la programmation des GPU

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Case Study on Productivity and Performance of GPGPUs

Embedded Systems: map to FPGA, GPU, CPU?

Stream Processing on GPUs Using Distributed Multimedia Middleware

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

CUDA Programming. Week 4. Shared memory and register

Texture Cache Approximation on GPUs

Accelerating CFD using OpenFOAM with GPUs

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

An Introduction to Parallel Computing/ Programming

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

QCD as a Video Game?

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Real-time Visual Tracker by Stream Processing

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Performance Characteristics of Large SMP Machines

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

HP ProLiant SL270s Gen8 Server. Evaluation Report

GPGPU accelerated Computational Fluid Dynamics

Introduction to GPU hardware and to CUDA

Learn CUDA in an Afternoon: Hands-on Practical Exercises

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Multi-GPU Load Balancing for Simulation and Rendering

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA Basics. Murphy Stein New York University

Introduction to GPU Programming Languages

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

FPGA-based Multithreading for In-Memory Hash Joins

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Trends in High-Performance Computing for Power Grid Applications

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Accelerating variant calling

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Binary search tree with SIMD bandwidth optimization using SSE

Turbomachinery CFD on many-core platforms experiences and strategies

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

OpenACC Programming and Best Practices Guide

Retargeting PLAPACK to Clusters with Hardware Accelerators

Jean-Pierre Panziera Teratec 2011

OpenACC Basics Directive-based GPGPU Programming

Introduction to CUDA C

Next Generation Operating Systems

Evaluation of CUDA Fortran for the CFD code Strukti

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Speeding Up RSA Encryption Using GPU Parallelization

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

ultra fast SOM using CUDA

FPGA-based MapReduce Framework for Machine Learning

Introduction to Cluster Computing

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

High Performance Computing. Course Notes HPC Fundamentals

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Parallel Computing with MATLAB

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Transcription:

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2, Michael Bussmann1 Helmholtz-Zentrum Dresden Rossendorf 2 Technische Universität Dresden 1 Prof. Peter Mustermann I Institut xxxxx I www.hzdr.de

PICon GPU Electron Acceleration with Lasers Ion Acceleration with Lasers Plasma Instabilities Compact X-Ray sources Tumor Therapy Astrophysics 2

Domain Decomposition Field and Particle Domain 3

Domain Decomposition Field and Particle Domain 4 Moving Particles create Fields Particles change Cells Fields act back on Particles

Creating Vectorized Data Structures for Particles and Fields Field Domain 1 3 5 Particle Domain 2 4

Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 1 3 6 2 4 Cell 1 Cell 2 Cell 4

Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 1 3 2 Cell 2 Cell 4 4 Cell 1 chunked in supercells line wise aligned 7

Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 1 3 8 2 Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays

Creating Vectorized Data Structures for Particles and Fields Field Domain Particle Domain Cell 1 1 3 9 2 Cell 2 Cell 4 4 Cell 1 chunked in supercells fixed size frames line wise aligned struct of aligned arrays

Algorithm Driven Cache Strategy Cell 1 1 10 2 Cell 2 3 Cell 1 Cell 4 4

Algorithm Driven Cache Strategy Global Memory Cell 1 1 11 2 Cell 2 3 Cell 1 Cell 4 4

Algorithm Driven Cache Strategy Global Memory Shared Memory Cell 1 1 12 2 Cell 2 3 Cell 1 Cell 4 4

Algorithm Driven Cache Strategy Shared Memory Global Memory Cell 1 1 13 2 Cell 2 3 Cell 1 Cell 4 4

High Utilization of Threads Shared Memory Global Memory Cell 1 1 THREAD BLOCK Cell 4 4 THREAD 1 14 2 Cell 2 3 Cell 1 THREAD 2 THREAD 3 THREAD 4

Task-Parallel Execution of Kernels Asynchronous Communication 15

PIConGPU Scales up to 16,384 GPUs strong scaling 10000 speedup 1000 100 ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384 10 1 16 1 10 100 1000 number of GPUs 10000

PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency 105 10000 100 eﬃciency [%] speedup 1000 100 ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384 10 1 17 1 10 100 1000 number of GPUs 95 90 85 80 10000 ideal PIConGPU 1 10 100 1000 number of GPUs 10000

PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency 105 10000 100 eﬃciency [%] speedup 1000 100 ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384 10 1 18 1 10 100 1000 number of GPUs 95 Efficiency >95% 90 85 80 10000 ideal PIConGPU 1 10 100 1000 number of GPUs 10000

PIConGPU Scales up to 16,384 GPUs strong scaling weak scaling efficiency 105 10000 100 eﬃciency [%] speedup 1000 100 ideal 1 to 32 8 to 256 64 to 2048 512 to 16384 4096 to 16384 10 1 19 1 10 100 1000 number of GPUs 95 Efficiency >95% 6.9 PFlop/s (SP) 90 85 80 10000 ideal PIConGPU 1 10 100 1000 number of GPUs 10000

More Physics, More Computations, More Power! s1 s2 20

More Physics, More Computations, More Power! Old atom state s1 s2 s1,1 s1,2 s1,3... sn,m 21

More Physics, More Computations, More Power! s1 s2 Atom-physical effects Old atom state t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s1,1 s1,2 s1,3... sn,m 22

More Physics, More Computations, More Power! Atom-physical effects s1 s2 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n Old atom New atom state state s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m 23

More Physics, More Computations, More Power! Atom-physical effects s1 t1,1 t1,2 t1,3 t1,n t2,1. t3,1.. tn,1 tn,n s2 s1,1 s1,2 s1,3... sn,m = s1,1 s1,2 s1,3... sn,m Really Big Data Task 24 Old atom New atom state state Random access on big amounts of data > 100 GB Good job for powerful CPUs Efficient CPU/GPU cooperation

Small Open Source Communities need Maintainable Codes Heterogeneity Testability Sustainability Validate once, Write once, Porting implies get correct results minimal code execute everywhere everywhere changes Optimizability Openness Tune for good Open source performance and at minimum open standards coding effort Single Source 25

Alpaka 26

Good News: there are Alpakas on the Compute Meadow C Threads Fibers Next parallelism model Single zero overhead interface to existing parallelism models Single source C11 kernels Data-agnostic memory model 27

Abstract Hierarchical Redundant Parallelism Model Synchronize Parallel Grid Sequential 28

Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Parallel Block Sequential 29

Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Parallel Thread Sequential 30

Abstract Hierarchical Redundant Parallelism Model Synchronize Grid Block Thread Parallel Element Element level is an explicit sequential layer Sequential 31

Data Structure Agnostic Memory Model Explicit deep copy Grid Host Memory 32 Device Global Memory

Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 33 Device Global Memory Shared Memory

Data Structure Agnostic Memory Model Explicit deep copy Grid Block Host Memory 34 Device Global Memory Register Memory Shared Memory Thread Register Memory

Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 35 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 36 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 37 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R AVX AVX AVX AVX Global Memory Shared Memory Register Memory Explicit mapping of parallelization levels to hardware

Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 38 Grid Block Package Package L3 L3 Thread Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

Map the Abstraction Model to your Desired Acceleration Back-End CPU RAM 39 Grid Block Package Package Thread L3 L3 Element Core Core Core Core L1/2 L1/2 L1/2 L1/2 R R R R Shared Memory AVX AVX AVX AVX Register Memory Global Memory Explicit mapping of parallelization levels to hardware

Map the Abstraction Model to your Desired Acceleration Back-End Specific unsupported levels of the model can be ignored Abstract interface allows to extend the set of mappings 40

Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { } }; 41

Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { using alp = alpaka; auto globalidx = alp::idx::getidx<alp::grid, alp::threads>(acc)[0u]; auto elemsperthread = alp::workdiv::getworkdiv<alp::thread, alp::elems>(acc)[0u]; } }; 42

Alpaka : Vector Addition Kernel struct VectorAdd { template<typename TAcc, typename TElem, typename TSize> ALPAKA_FN_ACC auto operator()( TAcc const & acc, TSize const & numelements, TElem const * const X, TElem * const Y) const -> void { using alp = alpaka; auto globalidx = alp::idx::getidx<alp::grid, alp::threads>(acc)[0u]; auto elemsperthread = alp::workdiv::getworkdiv<alp::thread, alp::elems>(acc)[0u]; auto begin = globalidx * elemsperthread; auto end = min(begin elemsperthread, numelements); for(tsize i = begin; i < end; i){ Y[i] = X[i] Y[i]; } } }; 43

Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; 44

Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); 45

Alpaka : Initialization // Configure Alpaka using Dim = alpaka::dim::dimint<3u> using Size = std::size_t using Acc = alpaka::acc::acccpuserial<dim, Size>; using Host = alpaka::acc::acccpuserial<dim, Size>; using Stream = alpaka::stream::streamcpusync; using WorkDiv = alpaka::workdiv::workdivmembers<dim, Size>; using Elem = float; // Retrieve devices and stream DevHost devhost ( alpaka::dev::devman<host>::getdevbyidx(0) ); DevAcc devacc ( alpaka::dev::devman<acc>::getdevbyidx(0) ); Stream stream ( devacc); // Specify work division auto elementsperthread ( alpaka::vec<dim, Size>::ones() ); auto threadsperblock ( alpaka::vec<dim, Size>::all(2u) ); auto blockspergrid ( alpaka::vec<dim, Size>(4u, 8u, 16u) ); WorkDiv workdiv(alpaka::workdiv::workdivmembers<dim, Size>(blocksPerGrid, threadsperblock, elementsperthread)); 46

Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); 47

Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); 48

Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); 49

Alpaka : Call the Kernel // Memory allocation and host to device auto X_h = alpaka::mem::buf::alloc<int, auto Y_h = alpaka::mem::buf::alloc<int, auto X_d = alpaka::mem::buf::alloc<val, auto Y_d = alpaka::mem::buf::alloc<val, memory copy int>(devhost, int>(devhost, Size>(devAcc, Size>(devAcc, extent); extent); extent); extent); alpaka::mem::view::copy(stream, X_d, X_h, extent); alpaka::mem::view::copy(stream, Y_d, Y_h, extent); // Kernel creation and execution VectorAdd kernel; auto const exec(alpaka::exec::create<acc>( workdiv, kernel, numelements alpaka::mem::view::getptrnative(x_d), alpaka::mem::view::getptrnative(y_d))); alpaka::stream::enqueue(stream, exec); // Copy memory back to host alpaka::mem::view::copy(stream, Y_h, Y_d, extent); 50

Cupla for the rescue : very fast porting! Alpaka 51

Cupla for the rescue : very fast porting! Fast enough for live hack session! Alpaka 52

Live Cupla Hack Session Live port of CUDA to Cupla application Starting point: matrix multiplication algorithm of CUDA samples Aim: single source code executed on GPU and CPU hardware Tiling matrix-matrix multiplication algorithm 53

Single Source Alpaka DGEMM Kernel on Various Architectures 480 GFLOPS 560 GFLOPS 540 GFLOPS 150 GFLOPS 1450 GFLOPS Theoretical Peak Performance DGEMM: C αab βc 74 54 117 121 35 283

What happend so far... 55

What happend so far... 56

What happend so far... 57

What happend so far... 58

What happend so far... 59

PIConGPU Runtime on Various Architectures 60

PIConGPU Efficiency on Various Architectures 61

Clone us from GitHub https://github.com/computationalradiationphysics git clone https://github.com/computationalradiationphysics/picongpu git clone https://github.com/computationalradiationphysics/alpaka git clone https://github.com/computationalradiationphysics/cupla Alpaka paper pre-print: http://arxiv.org/abs/1602.08477 62

63