Accelerating a Particle-in-Cell Code for Space Plasma Simulations with OpenACC

Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPU Acceleration of the SENSEI CFD Code Suite

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

BLM 413E - Parallel Programming Lecture 3

Introduction to GPU Programming Languages

HP ProLiant SL270s Gen8 Server. Evaluation Report

GPUs for Scientific Computing

CUDA programming on NVIDIA GPUs

Sourcery Overview & Virtual Machine Installation

Parallel Programming Survey

Introduction to GPU hardware and to CUDA

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

~ Greetings from WSU CAPPLab ~

Pedraforca: ARM + GPU prototype

HPC enabling of OpenFOAM R for CFD applications

Experiences with Tools at NERSC

Multicore Parallel Computing with OpenMP

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Parallel Computing. Introduction

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

OpenACC Basics Directive-based GPGPU Programming

High Performance Computing in CST STUDIO SUITE

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Case Study on Productivity and Performance of GPGPUs

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Trends in High-Performance Computing for Power Grid Applications

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Accelerating CFD using OpenFOAM with GPUs

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

GPGPU Computing. Yong Cao

OpenACC 2.0 and the PGI Accelerator Compilers

Lecture 1: an introduction to CUDA

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

OpenACC Programming and Best Practices Guide

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Turbomachinery CFD on many-core platforms experiences and strategies

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Deep Learning GPU-Based Hardware Platform

HPC Wales Skills Academy Course Catalogue 2015

Auto-Tunning of Data Communication on Heterogeneous Systems

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Advancing Applications Performance With InfiniBand

Parallel Computing with MATLAB

Parallel Software usage on UK National HPC Facilities : How well have applications kept up with increasingly parallel hardware?

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Scalability evaluation of barrier algorithms for OpenMP

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

High Performance Computing. Course Notes HPC Fundamentals

Part I Courses Syllabus

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Evaluation of CUDA Fortran for the CFD code Strukti

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Big Data Visualization on the MIC

Xeon+FPGA Platform for the Data Center

Stream Processing on GPUs Using Distributed Multimedia Middleware

NVLink High-Speed Interconnect: Application Performance

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.

Roadmap for Many-core Visualization Software in DOE. Jeremy Meredith Oak Ridge National Laboratory

Kriterien für ein PetaFlop System

GPU Hardware Performance. Fall 2015

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

NVIDIA HPC Update. Carl Ponder, PhD; NVIDIA, Austin, TX, USA - Sr. Applications Engineer, Developer Technology Group

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

FLOW-3D Performance Benchmark and Profiling. September 2012

Summit and Sierra Supercomputers:

How to Run Parallel Jobs Efficiently

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Next Generation GPU Architecture Code-named Fermi

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Graphic Processing Units: a possible answer to High Performance Computing?

Designed for Maximum Accelerator Performance

benchmarking Amazon EC2 for high-performance scientific computing

Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Transcription:

Accelerating a Particle-in-Cell Code for Space Plasma Simulations with OpenACC Ivy Bo Peng 1, Stefano Markidis 1, Andris Vaivads 2, Juris Vencels 1, Jan Deca 3, Giovanni Lapenta 3, Alistair Hart 4 and Erwin Laure 1 1 HPCViz Department, KTH Royal Institute of Technology 2 Swedish Institute of Space Physics, Uppsala, Sweden 3 Department of Mathematics, Centre for Mathematical Plasma Astrophysics (CmPA), KU Leuven 4 Cray Exascale Research Initiative Europe, UK

Top Supercomputers are Accelerated 15% of supercomputers in Top500 list are accelerated. Accelerated supercomputers provide 35% of the total Top500 performance. NVIDIA GPUs are dominating the accelerator market. Source: www-top500.org Nov 2014 list

Exascale Simulations need to be Accelerated Formation of a Magnetosphere Grid 384 x 384 x 384 Particles initially 3.01x10 9 Time Steps 15000 No. Of MPI Processes 2048 FLOPs in Mover 10 15 Simulation time 24 hours Simulation was run on Lindgren Supercomputer at KTH (Cray XE6 system, AMD Opteron processors and Cray Gemini interconnect)

Porting Code to GPU with OpenACC OpenACC is an accelerator programming API standard supported by multiple vendors. We used Cray and PGI compilers in our work. Aim for incremental porting of C, C++ and Fortran programs to multi-gpu systems using Compiler Directives: Offload computational intensive works to GPU Implicit data movement between CPU (host) and GPU (device) memory spaces Very similar to OpenMP and allows for porting applications not initially designed for GPU, like ipic3d.

Particle-in-Cell Code ipic3d ipic3d is a Particle-in-Cell code for space physics community to study the interactions between solar wind and Earth magnetosphere ipic3d is a parallel code implemented in C++ using the hybrid of MPI and OpenMP, 20,000 LOC, with 80% parallel efficiency on 16,000 cores. In this work, we study the porting of ipic3d to multigpu systems with OpenACC

Particle-in-Cell (PIC) Method Solves the Vlasov (transport equation without collision term) and Maxwell s equations with computational particles (order billion particles in our simulations). Computational cycle is formed by 3 basic steps: particle mover, interpolation and field solver. In our work: we use OpenACC in mover and Interpolation stages. MOVER INTERPOLATION FIELD SOLVER

Computational Time Spent in ipic3d We focus on the most timeconsuming parts (other than communication) Mover: 41.1% Interpolation: 13.2% CrayPAT Profiling Results are based on a typical magnetic reconnection simulation running on 256 processes on Beskow Cray XC40 supercomputer at KTH

Challenges in porting ipic3d We identified two challenges: Deep Copy Issue: OpenACC is good with 1D array but requires more hacking for multidimensional array, structures, classes, template classes Both compiler support and programmer input are required Atomic capture issue.

The Deep-Copy Issue OpenACC supports flat object model C++ pointer indirection requires non-continuous transfer and pointer translation We solve it by hack: Cast particle information to 1D array and use 1D arrays to move data back and forth from/to GPU Source: http://on-demand.gputechconf.com/gtc/2013/presentations/s3084-openacc-openmp-directives-cce.pdf

Atomic Capture Issue Is a OpenACC atomic directive that guarantees a variable is accessed and updated atomically Is essential for correctly accelerating the interpolation stage: multiple particles could map to the same grid point Not working in PGI compilers version < 15.1.

GPU Porting Results ipic3d GPU porting tested against the 2D magnetic reconnection problem, important for Earth magnetosphere Out-of-plane magnetic field component evolution

GPU Performance Results TEST ENVIRONMENT: Cray XC30 System SWAN, K20X GPU, Aries interconnect and Intel CPUs

Conclusions Successfully ported the ipic3d code to multi-gpu systems with OpenACC. C++ not the best match for OpenACC: Lack of compiler support for deep-copy in C++. Cray and PGI compiler teams are working on this but programmer s input will still be needed. Partial Control of memory management and thread could facilitate more aggressive optimization that can be achieved with CUDA. Preliminary porting with OpenACC results 42%, 15%, 16%, 18% faster on 1, 2, 4 and 6 nodes compared to original version GPU direct will simplify communication in mover.

Thanks! This work was funded by the Swedish VR grant D621-2013-4309 and by the European Commission through the EPiGRAM project (grant agreement no. 610598. epigram-project.eu). The work used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE- AC05-00OR22725. The work also used resource provided by the Cray Marketing Partner Network

Future Optimization Particle communication only copy exiting particles to CPU for MPI communication (use atomic update to flag out exiting particles) GPU-direct communication Cache-coherent Particle sorting - sort particles within subdomains to minimize cache misses during particle to field interpolation. Multiple MPI processes on one node

The Good and Bad Things about GPU GPU are good in computation-intensive parts of the application Connection between host and device memory spaces is the bottleneck: Minimize data movement between host and GPU Use pinned memory GPU CPU 250 GB/s (GDDR5) PCIe 2.0 x 16 = 8GB/s PCIe 3.0 x 16 = 15.75GB/s 32 GB/s (DDR3 x 4 channels) Device Memory Host Memory Source: https://www.olcf.ornl.gov/support/system-user-guides/accelerated-computing-guide/

Porting ipic3d to GPU ü 100X FLOP to update the velocity and location of each particle in each computation cycle ü Different data structure of particles and conversion are supported in the code x Advanced ü Use C++ compiler template directives requires deep to copy -> linearize to array structure x Conversion between two data structure is time-consuming x Field data is required for updating particles x Possible race-condition when interpolate from particle to field -> Need automic update (compiler issue, high collision is inefficient)

Porting to GPU with OpenACC Ø Add in field components in particle class and linearize the particle structures Ø Copy in field data in global memory Ø Create data region on GPU upon the initialization of particle species in class constructor and free data region in class destructor Ø Asynchronous data movement for different particle species Ø Two compilers: Cray and PGI on two GPUaccelerated supercomputer: Titan and Swan