OpenACC Programming on GPUs

Similar documents

OpenACC Basics Directive-based GPGPU Programming

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

OpenACC 2.0 and the PGI Accelerator Compilers

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Case Study on Productivity and Performance of GPGPUs

The OpenACC Application Programming Interface

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

OpenMP* 4.0 for HPC in a Nutshell

OpenACC Programming and Best Practices Guide

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

12 Tips for Maximum Performance with PGI Directives in C

CUDA Basics. Murphy Stein New York University

GPU Tools Sandra Wienke

Parallel Computing. Parallel shared memory computing with OpenMP

Lecture 1: an introduction to CUDA

Introduction to GPU hardware and to CUDA

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Introduction to Hybrid Programming

Parallel Programming Survey

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

U N C L A S S I F I E D

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Parallel Computing. Shared memory parallel programming with OpenMP

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Introduction to CUDA C

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

The OpenACC Application Programming Interface

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Programming the Intel Xeon Phi Coprocessor

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13

Programming GPUs with CUDA

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

GPU Parallel Computing Architecture and CUDA Programming Model

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Best Practice mini-guide accelerated clusters

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Evaluation of CUDA Fortran for the CFD code Strukti

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Optimizing Application Performance with CUDA Profiling Tools

CUDA Programming. Week 4. Shared memory and register

CUDA programming on NVIDIA GPUs

Debugging with TotalView

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPGPU Parallel Merge Sort Algorithm

GPU Acceleration of the SENSEI CFD Code Suite

GPU Performance Analysis and Optimisation

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Embedded Systems: map to FPGA, GPU, CPU?

Introduction to GPU Programming Languages

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Performance Characteristics of Large SMP Machines

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Hybrid Programming with MPI and OpenMP

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

A Case Study - Scaling Legacy Code on Next Generation Platforms

Using the Intel Inspector XE

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Lecture 3. Optimising OpenCL performance

HPC Wales Skills Academy Course Catalogue 2015

OpenMP and Performance

Multi-Threading Performance on Commodity Multi-Core Processors

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Le langage OCaml et la programmation des GPU

What is Multi Core Architecture?

An Introduction to Parallel Computing/ Programming

Hands-on CUDA exercises

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Image Processing & Video Algorithms with CUDA

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Common Mistakes in OpenMP and How To Avoid Them

1. If we need to use each thread to calculate one output element of a vector addition, what would

Next Generation GPU Architecture Code-named Fermi

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

OpenCL for programming shared memory multicore CPUs

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

COSCO 2015 Heterogeneous Computing Programming

SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology

Texture Cache Approximation on GPUs

Transcription:

OpenACC Programming on GPUs Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) HiPerCH, April 2013

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 2

Example SAXPY CPU void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); SAXPY = Single-precision real Alpha X Plus Y y x y 3 free(x); free(y); return 0;

Example SAXPY OpenACC void saxpyopenacc(int n, float a, float *x, float *y) { #pragma acc parallel loop vector_length(256) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpyopenacc(n, a, x, y); 4 free(x); free(y); return 0;

Example SAXPY CUDA global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float a = 2.0f; float* h_x,*h_y; // Pointer to CPU memory h_x = (float*) malloc(n* sizeof(float)); h_y = (float*) malloc(n* sizeof(float)); // Initialize h_x, h_y for(int i=0; i<n; ++i){ h_x[i]=i; h_y[i]=5.0*i-1.0; float *d_x,*d_y; // Pointer to GPU memory 1. Allocate data on GPU + transfer data to CPU cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; 2. Launch kernel 3. Transfer data to CPU + free data on GPU 5

Introduction Nowadays: low-level GPU APIs (like CUDA, OpenCL) often used Unproductive development process Directive-based programming model delegates responsibility for low-level GPU programming tasks to compiler Data movement Kernel execution Awareness of particular GPU type OpenACC Directive-based model to offload compute-intensive loops to attached accelerator 6

Overview OpenACC Open industry standard Portability Introduced by CAPS, Cray, NVIDIA, PGI (Nov. 2011) Support C,C++ and Fortran NVIDIA GPUs, AMD GPUs & Intel MIC (now/near future) Timeline Nov 11: Specification 1.0 Q1/Q2 12: Cray, PGI & CAPS compiler understands parts of OpenACC API Nov 12: Proposed additions for OpenACC 2.0 More supporters, e.g. TUD, RogueWave 7

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 8

Execution & Memory Model (Recap) Host-directed execution model CPU PCIe 3 1 Separate host and device memories 2 GPU Many tasks can be done by compiler/ runtime User-directed programming GPU CPU 9

Directives Syntax C #pragma acc directive-name [clauses] Fortran!$acc directive-name [clauses] Iterative development process Compiler feedback helpful Whether an accelerator kernel could be generated Which loop schedule is used Where/which data is copied 10

Compiling (PGI) pgcc -acc -ta=nvidia,cc30,5.0 -Minfo=accel saxpy.c pgcc -acc -ta=nvidia C PGI compiler (pgf90 for Fortran) Tells compiler to recognize OpenACC directives Specifies the target architecture here: NVIDIA GPUs cc30 Optional. Specifies target compute capability 3.0 5.0 Optional. Uses CUDA Toolkit 5.0 for code generation -Minfo=accel Optional. Compiler feedback for accelerator code We focus on PGI s OpenACC compiler here. For other compiler, compilation and feedback will look differently. 11

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 12

Offload Regions (in examples) serial (host) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 13

Offload Regions (in examples) acc kernels int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; main: PGI Compiler Feedback 29,Generating copyin(x[:10240]) Generating copy(y[:10240]) Generating compute capability 2.0 binary 31,Loop is parallelizable Accelerator kernel generated 31, #pragma acc loop gang, vector(32) 37,Loop is parallelizable Accelerator kernel generated 37, #pragma acc loop gang, vector(32) Kernel 1 Kernel 2 14

Offload Regions (in examples) acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 15

Directives: Offload regions Offload region grey = background information Region maps to a CUDA kernel function C/C++ #pragma acc parallel [clauses] Fortran!$acc parallel [clauses]!$acc end parallel C/C++ #pragma acc kernels[clauses] Fortran!$acc kernels[clauses]!$acc end kernels User responsible for finding parallelism (loops) acc loop needed for worksharing No automatic sync between several loops Compiler responsible for finding parallelism (loops) acc loop directive only for tuning needed Automatic sync between loops within kernels region 16

Directives: Offload regions Clauses for compute constructs (parallel, kernels) If condition true, acc version is executed Executes async, see Tuning slides. Define number of parallel gangs(parallel only)... Define number of workers within gang(parallel only). Define length for vector operations(parallel only) Reduction with op at end of region(parallel only)... H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides.. Copy of each list-item for each parallel gang(parallel only) As private + copy initialization from host(parallel only). C/C++, Fortran if(condition) async [(int-expr)] num_gangs(int-expr) num_workers(int-expr) vector_length(int-expr) reduction(op:list) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list) private(list) firstprivate(list) 17

Loop schedule Directives: Loops Share work of loops Loop work gets distributed among threads on GPU (in certain schedule) C/C++ #pragma acc loop[clauses] Fortran!$acc loop[clauses] kernels loop defines loop schedule by int-expr in gang, worker or vector (instead of num_gangs etc with parallel) Loop clauses Distributes work into thread blocks.. Distributes work into warps Distributes work into threads within warp/ thread block Executes loop sequentially on the device Collapse n tightly nested loops Says independent loop iterations(kernels loop only). Reduction with op.. Private copy for each loop iteration 18 C/C++, Fortran gang worker vector seq collapse(n) independent reduction(op:list) private(list)

Misc on Loops Combined directives #pragma acc kernels loop for (int i=0; i<n; ++i) { /*...*/ #pragma acc parallel loop for (int i=0; i<n; ++i) { /*...*/ Reductions #pragma acc parallel #pragma acc loop reduction(+:sum) for (int i=0; i<n; ++i) { sum += i; PGI compiler can often recognize reductions on its own. See compiler feedback, e.g.: Sum reduction generated for var 19

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 20

Mapping OpenACC on CUDA model Possible mapping to CUDA terminology (GPUs) gang = block worker = warp compiler dependent Execution Model Vector Gang Core SM(X) vector = threads Within block (if omitting worker) Within warp (if specifying worker) Grid (Kernel) Device 21

Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; Kernel 1 for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; Kernel 2 22

Loop schedule (in examples) int main(int argc, const char* argv[]) { Without loop schedule int n = 10240; float a = 2.0f; float b = 3.0f; 33, Loop is parallelizable float *x = (float*) malloc(n * sizeof(float)); Accelerator kernel generated float *y = (float*) malloc(n * sizeof(float)); 33, #pragma acc loop gang, // Initialize x, y vector(128) // Run SAXPY TWICE 39, Loop is parallelizable Accelerator kernel generated 39, #pragma acc loop gang, vector(128) #pragma acc kernels copy(y[0:n]) copyin(x[0:n]) #pragma acc loop gang vector(256) With loop schedule for (int i = 0; i < n; ++i){ 33, Loop is parallelizable y[i] = a*x[i] + y[i]; Accelerator kernel generated 33, #pragma acc loop gang, vector(256) #pragma acc kernels copy(y[0:n]) copyin(x[0:n]) #pragma acc loop gang(64) vector for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 39, Loop is parallelizable Accelerator kernel generated 39, #pragma acc loop gang(64), vector(128) Sometimes the compiler feedback is buggy. Check once with ACC_NOTIFY. 23

Loop schedule (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; vector_length(256) #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; num_gangs(64) vector_length: Optional: Specifies number of threads in thread block. num_gangs: Optional. Specifies number of thread blocks in grid. 24

Loop schedule (in examples) #pragma acc parallel vector_length(256) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ // do something #pragma acc parallel vector_length(256) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ for (int j = 0; j < m; ++j){ // do something #pragma acc parallel vector_length(256) num_gangs(16) #pragma acc loop gang vector for (int i = 0; i < n; ++i){ // do something Distributes loop to n threads on GPU. 256 threads per thread block usually ceil(n/256) blocks in grid Distributes outer loop to n threads on GPU. Each thread executes inner loop sequentially. 256 threads per thread block usually ceil(n/256) blocks in grid Distributes loop to threads on GPU (see above). If 16*256 < n, each thread gets multiple elements. 256 threads per thread block 16 blocks in grid 25

Loop schedule (in examples) #pragma acc parallel vector_length(256) #pragma acc loop gang for (int i = 0; i < n; ++i){ #pragma acc loop vector for (int j = 0; j < m; ++j){ // do something Distributes outer loop to GPU multiprocessors (block-wise). Distributes inner loop to threads within thread blocks. 256 threads per thread block usually n blocks in grid #pragma acc kernels #pragma acc loop gang(100) vector(8) for (int i = 0; i < n; ++i){ #pragma acc loop gang(200) vector(32) for (int j = 0; j < m; ++j){ // do something Multidimensional portioning not yet possible with parallel construct. See proposed tile clause in v2.0 With nested loops, specification of multidimensional blocks and grids possible: use same resource for outer and inner loop. 100 blocks in Y-direction (rows); 200 blocks in X-direction (columns) 8 threads in Y-dimension of one block; 21 threads in X-dimension of one block Total: 8*32=256 threads per thread block; 100*200=20,000 blocks in grid 26

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 27

Data Movement (in examples) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 28

Data Movement (in examples) acc data int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { #pragma acc parallel copy(y[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; #pragma acc parallel copy(y[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; present(x[0:n]) present(x[0:n]) 29

Directives: Data regions & clauses Data region 30 Decouples data movement from offload regions C/C++ #pragma acc data [clauses] Fortran!$acc data [clauses]!$acc end data Data clauses Triggers data movement of denoted arrays If cond true, move data to accelerator H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides C/C++, Fortran if(cond) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list)

Data Movement Data clauses can be used on data, kernels or parallel copy, copyin, copyout, present, present_or_copy, create, deviceptr Array shaping Compiler sometimes cannot determine size of arrays Specify explicitly using data clauses and array shape [lower bound: size] C/C++ #pragma acc data copyin(a[0:length]) copyout(b[s/2:s/2]) Fortran!$acc data copyin(a(1:length)) copyout(b(s/2+1:s)) [lower bound: upper bound] 31

Data Movement (in examples) acc update Rank 0 Rank 1 for (t=0; t<t; t++){ // Compute x MPI_SENDRECV(x, ) // Adjust x #pragma acc data copy(x[0:n]) { for (t=0; t<t; t++){ // Compute x on device #pragma acc update host(x[0:n]) MPI_SENDRECV(x, ) #pragma acc update device(x[0:n]) // Adjust x on device 32

Directives: Update Update executable directive Move data from GPU to host, or host to GPU Used to update existing data after it has changed in its corresponding copy C/C++ #pragma acc update host device [clauses] Fortran!$acc update host device [clauses] Data movement can be conditional or asynchronous Update clauses list variables are copied from acc to host. list variables are copied from host to acc.. If cond true, move data to accelerator Executes async, see Tuning slides. C/C++, Fortran host(list) device(list) if(cond) async[(int-expr)] 33

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 34

Global Memory Throughput (Recap) Strive for perfect coalescing Warp should access within contiguous region Align starting address (may require padding) Data layout matters (e.g. SoA over AoS) 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 caching load addresses from warp Have enough concurrent accesses to saturate the bus Process several elements per thread Launch enough threads to cover access latency Try caching or non-caching memory loads... PGI: caching is default; non-caching with experimental compiler option: -Mx,180,8 35

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 36

Shared Memory Smem per SM(X) (10s of KB) Inter-thread communication within a block Need synchronization to avoid RAW / WAR / WAW hazards Low-latency, high-throughput memory Cache data in smem to reduce gmem accesses #pragma acc kernels copyout(b[0:n][0:n]) copyin(a[0:n][0:n]) #pragma acc loop gang vector Accelerator kernel generated 65, #pragma acc loop gang, vector(2) for (int i = 1; i < N-1; ++i){ Cached references to size #pragma acc loop gang vector [(y+2)x(x+2)] block of 'a' for (int j = 1; j < N-1; ++j){ 67, #pragma acc loop gang, vector(128) #pragma acc cache(a[i-1:i+1][j-1:j+1]) b[i][j] = (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]) / 4; 37

Directives: Caching Cache construct Prioritizes data for placement in the highest level of data cache on GPU C/C++ #pragma acc cache(list) Fortran!$acc cache(list) Sometimes the PGI compiler ignores the cache construct. See compiler feedback Use at the beginning of the loop or in front of the loop 38

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 39

Asynchronous Operations Definition Synchronous: Control does not return until accelerator action is complete Asynchronous: Control returns immediately Allows heterogeneous computing (CPU + GPU) Asynchronous operations with async clause Kernel execution: kernels, parallel Data movement: update, PGI only: data construct #pragma acc kernels async for (int i=0; i<n; i++) { // do work on host 40

Directives: async clause Async clause Executes parallel kernels update operation asynchronously while host process continues with code C/C++ #pragma acc parallel kernels update async[(scalar-int-expr)] Fortran!$acc parallel kernels update async[(scalar-int-expr)] Integer argument (optional) can be seen as CUDA stream number Integer argument can be used in a wait directive Async activities with same argument: executed in order Async activities with diff. argument: executed in any order relative to each other 41

Streams Streams = CUDA concept Stream = sequence of operations that execute in issue-order on GPU Same int-expr in async clause: one stream Different streams/ int-expr: any order relative to each other Tips for debugging - disable async (PGI only) Use int-expr -1 in async clause (PGI 13.x) Set ACC_SYNCHRONOUS=1 42

Wait on Asynchronous Operations Synchronize async operations wait directive Wait for completion of an asynchronous activity (all or certain stream) #pragma acc kernels loop async for(int i=0; i<n/2; i++) {/* do work on device: x[0:n/2] */ // do work on host: x[n/2:n] #pragma acc update host(x[0:n/2]) async #pragma acc wait #pragma acc kernels loop async(1) for(int i=0; i<n; i++) {/* do work on device: x[0:n] */ #pragma acc update device(y[0:m]) async(2) #pragma acc kernels loop async(2) for(int i=0; i<m; i++) {/* do work on device: y[0:m] */ #prama acc update host(x[0:n]) async(1) // do work on host: z[0:l] #pragma acc wait(1) // do work on host: x[0:n] #pragma acc wait(2) Streams 1 2 K M M K 43

Directives: wait Wait directive Host thread waits for completion of an asynchronous activity If integer expression specified, waits for all async activities with the same value Otherwise, host waits until all async activities have completed C/C++ #pragma acc wait[(scalar-int-expr)] Fortran!$acc wait[(scalar-int-expr)] Runtime routines int acc_async_test(int): tests for completion of all associated async activities; returns nonzero value/.true. if all have completed int acc_asnyc_test_all(): tests for completion of all async activities int acc_async_wait(int): waits for completion of all associated async activites; routine will not return until the latest async activity has completed int acc_async_wait_all(): waits for completion of all asnyc activities 44

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 45

Multiple GPUs Several GPUs within one compute node Can use within one program (or use MPI) E.g. assign one device to each CPU thread/ process #include <openacc.h> acc_set_device_num(0, acc_device_nvidia); #pragma acc kernels async acc_set_device_num(1, acc_device_nvidia); #pragma acc kernels async 46

Multiple GPUs OpenMP + MPI #include <openacc.h> #include <omp.h> OpenMP #pragma omp parallel num_threads(12) { int numdev = acc_get_num_devices(acc_device_nvidia); int devid = omp_get_thread_num() % numdev; acc_set_device_num(devid, acc_device_nvidia); #include <openacc.h> #include <mpi.h> MPI int myrank; MPI_Comm_rank(MPI_COMM_WORLD, &myrank); int numdev = acc_get_num_devices(acc_device_nvidia); int devid = myrank % numdev; acc_set_device_num(devid, acc_device_nvidia); 47

Runtime Library Routines Interface to runtime routines & data types C/C++ #include openacc.h Fortran use openacc Initialization of device E.g. to exclude initialization time from computation time C/C++, Fortran acc_init(devicetype) 48

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 49

Development Tips Favorable for parallelization: (nested) loops Large loop counts to compensate (at least) data movement overhead Independent loop iterations Think about data availability on host/ GPU Use data regions to avoid unnecessary data transfers Specify data array shaping (may adjust for alignment) Inline function calls in directives regions PGI compiler flag: -Minline or Minline=levels:<N> Call support for OpenACC 2.0 intended Conditional compilation for OpenACC Macro _OPENACC 50

Development Tips Using pointers (C/C++) Problem: compiler can not determine whether loop iterations that access pointers are independent no parallelization possible Solution (if independence is actually there) Use restrict keyword: float *restrict ptr; Use compiler flag: -alias=ansi Use OpenACC loop clause: independent Verifying execution on GPU See PGI compiler feedback. Term Accelerator kernel generated needed. See information on runtime (more details in Tools slides): export ACC_NOTIFY=1 51

Agenda Motivation & Overview Execution & Memory Model Offload Regions kernels, parallel & loop Loop Schedule Data Movement Optimizing Memory Accesses Global Memory Throughput Shared Memory Heterogeneous Computing Asynchronous Operations Multiple GPUs (& Runtime Library Routines) Development Tips Summary 52

Summary Constructs for basic parallelization on GPU kernels, parallel, loop, data Optimization possible Loop Schedules (gang, worker, vector) Software caching (cache) Asynchronous Operations (async, wait) Interoperability with low-level kernels & libraries (not done here) Proposal on OpenACC 2.0: new features Often productive development process, but knowledge about GPU programming concepts and hardware still needed Productivity may come at the cost of performance Important step towards (OpenMP-) standardization OpenMP 4.0 Release Candidate includes accelerator directives 53

German Heterogeneous Computing Group (GHCG) Usergroup Meeting 23.5.-24.5.2013 (2 half days) Workshop on programming accelerators 22.5.-23.5.2013 (2 half days) Venue: Aachen, RZ www.ghc-group.org 54