OpenACC Basics Directive-based GPGPU Programming

Similar documents
OpenACC Programming on GPUs

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

OpenACC 2.0 and the PGI Accelerator Compilers

Case Study on Productivity and Performance of GPGPUs

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

The OpenACC Application Programming Interface

OpenMP* 4.0 for HPC in a Nutshell

12 Tips for Maximum Performance with PGI Directives in C

OpenACC Programming and Best Practices Guide

GPU Tools Sandra Wienke

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

CUDA Basics. Murphy Stein New York University

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Lecture 1: an introduction to CUDA

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

Introduction to GPU hardware and to CUDA

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Introduction to Hybrid Programming

The OpenACC Application Programming Interface

Parallel Computing. Shared memory parallel programming with OpenMP

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

U N C L A S S I F I E D

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Parallel Computing. Parallel shared memory computing with OpenMP

Programming GPUs with CUDA

Debugging with TotalView

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Introduction to CUDA C

GPGPU Parallel Merge Sort Algorithm

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

CUDA Programming. Week 4. Shared memory and register

CUDA programming on NVIDIA GPUs

Using the Intel Inspector XE

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Parallel Programming Survey

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Introduction to GPU Programming Languages

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Programming the Intel Xeon Phi Coprocessor

COSCO 2015 Heterogeneous Computing Programming

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

The Importance of Software in High-Performance Computing

BLM 413E - Parallel Programming Lecture 3

GPU Acceleration of the SENSEI CFD Code Suite

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Evaluation of CUDA Fortran for the CFD code Strukti

Le langage OCaml et la programmation des GPU

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

HPC with Multicore and GPUs

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Lecture 3. Optimising OpenCL performance

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Performance Characteristics of Large SMP Machines

Performance Analysis for GPU Accelerated Applications

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Suitability of Performance Tools for OpenMP Task-parallel Programs

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Embedded Systems: map to FPGA, GPU, CPU?

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

An HPC Application Deployment Model on Azure Cloud for SMEs

OpenCL Programming for the CUDA Architecture. Version 2.3

Multi-Threading Performance on Commodity Multi-Core Processors

HPC Wales Skills Academy Course Catalogue 2015

OpenCL for programming shared memory multicore CPUs

NVIDIA Tools For Profiling And Monitoring. David Goodwin

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPUs for Scientific Computing

Introduction to GPU Computing

Towards OpenMP Support in LLVM

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

A Case Study - Scaling Legacy Code on Next Generation Platforms

GPU Parallel Computing Architecture and CUDA Programming Model

Rootbeer: Seamlessly using GPUs from Java

Informatica e Sistemi in Tempo Reale

GPU Computing - CUDA

A quick tutorial on Intel's Xeon Phi Coprocessor

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

OpenMP C and C++ Application Program Interface

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Transcription:

OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES, March 2013

Agenda Motivation & Overview Execution & Memory model Directives (in examples) Offload regions (kernels, parallel & loops) Data movement (data region, data clauses, array shaping, update data) Runtime Library Routines Development Tips Summary 2

Example SAXPY CPU void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpycpu(n, a, x, y); SAXPY = Single-precision real Alpha X Plus Y y x y 3 free(x); free(y); return 0;

Example SAXPY OpenACC void saxpyopenacc(int n, float a, float *x, float *y) { #pragma acc parallel loop vector_length(256) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y for(int i=0; i<n; ++i){ x[i]=i; y[i]=5.0*i-1.0; // Invoke serial SAXPY kernel saxpyopenacc(n, a, x, y); 4 free(x); free(y); return 0;

Example SAXPY CUDA global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ y[i] = a*x[i] + y[i]; int main(int argc, char* argv[]) { int n = 10240; float a = 2.0f; float* h_x,*h_y; // Pointer to CPU memory h_x = (float*) malloc(n* sizeof(float)); h_y = (float*) malloc(n* sizeof(float)); // Initialize h_x, h_y for(int i=0; i<n; ++i){ h_x[i]=i; h_y[i]=5.0*i-1.0; float *d_x,*d_y; // Pointer to GPU memory 1. Allocate data on GPU + transfer data to CPU cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); // Invoke parallel SAXPY kernel dim3 threadsperblock(128); dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; 2. Launch kernel 3. Transfer data to CPU + free data on GPU 5

Introduction Nowadays: low-level GPU APIs (like CUDA, OpenCL) often used Unproductive development process Directive-based programming model delegates responsibility for low-level GPU programming tasks to compiler Data movement Kernel execution Awareness of particular GPU type OpenACC Directive-based model to offload compute-intensive loops to attached accelerator 6

Overview OpenACC Open industry standard Portability Introduced by CAPS, Cray, NVIDIA, PGI (Nov. 2011) Support C,C++ and Fortran NVIDIA GPUs, AMD GPUs & Intel MIC (now/near future) Timeline Nov 11: Specification 1.0 Q1/Q2 12: Cray, PGI & CAPS compiler understands parts of OpenACC API Nov 12: Proposed additions for OpenACC 2.0 More supporters, e.g. TUD, RogueWave 7

Execution & Memory Model (Recap) Host-directed execution model CPU PCIe 3 1 Separate host and device memories 2 GPU Many tasks can be done by compiler/ runtime User-directed programming GPU CPU 8

Directives Syntax C #pragma acc directive-name [clauses] Fortran!$acc directive-name [clauses] Iterative development process Compiler feedback helpful Whether an accelerator kernel could be generated Which loop schedule is used Where/which data is copied 9

Directives (in examples): SAXPY serial (host) int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 10

Directives (in examples): SAXPY acc kernels int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc kernels { for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; main: PGI Compiler Feedback 29,Generating copyin(x[:10240]) Generating copy(y[:10240]) Generating compute capability 2.0 binary 31,Loop is parallelizable Accelerator kernel generated 31, #pragma acc loop gang, vector(32) 37,Loop is parallelizable Accelerator kernel generated 37, #pragma acc loop gang, vector(32) Kernel 1 Kernel 2 11

Directives (in examples): SAXPY acc parallel, acc loop, data clauses int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; #pragma acc parallel copy(y[0:n]) copyin(x[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; 12

Directives (in examples): SAXPY acc data int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float b = 3.0f; float *x = (float*) malloc(n * sizeof(float)); float *y = (float*) malloc(n * sizeof(float)); // Initialize x, y // Run SAXPY TWICE #pragma acc data copyin(x[0:n]) { #pragma acc parallel copy(y[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; #pragma acc parallel copy(y[0:n]) #pragma acc loop for (int i = 0; i < n; ++i){ y[i] = b*x[i] + y[i]; free(x); free(y); return 0; present(x[0:n]) present(x[0:n]) 13

OpenACC PGI Compiler OpenACC compiler in RWTH Cluster grey = background information Environment: PGI Compiler module switch intel pgi # switch default Intel # compiler to PGI compiler Compiling $CC -acc -ta=nvidia,cc20,5.0 -Minfo=accel saxpy.c $CC If PGI module is loaded, this maps to pgcc -acc Tells compiler to recognize OpenACC directives -ta=nvidia Specifies the target architecture here: NVIDIA GPUs cc20 Optional. Specifies target compute capability 2.0 5.0 Optional. Uses CUDA Toolkit 5.0 for code generation -Minfo=accel Optional. Compiler feedback for accelerator code 14

Directives: Offload regions Offload region Region maps to a CUDA kernel function C/C++ #pragma acc parallel [clauses] Fortran!$acc parallel [clauses]!$acc end parallel C/C++ #pragma acc kernels[clauses] Fortran!$acc kernels[clauses]!$acc end kernels User responsible for finding parallelism (loops) acc loop needed for worksharing No automatic sync between several loops Compiler responsible for finding parallelism (loops) acc loop directive only for tuning needed Automatic sync between loops within kernels region 15

Directives: Offload regions Clauses for compute constructs (parallel, kernels) If condition true, acc version is executed Executes async, see Tuning slides. Define number of parallel gangs(parallel only)... Define number of workers within gang(parallel only). Define length for vector operations(parallel only) Reduction with op at end of region(parallel only)... H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides.. Copy of each list-item for each parallel gang(parallel only) As private + copy initialization from host(parallel only). C/C++, Fortran if(condition) async [(int-expr)] num_gangs(int-expr) num_workers(int-expr) vector_length(int-expr) reduction(op:list) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list) private(list) firstprivate(list) 16

Loop schedule Directives: Loops Share work of loops Loop work gets distributed among threads on GPU (in certain schedule) C/C++ #pragma acc loop[clauses] Fortran!$acc loop[clauses] kernels loop defines loop schedule by int-expr in gang, worker or vector (instead of num_gangs etc with parallel) Loop clauses Distributes work into thread blocks.. Distributes work into warps Distributes work into threads within warp/ thread block Executes loop sequentially on the device Collapse n tightly nested loops Says independent loop iterations(kernels loop only). Reduction with op.. Private copy for each loop iteration 17 C/C++, Fortran gang worker vector seq collapse(n) independent reduction(op:list) private(list)

Directives: Data regions & clauses Data region 18 Decouples data movement from offload regions C/C++ #pragma acc data [clauses] Fortran!$acc data [clauses]!$acc end data Data clauses Triggers data movement of denoted arrays If cond true, move data to accelerator H2D-copy at region start + D2H at region end. Only H2D-copy at region start... Only D2H-copy at region end... Allocates data on device, no copy to/from host Data is already on device.... Test whether data on device. If not, transfer.. See Tuning slides C/C++, Fortran if(cond) copy(list) copyin(list) copyout(list) create(list) present(list) present_or_*(list) deviceptr(list)

Misc on Loops Combined directives #pragma acc kernels loop for (int i=0; i<n; ++i) { /*...*/ #pragma acc parallel loop for (int i=0; i<n; ++i) { /*...*/ Reductions #pragma acc parallel #pragma acc loop reduction(+:sum) for (int i=0; i<n; ++i) { sum += i; PGI compiler can often recognize reductions on its own. See compiler feedback, e.g.: Sum reduction generated for var 19

Data Movement Data clauses can be used on data, kernels or parallel copy, copyin, copyout, present, present_or_copy, create, deviceptr Array shaping Compiler sometimes cannot determine size of arrays Specify explicitly using data clauses and array shape [lower bound: size] C/C++ #pragma acc data copyin(a[0:length]) copyout(b[s/2:s/2]) Fortran!$acc data copyin(a(0:length-1)) copyout(b(s/2:s)) [lower bound: upper bound] 20

Directives (in examples) acc update Rank 0 Rank 1 for (t=0; t<t; t++){ // Compute x MPI_SENDRECV(x, ) // Adjust x #pragma acc data copy(x[0:n]) { for (t=0; t<t; t++){ // Compute x on device #pragma acc update host(x[0:n]) MPI_SENDRECV(x, ) #pragma acc update device(x[0:n]) // Adjust x on device 21

Directives: Update Update executable directive Move data from GPU to host, or host to GPU Used to update existing data after it has changed in its corresponding copy C/C++ #pragma acc update host device [clauses] Fortran!$acc update host device [clauses] Data movement can be conditional or asynchronous Update clauses list variables are copied from acc to host. list variables are copied from host to acc.. If cond true, move data to accelerator Executes async, see Tuning slides. C/C++, Fortran host(list) device(list) if(cond) async[(int-expr)] 22

Runtime library routines Interface to runtime routines & data types C/C++ #include openacc.h Fortran use openacc Initialization of device E.g. to exclude initialization time from computation time C/C++, Fortran acc_init(devicetype) Specify the device to run on 23

Development Tips Favorable for parallelization: (nested) loops Large loop counts to compensate (at least) data movement overhead Independent loop iterations Think about data availability on host/ GPU Use data regions to avoid unnecessary data transfers Specify data array shaping (may adjust for alignment) Inline function calls in directives regions PGI compiler flag: -Minline or Minline=levels:<N> Call support for OpenACC 2.0 intended Conditional compilation for OpenACC Macro _OPENACC 24

Development Tips Using pointers (C/C++) Problem: compiler can not determine whether loop iterations that access pointers are independent no parallelization possible Solution (if independence is actually there) Use restrict keyword: float *restrict ptr; Use compiler flag: -ansi-alias Use OpenACC loop clause: independent Verifying execution on GPU See PGI compiler feedback. Term Accelerator kernel generated needed. See information on runtime (more details in Tools slides): export ACC_NOTIFY=1 25

Summary Constructs for basic parallelization on GPU Often productive development process, but knowledge about GPU programming concepts and hardware still needed Productivity may come at the cost of performance But, potentially good ratio of effort to performance Important step towards (OpenMP-) standardization OpenMP for Accelerators technical report in Nov. 2012 OpenMP 4.0 specification intended for ISC 2013 OpenMP for Accelerators supported by many vendors/ architectures NVIDIA and AMD GPUs Intel Many Integrated Core (MIC) architecture 26