A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators



Similar documents
OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC Basics Directive-based GPGPU Programming

Case Study on Productivity and Performance of GPGPUs

OpenACC Programming and Best Practices Guide

OpenACC Programming on GPUs

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

The OpenACC Application Programming Interface

Parallel Computing. Parallel shared memory computing with OpenMP

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Introduction to GPU Programming Languages

OpenCL for programming shared memory multicore CPUs

Parallel Programming Survey

Parallel Computing. Shared memory parallel programming with OpenMP

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Spring 2011 Prof. Hyesoon Kim

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPU Acceleration of the SENSEI CFD Code Suite

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

The OpenACC Application Programming Interface

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Evaluation of CUDA Fortran for the CFD code Strukti

Multi-Threading Performance on Commodity Multi-Core Processors

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Using the Intel Inspector XE

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Introduction to CUDA C

OpenMP* 4.0 for HPC in a Nutshell

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

GPU Tools Sandra Wienke

Programming the Intel Xeon Phi Coprocessor

Introduction to GPU hardware and to CUDA

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Next Generation GPU Architecture Code-named Fermi

GPI Global Address Space Programming Interface

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Embedded Systems: map to FPGA, GPU, CPU?

CUDA programming on NVIDIA GPUs

OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging

OpenMP Application Program Interface

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to Cloud Computing

12 Tips for Maximum Performance with PGI Directives in C

SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

OMPT: OpenMP Tools Application Programming Interfaces for Performance Analysis

A quick tutorial on Intel's Xeon Phi Coprocessor

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Suitability of Performance Tools for OpenMP Task-parallel Programs

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

HPC Wales Skills Academy Course Catalogue 2015

Performance Analysis for GPU Accelerated Applications

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to Hybrid Programming

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Computing for Data Science

GPU Parallel Computing Architecture and CUDA Programming Model

Parallel Algorithm Engineering

OpenMP C and C++ Application Program Interface

HPC Programming Framework Research Team

GPUs for Scientific Computing

BSC vision on Big Data and extreme scale computing

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Performance Characteristics of Large SMP Machines

Towards OpenMP Support in LLVM

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

BLM 413E - Parallel Programming Lecture 3

OpenMP and Performance

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Scalability evaluation of barrier algorithms for OpenMP

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

A Case Study - Scaling Legacy Code on Next Generation Platforms

Big Data Visualization on the MIC

Transcription:

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen 3 Cray Inc. Euro-Par Conference (Porto, Portugal), August 2014

Agenda Motivation Overview on OpenACC & OpenMP Device Capabilities Expressiveness by Pattern-Based Comparison Map Stencil Fork-Join Superscalar Sequence Parallel Update Summary Conclusion & Outlook 2

Motivation Today/ future: Heterogeneity & specialized accelerators Consequence: Parallel programming gets more complex Directive-based programming models may simplify software design Two well-promoted approaches: OpenACC, OpenMP User question: Which programming model to use? Power/ expressiveness, performance, long-term perspective Targeted hardware: GPU, Xeon Phi, Expressiveness by pattern-based approach Patterns: basic structural entities of algorithms Used: structured parallel patterns [1], e.g. map, stencil, reduction, fork-join, superscalar sequence, nesting, parallel update and geometric decomposition 3 [1] McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012)

Overview on OpenACC & OpenMP Device Capabilities Directive-based accelerator programming: C/C++, Fortran OpenACC 4 Initiated by Cray, CAPS, PGI, NVIDIA 2011: specification v1.0 2013: specification v2.0 OpenMP Broad (shared-memory) standard 2013: specification v4.0 OpenACC OpenMP Description of constructs/ clauses parallel target offload work parallel teams, parallel create in par. running threads kernels compiler finds parallelism loop distribute, do, for, simd worksharing across parallel units data target data manage data transfers (block) enter data exit data unstructured data lifetime: to and from the device update target update data movement in data environment host data cache tile interoperability with CUDA/ libs advice to put objects to closer memory strip-mining of data declare declare target declare global, static data routine declare target for function calls async(int) task depend async. exec. w/ dependencies wait taskwait sync of streams/ tasks async wait async. waiting parallel in parallel parallel in parallel/ team nested parallelism device_type Comparison of Constructs device-specific clause tuning atomic atomic atomic operations sections critical master, single barrier non-iterative worksharing block executed by master block executed by one thread synchronization of threads We talk about the following specifications: OpenACC 2.0, OpenMP 4.0.

control processing elements processing elements processing elements processing elements Overview on OpenACC & OpenMP Device Capabilities OpenACC Execution & Memory Model Host-directed execution Weak device memory model (no sync at gang/ team level) Separate or shared memory OpenMP Host Device core core compute unit compute unit compute unit compute unit core core L1 L1 L2 L1 L1 L2 L1 L1 mem mem *OpenCL terminology 5 We talk about the following specifications: OpenACC 2.0, OpenMP 4.0.

Map Maps in parallel different elements of input data to output collection Elemental function: B = p A T #pragma acc routine seq double f(double p, double aij) { return (p * aij); } // [..] #pragma acc parallel #pragma acc loop gang vector for(i=0; i<n; i++) { 6 #pragma acc loop vector for(j=0; j<m; j++) { b[j][i] = f(5.0,a[i][j]); } } OpenACC Parallelism + workshare Kernels, parallel, loop Levels of parallelism Automatic; tuning: gang/worker/vector Performance portability possible routine for function calls Denote parallelism: tuning, but call from same context fork OpenMP #pragma omp declare target double f(double p, double aij) { return (p * aij); } #pragma omp end declare target // [..] #pragma omp target #pragma omp teams parallel distribute for for(i=0; i<n; i++) { #pragma omp parallel for for(j=0; j<m; j++) { b[j][i] = f(5.0,a[i][j]); } } task data dependency join Parallelism + workshare Target, parallel, for, teams, distribute Levels of parallelism Teams distribute, parallel for Performance portability difficult declare for function calls No parallelism: call from different contexts, less tuning [*] McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012) [*]

Stencil Access of neighboring input elements with fixed offsets Ability for data reuse & cache optimization OpenACC #pragma acc parallel #pragma acc loop tile(64,4) gang vector for(i=1; i<n-1; i++) { for(j=1; j<m-1; j++) { #pragma acc cache(a[i-1:3][j-1:3]) anew[i][j] = (a[i-1][j] + a[i+1][j] + \ a[i][j-1] + a[i][j+1]) * 0.25; } } tile: strip-mining 1st no. = block size of most inner loop cache: data caching 7 Just hint, compiler can ignore (Performance-wise) needed especially for software-managed mem (GPUs) task data dependency join fork OpenMP #pragma omp target #pragma omp teams distribute collapse(2) for(i=1; i<n-1; i+=4) { for(j=1; j<m-1; j+=64) { #pragma omp parallel for collapse(2) for(k=i; k<min(n-1,i+4); k++) { for(l=j; l<min(m-1,j+64); l++) { anew[k][l] = (a[k-1][l] + ^\ a[k+1][l] + a[k][l-1] + \ a[k][l+1]) * 0.25; } } } } Tiling must be expressed explicitly More development effort than w/ tiling Performance expected similar to tiling No caching hints possible Maybe performance loss on GPUs [*] Similar to: McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012) [*]

Fork-Join Workflow is split (forked) into parallel/ independent flows Merged (joined) again later task data [*] dependency join fork OpenACC // no tasking concept Data parallelism possible Task parallelism 8 Just approximation by host-directed nested parallel constructs #pragma omp declare target int fib(int n) { int x, y; if (n < 2) {return n;} #pragma omp task shared(x) x = fib(n - 1); #pragma omp task shared(y) y = fib(n - 2); #pragma omp taskwait return (x+y); } #pragma omp end declare target OpenMP #pragma omp target #pragma omp parallel #pragma omp single result=fib(n); Data parallelism possible Task parallelism possible sections, tasks [*] McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012)

Superscalar Sequence Parallelization of ordered list of tasks Satisfaction of data dependencies Additional interpretation: heterogeneity OpenACC #pragma acc parallel loop async(1) // F = f(a) #pragma acc parallel loop async(2) // G = g(b) #pragma acc wait(1,2) async(3) #pragma acc parallel loop async(3) // H = h(f,g) #pragma acc wait(1) // S = s(f) #pragma acc wait Streaming concept async(int) + wait async Calling thread asynchronously continues 9 s A f h B g F=f(A) G=g(B) H=h(F,G) S=s(F) task data dependency join fork #pragma omp parallel OpenMP #pragma omp single { #pragma omp task depend(out:f) #pragma omp target teams distribute parallel for // F = f(a) #pragma omp task depend(out:g) #pragma omp target teams distribute parallel for // G = g(b) #pragma omp task depend(in:f,g) depend(inout:h) #pragma omp target teams distribute parallel for // H = h(f,g) #pragma omp task depend(in:f) // S = s(f) #pragma omp taskwait } Tasking concept Task dependencies

Parallel Update (1/3) Own extension of pattern definitions Synchronization of data between (separate) memories void stencilonacc(double **a, double ** \ anew, int n, int m) { #pragma acc parallel present copy \ \ OpenACC (a[1:n-2][0:m], anew[1:n-2][0:m]) #pragma acc loop // stencil comp. } // [..] #pragma acc data create(anew[0:n][0:m]) \ copyin(a[0:n][0:m]) { while (iter < iter_max) { stencilonacc(a,anew,n,m); #pragma acc update host(anew[1:n-2][0:m]) swaponhost(a,anew,n,m); #pragma acc update device(a[1:n-2][0:m]) iter++; } } Both: data movement 10 Map data to/ from device, allocate Data regions, update constructs task data dependency void stencilonacc(double **a, \ join OpenMP double** anew, int n, int m) fork { #pragma omp target map(tofrom: \ (a[1:n-2][0:m], anew[1:n-2][0:m]) #pragma omp teams distribute \ parallel for // stencil comp. } // [..] #pragma omp target data \ map(alloc:anew[0:n][0:m]) \ map(to:a[0:n][0:m]) { while (iter < iter_max) { stencilonacc(a,anew,n,m); #pragma omp target update \ from(anew[1:n-2][0:m]) swaponhost(a,anew,n,m); #pragma omp target update \ to(a[1:n-2][0:m]) iter++; } }

Parallel Update (2/3) Own extension of pattern definitions Synchronization of data between (separate) memories void stencilonacc(double **a, double ** \ anew, int n, int m) { #pragma acc parallel present \ OpenACC (a[1:n-2][0:m], anew[1:n-2][0:m]) #pragma acc loop // stencil comp. } // [..] #pragma acc data create(anew[0:n][0:m]) \ copyin(a[0:n][0:m]) if(test) { while (iter < iter_max) { stencilonacc(a,anew,n,m); #pragma acc update host(anew[1:n-2][0:m]) swaponhost(a,anew,n,m); #pragma acc update device(a[1:n-2][0:m]) iter++; } } Missing data: Error Present: no data movement; error if data not there 11 task data dependency void stencilonacc(double **a, \ join OpenMP double** anew, int n, int m) fork { #pragma omp target map(tofrom: \ a[1:n-2][0:m], anew[1:n-2][0:m]) #pragma omp teams distribute \ parallel for // stencil comp. } // [..] #pragma omp target data \ map(alloc:anew[0:n][0:m]) \ map(to:a[0:n][0:m]) if(test) { while (iter < iter_max) { stencilonacc(a,anew,n,m); #pragma omp target update \ from(anew[1:n-2][0:m]) swaponhost(a,anew,n,m); #pragma omp target update \ to(a[1:n-2][0:m]) iter++; } } Missing data: Fix-Up Implicit check for data; move if not there

Parallel Update (3/3) Own extension of pattern definitions Synchronization of data between (separate) memories class CArray { OpenACC public: CArray(int n) { a = new double[n]; #pragma acc enter data create(a[0:n]) } CArray() { #pragma acc exit data delete(a[0:n]) delete(a); } void fillarray(int n) { #pragma acc parallel loop for(int i=0; i<n; i++) { a[i]=i; } } private: double *a; }; Unstructured data lifetimes enter/ exit data API functions also exist Use cases intit/ fini functions or constructors / destructors task data dependency join fork Manual deep copies of pointer structures OpenMP committee is working on it 12

Summary Parallel Pattern OpenACC OpenMP Map F (kernels, parallel, loop, gang/worker/vector, routine) Stencil I (tile, cache) - Reduction F (reduction) F (reduction) Fork-Join - F (tasks) Workpile - F (tasks) Superscalar Sequence F (streams) F (tasks) F (target, parallel for, teams distribute, declare target) Nesting F (parallel in parallel) F- (parallel in target/ teams/ parallel) Parallel Update* Geometric Decomposition F (copy, data region, update, enter/exit data, error concept) F (acc_set_device, device_type) F- (device) F- (map, target data, target update, fix-up concept) 13 F : supported directly, with a special feature/ language construct F-: base aspects (but not all) directly supported I : implementable easily and efficiently using other features - : not implementable easily Following concept of McCool, M., Reinders, J., Robison, A.: Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, first edn. (2012). (*) new pattern

Conclusion & Outlook (1/2) Concurrently: OpenACC one step ahead of OpenMP Less general concepts, but more features Today s recommendation for GPUs: OpenACC Port to OpenMP will be easy (with OpenACC 1.0 features, or new OpenMP features) Long-term perspective: Development of standards Our assumption (OpenACC, OpenMP committee work): co-existence OpenMP might be advantageous in the long term Broader user / vendor base (portability) But, implementation effort significant Might imply limited OpenMP functionality within offload regions 14

Conclusion & Outlook (2/2) Long-term perspective: architectures Possible shared memory by host & device: supported by both models But, promising perspectives for offload mode in general? OpenMP might take lead with non-offload features Long-term perspective: performance We don t expect differences between equivalent approaches (same target hardware, compiler) Dissimilarities might occur if general concept differs (e.g. streams vs. tasks) Performance evaluation is future work (including comparison to other models) Thank you for your attention! 15