Introduction to OpenACC Directives. Duncan Poole, NVIDIA Thomas Bradley, NVIDIA

Similar documents
OpenACC Programming and Best Practices Guide

Case Study on Productivity and Performance of GPGPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

OpenACC 2.0 and the PGI Accelerator Compilers

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

OpenACC Basics Directive-based GPGPU Programming

Introduction to GPU Programming Languages

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

CUDA programming on NVIDIA GPUs

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

HPC with Multicore and GPUs

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Parallel Computing with MATLAB

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes

GPUs for Scientific Computing

ST810 Advanced Computing

Spring 2011 Prof. Hyesoon Kim

Next Generation GPU Architecture Code-named Fermi

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

A quick tutorial on Intel's Xeon Phi Coprocessor

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

OpenACC Programming on GPUs

Evaluation of CUDA Fortran for the CFD code Strukti

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Accelerating CFD using OpenFOAM with GPUs

GPU Acceleration of the SENSEI CFD Code Suite

ultra fast SOM using CUDA

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Le langage OCaml et la programmation des GPU

HPC Wales Skills Academy Course Catalogue 2015

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

OpenCL for programming shared memory multicore CPUs

BLM 413E - Parallel Programming Lecture 3

A Case Study - Scaling Legacy Code on Next Generation Platforms

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Parallel Programming Survey

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

GPGPU accelerated Computational Fluid Dynamics

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Parallel Algorithm Engineering

Retargeting PLAPACK to Clusters with Hardware Accelerators

GPGPU acceleration in OpenFOAM

GPU Computing - CUDA

COSCO 2015 Heterogeneous Computing Programming

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Embedded Systems: map to FPGA, GPU, CPU?

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Sourcery Overview & Virtual Machine Installation

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Part I Courses Syllabus

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

An Introduction to Parallel Computing/ Programming

GPU Parallel Computing Architecture and CUDA Programming Model

Turbomachinery CFD on many-core platforms experiences and strategies

High Performance Matrix Inversion with Several GPUs

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

The OpenACC Application Programming Interface

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

MPI and Hybrid Programming Models. William Gropp

Multicore Parallel Computing with OpenMP

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Big Data Visualization on the MIC

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Intelligent Heuristic Construction with Active Learning

Graphic Processing Units: a possible answer to High Performance Computing?

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Computer Graphics Hardware An Overview

Experiences With Mobile Processors for Energy Efficient HPC

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Large-Scale Reservoir Simulation and Big Data Visualization

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Parallel Computing for Data Science

Deep Learning and GPUs. Julie Bernauer

High Performance Computing

OpenMP and Performance

Transcription:

Introduction to OpenACC Directives Duncan Poole, NVIDIA Thomas Bradley, NVIDIA

GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics Life Sciences Defense Weather Climate Plasma Physics 2004 Present Time

3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration CUDA Libraries are interoperable with OpenACC Easily Accelerate Applications Maximum Flexibility

3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration OpenACC Directives Easily Accelerate Applications CUDA Languages are interoperable with OpenACC, too! Programming Languages Maximum Flexibility

NVIDIA cublas NVIDIA curand NVIDIA cusparse NVIDIA NPP Vector Signal Image Processing GPU Accelerated Linear Algebra Matrix Algebra on GPU and Multicore NVIDIA cufft IMSL Library Building-block Algorithms for CUDA Sparse Linear Algebra C++ STL Features for CUDA GPU Accelerated Libraries Drop-in Acceleration for Your Applications

OpenACC Directives CPU GPU Simple Compiler hints Program myscience... serial code...!$acc kernels do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end kernels... End Program myscience OpenACC Compiler Hint Compiler Parallelizes code Works on many-core GPUs & multicore CPUs Your original Fortran or C code

Familiar to OpenMP Programmers OpenMP OpenACC CPU CPU GPU main() { double pi = 0.0; long i; main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); printf( pi = %f\n, pi/n); #pragma acc kernels for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); printf( pi = %f\n, pi/n);

OpenACC Open Programming Standard for Parallel Computing OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan. --Buddy Bland, Titan Project Director, Oak Ridge National Lab OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP. --Michael Wong, CEO OpenMP Directives Board OpenACC Standard

OpenACC The Standard for GPU Directives Simple: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

High-level Compiler directives to specify parallel regions in C & Fortran Offload parallel regions Portable across OSes, host CPUs, accelerators, and compilers Create high-level heterogeneous programs Without explicit accelerator initialization Without explicit data or program transfers between host and accelerator

High-level with low-level access Programming model allows programmers to start simple Compiler gives additional guidance Loop mappings, data location, and other performance details Compatible with other GPU languages and libraries Interoperate between CUDA C/Fortran and GPU libraries e.g. CUFFT, CUBLAS, CUSPARSE, etc.

5x in 40 Hours 2x in 4 Hours 5x in 8 Hours Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. -- Developer at the Global Manufacturer of Navigation Systems Directives: Easy & Powerful Real-Time Object Detection Global Manufacturer of Navigation Systems Valuation of Stock Portfolios using Monte Carlo Global Technology Consulting Company Interaction of Solvents and Biomolecules University of Texas at San Antonio

Focus on Exposing Parallelism With Directives, tuning work focuses on exposing parallelism, which makes codes inherently better Example: Application tuning work using directives for new Titan system at ORNL S3D Research more efficient combustion with nextgeneration fuels CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios Tuning top 3 kernels (90% of runtime) 3 to 6x faster on CPU+GPU vs. CPU+CPU But also improved all-cpu version by 50% Tuning top key kernel (50% of runtime) 6.5x faster on CPU+GPU vs. CPU+CPU Improved performance of CPU version by 100%

A Very Simple Example: SAXPY SAXPY in C SAXPY in Fortran void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];... // Perform SAXPY on 1M elements saxpy(1<<20, 2.0, x, y);... subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i!$acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo!$acc end kernels end subroutine saxpy... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d)...

Jacobi Iteration: C Code while ( err > tol && iter < iter_max ) { err=0.0; for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); Iterate until converged Iterate across matrix elements Calculate new value from neighbors err = max(err, abs(anew[j][i] - A[j][i])); Compute max error for convergence for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; Naïve swap input/output arrays iter++;

Jacobi Iteration: OpenMP C Code while ( err > tol && iter < iter_max ) { err=0.0; #pragma omp parallel for shared(m, n, Anew, A) reduction(max:err) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Parallelize loop across CPU threads Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i]); #pragma omp parallel for shared(m, n, Anew, A) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; Parallelize loop across CPU threads iter++;

Jacobi Iteration: OpenACC C Code #pragma acc data copy(a), create(anew) while ( err > tol && iter < iter_max ) { err=0.0; Copy A in at beginning of loop, out at end. Allocate Anew on accelerator #pragma acc kernels reduction(max:err) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(anew[j][i] - A[j][i]); #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++;

Performance CPU: Intel Xeon X5680 6 Cores @ 3.33GHz GPU: NVIDIA Tesla M2090 Execution Time (s) Speedup CPU 1 OpenMP thread 69.80 -- CPU 2 OpenMP threads 44.76 1.56x CPU 4 OpenMP threads 39.59 1.76x CPU 6 OpenMP threads 39.71 1.76x vs. 1 CPU core OpenACC GPU 9.78 4.06x (7.14x) vs. 6 CPU cores (1 CPU core)

Further speedups OpenACC allows more detailed control over parallelization Using gang, worker, and vector clauses By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code By understanding bottlenecks in the code via profiling and compiler feedback, we can reorganize the code for higher performance

OpenACC Specification and Website Full OpenACC 1.0 Specification available online www.openacc.org Quick reference card also available Beta implementations available now from PGI, Cray, and CAPS

Start Now with OpenACC Directives Free trial license to PGI Accelerator Tools for quick ramp Sign up for a free trial of the directives compiler now! www.nvidia.com/gpudirectives

OpenACC Talks on GTC-On-Demand Tutorials Monday (CAPS, NVIDIA & PGI) Talks Mark Harris: Getting Started with OpenACC Cliff Wooley: Profiling Michael Wolfe: Advanced Topics Francois Bodin: Programming Many-core Using Directives CAPS (Francois Bodin) Cray (James Beyer, Luiz DeRose) PGI (Brent Leback)

Thank you dpoole@nvidia.com tbradley@nvidia.com