OpenACC Parallelization and Optimization of NAS Parallel Benchmarks



Similar documents
NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

Multicore Parallel Computing with OpenMP

OpenACC 2.0 and the PGI Accelerator Compilers

Scalability evaluation of barrier algorithms for OpenMP

Case Study on Productivity and Performance of GPGPUs

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Performance and Scalability of the NAS Parallel Benchmarks in Java

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

OpenACC Basics Directive-based GPGPU Programming

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

OpenACC Programming and Best Practices Guide

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

GPU Acceleration of the SENSEI CFD Code Suite

benchmarking Amazon EC2 for high-performance scientific computing

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

OpenACC Programming on GPUs

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

Spring 2011 Prof. Hyesoon Kim

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

MAQAO Performance Analysis and Optimization Tool

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

OpenCL for programming shared memory multicore CPUs

GPGPU accelerated Computational Fluid Dynamics

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Parallel Computing. Parallel shared memory computing with OpenMP

Part I Courses Syllabus

ultra fast SOM using CUDA

Introduction to GPU hardware and to CUDA

Evaluation of CUDA Fortran for the CFD code Strukti

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

An Introduction to Parallel Computing/ Programming

Optimizing Code for Accelerators: The Long Road to High Performance

The OpenACC Application Programming Interface

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

HPC with Multicore and GPUs

High Performance Computing in CST STUDIO SUITE

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Parallel Programming Survey

Big Data Visualization on the MIC

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

GPUs for Scientific Computing

OpenMP and Performance

Performance Characteristics of Large SMP Machines

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

A Case Study - Scaling Legacy Code on Next Generation Platforms

CUDA programming on NVIDIA GPUs

THE NAS KERNEL BENCHMARK PROGRAM

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

High Performance Computing

CUDA Programming. Week 4. Shared memory and register

Introduction to GPGPU. Tiziano Diamanti

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Clustering Billions of Data Points Using GPUs

HPC Wales Skills Academy Course Catalogue 2015

Data Structure Oriented Monitoring for OpenMP Programs

Parallel Computing with MATLAB

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

GPGPU acceleration in OpenFOAM

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Cloud-based OpenMP Parallelization Using a MapReduce Runtime. Rodolfo Wottrich, Rodolfo Azevedo and Guido Araujo University of Campinas

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Performance Analysis for GPU Accelerated Applications

Obtaining Hardware Performance Metrics for the BlueGene/L Supercomputer

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Programming the Intel Xeon Phi Coprocessor

Introduction to GPU Programming Languages

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU Computing - CUDA

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Parallel Computing for Data Science

Accelerating CFD using OpenFOAM with GPUs

COSCO 2015 Heterogeneous Computing Programming

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Transcription:

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools group (http://web.cs.uh.edu/~hpctools/) Department of Computer Science University of Houston Rengan Xu GTC 2014 1

Outline Motivation Overview of OpenACC and NPB benchmarks Parallelization and Optimization Techniques Parallelizing NPB Benchmarks Performance Evaluation Conclusion and Future Work Rengan Xu GTC 2014 2

Motivation NPB is the benchmark close to real applications Parallelization techniques for improving performance Rengan Xu GTC 2014 3

Overview of OpenACC Standard, a high-level directive-based programming model for accelerators OpenACC 2.0 released late 2013 Data Directive: copy/copyin/copyout/ Data Synchronization directive update Compute Directive Parallel: more control to the user Kernels: more control to the compiler Three levels of parallelism Gang Worker Vector Rengan Xu GTC 2014 4

OpenUH: An Open Source OpenACC Compiler Link: http://web.cs.uh.edu/~openuh/ Rengan Xu GTC 2014 5

NAS Parallel Benchmarks (NPB) Well recognized for evaluating current and emerging multicore/many-core hardware architectures 5 parallel kernels IS, EP, CG, MG and FT 3 simulated computational fluid dynamics (CFD) applications LU, SP and BT Different problem sizes Class S: small for quick test purpose Class W: workstation size Class A: standard test problem Class E: largest test problem Rengan Xu GTC 2014 6

Steps to parallelize an application Profile to find the hotspot Analyze compute intensive loops to make it parallelizable Add compute directive to these loops Add data directive to manage data motion and synchronization Optimize data structure and array access pattern Apply Loop scheduling tuning Apply other optimizations, e.g. async and cache Rengan Xu GTC 2014 7

Parallelization and Optimization Techniques Array privatization Loop scheduling tuning Memory coalescing optimization Scalar substitution Loop fission and unrolling Data motion optimization New algorithm adoption Rengan Xu GTC 2014 8

Array Privatization Before array privatization (has data race) #pragma acc kernels for(k=0; k<=grid_points[2]-1; k++){ for(j=0; j<grid_points[1]-1; j++){ for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[j][i][m] = forcing[k][j][i][m]; After array privatization (no data race, increased memory) #pragma acc kernels for(k=0; k<=grid_points[2]-1; k++){ for(j=0; j<grid_points[1]-1; j++){ for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[k][j][i][m] = forcing[k][j][i][m]; Rengan Xu GTC 2014 9

Loop Scheduling Tuning Before tuning #pragma acc kernels for(k=0; k<=grid_points[2]-1; k++){ for(j=0; j<grid_points[1]-1; j++){ for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[k][j][i][m] = forcing[k][j][i][m]; After tuning #pragma acc kernels loop gang for(k=0; k<=grid_points[2]-1; k++){ #pragma acc loop worker for(j=0; j<grid_points[1]-1; j++){ #pragma acc loop vector for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[k][j][i][m] = forcing[k][j][i][m]; Rengan Xu GTC 2014 10

Memory Coalescing Optimization Non-coalesced memory access #pragma acc kernels loop gang for(j=1; j <= gp12; j++){ #pragma acc loop worker for(i=1; I <= gp02; i++){ #pragma acc loop vector for(k=0; k <= ksize; k++){ fjacz[0][0][k][i][j] = 0.0; Coalesced memory access (loop interchange) #pragma acc kernels loop gang for(k=0; k <= ksize; k++){ #pragma acc loop worker for(i=1; i<= gp02; i++){ #pragma acc loop vector for(j=1; j <= gp12; j++){ fjacz[0][0][k][i][j] = 0.0; Rengan Xu GTC 2014 11

Memory Coalescing Optimization Non-coalescing memory access #pragma acc kernels loop gang for(k=0; k<=grid_points[2]-1; k++){ #pragma acc loop worker for(j=0; j<grid_points[1]-1; j++){ #pragma acc loop vector for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[k][j][i][m] = forcing[k][j][i][m]; Coalesced memory access (change data layout) #pragma acc kernels loop gang for(k=0; k<=grid_points[2]-1; k++){ #pragma acc loop worker for(j=0; j<grid_points[1]-1; j++){ #pragma acc loop vector for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[m][k][j][i] = forcing[m][k][j][i]; Rengan Xu GTC 2014 12

Scalar Substitution Before scalar substitution #pragma acc kernels loop gang for(k=0; k<=grid_points[2]-1; k++){ #pragma acc loop worker for(j=0; j<grid_points[1]-1; j++){ #pragma acc loop vector for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[m][k][j][i] = forcing[k][j][i][m]; After scalar substitution #pragma acc kernels loop gang for(k=0; k<=gp2-1; k++){ #pragma acc loop worker for(j=0; j<gp1-1; j++){ #pragma acc loop vector for(i=0; i<gp0-1; i++){ for(m=0; m<5; m++){ rhs[m][k][j][i] = forcing[m][k][j][i]; Rengan Xu GTC 2014 13

Loop Unrolling Before loop unrolling #pragma acc kernels loop gang for(k=0; k<=gp21; k++){ #pragma acc loop worker for(j=0; j<gp1-1; j++){ #pragma acc loop vector for(i=0; i<gp0-1; i++){ for(m=0; m<5; m++){ rhs[m][k][j][i] = forcing[m][k][j][i]; After loop unrolling #pragma acc kernels loop gang for(k=0; k<=gp2-1; k++){ #pragma acc loop worker for(j=0; j<gp1-1; j++){ #pragma acc loop vector for(i=0; i<gp0-1; i++){ rhs[0][k][j][i] = forcing[0][k][j][i]; rhs[1][k][j][i] = forcing[1][k][j][i]; rhs[2][k][j][i] = forcing[2][k][j][i]; rhs[3][k][j][i] = forcing[3][k][j][i]; rhs[4][k][j][i] = forcing[4][k][j][i]; Rengan Xu GTC 2014 14

Loop Fission Before loop fission # pragma acc parallel loop gang for(i3 = 1; i3 < n3-1; i3 ++) { # pragma acc loop worker for (i2 = 1; i2 < n2-1; i2 ++) { # pragma acc loop vector for (i1 = 0; i1 < n1; i1 ++) {... # pragma acc loop vector for (i1 = 1; i1 < n1-1; i1 ++) {... After loop fission # pragma acc parallel loop gang for(i3 = 1; i3 < n3-1; i3 ++) { # pragma acc loop worker for (i2 = 1; i2 < n2-1; i2 ++) { # pragma acc loop vector for (i1 = 0; i1 < n1; i1 ++) {... # pragma acc parallel loop gang for(i3 = 1; i3 < n3-1; i3 ++) { # pragma acc loop worker for (i2 = 1; i2 < n2-1; i2 ++) { # pragma acc loop vector for (i1 = 1; i1 < n1-1; i1 ++) {... Rengan Xu GTC 2014 15

Data Movement Optimization In NPB, most of the benchmarks contain many global arrays live throughout the entire program Allocate memory at the beginning Update directive to synchronize data between host and device Async directive to overlap communication and computation Rengan Xu GTC 2014 16

New algorithm adoption To exploit parallelism We adopt Hyperplane 1 algorithm for LU To overcome GPU memory size limit (blocking EP) Rengan Xu GTC 2014 1. Leslie Lamport. The Parallel Execution of DO Loops, in Communications of the ACM, Volumn 17, Issue 2, pp. 83-93 17

NAS Parallel Benchmarks (NPB) Well recognized for evaluating current and emerging multi-core/many-core hardware architectures 5 parallel kernels IS, EP, CG, MG and FT 3 simulated computational fluid dynamics (CFD) applications LU, SP and BT Different problem sizes Class S: small for quick test purpose Class W: workstation size Class A,B,C: standard test problems Class D,E,F: large test problems Use OpenACC 1.0 to create the benchmark suite Rengan Xu GTC 2014 18

Parallelizing NPB Benchmarks - EP Array reduction issue every element of an array needs reduction (a) OpenMP solution (b) OpenACC solution 1 (c) OpenACC solution 2 Rengan Xu GTC 2014 19

Parallelizing NPB Benchmarks FT and IS FFT: complex data type and too many function calls Solution for FFT: Convert the array of structures (AoS) to structure of arrays (SoA) Manually inline function calls in all compute regions IS: irregular memory access Atomic operations Inclusive scan and exclusive scan Rengan Xu GTC 2014 20

Parallelizing NPB Benchmarks - MG Partial array copy extensively used In resid() routine, ov is read only and or is written only They may point to the same memory (alias) Using copyin for ov and copyout for or is implementation defined Solution: put two arrays into one copy clause Rengan Xu GTC 2014 21

Technical Report NAS-99-011, NASA Ames Research Center Parallelizing NPB Benchmarks - LU OpenMP uses pipepline, OpenACC uses hyperplane Array privatization and memory coalescing optimization Rengan Xu GTC 2014 22 Haoqiang Jin, Michael Frumkin and Jerry Yan, The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011. 1999

Performance Evaluation 16 cores Intel Xeon x86_64 CPU with 32 GB memory Kepler 20 with 5GB memory NPB 3.3 C version 1 GCC4.4.7, OpenUH Compare to serial and CUDA version Rengan Xu GTC 2014 1. http://aces.snu.ac.kr/center_for_manycore_programming/snu_npb_suite.html 23

Performance Evaluation of OpenUH OpenACC NPB compared to serial Rengan Xu GTC 2014 24

Performance Evaluation of OpenUH OpenACC NPB effectiveness of optimization Rengan Xu GTC 2014 25

Performance Evaluation OpenUH OpenACC NPB OpenACC vs CUDA 1 LU-HP BT SP Rengan Xu GTC 2014 1. http://www:tu-chemnitz:de/informatik/pi/forschung/download/npb-gpu 26

Conclusion and Future Work Conclusion Discussed different parallelization techniques for OpenACC Demonstrated speedup of OpenUH OpenACC over serial code Compared the performance between OpenUH OpenACC and CUDA Contributed 4 NPB benchmarks to SPEC ACCEL V1.0 (released on March 18, 2014) http://www.hpcwire.com/off-the-wire/spechpg-releases-new-hpc-benchmark-suite/ Looking forward to making more contributions to future SPEC ACCEL suite Performance comparison between different compilers: in session S4343 (Wednesday, 03/26, 17:00 17:25, Room LL20C) Future Work Explore other optimizations Automate some of the optimizations in OpenUH compiler Rengan Xu GTC 2014 27