OpenCL: Portability and Performance

Similar documents
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

HPC with Multicore and GPUs

Lecture 3. Optimising OpenCL performance

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

ST810 Advanced Computing

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

OpenCL Programming for the CUDA Architecture. Version 2.3

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPUs for Scientific Computing

Matrix Multiplication

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

ultra fast SOM using CUDA

Introduction to GPU Programming Languages

Parallel Programming Survey

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

High Performance Matrix Inversion with Several GPUs

Accelerating CFD using OpenFOAM with GPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Graphic Processing Units: a possible answer to High Performance Computing?

Next Generation GPU Architecture Code-named Fermi

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Turbomachinery CFD on many-core platforms experiences and strategies

Interactive Level-Set Deformation On the GPU

Intelligent Heuristic Construction with Active Learning

High Performance Computing in CST STUDIO SUITE

Parallel Computing with MATLAB

OpenACC 2.0 and the PGI Accelerator Compilers

Case Study on Productivity and Performance of GPGPUs

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Introduction to GPGPU. Tiziano Diamanti

GPGPU accelerated Computational Fluid Dynamics

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Evaluation of CUDA Fortran for the CFD code Strukti

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

Introduction to GPU Computing

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Stream Processing on GPUs Using Distributed Multimedia Middleware

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Clustering Billions of Data Points Using GPUs

Retargeting PLAPACK to Clusters with Hardware Accelerators

CUDA programming on NVIDIA GPUs

Recent Advances in HPC for Structural Mechanics Simulations

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Choosing a Computer for Running SLX, P3D, and P5

GPU for Scientific Computing. -Ali Saleh

Introduction to GPU hardware and to CUDA

Exploiting GPU Hardware Saturation for Fast Compiler Optimization

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

GPGPU acceleration in OpenFOAM

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)


Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Several tips on how to choose a suitable computer

A quick tutorial on Intel's Xeon Phi Coprocessor

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Overview of HPC Resources at Vanderbilt

GPU Parallel Computing Architecture and CUDA Programming Model

Trends in High-Performance Computing for Power Grid Applications

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Le langage OCaml et la programmation des GPU

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

How to choose a suitable computer

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

High Performance Computing Lab Exercises

64-Bit versus 32-Bit CPUs in Scientific Computing

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Part I Courses Syllabus

Pedraforca: ARM + GPU prototype

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Interactive Level-Set Segmentation on the GPU

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

OpenACC Basics Directive-based GPGPU Programming

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

GeoImaging Accelerator Pansharp Test Results

Big-data Analytics: Challenges and Opportunities

Transcription:

OpenCL: Portability and Performance Bernd Dammann Associate Professor Scientific Computing DTU Informatics HPC architect & consultant DTU Computing Center Technical University of Denmark

Technical University of Denmark 2

Outline GPUlab @ DTU Informatics Motivation Goal People Equipment Projects OpenCL Case Study Future 3

Why do we work with GPUs? We could not neglect the development Our customers asked for it DTU Scientific Computing & HPC Computer Graphics Embedded Systems Engineering We 4 Informatics has the expertise in-house: were able to attract new students GPU computing is a hot topic

GPUlab @ DTU Informatics Research Proposal: Desktop Scientific Computing on Consumer Graphics Cards Proposal was accepted by The Danish Council for Independent Research Technology and Production Sciences, December 2009 Project: 2 5 May, 2010 May, 2013 PhD positions & 1 Postdoc

GPUlab @ DTU Informatics People: 6 Algorithms Prof. Per Chr. Hansen Assoc. Prof. Bernd Dammann Assoc. Prof. John B. Jørgensen Assoc. Prof. Allan Peter Engsig-Karup Assoc. Prof. Jeppe Revall Frisvad Postdoc Hans Henrik Brandenborg Sørensen PhDs: Nicolai Fog Gade-Nielsen, Stefan Glimberg MSc & BSc students HPC Control PDEs Graphics

GPUlab @ DTU Informatics Goal: Industry: FORCE Technology QuantumWise A/S Brüel & Kjær DHI Group MOSEK Aps NVIDIA(?)... 7 Academics: Brown University Rice University INRIA RWTH Aachen Univ. of Antwerp Copenhagen Univ. Aalborg Univ....

GPUlab @ DTU Informatics Our equipment: 8 Intel Core 2 Q9450 @ 2.66 GHz 4 GB NVIDIA Geforce GTX 580 (1.5 GB) Intel Core i7 920 @ 2.67 GHz 24 GB NVIDIA Tesla C2070 (6 GB) / Geforce 9500 GT Intel Core i7 930 @ 2.80 GHz 12 GB NVIDIA Tesla C2050 (3 GB) / Geforce GT 240 Intel Xeon E5620 @ 2.40 GHz 12 GB 2x NVIDIA Geforce GTX 590 (3 GB) Intel Xeon E5620 @ 2.40 GHz 12 GB 2x AMD Radeon 6990 (4 GB)

GPUlab @ DTU Informatics 9 all our machines are homebuilt challenges: to find the right PSUs and get enough of the right cables we have access to external resources as well, e.g. a 8 GPU cluster (4 nodes, 2x Nvidia M2050) at DTU

Projects @ GPUlab Finished, on-going and planned projects: Solver for non-linear water waves Fast computational methods for high-resolution ODF problems on many-core systems Auto-tuning of Dense Linear Algebra on GPUs GPUlab Library a high-performance GPUbased Library for the Development of Scientific Applications... 10

Non-linear water waves Wave loadings on ships, offshore, platforms Influence regions of the bottom interaction in coastal Seakeeping Ship 11 & maneuvering simulator

Non-linear water waves system with close to 100.000.000 degrees of freedom can be solved on GPU (4 GB RAM) one iteration of the solver in less than 1 sec complete 12 re-write of code

ODF Reconstruction Orientation used Distribution Functions (ODF) in X-ray analysis of material properties Collaboration 13 with Material Physics group

ODF Reconstruction Reconstruction of the ODF from CCD images solving of a linear system Ax = b A is a sparse matrix of the size 4 N3 elements/row, only 4 (3 N 2) are non-zero with a desired ODF resolution of 4 10003, the x vector would take up to 16 GB, and the A matrix up to 44 TB in sparse format (CRS)! Solution: continous re-calculation of the rows (by ray-tracing) and using a CGLS solver HW: 14 dual-cpu, quad-gpu workstation

ODF Reconstruction PCGLS: parallel CGLS on CPU ray-tracing on CPU or GPU CUDA CGLS: both CGLS and ray-tracing on GPU full-lines: single prec. dashed-lines: double precision 15

So far, everything we have done was based on CUDA. So why change to OpenCL? 16

Why OpenCL? By request from students in our HPC courses: OpenCL instead of CUDA! I don't care it's harder, I care that it's something I will be able to use on every graphics card. Our collaborators don't want to lock themselves to one vendor or they might have AMD GPUs already. Keyword: But 17 Portability! what about performance?

Why OpenCL? Ask Google... 18

OpenCL: case study Scope: investigate portability and performance test on different GPU architectures: Nvidia & AMD compare implementations: Full 19 CUDA vs OpenCL: Nvidia GPUs only report will be available on-line: A. Svejstrup Nielsen, A.P. Engsig-Karup & BD: Parallel Programming using OpenCL on Modern Architectures, IMM Technical Report 2012-05, Technical University of Denmark

OpenCL: case study 20

OpenCL: case study The test case: matrix B multiplication: C = A x B C[i][j] = Sum_k A[i][k] * B[k][j] A data type float (single precision) only square matrices considered here 21 C

OpenCL: case study naïve CPU version void MatMulCPU(float* A, float* B, float *C, int M, int N, int K) { int n,m,k; for (m = 0; m<m; m++) for (n = 0; n<n; n++) C[n*M + m] = 0.0; for ( m = 0; m < M; m++) for ( k = 0; k < K; k++) for ( n = 0; n < N; n++) C[n*M + m] += A[n*K + k]*b[k*m + m]; } 22

OpenCL: case study naïve OpenCL kernel kernel void MatMulNaive( global float *A, global float *B, global float *C, int size) { // Retrieve work item global index int i = get_global_id(0); int j = get_global_id(1); // If work item within matrix dimension, do dot product if((i<size) && (j<size)){ float tmp = 0.0; int k; access to global memory // loop inner dimension for (k=0;k<size;k++) tmp = tmp + A[j*size+k]*B[k*size+i]; } 23 } C[j*size+i] = tmp;

OpenCL: case study using local memory: // Retrieve global & local work item index int ig = get_global_id(0); int jg = get_global_id(1); int il = get_local_id(0); int jl = get_local_id(1); float tmp = 0.0; for (int n = 0; n < get_num_groups(0); n++) { //Declare local memory space & load data from global memory local float locala[block_size][block_size]; local float localb[block_size][block_size]; locala[jl][il] = A[jG*size + n*block_size + il]; localb[jl][il] = B[(n*BLOCK_SIZE+jL)*size + ig]; // ensure synchronization between all work items in work-group barrier(clk_local_mem_fence); access to local memory // Dot product - work on local memory for (int k = 0; k<block_size; k++) in work loop tmp += locala[jl][k]*localb[k][il]; barrier(clk_local_mem_fence); } // Transfer result back to global memory C[jG*size+iG] = tmp; 24

OpenCL: case study First results AMD HD6990 25

OpenCL: case study More improvements FAST1: using a 16x16 tile of A in local memory...... and 1x16 tile of B in registers (16,1) work-groups/thread-blocks FAST2: increasing occupancy like FAST1, but with (64,1) work-groups FAST3: 26 reducing communication + loop unrolling (by the compiler)

OpenCL: case study Comments The tuning techniques and ideas shown are based on our experiences with the Nvidia Tesla (aka GT200) architecture Since the Nvidia Fermi architecture has more features, e.g. more caches, we would need to apply other tricks as well All timings include data transfers, i.e. host device and device host 27

OpenCL: case study OpenCL vs CUDA on Nvidia GTX 280 28

OpenCL: case study OpenCL vs CUDA on Nvidia GTX 590 CUDA slower than OpenCL? 29

OpenCL: case study OpenCL vs CUDA on Nvidia GTX 590 after removing bank conflicts 30 now the same performance

OpenCL: case study Results on AMD HD 6990 31

OpenCL: case study Comparison of the 3 GPUs in this study 32

OpenCL: case study Conclusions it is possible to write OpenCL code that performs equally well as CUDA code advantage: cublas 33 code is portable is faster but max. performance wasn't the goal of this study tuning tricks were Tesla specific not for Fermi! we could probably tune the code more, but might lose portability

What others found... Scalable HeterOgeneous Computing (SHOC) Benchmark from ORNL http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2012-02-20/13-shoc.pdf 34

More possibilities Extend or replace the GPUlab library with an OpenCL version of the kernels, thus making the library more portable Apply our auto-tuning framework to OpenCL kernels make 35 OpenCL the default for new projects

Auto-tuning GPU kernels Motivation: many of the BLAS2 functions (xgemv, xsymv,...) in libraries like CUBLAS, Magma, etc are optimized for square matrices, only those functions are memory bound, thus memory access and work distribution is important. auto-tuning: parametrize and tune kernels for optimal performance at particular input shapes and sizes 36

Auto-tuning GPU kernels 37

Auto-tuning GPU kernels 38

Auto-tuning GPU kernels 39

Auto-tuning GPU kernels Comparison with CUBLAS and MAGMA: 40

The future... Prediction is very difficult, especially if it is about the future. -- Niels Bohr (1885-1962) In the past: we have had accelerators before: transputers, etc specialized hardware not a big market. GPUs GPU computing will (probably) not go away... but it will develop/change keyword: Heterogenous Computing What 41 are a based on a mass market product is the language of HC?

If parallel programming is hard, heterogeneous programming is that hard, squared. Michael Wolfe, The Portland Group, Inc. From: The Heterogeneous Programming Jungle http://www.hpcwire.com/hpcwire/2012-03-19/the_heterogeneous_programming_jungle.html 42

Acknowledgements Thanks to... Allan Svejstrup for the OpenCL case study Allan Engsig-Karup and Morten Gorm Madsen for the water waves work Nicolai Fog Gade-Nielsen, Martin Høstergaard and Søren Schmidt for the ODF work Hans Henrik Sørensen for the Auto-Tuning work Nikolaj and Toke the two students who got us into all that back in 2007 43

Thank you! http://gpulab.imm.dtu.dk/ 44