OpenCL: Portability and Performance

OpenCL: Portability and Performance Bernd Dammann Associate Professor Scientific Computing DTU Informatics HPC architect & consultant DTU Computing Center Technical University of Denmark

Technical University of Denmark 2

Outline GPUlab @ DTU Informatics Motivation Goal People Equipment Projects OpenCL Case Study Future 3

Why do we work with GPUs? We could not neglect the development Our customers asked for it DTU Scientific Computing & HPC Computer Graphics Embedded Systems Engineering We 4 Informatics has the expertise in-house: were able to attract new students GPU computing is a hot topic

GPUlab @ DTU Informatics Research Proposal: Desktop Scientific Computing on Consumer Graphics Cards Proposal was accepted by The Danish Council for Independent Research Technology and Production Sciences, December 2009 Project: 2 5 May, 2010 May, 2013 PhD positions & 1 Postdoc

GPUlab @ DTU Informatics People: 6 Algorithms Prof. Per Chr. Hansen Assoc. Prof. Bernd Dammann Assoc. Prof. John B. Jørgensen Assoc. Prof. Allan Peter Engsig-Karup Assoc. Prof. Jeppe Revall Frisvad Postdoc Hans Henrik Brandenborg Sørensen PhDs: Nicolai Fog Gade-Nielsen, Stefan Glimberg MSc & BSc students HPC Control PDEs Graphics

GPUlab @ DTU Informatics Goal: Industry: FORCE Technology QuantumWise A/S Brüel & Kjær DHI Group MOSEK Aps NVIDIA(?)... 7 Academics: Brown University Rice University INRIA RWTH Aachen Univ. of Antwerp Copenhagen Univ. Aalborg Univ....

GPUlab @ DTU Informatics Our equipment: 8 Intel Core 2 Q9450 @ 2.66 GHz 4 GB NVIDIA Geforce GTX 580 (1.5 GB) Intel Core i7 920 @ 2.67 GHz 24 GB NVIDIA Tesla C2070 (6 GB) / Geforce 9500 GT Intel Core i7 930 @ 2.80 GHz 12 GB NVIDIA Tesla C2050 (3 GB) / Geforce GT 240 Intel Xeon E5620 @ 2.40 GHz 12 GB 2x NVIDIA Geforce GTX 590 (3 GB) Intel Xeon E5620 @ 2.40 GHz 12 GB 2x AMD Radeon 6990 (4 GB)

GPUlab @ DTU Informatics 9 all our machines are homebuilt challenges: to find the right PSUs and get enough of the right cables we have access to external resources as well, e.g. a 8 GPU cluster (4 nodes, 2x Nvidia M2050) at DTU

Projects @ GPUlab Finished, on-going and planned projects: Solver for non-linear water waves Fast computational methods for high-resolution ODF problems on many-core systems Auto-tuning of Dense Linear Algebra on GPUs GPUlab Library a high-performance GPUbased Library for the Development of Scientific Applications... 10

Non-linear water waves Wave loadings on ships, offshore, platforms Influence regions of the bottom interaction in coastal Seakeeping Ship 11 & maneuvering simulator

Non-linear water waves system with close to 100.000.000 degrees of freedom can be solved on GPU (4 GB RAM) one iteration of the solver in less than 1 sec complete 12 re-write of code

ODF Reconstruction Orientation used Distribution Functions (ODF) in X-ray analysis of material properties Collaboration 13 with Material Physics group

ODF Reconstruction Reconstruction of the ODF from CCD images solving of a linear system Ax = b A is a sparse matrix of the size 4 N3 elements/row, only 4 (3 N 2) are non-zero with a desired ODF resolution of 4 10003, the x vector would take up to 16 GB, and the A matrix up to 44 TB in sparse format (CRS)! Solution: continous re-calculation of the rows (by ray-tracing) and using a CGLS solver HW: 14 dual-cpu, quad-gpu workstation

ODF Reconstruction PCGLS: parallel CGLS on CPU ray-tracing on CPU or GPU CUDA CGLS: both CGLS and ray-tracing on GPU full-lines: single prec. dashed-lines: double precision 15

So far, everything we have done was based on CUDA. So why change to OpenCL? 16

Why OpenCL? By request from students in our HPC courses: OpenCL instead of CUDA! I don't care it's harder, I care that it's something I will be able to use on every graphics card. Our collaborators don't want to lock themselves to one vendor or they might have AMD GPUs already. Keyword: But 17 Portability! what about performance?

Why OpenCL? Ask Google... 18

OpenCL: case study Scope: investigate portability and performance test on different GPU architectures: Nvidia & AMD compare implementations: Full 19 CUDA vs OpenCL: Nvidia GPUs only report will be available on-line: A. Svejstrup Nielsen, A.P. Engsig-Karup & BD: Parallel Programming using OpenCL on Modern Architectures, IMM Technical Report 2012-05, Technical University of Denmark

OpenCL: case study 20

OpenCL: case study The test case: matrix B multiplication: C = A x B C[i][j] = Sum_k A[i][k] * B[k][j] A data type float (single precision) only square matrices considered here 21 C

OpenCL: case study naïve CPU version void MatMulCPU(float* A, float* B, float *C, int M, int N, int K) { int n,m,k; for (m = 0; m<m; m++) for (n = 0; n<n; n++) C[n*M + m] = 0.0; for ( m = 0; m < M; m++) for ( k = 0; k < K; k++) for ( n = 0; n < N; n++) C[n*M + m] += A[n*K + k]*b[k*m + m]; } 22

OpenCL: case study naïve OpenCL kernel kernel void MatMulNaive( global float *A, global float *B, global float *C, int size) { // Retrieve work item global index int i = get_global_id(0); int j = get_global_id(1); // If work item within matrix dimension, do dot product if((i<size) && (j<size)){ float tmp = 0.0; int k; access to global memory // loop inner dimension for (k=0;k<size;k++) tmp = tmp + A[j*size+k]*B[k*size+i]; } 23 } C[j*size+i] = tmp;

OpenCL: case study using local memory: // Retrieve global & local work item index int ig = get_global_id(0); int jg = get_global_id(1); int il = get_local_id(0); int jl = get_local_id(1); float tmp = 0.0; for (int n = 0; n < get_num_groups(0); n++) { //Declare local memory space & load data from global memory local float locala[block_size][block_size]; local float localb[block_size][block_size]; locala[jl][il] = A[jG*size + n*block_size + il]; localb[jl][il] = B[(n*BLOCK_SIZE+jL)*size + ig]; // ensure synchronization between all work items in work-group barrier(clk_local_mem_fence); access to local memory // Dot product - work on local memory for (int k = 0; k<block_size; k++) in work loop tmp += locala[jl][k]*localb[k][il]; barrier(clk_local_mem_fence); } // Transfer result back to global memory C[jG*size+iG] = tmp; 24

OpenCL: case study First results AMD HD6990 25

OpenCL: case study More improvements FAST1: using a 16x16 tile of A in local memory...... and 1x16 tile of B in registers (16,1) work-groups/thread-blocks FAST2: increasing occupancy like FAST1, but with (64,1) work-groups FAST3: 26 reducing communication + loop unrolling (by the compiler)

OpenCL: case study Comments The tuning techniques and ideas shown are based on our experiences with the Nvidia Tesla (aka GT200) architecture Since the Nvidia Fermi architecture has more features, e.g. more caches, we would need to apply other tricks as well All timings include data transfers, i.e. host device and device host 27

OpenCL: case study OpenCL vs CUDA on Nvidia GTX 280 28

OpenCL: case study OpenCL vs CUDA on Nvidia GTX 590 CUDA slower than OpenCL? 29

OpenCL: case study OpenCL vs CUDA on Nvidia GTX 590 after removing bank conflicts 30 now the same performance

OpenCL: case study Results on AMD HD 6990 31

OpenCL: case study Comparison of the 3 GPUs in this study 32

OpenCL: case study Conclusions it is possible to write OpenCL code that performs equally well as CUDA code advantage: cublas 33 code is portable is faster but max. performance wasn't the goal of this study tuning tricks were Tesla specific not for Fermi! we could probably tune the code more, but might lose portability

What others found... Scalable HeterOgeneous Computing (SHOC) Benchmark from ORNL http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2012-02-20/13-shoc.pdf 34

More possibilities Extend or replace the GPUlab library with an OpenCL version of the kernels, thus making the library more portable Apply our auto-tuning framework to OpenCL kernels make 35 OpenCL the default for new projects

Auto-tuning GPU kernels Motivation: many of the BLAS2 functions (xgemv, xsymv,...) in libraries like CUBLAS, Magma, etc are optimized for square matrices, only those functions are memory bound, thus memory access and work distribution is important. auto-tuning: parametrize and tune kernels for optimal performance at particular input shapes and sizes 36

Auto-tuning GPU kernels 37

Auto-tuning GPU kernels Comparison with CUBLAS and MAGMA: 40

The future... Prediction is very difficult, especially if it is about the future. -- Niels Bohr (1885-1962) In the past: we have had accelerators before: transputers, etc specialized hardware not a big market. GPUs GPU computing will (probably) not go away... but it will develop/change keyword: Heterogenous Computing What 41 are a based on a mass market product is the language of HC?

If parallel programming is hard, heterogeneous programming is that hard, squared. Michael Wolfe, The Portland Group, Inc. From: The Heterogeneous Programming Jungle http://www.hpcwire.com/hpcwire/2012-03-19/the_heterogeneous_programming_jungle.html 42

Acknowledgements Thanks to... Allan Svejstrup for the OpenCL case study Allan Engsig-Karup and Morten Gorm Madsen for the water waves work Nicolai Fog Gade-Nielsen, Martin Høstergaard and Søren Schmidt for the ODF work Hans Henrik Sørensen for the Auto-Tuning work Nikolaj and Toke the two students who got us into all that back in 2007 43

Thank you! http://gpulab.imm.dtu.dk/ 44