OpenCL: Portability and Performance

Size: px

Start display at page:

Download "OpenCL: Portability and Performance"

Cameron Eaton
9 years ago
Views:

1 OpenCL: Portability and Performance Bernd Dammann Associate Professor Scientific Computing DTU Informatics HPC architect & consultant DTU Computing Center Technical University of Denmark

2 Technical University of Denmark 2

3 Outline DTU Informatics Motivation Goal People Equipment Projects OpenCL Case Study Future 3

4 Why do we work with GPUs? We could not neglect the development Our customers asked for it DTU Scientific Computing & HPC Computer Graphics Embedded Systems Engineering We 4 Informatics has the expertise in-house: were able to attract new students GPU computing is a hot topic

5 DTU Informatics Research Proposal: Desktop Scientific Computing on Consumer Graphics Cards Proposal was accepted by The Danish Council for Independent Research Technology and Production Sciences, December 2009 Project: 2 5 May, 2010 May, 2013 PhD positions & 1 Postdoc

6 DTU Informatics People: 6 Algorithms Prof. Per Chr. Hansen Assoc. Prof. Bernd Dammann Assoc. Prof. John B. Jørgensen Assoc. Prof. Allan Peter Engsig-Karup Assoc. Prof. Jeppe Revall Frisvad Postdoc Hans Henrik Brandenborg Sørensen PhDs: Nicolai Fog Gade-Nielsen, Stefan Glimberg MSc & BSc students HPC Control PDEs Graphics

7 DTU Informatics Goal: Industry: FORCE Technology QuantumWise A/S Brüel & Kjær DHI Group MOSEK Aps NVIDIA(?)... 7 Academics: Brown University Rice University INRIA RWTH Aachen Univ. of Antwerp Copenhagen Univ. Aalborg Univ....

8 DTU Informatics Our equipment: 8 Intel Core GHz 4 GB NVIDIA Geforce GTX 580 (1.5 GB) Intel Core i GHz 24 GB NVIDIA Tesla C2070 (6 GB) / Geforce 9500 GT Intel Core i GHz 12 GB NVIDIA Tesla C2050 (3 GB) / Geforce GT 240 Intel Xeon 2.40 GHz 12 GB 2x NVIDIA Geforce GTX 590 (3 GB) Intel Xeon 2.40 GHz 12 GB 2x AMD Radeon 6990 (4 GB)

9 DTU Informatics 9 all our machines are homebuilt challenges: to find the right PSUs and get enough of the right cables we have access to external resources as well, e.g. a 8 GPU cluster (4 nodes, 2x Nvidia M2050) at DTU

10 GPUlab Finished, on-going and planned projects: Solver for non-linear water waves Fast computational methods for high-resolution ODF problems on many-core systems Auto-tuning of Dense Linear Algebra on GPUs GPUlab Library a high-performance GPUbased Library for the Development of Scientific Applications... 10

11 Non-linear water waves Wave loadings on ships, offshore, platforms Influence regions of the bottom interaction in coastal Seakeeping Ship 11 & maneuvering simulator

12 Non-linear water waves system with close to degrees of freedom can be solved on GPU (4 GB RAM) one iteration of the solver in less than 1 sec complete 12 re-write of code

13 ODF Reconstruction Orientation used Distribution Functions (ODF) in X-ray analysis of material properties Collaboration 13 with Material Physics group

14 ODF Reconstruction Reconstruction of the ODF from CCD images solving of a linear system Ax = b A is a sparse matrix of the size 4 N3 elements/row, only 4 (3 N 2) are non-zero with a desired ODF resolution of , the x vector would take up to 16 GB, and the A matrix up to 44 TB in sparse format (CRS)! Solution: continous re-calculation of the rows (by ray-tracing) and using a CGLS solver HW: 14 dual-cpu, quad-gpu workstation

15 ODF Reconstruction PCGLS: parallel CGLS on CPU ray-tracing on CPU or GPU CUDA CGLS: both CGLS and ray-tracing on GPU full-lines: single prec. dashed-lines: double precision 15

16 So far, everything we have done was based on CUDA. So why change to OpenCL? 16

17 Why OpenCL? By request from students in our HPC courses: OpenCL instead of CUDA! I don't care it's harder, I care that it's something I will be able to use on every graphics card. Our collaborators don't want to lock themselves to one vendor or they might have AMD GPUs already. Keyword: But 17 Portability! what about performance?

18 Why OpenCL? Ask Google... 18

19 OpenCL: case study Scope: investigate portability and performance test on different GPU architectures: Nvidia & AMD compare implementations: Full 19 CUDA vs OpenCL: Nvidia GPUs only report will be available on-line: A. Svejstrup Nielsen, A.P. Engsig-Karup & BD: Parallel Programming using OpenCL on Modern Architectures, IMM Technical Report , Technical University of Denmark

20 OpenCL: case study 20

21 OpenCL: case study The test case: matrix B multiplication: C = A x B C[i][j] = Sum_k A[i][k] * B[k][j] A data type float (single precision) only square matrices considered here 21 C

22 OpenCL: case study naïve CPU version void MatMulCPU(float* A, float* B, float *C, int M, int N, int K) { int n,m,k; for (m = 0; m<m; m++) for (n = 0; n<n; n++) C[n*M + m] = 0.0; for ( m = 0; m < M; m++) for ( k = 0; k < K; k++) for ( n = 0; n < N; n++) C[n*M + m] += A[n*K + k]*b[k*m + m]; } 22

23 OpenCL: case study naïve OpenCL kernel kernel void MatMulNaive( global float *A, global float *B, global float *C, int size) { // Retrieve work item global index int i = get_global_id(0); int j = get_global_id(1); // If work item within matrix dimension, do dot product if((i<size) && (j<size)){ float tmp = 0.0; int k; access to global memory // loop inner dimension for (k=0;k<size;k++) tmp = tmp + A[j*size+k]*B[k*size+i]; } 23 } C[j*size+i] = tmp;

24 OpenCL: case study using local memory: // Retrieve global & local work item index int ig = get_global_id(0); int jg = get_global_id(1); int il = get_local_id(0); int jl = get_local_id(1); float tmp = 0.0; for (int n = 0; n < get_num_groups(0); n++) { //Declare local memory space & load data from global memory local float locala[block_size][block_size]; local float localb[block_size][block_size]; locala[jl][il] = A[jG*size + n*block_size + il]; localb[jl][il] = B[(n*BLOCK_SIZE+jL)*size + ig]; // ensure synchronization between all work items in work-group barrier(clk_local_mem_fence); access to local memory // Dot product - work on local memory for (int k = 0; k<block_size; k++) in work loop tmp += locala[jl][k]*localb[k][il]; barrier(clk_local_mem_fence); } // Transfer result back to global memory C[jG*size+iG] = tmp; 24

25 OpenCL: case study First results AMD HD

26 OpenCL: case study More improvements FAST1: using a 16x16 tile of A in local memory and 1x16 tile of B in registers (16,1) work-groups/thread-blocks FAST2: increasing occupancy like FAST1, but with (64,1) work-groups FAST3: 26 reducing communication + loop unrolling (by the compiler)

27 OpenCL: case study Comments The tuning techniques and ideas shown are based on our experiences with the Nvidia Tesla (aka GT200) architecture Since the Nvidia Fermi architecture has more features, e.g. more caches, we would need to apply other tricks as well All timings include data transfers, i.e. host device and device host 27

28 OpenCL: case study OpenCL vs CUDA on Nvidia GTX

29 OpenCL: case study OpenCL vs CUDA on Nvidia GTX 590 CUDA slower than OpenCL? 29

30 OpenCL: case study OpenCL vs CUDA on Nvidia GTX 590 after removing bank conflicts 30 now the same performance

31 OpenCL: case study Results on AMD HD

32 OpenCL: case study Comparison of the 3 GPUs in this study 32

33 OpenCL: case study Conclusions it is possible to write OpenCL code that performs equally well as CUDA code advantage: cublas 33 code is portable is faster but max. performance wasn't the goal of this study tuning tricks were Tesla specific not for Fermi! we could probably tune the code more, but might lose portability

34 What others found... Scalable HeterOgeneous Computing (SHOC) Benchmark from ORNL 34

35 More possibilities Extend or replace the GPUlab library with an OpenCL version of the kernels, thus making the library more portable Apply our auto-tuning framework to OpenCL kernels make 35 OpenCL the default for new projects

36 Auto-tuning GPU kernels Motivation: many of the BLAS2 functions (xgemv, xsymv,...) in libraries like CUBLAS, Magma, etc are optimized for square matrices, only those functions are memory bound, thus memory access and work distribution is important. auto-tuning: parametrize and tune kernels for optimal performance at particular input shapes and sizes 36

37 Auto-tuning GPU kernels 37

38 Auto-tuning GPU kernels 38

39 Auto-tuning GPU kernels 39

40 Auto-tuning GPU kernels Comparison with CUBLAS and MAGMA: 40

41 The future... Prediction is very difficult, especially if it is about the future. -- Niels Bohr ( ) In the past: we have had accelerators before: transputers, etc specialized hardware not a big market. GPUs GPU computing will (probably) not go away... but it will develop/change keyword: Heterogenous Computing What 41 are a based on a mass market product is the language of HC?

42 If parallel programming is hard, heterogeneous programming is that hard, squared. Michael Wolfe, The Portland Group, Inc. From: The Heterogeneous Programming Jungle 42

43 Acknowledgements Thanks to... Allan Svejstrup for the OpenCL case study Allan Engsig-Karup and Morten Gorm Madsen for the water waves work Nicolai Fog Gade-Nielsen, Martin Høstergaard and Søren Schmidt for the ODF work Hans Henrik Sørensen for the Auto-Tuning work Nikolaj and Toke the two students who got us into all that back in

44 Thank you! 44

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas