GPGPUs, CUDA and OpenCL Timo Lilja January 21, 2010 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 1 / 42
Course arrangements Course code: T-106.5800 Seminar on Software Techniques Credits: 3 Thursdays 1516 at A232, lecture period III only Mandatory attendance but you can skip 1 session Presentation One hour presentation Two presentations per session Programming project Small programming project from a given topic or your own topic if you haven't received credits from it from some other course The goal is to parallelize the given program You can choose whether you want to use Cuda or OpenCL We provide a development environment for this programming project. More information will be announced later, check the wiki page Check the course wiki page http://wiki.tkk.fi/display/gpgpuk2010/home Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 2 / 42
Contents Introduction 1 Introduction 2 NVidia Hardware Cuda 3 OpenCL 4 Conclusion Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 3 / 42
Why GPGPU? Introduction GPGPU can in many cases oer a hundredfold increase in performance, tenfold decrease in price and threefold increase in power eciency over traditional CPU in many scientic computing eorts. Business opportunities in various elds: medical technology, data mining,... Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 4 / 42
What is a GPGPU? Introduction Original application in computer graphics and games General-Purpose Computing on Graphics Processing Units Origins in programmable vertex and fragment shaders First GPGPU programs where done by using normal graphics APIs in late 90s In early 2000s rst programmable shaders fully programmable GPU cores Ca. 2005 rst fully-programmable shaders Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 5 / 42
Introduction Parallel Computing Architectures According to Flynn's taxonomy dened in 1966 by Michael J. Flynn. Pictures by Colin M.L. Burnett Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 6 / 42
Stream Processing Introduction Programming paradigm related to SIMD Given a stream of data and a series of operations, called kernel functions The kernel function is applied to all elements of a stream concurrently Memory is very hierarchical: local memory easily accessible, global memory much more expensive Memory accesses usually in bulk so memory optimized or high bandwidth and not to low latency Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 7 / 42
GPU vs. CPU Introduction To support SIMD parallelism, ALUs must be abundant whereas control logic and data caches are not needed that much Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 8 / 42
NVIDIA GPU NVidia Hardware Implementation of a stream processor system Unied architecture vertex, pixel and other shaders use the same GPU facilities Highly hierarchical hardware Streaming-Processor core (SP) Streaming multiprocessor (SM) Texture/processor cluster (TPC) Streaming processor array (SPA) Limitations and dierences when compared to CPU Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 9 / 42
NVidia Streaming Multiprocessor (SM) Hardware 8 Streaming Processor (SP) cores scalar multiply-add (MAD) and ALU units single precision oats and ALU operations in 4 cycles Fused Multiply-Add unit (FMAD) IEEE 754R double precision oating points 1 per/processor: double precision oats are slow 2 special function units (SFU) provide transcendental functions other complex functions: reciprocal slow latencies 16-32 cycles or more low-latency interconnect network between SPs and shared-memory banks multi-threaded instruction fetch and issue unit caches: instruction cache and read-only constant cache 16K read/write shared memory Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 10 / 42
NVidia Hardware Texture/Processor Cluster (TPC) Geometry controller maps the operations into Streaming Multiprocessors Provides 2-dimenisional texture cache that uses (x, y)-spatial locality Streaming multiprocessor (SM) controller Older NVidia's cards (G80) have 2 SMs/TPC, newer have (GT200) 3 SMs/TPC Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 11 / 42
Streaming Processor Array NVidia Hardware Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 12 / 42
NVidia Memory and other features Hardware Memory is highly hierarchical and cached Thread local memory Shared memory which is shared inside a Streaming Multiprocessor (SM) Global memory which is accessible to all threads Raster operation processor (ROP) Other units are mainly used for computer graphics Texture unit Rasterization: Raster operations processor (ROP) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 13 / 42
Die micrograph NVidia Hardware Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 14 / 42
Hardware limitations NVidia Hardware Branching can cause the program to run fully sequentially Double precision oating point numbers are slow Bus bandwidth between CPU and GPU can become a bottleneck Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 15 / 42
NVidia Current hardware specications Hardware Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 16 / 42
Cuda NVidia Cuda Compute Unied Device Architecture NVidia's proprietary stream programming language Available for Linux, Mac OS X and Windows Current release 2.3, rst release in 2007 C for Cuda Compiled through Pathscale's Open64 C compiler Standard C with kernel extensions Cuda driver API Standard C API interface kernels are explicitly loaded Cuda toolkit includes compiler, proler, debugger, manual pages, runtime libraries Cuda SDK Various code examples, some extra libraries Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 17 / 42
Programming Cuda (1/2) NVidia Cuda Consider adding two vectors A and B and storing the result in C. In ordinary C void VecAdd(float *A, float *B, float *C) { for (i = 0; i < N; i++) C[i] = A[i] + B[i]; } In Cuda global void VecAdd(float* A, float *B, float *C) { int i = threadidx.x; C[i] = A[i] + B[i]; } Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 18 / 42
Programming Cuda (2/2) NVidia Cuda In order to run a parallel program 1 Data must be copied to GPU 2 The kernel must be invoked from the CPU code with special syntax 3 and the data must be copied back to CPU The language used in Cuda kernels is limited recursion is not supported function pointers cannot be used few other restrictions documented in Cuda programming manual See example/cuda/vec.cu Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 19 / 42
Processing ow on CUDA NVidia Cuda Picture by Tosaka Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 20 / 42
NVidia Cuda Threads, Blocks and Grids (1/2) Threads perform single scalar operation per cycle Thread blocks Can be 1-, 2- or 3-dimensional can communicate through shared memory can synchronize through syncthreads() at most 512 threads per block Thread blocks are executed in 32 thread warps in a single SM Grids kernel can be executed by multiple thread blocks thread blocks are organized into 1- or 2-dimension grid which can be used indexing the block Kernel invocation syntax is <<<dimgrid,dimblock>>>(args); dimgrid can be 1- or 2-dimensional dimblock can be either 1D,2D or 3D Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 21 / 42
NVidia Cuda Threads, blocks and grids (2/2) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 22 / 42
NVidia Example: matrix addition (1/2) Cuda In normal C void addmatrix(float *a, float *b, float *c, int N) { int i, j, idx; for (i = 0; i < N; i++) for (j = 0; j < N; j++) idx = i + j*n; c[idx] = a[idx] + b[idx]; } } } int main(void) {... addmatrix(a,b, c, N); } Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 23 / 42
NVidia Example: matrix addition (2/2) Cuda In Cuda global void addmatrixg(float *a, float *b, float *c, int N) { int i = blockidx.x*blockdim.x + threadidx.x; int j = blockidx.y*blockdim.y + threadidx.y; int idx = i + j*n; if (i < N && j < N) c[idx] = a[idx] + b[idx]; } int main(void) { dim3 dimblock (blocksize, blocksize); dim3 dimgrid (N/dimBlock.x, N/dimBlock.y) addmatrixg<<<dimgrid, dimblock>>>(a,b,c, N) } Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 24 / 42
Compiling Cuda NVidia Cuda Cuda for C code is compiled with nvcc compiler and its extension is.cu The host code is compiled to native x86 The device code is rst compiled to Parallel Thread Execution PTX assembler and then to cubin binary format pseudo-assembler with arbitrary large register set almost entirely in SSA form NVidia graphics card driver load cubin code compiles and executes the PTX code With the Cuda C driver API it is possible to upload own, non-nvcc generated cubin code to the driver Used in some HLL to provide Cuda support Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 25 / 42
Other features NVidia Cuda Asynchronous execution Memory hierarchy: Device Memory, Shared Memory, Page-Locked Host Memory Error Handling Multiple Devices Debugger, Proler and the Device emulation mode Performance tuning Check the SDK examples! Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 26 / 42
OpenCL OpenCL Open Computing Language was initially developed by Apple Now developed by Khoronos Group and OpenCL 1.0 was published on December 8, 2008 Both AMD and NVidia support OpenCL 1.0 as of late 2009 Apple's implementation is based on LLVM compiler framework OpenCL is fully open standard with The goal is to support GPGPUs, Cells, DSPs OpenCL language is based on C99 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 27 / 42
Terminology OpenCL A OpenCL host is the machine controlling one or more OpenCL devices A device consists of one ore more computing cores A computing core consists of one or more processing elements Processing elements execute code as SIMD or SPMD (Single Process Multiple Data, ordinary OpenMP kind multitasking) A Program consists of one or more kernels Computation domains can be 1-, 2- or 3-dimensional Work-items execute kernels in parallel and are grouper to local workgroups Synchronization can be done only within a workgroup Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 28 / 42
Memory Model OpenCL Like in Cuda, memory is hierarchical Private memory is per work-item Local Memory is shared with a workgroup Unsynchronized Local Global/Constant Memory Host Memory Memory must be copied between host, global and local memory Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 29 / 42
OpenCL Objects and Running OpenCL Setup 1 Choose the device (GPU, CPU, Cell) 2 Create a context, which is a collection of devices 3 Create and submit work into a queue Memory consists of buers which can be accessed freely, read/write images which can be either read or written in a kernel, not both and can be accessed only by specic functions. Work is run asynchronously, synchronous access requires blocking API calls Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 30 / 42
OpenCL OpenCL kernel language Based on ISO C99 No function pointers, recursion, variable length arrays, bit elds Syntax and other additions work-items and workgroups vector types and operations synchronization address space qualiers Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 31 / 42
OpenCL OpenCL, CUDA and Linux Compiled with ordinary GCC and linked against Cuda's libopencl Kernels must be embedded into C strings or loaded from external les through OpenCL API, In Cuda kernels are recompiled and linked to the binary NVidia Cuda SDK (at least in 3.0) has lots of OpenCL examples Kernel syntax is dierent! Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 32 / 42
Conclusion Conclusion GPGPUs provide one form of parallelism, namely SIMD Multi-core CPUs provide MIMD parallelism Will the future merge these two into a single platform? NVidia Cuda is strongly stream processing SIMD implementation whereas OpenCL is far more generic supporting both SIMD and SPMD/MIMD What kind of applications and who will benet from GPGPU stream processing? Will it make Oce applications run faster? Will it benet average user? average programmer? average scientist? At least it will benet the average gamer Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 33 / 42
References Conclusion GPUs and CUDA http://www.cis.temple.edu/~ingargio/ /cis307/readings/cuda.html NVIDIA's GT200: Inside a Parallel Processor http://www. realworldtech.com/page.cfm?articleid=rwt090808195242&p=1 NVidia Cuda Programming Guide 2.3 Building NVIDIA's GT200 http://www.anandtech.com/video/showdoc.aspx?i=3334&p=2 ixbt Labs: NVIDIA CUDA http://ixbtlabs.com/articles3/video/cuda-1-p6.html Lindholm et al. NVIDIA Tesla: A Unied Graphics and Computing Architecture. IEEE micro. vol. 28 no. 2, March/April 2008 NVidia OpenCL JumpStart Guide Khronos group's OpenCL overview Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 34 / 42
Possible topics (1/2) Conclusion How to optimize matrix multiplications in Cuda/OpenCL Section 3.2.2 in Cuda Programming Cude Starting from CPU multiplication and ending up in GPGPU benchmarking after each optimization step Performance tuning and Best practices Cuda/OpenCL Best practices Guide Cuda and OpenCL API comparison performance evaluation User experiences and example applications NVidia's Cuda/OpenCL SDK Other applications Libraries and tools already ported to Cuda/OpenCL Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 35 / 42
Possible topics (2/2) Conclusion AMD Hardware overview and comparison to NVidia Overview of AMD's implementation of OpenCL AMD currently leading in GPU performance High-level languages and GPGPU Python bindings for both Cuda and OpenCL C++, FP languages, Matlab GPGPU IDEs and development tools Future GPGPU trends Merging CPU and GPGPU: will it happen? Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 36 / 42
Conclusion Arrangements once more Need two presentantions for the next session Next session on Jan 28th or Feb 4th? For the rest: e-mail me (timo.lilja@tkk.) suitable times and topic suggestions ASAP You can suggest your own programming project topic too by emailing me Check the wiki pages, I will add instructions on how to use Cuda/OpenCL in course server environment http://wiki.tkk.fi/display/gpgpuk2010/running+cuda+and+ OpenCL+in+course+server Would be interesting to get some instructions on running AMD's OpenCL stack into the course wiki as well Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 37 / 42
Helmholtz Dierential Equation (1/2) An elliptic partial DE, in general form: 2 ψ + k 2 ψ = 0 Height of the wave V at coordinates (x, y) accelerates towards the wave height of adjacent places (x d, y), (x + d, y), (x, y d), (x, y + d) D 2 t V (x, y) = C V (x d, y) + V (x + d, y) + V (x, y d) + V (x, y + d) 4 Add a little friction... F D t V (x, y) and impulse... + I (t, x, y) «V (x, y) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 38 / 42
Helmholtz Dierential Equation (2/2) The coecient C corresponds to the conductivity of the material, C = 0 the wave can't penetrate this material The coecient F corresponds to the friction of the material, F = 0 no friction, the wave continues forever d is the distance between two points in the discretized space We set d = 1 and adjust C and F correspondingly, and store V (x, y) in a two-dimensional table For a numerical solution to be at least somewhat accurate, C < 1 and the wavelength > 4 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 39 / 42
Solving the DE Numerically (1/3) Given a (set of) DE y = y (t, y) Euler's algorithm: y t+h = y t + h y (t, y t ) follow the tangent for a step h Very inaccurate Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 40 / 42
Solving the DE Numerically (2/3) Runge-Kutta algorithm, one variation: where α + 2β + 2γ + δ y t+h = y t + h 6 α = y (t, y t ), β = y (t + h 2, y t + h 2 α), γ = y (t + h 2, y t + h 2 β), δ = y (t + h, y t + hγ) (Even better algorithms exist, especially for sti problems like Helmholtz, but ignored here.) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 41 / 42
Solving the DE Numerically (3/3) If the DE is of a higher degree, we can normalize it: V t (x, y) = hv t (x, y) V V (x d, y) +... t (x, y) = C 4 «V (x, y) F D tv (x, y) Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 42 / 42