Lecture 1: an introduction to OpenL Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research entre Edited from the UDA originals by Tom Deakin Lecture 1 p. 1
Overview hardware view software view OpenL programming Lecture 1 p. 2
Hardware view At the top-level, a PIe graphics card with a many-core GPU and high-speed graphics device memory sits inside a standard P/server with one or two multicore PUs: DDR3 GDDR5 motherboard graphics card Lecture 1 p. 3
Hardware view There are multiple GPU products out at the moment onsumer graphics cards (GeForce): GTX680: 1536 cores, 2/4GB ( 360/440) GTX690: 2 1536 cores, 2 2GB ( 800) AMD Radeon HD 7970 GHz Ed.: 2048 stream proc., 3GB ( 420) Dedicated HP cards (no graphics output): K10 module: 2 1536 cores, 2 4GB K20 card: 2496 cores, 5GB K20X module: 2688 cores, 6GB Lecture 1 p. 4
Hardware view We take a brief look at NVIDIA GPU design: building block is a streaming multiprocessor (SMX): 192 cores and 64k registers 64KB of shared memory / L1 cache 8KB cache for constants 48KB texture cache for read-only arrays up to 2K threads per SMX different chips have different numbers of these SMXs: product SMXs bandwidth memory power GTX 650 Ti 4 86 GB/s 1/2 GB 110W GTX 680 8 190 GB/s 2/4 GB 195W K10 (2 ) 8 160 GB/s 4 GB 110W K20X 14 250 GB/s 6 GB 235WLecture 1 p. 5
Hardware View Kepler GPU SMX SMX SMX SMX L2 cache SMX SMX SMX SMX L1 cache / shared memory Lecture 1 p. 6
Hardware View Fermi GPU SM SM SM SM SM SM SM L2 cache SM SM SM SM SM SM SM L1 cache / shared memory Lecture 1 p. 7
Multithreading Key hardware feature is that the cores in an SMX are SIMT (Single Instruction Multiple Threads) cores: all cores execute the same instructions simultaneously, but with different data similar to vector computing on RAY supercomputers minimum of 32 threads all doing the same thing at (almost) the same time natural for graphics processing and much scientific computing SIMT is also a natural choice for many-core chips to simplify each core Lecture 1 p. 8
Multithreading Lots of active threads is the key to high performance: no context switching ; each thread has its own registers, which limits the number of active threads threads become inactive whilst waiting for data or part of the compute group takes a divergent path (if statements) Lecture 1 p. 9
Multithreading for each thread, one operation completes long before the next starts avoids the complexity of pipeline overlaps which can limit the performance of modern processors 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 time memory access from device memory has a delay of 400-600 cycles; with 40 threads this is equivalent to 10-15 operations, so hopefully there s enough computaton to hide the latency Lecture 1 p. 10
Software view At the top level, we have a master process which runs on the PU and performs the following steps: 1. initialises compute device 2. defines problem domain 3. allocates memory in host and on device 4. copies data from host to device memory 5. launches execution kernel on device 6. copies data from device memory to host 7. repeats 4-6 as needed 8. de-allocates all memory and terminates Lecture 1 p. 11
Software view At a lower level, within the compute device: each compute device is composed of multiple work-group each work-group is composed of multiple work-items all work-items execute an instance of the kernel simultaneously all work items within one work-group can access shared local memory but can t see what other work items are doing relaxed consistency memory model; i.e. the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times Lecture 1 p. 12
Platform model Lecture 1 p. 13
OpenL OpenL (Open omputing Language) is a program development environment maintained by Khronos: based on functions to set up and communicate with compute device compute kernels are written in. OpenL provides some common functions as part of runtime api alternative ++ API available for host code open royalty-free standard for general purpose parallel programming across PUs, GPUs and other processors Lecture 1 p. 14
Installing OpenL 2 components driver low-level software that controls the graphics card implementation often packaged as SDK, containing tools and examples vendor specific OpenL library and include files vendors include AMD, NVIDIA, Intel, Apple, etc. Lecture 1 p. 15
OpenL programming Already explained that a OpenL program has two pieces: host code on the PU which interfaces to the device kernel code which runs on the device At the host level, there is a choice of 2 APIs (Application Programming Interfaces): - the original API ++ - built on top of the API We will mostly use the API in this course, the latest ++ API looks promising and simplifies the host code. However it is useful to know what is really going on. Lecture 1 p. 16
OpenL programming At the host code level, there are library routines for: device initialisation memory allocation on graphics card data transfer to/from device memory constants images ordinary data programs and kernels command queues Lecture 1 p. 17
OpenL programming Lecture 1 p. 18
OpenL programming Boilerplate is similar between each program you write. It complicated, but DONT PANI! 1. Define the platform Get platform clgetplatformids() Discover devices within platform clgetdeviceids() reate context for device clreateontext() reate command queue to feed device clreateommandqueue() Lecture 1 p. 19
OpenL programming 2. reate and build the program Build program object clreateprogramwithsource() ompile program to build library of kernels clbuildprogram() Lecture 1 p. 20
OpenL programming 3. Setup memory objects Allocate and initialise input vectors on host Define OpenL memory objects clreatebuffer() Lecture 1 p. 21
OpenL programming 4. Define the kernel reate kernel object from program clreatekernel() Attach arguments the kernel clsetkernelarg() Lecture 1 p. 22
OpenL programming 5. Submit commands Write buffers from host into global memory clenqueuewritebuffer() Enqueue kernel for execution clenqueuendrangekernel() Read back result clenqueuereadbuffer() Note the command queue is in-order, so as long as the reading back is a blocking call the others do not need to be. Lecture 1 p. 23
OpenL programming At the lower level, when one instance of the kernel is started on a device it is executed by a number of work-items, each of which knows about: some variables passed as arguments memory buffers in global or local memory global constants in global memory local memory and private registers/variables some special functions: get global id() index in domain get local id() index in work-group get block id() index of work-group get local size() size of block etc... Lecture 1 p. 24
OpenL programming The kernel code looks fairly normal once you get used to two things: code is written from the point of view of a single thread quite different to OpenMP multithreading similar to MPI, where you use the MPI rank to identify the MPI process all private variables are private to that thread need to think about where each variable lives (more on this in the next lecture) any operation involving data in the device memory forces its transfer to/from registers in the GPU often better to copy the value into a private register variable Lecture 1 p. 25
Kernel code kernel void my_first_kernel( global float *x) { int tid = get_global_id(0); x[tid] = (float) get_local_id(0); } kernel identifier says it s a kernel function each work-item sets one element of x array within each work-group, get local id(0) ranges from 0 to get local size(0)-1, so each thread has a unique value for tid Lecture 1 p. 26
OpenL programming Suppose we have 1000 work-groups, each with 128 work-items. In this simple case we have a 1D grid, and a 1D set of work-items making up each work-group. Then the global size is 128000, and the local size is 128. If we want to use a 2D grid, we would set our global (and local) work size array with two elements: const size t global[] = {nx, ny} We specify the problem dimension when we enqueue the kernel. Problems can be 1 (like an array), 2 (like a grid) or 3 dimensional (like a cube) clenqueuendrangekernel(queue, kernel, work dim, offset, &global, &local, 0, NULL, NULL) Lecture 1 p. 27
Practical 1 start from code shown above (but with comments) test error-checking and printing from kernel functions modify code to add two vectors together (including sending them over from the host to the device) if time permits, look at OpenL examples in the UDA SDK Lecture 1 p. 28
Practical 1 Things to note: memory allocation cl mem d x = clreatebuffer(context, L MEM READ WRITE, nbytes, NULL, NULL); data copying clenqueuereadbuffer(queue, d x, L TRUE, 0, nbytes, h x, 0, NULL, NULL); reminder: prefix h and d to distinguish between arrays on the host and on the device is not mandatory, just helpful labelling kernel routine is declared by kernel prefix, and is written from point of view of a single thread Lecture 1 p. 29
Practical 1 Second version of the code is very similar to first, but uses a header file for various safety checks gives useful feedback in the event of errors. check for error return codes clutilsafeall(clapiall(... )); check for errors passes as API variable clapiall(..., &err); clutilsafeall(err); Lecture 1 p. 30
Practical 1 One thing to experiment with is the use of printf within a kernel function: (requires OpenL v1.2 SDK, i.e. AMD) essentially the same as standard printf; minor difference in integer return code each thread generates its own output; use conditional code if you want output from only one thread output from printf is flushed to implementation defined output stream at synchronisation points need to use clfinish(queue); at the end of the main code to flush all pending output Lecture 1 p. 31
Key reading OpenL Specification, version 1.2: hapter 3: The OpenL Architecture hapter 4: The OpenL Platform Layer OpenL Programming Guide: Aaftab Munshi, Benedict Gaster, Timothy G. Mattson and James Fung, 2011 Heterogeneous omputing with OpenL: Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry and Dana Schaa, 2011 Lecture 1 p. 32