GPU computing. Jochen Gerhard Institut für Informatik Frankfurt Institute for Advanced Studies

GPU computing Jochen Gerhard Institut für Informatik Frankfurt Institute for Advanced Studies

Overview How is a GPU structured? (Roughly) How does manycore programming work compared to multicore? How can one access the GPU from Python? Some details about the structure of OpenCL Programs. How to make the GPU do what you want?

Hardware A modern Computer has more than just a CPU: More than one socket and one core anyway. But also graphic cards, sometimes even more than one. Wouldn t it be nice to harvest all that computing power?

HPC for the poor man: GPUs As example let s take my MacBook Pro 5.1: 1x Intel Core 2 Duo @2.4 GHz Galaxy Benchmark ~12 GFLOPs NVIDIA GeForce 9600M GT Galaxy Benchmark ~40 GFLOPs Why not harvest both of it? With just coding one program???

The CPU n cores, n rather small (less than 32) each with private L1-cache pairwise/quadwise shared L2-cache shared L3-cache Slow access to system memory Is good in multiple things

Multicore Compute different (rather complicated tasks) Each core computes even different programs. Complicated hardware Sometimes share information. (messages)

The GPU Lots of cores. Not so much memory. pretty simple. good in number crunching.

Manycore Compute (almost) always the same task. Different groups can work on slightly different branches. Simpler hardware. Arrange within groups only.

Ultra-Threaded Dispatch Processor Hardware overview from AMD s Programming Guide (*). To use the former picture: 5 chickens always stay together. (processing elements) Compute Unit Compute Unit Compute Unit Compute Unit Each coop contains 16 of this cliques. (compute units) T-Processing Element Stream Core Instruction and Control Flow Branch Execution Unit Processing Element General-Purpose Registers 1. Much of this is transparent to the programmer.

The GPU s hierarchy Lots of cores (e.g. ATI Radeon 5870) 20 compute units 16 stream cores (4+1) processing elements 1600 SP units, 320 DP units, 320 SF units.

GPU / CPU GPU Mainboard Compute Unit CPU / Socket Stream Core Core Processing Element FPU

The GPU Performance Processing at 850 MHz => theoretical peak performance of 1.36 TFLOPs (for 299$) Not so much memory. (1/2 GB accessible without tricks) Fast Global Memory on GPU Very Fast Local Memory. (32kB come with almost no latency for each compute unit). 8 kb L1 (RO) cache per compute unit.

The OpenCL Platform One host Various Compute devices GPUs and CPUs Each consistent of Compute Units Cores / SIMD engines Platform overview from AMD Programming Guide (*) Again devised into Processing Elements processing Elements / FPUs

Organizing OpenCL First a platform has to be chosen: platform = implementation of OpenCL #platforms 1 (Like Apple + Nvidia on my Laptop) Then you have to query the devices, that can be accessed by means of this platform. In a context devices are tied together. It s used to manage buffers, programs and, kernels. You perform actions on this objects in queues.

Common usage Take first platform, you get! Put your GPU as the only DEVICE into the CONTEXT Have one command QUEUE connected with your GPU.

Organizing OpenCL II Though possible, I would not recommend to use more than one platform. If you want more than one GPU to work on the same memory, they have to share a context! When different devices share a context, the Buffers share the device constraints.

Memory Memory is managed in so called Buffers. Buffers are bound to a context They have to be declared (size and specifiers) You get your Data in and out, via a copy command in the queue. You may also just give a pointer to host memory.

Execution model OpenCL programs are sets of functions in a C99 derivative. Those functions to be executed directly in the queue are called kernels. Kernels operate on every element of an input stream independently. This is orchestrated by the NDRange argument.

Kernels Kernels are functions you put into the command queue. Essentially within a Kernel you explain, what each chicken (work unit) has to do! All work units will do the same thing, written in the kernel!

Orchestrating the kernels Kernels are put into command queue Before enqueueing a kernel one has specify where the kernel parameters point to. Kernels are enqueued with a NDRange argument: Gives an N-Dimensional-Range.

The NDRange Giving the number of work items for kernel. Can be organized geometrically: e.g. 1024x1024 work-items suited to problem size. Can be subdivided into workgroups. e.g. 128x128 workgroups, each having 8x8 work-items.

The NDRange: Chicken Version NDRange specifies how many chickens you want to work. You can organize them geometrically. (16 = 4x4) You can also group them together.

Why workgroups? All work items of a workgroup are executed on the same compute unit. They share the local memory. Which is tremendously fast. ( chickens within a coop ) Only within a workgroup you can synchronize. Next finer granularity is the wavefront. The execution stream within a wavefront is uniform. So branching within is extremely expensive.

Why workgroups: Chicken version All chickens within the same group reside in the same coop. They share the same bowl, which is much nearer than the global bowl for everyone. They wait for each other, when going to the local or global bowl. (synchronization) Next finer granularity: Chickens will all do the same! So if in the same wavefront, one chicken has to add and the other has to subtract - they all will do both!

Hands on

The OpenCL part Is a python string contains only one function, which is a kernel: kernel has one parameter data which is a global reachable array of int. gets first its global id in x-direction (0) Each work unit sets its entry to its GID

The Python part I platform / device / context is all managed by magic: create_some_context() queue is to be initialized with given context declare how many work units you want. Here we use 32 x 1 Work units. We need representations of the data on host and on device.

The Python part II First we build the program from source and according to its context. Out of the context, the compiler knows the device architecture. One can pass also compiler options here. (e.g. include files!) Every kernel becomes a method for the program Object.

The Python part III We pass queue, NDRange, and kernel parameters The.wait() ensures we wait for completion. Last step is getting data out of the data_buffer into the Numpy array data We.wait() till this is finished too.

Backup

Synchronization Within a workgroup barrier(clk_local_mem_fence) barrier(clk_global_mem_fence) In a Queue.wait() waits for the Event being computed.

Synchronization There is no global synchronization between work units. Chickens never wait for chickens in other groups.

A template

A practical example Naive matrix multiplication (using only global memory) Still approximately 300 times faster, than Numpy.dot (A,B) for A, B 1024 x 1024 single precision matrices. (On a ATI Radeon 5870)

Global Matrix Multiplication

Local Matrix Multiplication

1 st step Copy data from global memory to local memory Each work item copies one entry per matrix (A, B) per round (k++) from global to local memory

2 nd step Now all memory accesses are within local memory. Each work item in the workgroup computes like in the global example.

Metaprogramming We can use Python to modify the OpenCL source before compiling: src = #DEFINE LDIM 16 src += loadfile( matmul.cl ) src = #DEFINE LDIM %i %ldim where ldim is set in Python before...

Resumé Accessing the GPU from Python is quite easy. PyOpenCL works perfectly with Numpy. If you consider porting some slow routines to C (e.g. using Cython), probably you should consider OpenCL. First (even practical!) routines are easily implemented.

Introductory Documents (*) Programming Guide: AMD Accelerated Parallel Processing OpenCL http://www.khronos.org/developers/library/overview/opencl_overview.pdf http://mathema.tician.de/software/pyopencl http://www.khronos.org/registry/cl/