OpenCL Parallel Computing on the GPU and CPU. Aaftab Munshi

OpenCL Parallel Computing on the GPU and CPU Aaftab Munshi

Opportunity: Processor Today s processors are increasingly parallel CPUs Multiple cores are driving performance increases GPUs Transforming into general purpose data-parallel computational coprocessors Improving numerical precision (single and double)

Challenge: Processor Parallelism Writing parallel programs different for the CPU and GPU Differing domain-specific techniques Vendor-specific technologies Graphics API is not an ideal abstraction for general purpose compute

Introducing OpenCL OpenCL Open Computing Language Approachable language for accessing heterogeneous computational resources Supports parallel execution on single or multiple processors GPU, CPU, GPU + CPU or multiple GPUs Desktop and Handheld Profiles Designed to work with graphics APIs such as OpenGL

OpenCL = Open Standard Specification under review Royalty free, cross-platform, vendor neutral Khronos OpenCL working group (www.khronos.org) Based on a proposal by Apple Developed in collaboration with industry leaders Performance-enhancing technology in Mac OS X Snow Leopard

OpenCL Working Group Members Broad Industry Support Copyright Khronos Group, 2008 - Page

OpenCL A Sneak Preview

Design Goals of OpenCL Use all computational resources in system GPUs and CPUs as peers Data- and task- parallel compute model Efficient parallel programming model Based on C Abstract the specifics of underlying hardware Specify accuracy of floating-point computations IEEE 754 compliant rounding behavior Define maximum allowable error of math functions Drive future hardware requirements

OpenCL Software Stack Platform Layer query and select compute devices in the system initialize a compute device(s) create compute contexts and work-queues Runtime resource management execute compute kernels Compiler A subset of ISO C99 with appropriate language additions Compile and build compute program executables online or offline

OpenCL Execution Model Compute Kernel Basic unit of executable code similar to a C function Data-parallel or task-parallel Compute Program Collection of compute kernels and internal functions Analogous to a dynamic library Applications queue compute kernel execution instances Queued in-order Executed in-order or out-of-order Events are used to implement appropriate

OpenCL Data-Parallel Execution Define N-Dimensional computation domain Each independent element of execution in N-D domain is called a work-item The N-D domain defines the total number of workitems that execute in parallel global work size. Work-items can be grouped together work-group Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access Execute multiple work-groups in parallel Mapping of global work size to work-groups

OpenCL Task-Parallel Execution Data-parallel execution model must be implemented by all OpenCL compute devices Some compute devices such as CPUs can also execute task-parallel compute kernels Executes as a single work-item A compute kernel written in OpenCL A native C / C++ function

OpenCL Memory Model Implements a relaxed consistency, shared memory model Multiple distinct address spaces Address spaces can be collapsed

OpenCL Memory Model Implements a relaxed consistency, shared memory model Private Memory Multiple distinct address spaces Address spaces can be collapsed WorkItem 1 Private Memory WorkItem M Compute Unit 1 Private Memory WorkItem 1 Private Memory WorkItem M Compute Unit N Address Qualifiers private

OpenCL Memory Model Implements a relaxed consistency, shared memory model Private Memory Multiple distinct address spaces Address spaces can be collapsed WorkItem 1 Private Memory WorkItem M Compute Unit 1 Private Memory WorkItem 1 Private Memory WorkItem M Compute Unit N Address Qualifiers private local Local Memory Local Memory

OpenCL Memory Model Implements a relaxed consistency, shared memory model Private Memory Multiple distinct address spaces Address spaces can be collapsed WorkItem 1 Private Memory WorkItem M Compute Unit 1 Private Memory WorkItem 1 Private Memory WorkItem M Compute Unit N Address Qualifiers private local Local Memory Local Memory Global / Constant Memory Data Cache Compute Device constant and global Example: global float4 *p; Compute Device Memory Global Memory

Language for writing compute Derived from ISO C99 A few restrictions Recursion, function pointers, functions in C99 standard headers... Preprocessing directives defined by C99 are supported Built-in Data Types Scalar and vector data types Structs, Pointers Data-type conversion functions convert_type<_sat><_roundingmode> Image types

Language for writing compute Beyond Programmable Shading: Fundamentals

Language for writing compute Built-in Functions Required work-item functions math.h read and write image relational geometric functions synchronization functions

Language for writing compute Built-in Functions Required work-item functions math.h read and write image relational geometric functions synchronization functions Built-in Functions Optional double precision atomics to global and local memory selection of rounding mode

OpenCL FFT Example - Host API Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API // create a compute context with GPU device

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu);

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0);

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context,

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srca);

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srca); memobjs[1] = clcreatebuffer(context,

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srca); memobjs[1] = clcreatebuffer(context, CL_MEM_READ_WRITE,

OpenCL FFT Example - Host API // create a compute context with GPU device context = clcreatecontextfromtype(cl_device_type_gpu); // create a work-queue queue = clcreateworkqueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clcreatebuffer(context, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srca); memobjs[1] = clcreatebuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);

OpenCL FFT Example - Host API // create the compute program program = clcreateprogramfromsource(context, 1, &fft1d_1024_kernel_src, NULL); // build the compute program executable clbuildprogramexecutable(program, false, NULL, NULL); // create the compute kernel kernel = clcreatekernel(program, fft1d_1024 );

OpenCL FFT Example - Host API // create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clcreatendrangecontainer(context, 0, 1, global_work_size, local_work_size); // set the args values clsetkernelarg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clsetkernelarg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clsetkernelarg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clsetkernelarg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); // execute kernel clexecutekernel(queue, kernel, NULL, range, NULL, 0, NULL);

OpenCL FFT Example - Compute // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into // calls to a radix 16 function, another radix 16 function and then a radix 4 function // Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008 kernel void fft1d_1024 ( global float2 *in, global float2 *out, local float *smemx, local float *smemy) { int tid = get_local_id(0); int blockidx = get_group_id(0) * 1024 + tid; float2 data[16]; } // starting index of data to/from global memory in = in + blockidx; out = out + blockidx; globalloads(data, in, 64); // coalesced global reads fftradix16pass(data); // in-place radix-16 pass twiddlefactormul(data, tid, 1024, 0); // local shuffle using local memory localshuffle(data, smemx, smemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftradix16pass(data); // in-place radix-16 pass twiddlefactormul(data, tid, 64, 4); // twiddle factor multiplication localshuffle(data, smemx, smemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftradix4pass(data); fftradix4pass(data + 4); fftradix4pass(data + 8); fftradix4pass(data + 12); // coalesced global writes globalstores(data, out, 64);

OpenCL and OpenGL Sharing OpenGL Resources OpenCL is designed to efficiently share with OpenGL Textures, Buffer Objects and Renderbuffers Data is shared, not copied Efficient queuing of OpenCL and OpenGL commands Apps can select compute device(s) that will run OpenGL and OpenCL

Summary A new compute language that works across GPUs and CPUs C99 with extensions Familiar to developers Includes a rich set of built-in functions Makes it easy to develop data- and task- parallel compute programs Defines hardware and numerical precision requirements Open standard for heterogeneous parallel computing