Programming with CUDA

Size: px

Start display at page:

Download "Programming with CUDA"

Arabella Gilbert
7 years ago
Views:

1 Programming with CUDA Jens K. Mueller Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Monday 23 rd May, 2011

2 Today s lecture: OpenCL

3 CUDA 3 / 23 OpenCL Heterogeneous (GPU, CPU, etc.) standard for general-purpose programming Efficiency Portability Ranging from compute servers to handheld devices Based on C99 Language for writing kernels and runtime APIs OpenCL 1.1 (14th June 2010) Developed by Khronos Group consortium [9] Implementations by AMD, Nvidia, IBM, Apple,...

4 CUDA 4 / 23 Computing Platform and Terminology 1 host and 1+ compute devices Hosts submits work to devices work queue Terminology Work item Basic unit of work Kernel Describes the work of an work item (C function) Program Collection of kernels Context Environment for working with devices

5 CUDA 5 / 23 OpenCL Header File C header files 1. Include CL/opencl.h There are also C ++ bindings at Khronos OpenCL API Registry. 1. Download cl.hpp 2. Include cl.hpp # include <CL/ opencl.h> Listing 1: Including OpenCL header files

6 CUDA 6 / 23 OpenCL Platform Layer API Underlying Hardware Abstraction Query OpenCL devices Device configuration information Create OpenCL context for one/more devices clcreatecontext clcreatecontextfromtype CL_DEVICE_TYPE_CPU CL_DEVICE_TYPE_GPU CL_DEVICE_TYPE_ACCELERATOR CL_DEVICE_TYPE_DEFAULT CL_DEVICE_TYPE_ALL

7 CUDA 7 / 23 Creating an Context // create context cl_ context context ; context = clcreatecontextfromtype (NULL, CL_ DEVICE_ TYPE_ GPU, NULL, NULL, & clerror ) ; Listing 2: Creating an OpenCL context

8 CUDA 8 / 23 Check for Devices in the Context // query all devices available to the context size_t ncontextdescriptorsize ; clgetcontextinfo ( context, CL_ CONTEXT_ DEVICES, NULL, NULL, & ncontextdescriptorsize ); cl_ device_ id * devices = ( cl_ device_ id *) malloc ( ncontextdescriptorsize ); clgetcontextinfo ( context, CL_ CONTEXT_ DEVICES, ncontextdescriptorsize, devices, NULL ); Listing 3: Query devices within the context

9 CUDA 9 / 23 Command Queues and Events Queues belong to a device Enqueuing kernels Events to synchronize between queues // create a command queue for first device of the context cl_ command_ queue cmdqueue ; cmdqueue = clcreatecommandqueue ( context, devices [0], 0, & clerror ); Listing 4: Create a command queue CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_QUEUE_PROFILING_ENABLE

10 CUDA 10 / 23 OpenCL Programs Kernels: Derived from C99 No function pointers, recursion, variable length arrays, bit field, variadic functions,... Work item/work groups Vector Types Synchronization Address space qualifiers Built-In functions (image manipulation, work-item manipulation, math functions,...) Kernel is built by run-time (not by a external compiler)

11 CUDA 11 / 23 Vector Types int4 vi0 = (int4) -7; int4 vi1 = (int4)(0, 1, 2, 3); vi0.lo = vi1.hi; int8 v8 = (int8)(vi0, vi1.s01, vi1.odd); Vector Operations

12 CUDA 12 / 23 Kernel Configuration Global domain of work items (global dimension) Local domain of work groups (local dimension) No synchronization between work groups Synchronization in work groups possible

13 CUDA 13 / 23 OpenCL Memories Private memory (per work item) private Local memory (per work group) local Global/Constant Memory (all work groups) global and constant Host Memory

14 CUDA 14 / 23 Building Programs clcreateprogramwithsource and clcreateprogramwithbinary clbuildprogram and cl_ program program ; program = clcreateprogramwithsource ( context, 1, & kernelsource, NULL, & clerror ); CHECK_EQ ( CL_SUCCESS, error ); Listing 5: Create a program clerror = clbuildprogram ( program, 1, devices, NULL, NULL, NULL ); Listing 6: Build a program

15 CUDA 15 / 23 Program Build Info clgetprogrambuildinfo size_ t sizebuildlog = 200; char * result = ( char *) malloc ( sizebuildlog ); size_ t copied = 0; clerror = clgetprogrambuildinfo ( program, devices [0], CL_ PROGRAM_ BUILD_ LOG, sizebuildlog, result, & copied ); CHECK_EQ ( CL_SUCCESS, error ); LOG ( INFO ) << " Build log : " << result ; free ( result ); Listing 7: Build Information

16 CUDA 16 / 23 Create a Kernel clcreatekernel and clcreatekernelsinprogram cl_ kernel kernel ; kernel = clcreatekernel ( program, " kernel ", & clerror ); Listing 8: Create kernel

17 CUDA 17 / 23 Execute a Kernel clsetkernelarg int arg = 0; clerror = clsetkernelarg ( kernel, arg ++, sizeof ( cl_mem ), ( void *) & image ); clerror = clsetkernelarg ( kernel, arg ++, sizeof ( val ), ( void *) & val ); clerror = clsetkernelarg ( kernel, arg ++, sizeof ( val ), ( void *) & val1 ); clerror = clsetkernelarg ( kernel, arg ++, sizeof ( val ), ( void *) & val2 ); Listing 9: Specify kernel arguments

18 CUDA 18 / 23 Execute a Kernel (cont.) clenqueuendrangekernel const cl_ uint dim = 2; // size_t localworksize [ dim ] = {16, 16}; size_ t globalworksize [ dim ] = { width, height }; // execute kernel clerror = clenqueuendrangekernel ( cmdqueue, kernel, dim, NULL, globalworksize, NULL, 0, NULL, NULL ); Listing 10: Execute a kernel

19 CUDA 19 / 23 Kernel kernel void kernel ( write_only image2d_t dst, float zoom, float to_x, float to_ y ) { uint dimensions = get_ work_ dim (); } for ( uint d = 0; d < dimensions ; ++ d) { size_ t globalsize = get_ global_ size ( d); size_t globalid = get_global_id (d); size_t localsize = get_local_size (d); size_t localid = get_local_id (d); size_t numgroups = get_num_groups (d); size_t groupid = get_group_id (d); size_ t globaloffset = get_ global_ offset ( d); } Listing 11: A Kernel

20 CUDA 20 / 23 Memory Objects Buffer Objects and Image Objects Manage Memory Sub-Buffer Objects to distribute to multiple devices Buffer Objects clcreatebuffer, clcreatesubbuffer clenqueuereadbuffer, clenqueuewritebuffer, clenqueuecopybuffer

21 CUDA 21 / 23 Memory Objects (cont.) Buffer Objects and Image Objects Image Objects clcreateimage{2,3}d clgetsupportedimageformats clenqueuecopyimagetobuffer and clenqueuecopybuffertoimage clenqueuereadimage, clenqueuecopyimage, and clenqueuewriteimage CL_RGBA, CL_BGRA (optional: CL_R, CL_A,...) Kernel: {read,write}_image{f,i,ui}

22 CUDA 22 / 23 Image Example // allocate an image cl_ image_ format format ; format. image_ channel_ order = CL_ RGBA ; format. image_channel_data_type = CL_ UNSIGNED_ INT8 ; cl_ mem image = clcreateimage2d ( context, CL_ MEM_ WRITE_ ONLY, & format, width, height, NULL, NULL, & clerror ); Listing 12: Allocate an image

23 CUDA 23 / 23 References [6] OpenCL 1.1 Quick Reference card. Version URL: [7] OpenCL 1.1 Reference Pages. Khronos Group. URL: docs/man/xhtml/. [8] OpenCL 1.1 Specification. Version 36. Sept. 30, URL: http: // 1.1.pdf. [9] OpenCL. The open standard for parallel programming of heterogeneous systems URL: (cit. on p. 3).

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Mitglied der Helmholtz-Gemeinschaft OpenCL Basics Parallel Computing on GPU and CPU Willi Homberg Agenda Introduction OpenCL architecture Platform model Execution model Memory model Programming model Platform