HPC Practical Course Part 4.1. Open Computing Language (OpenCL)

1 HPC Practical Course Part 4.1 Open Computing Language (OpenCL) V. Akishina, I. Kisel, I. Kulakov, M. Zyzak Goethe University of Frankfurt am Main 11 June 2014

2 Computer Architectures Single Instruction Single Data Single Instruction Multiple Data Multiple Instruction Multiple Data Taken from: 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 2 /20

3 OpenCL Architecture OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing unit (CPUs), graphics processing unit (GPUs), and other processors. Platform Layer API - A hardware abstraction layer over diverse computational resources - Query, select and initialise compute devices - Create compute contexts and work-queues Runtime API - Execute compute kernels - Manage scheduling, compute, and memory resources Language Specification - C-based cross-platform programming interface - Subset of ISO C99 with language extensions - familiar to developers - Defined numerical accuracy - IEEE 754 rounding with specified maximum error - Online or offline compilation and build of compute kernel executables - Rich set of built-in functions Practicality, flexibility and retargetability 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 3 /20

4 Programming Model Host - run the main program - run the compilation - distributes tasks between compute devices - tasks are distributed via queues Compute devises - example - CPU or GPU - consists of one or more Compute units Compute units - example - set of cores of CPU, streaming multiprocessor of GPU - consists of one more Processing elements Processing elements - example - one core of CPU, one core of GPU 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 4 /20

5 Memory Model global memory space - largest memory space available to the device. Each compute unit on the device has a local memory, which is typically on the processor die, and therefore has much higher bandwidth and lower latency than global memory. Local memory can be read and written by any workitem in a work-group, and thus allows for local communication between work-groups. Additionally, attached to each processing element is a private memory, which is typically not used directly by programmers, but is used to hold data for each work-item that does not fit in the processing element s registers. Private Memory Work-Item Workgroup Compute Device Host Work-Item Local Memory Private Memory Workgroup Global/Constant Memory Host Memory Private Memory Work-Item Work-Item Local Memory Private Memory 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 5 /20

6 Execution Model OpenCL application runs on a host which submits work to the compute devices - Context: The environment within which work-items executes, includes devices and their memories and command queues - Program: Collection of kernels and other functions (Analogous to a dynamic library) - Kernel: the code for a. Basically a C function - Work item: the basic unit of work on an OpenCL device - Each processing element works on one - Work items are combined into working group - Each working group is assigned to the compute unit Applications queue kernel execution - Executed in-order or out-of-order work group... work group work group... work group size work group size work group size global size 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 6 /20

7 Host-program structure OpenCL CPU GPU Context Programs Kernels Memory Objects Command Queues Get a platform Get a device Set a context kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } dp_mul CPU program binary dp_mul GPU program binary dp_mul arg arg [0] [0] arg[0] value value value arg arg [1] [1] arg[1] value value value arg arg [2] [2] arg[2] value value value Images Buffers In In Order Queue GPU GPU Out of of Order Queue Create a command-queue Create memory buffer Write the buffer Create a program Compile the program Create a kernel Set the kernel arguments Call the kernel Read the buffer Clean the memory 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 7 /20

8 Host Program Structure Platform Device Context // Returns the error code cl_int oclgetplatformid (cl_platform_id *platforms) // Pointer to the platform object // Returns the error code cl_int clgetdeviceids (cl_platform_id platform, cl_device_type device_type, // Bitfield identifying the type. For the GPU we use CL_DEVICE_TYPE_GPU cl_uint num_entries, // Number of devices, typically 1 cl_device_id *devices, // Pointer to the device object cl_uint *num_devices) // Puts here the number of devices matching the device_type // Returs the context cl_context clcreatecontext (const cl_context_properties *properties, // Bitwise with the properties (see specification) cl_uint num_devices, // Number of devices const cl_device_id *devices, // Pointer to the devices object void (*pfn_notify)(const char *errinfo, const void *private_info, size_t cb, void *user_data), // (don't worry about this) void *user_data, // (don't worry about this) cl_int *errcode_ret) // error code result 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 8 /20

9 Host Program Structure Command queue Buffer: create Buffer: write cl_command_queue clcreatecommandqueue (cl_context context, cl_device_id device, cl_command_queue_properties properties, // Bitwise with the properties cl_int *errcode_ret) // error code result // Returns the cl_mem object referencing the memory allocated on the device cl_mem clcreatebuffer (cl_context context, // The context where the memory will be allocated cl_mem_flags flags, size_t size, // The size in bytes CL_MEM_READ_WRITE void *host_ptr, CL_MEM_WRITE_ONLY cl_int *errcode_ret) CL_MEM_READ_ONLY CL_MEM_USE_HOST_PTR CL_MEM_ALLOC_HOST_PTR CL_MEM_COPY_HOST_PTR 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 9 /20

10 Host Program Structure Program: create Program: build Error log: Kernel: create // Returns the OpenCL program cl_program clcreateprogramwithsource (cl_context context, cl_uint count, // number of files const char **strings, // array of strings, each one is a file const size_t *lengths, // array specifying the file lengths cl_int *errcode_ret) // error code to be returned cl_int clbuildprogram (cl_program program, cl_uint num_devices, const cl_device_id *device_list, const char *options, // Compiler options, see the specifications for more details void (*pfn_notify)(cl_program, void *user_data), void *user_data) cl_int clgetprogrambuildinfo (cl_program program, cl_device_id device, cl_program_build_info param_name, size_t param_value_size, void *param_value, // The answer size_t *param_value_size_ret) // The parameter we want to know CL_PROGRAM_BUILD_STATUS CL_PROGRAM_BUILD_OPTIONS CL_PROGRAM_BUILD_LOG cl_kernel clcreatekernel (cl_program program, // The program where the kernel is const char *kernel_name, // The name of the kernel, i.e. the name of the kernel function as it's declared in the code cl_int *errcode_ret) 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 10/20

11 Host Program Structure Kernel: arguments Kernel: call Profile cl_int clsetkernelarg (cl_kernel kernel, // Which kernel cl_uint arg_index, // Which argument size_t arg_size, // Size of the next argument (not of the value pointed by it) const void *arg_value) // Value cl_int clenqueuendrangekernel (cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, // Choose if we are using 1D, 2D or 3D work- items and work- groups const size_t *global_work_offset, const size_t *global_work_size, // The total number of work- items (must have work_dim dimensions) const size_t *local_work_size, // The number of work- items per work- group (must have work_dim dimensions) cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 11/20

12 Host Program Structure Buffer: read Clean: cl_int clenqueuereadbuffer (cl_command_queue command_queue, cl_mem buffer, // from which buffer cl_bool blocking_read, // whether is a blocking or non- blocking read size_t offset, // offset from the beginning size_t cb, // size to be read (in bytes) void *ptr, // pointer to the host memory cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) // Cleaning up delete[] src_a_h; delete[] src_b_h; delete[] res_h; delete[] check; clreleasekernel(vector_add_k); clreleasecommandqueue(queue); clreleasecontext(context); clreleasememobject(src_a_d); clreleasememobject(src_b_d); clreleasememobject(res_d); 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 12/20

13 Vectorization with OpenCL Vc int_v float_v store() load() OpenCL int4 float4 vstore4() vload4() Example: increase 12,13,14 and 15-th elements of array A by one int A[1000]; int4 a = vload4( 3, A ); a++; vstore4( a, 3, A ); 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 13/20

14 GPU Memory Global memory - very large, typically gigabytes - readable and writable by all s - state only well defined after kernel has finished - slow, but much faster with streaming access (coalescing) - sometimes cached Constant memory - read-only part of the global memory (writable from host) - often cached - prefer constant memory for constant values Local memory - very fast on-chip memory - shared among s within the same work group - versatile (explicit global memory cache, etc.) Private memory - private to a single - usually physically a part of the global memory, slow - will be used to store the s registers if the register file is exhausted (must be avoided) 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 14/20

15 Structure of AMD Radeon HD compute units Up to 925MHz Engine Clock 3GB GDDR5 Memory 3.79 TFLOPS Single Precision compute power 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 15/20

16 GCN Compute Unit Figure 3: GCN Compute Unit 16 floats per SIMD lane 64 floats in total per Compute Unit 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 16/20

17 Exercise 0 An example code is given. It computes vector sum C = A + B. 1. Part 1 1. Run and understand 2. Check error codes, returned by each function (they should be equal to CL_SUCCESS==0) 3. Play: try to change size of the arrays (try 128, 64, 16, 1023), type (try float), etc. 4. Solution is main1.cpp 2. Part 2 1. Display build log 2. Measure the execution time 1. for comparison implement scalar version 2. try more complicated computations (log, sqrt) 3. Solution is main2.cpp 3. Part 3: SIMDize 1. Solution is main3.cpp 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 17/20

18 Exercise 0 (continue) 4. Part 4: Create sub devices 1. Create sub devices with 2. Try CL_DEVICE_PARTITION_EQUALLY, CL_DEVICE_PARTITION_BY_COUNTS and 5. Part 5: cl_int clcreatesubdevices ( cl_device_id in_device, const cl_device_partition_property *properties, cl_uint num_devices, cl_device_id *out_devices, cl_uint *num_devices_ret ) CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN properties (more information you can find here: Solution is main4.cpp 1. Create a function into the kernel function for a sum calculation 2. We suggest to build the program in c++-like style: clbuildprogram(program, 1, &out_devices[0], "-x clc++", NULL, NULL); 3. Solution is main5.cpp 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 18/20

19 Exercise 0 (continue) 6. Part 6: Run on GPU 1. Try SIMD and scalar versions 2. Try different sizes of working groups 3. Solution is main6.cpp 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 19/20

20 Exercise 1. SIMD KF Implement SIMD KF package with OpenCL: Implement the host part in Fit.cxx (find TODO) Finish the Fit.cl file: implement kernel function, describe data structures. Functionality is already there Measure the scalability with OpenCL: cd TimeHisto. ~/pandaroot/trunk/build/config.sh root -l make_timehisto_stat_complex_opencl.c 11 June 2014 HPC, V. Akishina, I. Kisel, I. Kulakov, M. Zyzak 20/20

