Programming GPUs. Lecture 1 GPU Programs and Introduction to OpenCL (I) Juan J. Durillo. Executing Programs in GPU Introduction to OpenCL

Transcription

1 Programming GPUs Lecture 1 GPU Programs and (I) Juan J. Durillo Juan J. Durillo Programming GPUs 1/25

2 Section 1 Executing Programs in GPU Juan J. Durillo Programming GPUs 2/25

3 Classical Desktop Architecture CPU (Host) Memory GPU (Device) Which also have its own memory space CPU GPU GPU-mem DRAM Juan J. Durillo Programming GPUs 3/25

4 Abstraction GPU Architecture For us a GPU will be a collection of processing elements Every processing element is a fully operational core Floating and Integer ALUs,... Processing elements are grouped in Processing units Processing element Caches and control unit Processing unit Juan J. Durillo Programming GPUs 4/25

5 How does a GPU program look like? We will consider that a program is always composed of two parts: A part that is run entirely on the Host, which in our case is the CPU A part that is run entirely on the device, which in our case is the GPU In CUDA the device is always a GPU In OpenCL the device can be a CPU, a GPU, etc. We consider only the GPU case though. The code that is going to be executed in the GPU are coded as functions which are referred as Kernels Juan J. Durillo Programming GPUs 5/25

6 How does a program execute in the GPU? A program is usually composed of data structures and instructions Instructions should be encoded within kernel functions Kernel functions are normal functions annotated with some modifiers kernel in OpenCL global in CUDA The program that is run on the Host will produce the commands to run the kernel functions in the GPU Copying the data structures to the device memory Indicating how many instances of the kernel are going to be executed Executing the kernels Copying the data back from the device memory to the Host memory Juan J. Durillo Programming GPUs 6/25

7 Kernel Execution When a kernel is going to execute in the GPU The GPU runtime creates an integer index space 1 A instance of the kernel function will be run for each point in the index space 1 We will elaborate on index space later on this lecture Juan J. Durillo Programming GPUs 7/25

8 Kernel Execution In OpenCL each of the instances of that kernel are known as work -items We will use this terminology in the rest of this class In CUDA each of these instances are referred as threads All the work-items execute the same sequence of instructions They are all instances of the same kernel function! However, the behaviour of different work items may vary depending on the evaluation of branch condition 2 2 We will also elaborate more on this issue in future lectures Juan J. Durillo Programming GPUs 8/25

9 Kernel Execution Work-items are grouped in work-groups The same concept is called blocks in CUDA A work-group is entirely executed in a processing-unit The same is valid for the blocks in CUDA This means that all the works item within the workgroup are executed in the processing elements of the processing unit Every kernel instance is executed in a different processing element The only assumption we can do about a GPU program execution is that all the work-items within a work-group are executed concurrently Other than this, any assumption can be made In practice different work-groups are run in parallel using more processing units (or concurrently using the same processing unit) Juan J. Durillo Programming GPUs 9/25

10 Summing up until now... 1 A GPU consists of several processing units 2 Every processing unit contains several processing elements 3 GPU programs contains some functions to be executed in the GPU These functions are called kernels 4 When kernels are executed an index space is created 5 A kernel instance is run for every point within the index space 6 Kernels are groped within work-groups which are executed in the same processing units 7 Single kernel instances are run in processing elements 8 All the work instances within a work-group are executed concurrently (only assumption we can made) Juan J. Durillo Programming GPUs 10/25

11 Index space and kernel execution A index space is a N dimensional range of different values In OpenCL is known as NDRange N can be 1, 2, or 3 Vector, Matrix or a Cube Is up to the programmer how to define the index space A NDRange is defined by specifying the values of its N-dimensions e.g., (5,1,1) specifies an index space consisting in a vector of size 5 e.g., (2,2,1) specifies an index space consisting in a matrix of size 2x2 Work-items are uniquely identified within the index space (from 0 to size-1) The index space is divided in work-groups Juan J. Durillo Programming GPUs 11/25

12 Index space and kernel execution Work-groups are also defined by specifying the values of their dimensions An index space with N dimensions requires work-groups having n dimensions A three-dimensional index space requires three dimensional work-groups In OpenCL the size of every dimension of the work-group must evenly divide the index space This assures that all the work-groups are full All the work-groups have the same number of work-items Every work-group is uniquely identified within the index space Every work-item is also uniquely identified within a work-group, local id The global id of each work-item can be obtained from the combination of the work-group and its local id Juan J. Durillo Programming GPUs 12/25

13 Index space and kernel execution Example of a 2D NDRange on blackboard Juan J. Durillo Programming GPUs 13/25

14 Index space and kernel execution Vector addition Given two vectors, A, and B, we need to compute C=A+B 1 kernel add( int *a, int *b, int *c) { 2 int index = 0; 3 while ( index < N) { 4 c[ index] = a[ index] + b[ index]; 5 index ++; 6 } 7 } How many kernel instances do we need for computing the sum of a and b? Juan J. Durillo Programming GPUs 14/25

15 Index space and kernel execution The problem of the previous code is than only a processing element is used The rest of resources (other computing units, processing elements) are idle How can we improve the previous code? Defining an index space consisting in a vector of N components (e.g., (N,1,1)) Using the global id of every work-item in that element for memory access 1 kernel add( int *a, int *b, int *c) { 2 int index = get_global_id (0); 3 c[ index] = a[ index] + b[ index]; 4 } This code will execute in parallel using several processing elements! Juan J. Durillo Programming GPUs 15/25

16 Section 2 Juan J. Durillo Programming GPUs 16/25

17 Platform Model Several vendors provide OpenCL implementations Example AMD, Nvidia Each OpenCL implementation defines a different platform A platform enables the Host to interact with different OpenCL-capable devices OpenCL uses what is called Ïnstallable Client Driver"model The goal is to allow different platforms (i.e., implementations of different vendors to co-exist) Juan J. Durillo Programming GPUs 17/25

18 Platform Model Juan J. Durillo Programming GPUs 18/25

19 Every program in OpenCL should select the platform in which the code is going to be executed The function clgetplatformids retrieves the platforms available in a machine This function should be called twice The first is used to get the number of available platforms Space is allocated them for the platforms The second call is used to retrieve the information about the platforms Juan J. Durillo Programming GPUs 19/25

20 Each platform defines different devices that can executed Kernel functions e.g., the AMD implementation can use x86 CPUs and ATI graphic cards e.g., the nvidia implementation can use nvidia GPUs Once a platform is selected the function clgetdevicesids allows to get the devices associated with that platform When using this function we can specify which type of device we want As happened before, this function is called twice The first call determines the number of devices The second retrieves the devices information Juan J. Durillo Programming GPUs 20/25

21 Contexts OpenCL defines the Context term A context defines the environment within which the kernels are defined an executed A context is defined in terms of the following elements: Devices: the collection of OpenCL devices to be used by the host Kernels: the OpenCL functions that will run in OpenCL devices Program objects: the program source code and executables that implement the kernels Memory objects: A set of memory objects that are visible to OpenCL devices and can be operated and modified by kernels Juan J. Durillo Programming GPUs 21/25

22 Contexts OpenCL allows to define and manipulate a context When a context is created, it is required to indicate the list of devices to associate with it The function for creating a context is clcreatecontext The function creates a context given a list of devices The properties argument arguments specifies which platform should be used Juan J. Durillo Programming GPUs 22/25

23 Command Queues A command Queue is the mechanism for the host to request an action to be performed by a device There are three types of commands: Kernel execution commands execute a kernel on an OpenCL device Memory commands transfer data between the host and devices Synchronization commands put constrains on the order in which command execute A different command queue is required for each different device The commands within the queue can be synchronous or asynchronous Commands can execute in-order or out-of-order Juan J. Durillo Programming GPUs 23/25

24 Command Queues A command queue establishes a relationship between a context and a device I.e., given a context with several devices, the queues allows to operate with the devices The clcreatecommandqueue function creates a queue on a given device of a context Juan J. Durillo Programming GPUs 24/25

25 Summing up After this lecture you should be able of Checking how many OpenCL platforms are available Obtaining the information of a given platform Obtaining the devices on a platform and their info Creating contexts Creating queues Still a long way to go through Defining kernels Transfering information from the Host to the Device Compiling kernels Executing kernels and much more! Juan J. Durillo Programming GPUs 25/25