PDC Summer School Introduction to High- Performance Computing: OpenCL Lab

PDC Summer School Introduction to High- Performance Computing: OpenCL Lab Instructor: David Black-Schaffer <david.black-schaffer@it.uu.se> Introduction This lab assignment is designed to give you experience leveraging OpenCL to convert a standard C program for parallel execution on multiple CPU and GPU cores. You will start by profiling a model PDE solver. Using those performance results you will move portions of the computation over to an OpenCL GPU device to accelerate the computation. At each step you will analyze the resulting performance and answer a series of questions to provoke you to think about what is going on. Learning Objectives 1. Write and understand OpenCL code 2. Understand the OpenCL compute and memory models 3. Profile and analyze application performance 4. Understand the impact of local work group size on performance 5. Write your own OpenCL kernel 6. Optimize a simple kernel for coalesced memory accesses Assumed Background 1. Basic knowledge of C programming, compilation, and debugging on linux 2. Basic understanding of GPU architectures (from the lecture earlier today) 3. Some exposure to program performance analysis If you are concerned about your level of background please make sure you choose a partner whose skills complement your own. Materials Lecture Notes: GPU Architectures for Non-Graphics People Lecture Notes: Introduction to OpenCL Source code starter files provided on the summer school website The OpenCL Specification v. 1.0 (http://www.khronos.org/registry/cl/specs/opencl-1.0.pdf) OpenCL Quick Reference Card (http://www.khronos.org/files/opencl-quick-reference-card.pdf) Nvidia CUDA C Programming Guide v. 3.2 (http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/cuda_c_programming_ Guide.pdf) The last three pieces of documentation can be found by simply googling their titles. You may need them for the later parts of the lab. Hardware The lab machines are equipped with a GeForce 8400GS graphics card with 512MB of memory. These are Nvidia Compute version 1.1 cards with 1 streaming multiprocessor with 8 hardware cores processing 32 threads in a group (warp) running at up to 1.3GHz. The CPUs on the lab machines are Intel Core2Quad Q9550 (Yorkfield) at 2.83GHz. These are 2 dual-core dies in one package sharing a northbridge with 6MB of cache on each die. The cluster machines are equipped with four Tesla C2050 graphics cards with 3GB of memory. These are Nvidia Compute version 2.0 cards with 14 streaming multiprocessors with 32 hardware cores each processing 32 threads in a group (warp) running at up to 1.1GHz. The CPUs on the cluster machines are Page 1 of 26

dual Intel Xeon E5620s (Westmere-EP) at 2.4GHz. These are 4-core (8-thread) processors with 12MB of L3 cache each connected by QPI. The data used in this tutorial document is from an Apple MacBook Pro with a GeForce 9400M with 256MB of memory. It is an Nvidia Compute 1.1 device with 2 streaming multiprocessors with 16 cores each processing 32 threads in a group (warp) running at up to 1.1GHz. The CPU is an Intel Core2Duo P8800 at 2.66GHz with 3MB of L2 cache. A: Getting Started To get started you will run the OpenCL Hello World code that was discussed in the lecture. 1. Download the zip file of the code from the PDC website. 2. Unzip it in your home directory on one of the lab machines. 3. Type make hello 4. Run./hello_world.run If all goes well you should get one thousand sine calculations printed out to the terminal. To make sure you understand what is going on, go to the code and make and verify the following changes: 1. Change the code to run on the CPU or GPU (make sure it works on both). Note: if your code crashes when you change it to run on the CPU, you probably need to do some error checking. The hello_world program has none. Try looking at the error codes returned by clgetplatformids, clgetdeviceids, and clcreatecommandqueue. (You can find out how to get the error codes back by looking at the OpenCL documentation link.) What s going on? (You can look up the error IDs in the cl.h file which can be found by online or at the end of this document.) 2. Change the kernel to calculate the cosine of the numbers instead of the sine. 3. Change the size of the array processed to be 1024 times larger, remove the printout of the results, and time (using the time command on the command line) the difference in speed executing on the CPU vs. GPU. 4. Think about the results. Can you explain why one is faster than the other? 5. Now change the program to calculate the cosine without using OpenCL and compare the performance. What do you notice? B: Introduction to the PDE Solver Program Overview This tutorial uses a very simple PDE-solver-like program as a demonstration. The program is provided in a standard C version in the file main_c_1.c The program works as follows: int main (int argc, const char * argv[]) { float range = BIG_RANGE; float *in, *out; // ======== Initialize create_data(&in, &out); // ======== Compute while (range > LIMIT) { // Calculation update(in, out); // Compute Range range = find_range(out, SIZE*SIZE); swap(&in, &out); } Page 2 of 26

} Basic program flow. (Profiling measurements have been removed for simplicity.) The program creates two 4096x4096 arrays (in and out) and then performs an update by processing the in array to generate new values for the out array. The out array is then processed to determine the range (maximum value minus minimum value). If the range is too large, the program swaps the arrays (in becomes out and out becomes in) and repeats until the range converges below the limit. The update calculation takes in the four neighbors (a, b, c, d) along with the central point (e) to calculate the next central value as a weighted sum. This is a basic 5-point stencil operation, which can be visualized as:!"# $%&#! " # $ %! " # $ %! " # $ % The update iterates through the whole in matrix looking at the neighbors to calculate the new value, which is written into the out matrix. For the next update the in and out matrices are swapped, thereby overwriting the old values with the new ones. void update(float *in, float *out) { for (int y=1; y<size-1; y++) { for (int x=1; x<size-1; x++) { float a = in[size*(y-1)+(x)]; float b = in[size*(y)+(x-1)]; float c = in[size*(y+1)+(x)]; float d = in[size*(y)+(x+1)]; float e = in[size*y+x]; out[size*y+x] = (0.1*a+0.2*b+0.2*c+0.1*d+0.4*e); } } } The update calculation is simply a 5-point stencil. After the update, the program runs a range calculation on the output matrix that simply finds the minimum and maximum values and returns the difference between them as the range. float find_range(float *data, int size) { float max, min; max = min = 0.0f; // Iterate over the data and find the min/max for (int i=0; i<size; i++) { if (data[i] < min) min = data[i]; else if (data[i] > max) max = data[i]; } // Report the range return (max-min); } Finding the range simply iterates over all the data and keeps track of the minimum and maximum values, and then returns the difference. (This function is in util.c, as it is shared by all Page 3 of 26

the sample programs.) The program repeats the loop of update/range until the range is below the LIMIT specified in the parameters.h file. Don t change anything in the parameters.h file as it is used by all programs. Running the Program 1. Type make 1 to build the C-version of the program. 2. Run the program by typing./1_c.run When you run the program you will see the output converge after approximately 42 iterations. The program will print out the range at convergence. You should use this value to check that the optimized versions of the code later on in the lab are producing the same results. In addition to the range results, the program will print out profiling information. From this you can see how long the program took in total (Total) and how long the update (Update) and range (Range) calculations took. It also reports the standard deviation, which is a good sanity check as to how reliable your performance measurements are. 1. C Allocating data 2x (4096x4096 float, 64.0MB). Range starts as: 400.000000. Iteration 1, range=390.086487.... Iteration 42, range=99.828613. Total: Total: 14564.748000 ms (Avg: 14564.748000 ms, Std. dev.: 0.000000 ms, 1 samples) Update: Total: 12275.922000 ms (Avg: 306.898050 ms, Std. dev.: 39.567676 ms, 42 samples) [ignoring first/last samples] Range: Total: 1619.951000 ms (Avg: 40.498775 ms, Std. dev.: 1.242616 ms, 42 samples) [ignoring first/last samples] Output of the C program showing the final range and the profiling information. Note the large standard deviation on the Update measurement (13%) indicating that something else may have been happening on the computer at the same time. Performance Measurements Take a look at the code in the main_c_1.c file to see how the application is implemented. The only significant differences from the code listed above are for the performance measurements. As you can see performance measurements are made using perf variables. These must be initialized first, which is done in init_all_perfs(). The perfs are timers which are started by calling start_perf_measurement() and stopped with stop_perf_measurement(). Each time a start-stop pair is called the perf records the elapsed time. At the end the results are printed out with the print_perfs() call. Now take a look at the performance numbers you got and decide which part of your code should be optimized with OpenCL first. C. Accelerating (Decelerating?) with OpenCL Now that you understand the basic algorithm and have seen the source code it is time to accelerate your program by using OpenCL. As you can see from the performance data for the C code, the best place to start is by replacing the update() function with an OpenCL one. Ready? If the lectures were any good this shouldn t take more than a few minutes. You can use the hello world file as a starting point. I m just kidding. Starting from nothing is a major pain. So I ve done this for you in the main_opencl_1.c file, which is summarized below: int main (int argc, const char * argv[]) { float range = BIG_RANGE; float *in, *out; // ======== Initialize init_all_perfs(); Page 4 of 26

create_data(&in, &out); // ======== Setup OpenCL setup_cl(argc, argv, &opencl_device, &opencl_context, &opencl_queue); // ======== Compute while (range > LIMIT) { // Calculation update_cl(in, out); // Compute Range range = find_range(out, SIZE*SIZE); iterations++; swap(&in, &out); printf("iteration %d, range=%f.\n", iterations, range); } } main_opencl_1.c Note that the only two changes to the high-level algorithm are to setup OpenCL at the beginning (setup_cl) and to call update_cl() instead of update. We re going to skip the contents of setup_cl() for now. You can take a look at it in opencl_utils.c if you re interested, but all it does is look for the right type of device based on whether you passed in CPU or GPU on the command line and creates a context and queue for you. Instead, take a look at update_cl(). This is where the work is being done. void update_cl(float *in, float *out) { cl_int error; // Load the program source char* program_text = load_source_file("kernel.cl"); // Create the program cl_program program; program = clcreateprogramwithsource(opencl_context, 1, (const char**)&program_text, NULL, &error); // Compile the program and check for errors error = clbuildprogram(program, 1, &opencl_device, NULL, NULL, NULL); // Create the computation kernel cl_kernel kernel = clcreatekernel(program, "update", &error); // Create the data objects cl_mem in_buffer, out_buffer; in_buffer = clcreatebuffer(opencl_context, CL_MEM_READ_ONLY, SIZE_BYTES, NULL, &error); out_buffer = clcreatebuffer(opencl_context, CL_MEM_WRITE_ONLY, SIZE_BYTES, NULL, &error); // Copy data to the device error = clenqueuewritebuffer(opencl_queue, in_buffer, CL_FALSE, 0, SIZE_BYTES, in, 0, NULL, NULL); error = clenqueuewritebuffer(opencl_queue, out_buffer, CL_FALSE, 0, SIZE_BYTES, out, 0, NULL, NULL); // Set the kernel arguments error = clsetkernelarg(kernel, 0, sizeof(in_buffer), &in_buffer); error = clsetkernelarg(kernel, 1, sizeof(out_buffer), &out_buffer); // Enqueue the kernel size_t global_dimensions[] = {SIZE,SIZE,0}; error = clenqueuendrangekernel(opencl_queue, kernel, 2, NULL, global_dimensions, NULL, 0, NULL, NULL); // Enqueue a read to get the data back error = clenqueuereadbuffer(opencl_queue, out_buffer, CL_FALSE, 0, SIZE_BYTES, out, 0, NULL, NULL); // Wait for it to finish error = clfinish(opencl_queue); // Cleanup clreleasememobject(out_buffer); clreleasememobject(in_buffer); clreleasekernel(kernel); clreleaseprogram(program); free(program_text); Page 5 of 26

} update_cl() with all the error-checking code removed to make it easier to read. As you can see, this code is shockingly similar to the code in the Hello World OpenCL program. All it does is: 1. Create a program (loaded from the text file kernel.cl ) 2. Build it 3. Create a kernel ( update ) 4. Create two memory objects ( in_buffer and out_buffer ) 5. Enqueue a write to write the data to the memory objects (why would we copy data to the out buffer?) 6. Set the kernel arguments 7. Enqueue the kernel execution 8. Enqueue a read to get the data back 9. Wait for the execution to finish 10. Cleanup all the resources we allocated There s really nothing more to it than that. You can open the kernel.cl file and see the update kernel. It is very similar to the update() C code, except that it has to find out which update it should do (get_global_id()) and needs to avoid processing if it is on the edge. (The C version uses loops from 1 to SIZE-1 to avoid the edges.) kernel void update(global float *in, global float *out) { int WIDTH = get_global_size(0); int HEIGHT = get_global_size(1); // Don't do anything if we are on the edge. if (get_global_id(0) == 0 get_global_id(1) == 0) return; if (get_global_id(0) == (WIDTH-1) get_global_id(1) == (HEIGHT-1)) return; int y = get_global_id(1); int x = get_global_id(0); // Load the data float a = in[width*(y-1)+(x)]; float b = in[width*(y)+(x-1)]; float c = in[width*(y+1)+(x)]; float d = in[width*(y)+(x+1)]; float e = in[width*y+x]; // Do the computation and write back the results out[width*y+x] = (0.1*a+0.2*b+0.2*c+0.1*d+0.4*e); } The OpenCL kernel update(). Running the OpenCL version 1. Type make 1 (as before) 2. To run the CPU version type./1_opencl.run CPU 3. To run the GPU version type./1_opencl.run GPU So what did you get? How does the multicore OpenCL CPU version and the OpenCL GPU version compare to the standard C version for speed? D. Tutorial Worksheet Well, it s kind of hard to make sense of all those performance numbers, so now is the time to open the Tutorial Worksheet file and start filling in some numbers. Indeed, for the rest of the tutorial you ll be using this to track and analyze your performance. Once you ve finished implementing the optimizations you ll then submit your code to the cluster. This will give you a second set of performance numbers Page 6 of 26

which you will then compare to the ones you get on the lab machine to see the impact of different GPUs on performance and optimizations. Don t Cheat For you to learn the most you should read the appropriate part of the text here, write and debug the code, fill in the worksheet, and answer the questions in the worksheet, in that order. So when the tutorial says don t turn the page until you ve finished filling in the worksheet please don t. If you cheat and read ahead you will bias yourself and will learn a lot less. If you get stuck ask for help before you read ahead. Benchmarking is hard You ll be getting timing information running on the lab machines. If you re doing something else at the same time (such as web browsing, filling in a worksheet, etc.) it will impact your measurements. You can see from the standard deviation in the measurements in the output to evaluate how much you can trust your measurements. But it would be best to fill in the worksheet on your own laptop using Excel, rather than open office on the lab machines. Get started Go ahead and fill in the numbers from your first three runs in the tutorial worksheet. They should go under the 1. Baseline section. Be sure to fill in the Final Range Values to make sure that you re actually computing the same thing on all versions of the code. (It s very easy to get amazing speedups by computing nothing without realizing it.) Now answer the questions for part 1 of the worksheet. Don t turn the page until you ve finished filling in the worksheet. Page 7 of 26

1. Baseline Note: the graphs and data shown here are for running on my MacBook Pro. Depending on your hardware and operating system you may see different results. So you re going to have to think for yourself to figure this stuff out. The performance discussion below is meant as a guideline.!"#$%&'()#*+,-./01!")*+$)%&%,"-$. /"012%34"567859"(($) :-"14. 2"012;<4"567=59"(($)>1*;??;=$!")*+$)%@%,"-$. :-"14. A!B%CDE$. K!B%CDE$. 2"#304&1/'& A$F')5$%GHIIJ K@/?'%!LLII%>%@;MMANO (5(67 ()5(67 ()5867 *-,01 &PLQ& &QLHM @H&MH 7%90,& &&RLL &RPLQ @&@QH :0';& &MIR &RGR &MPI </'01#:0';&#=01+& GG;L@LM&P GG;L@LM&P GG;L@LML@ $>&.?&09 MQL LMH &@MI @;I% &;L% &;M% &;H% 2"#304&1/'&# &;@% &;I% I;L% U0$)9$"2% ^"+#$% BE2"*$% I;M% I;H% I;@% I;I% K7K!B% KV7K!B% KV7A!B% @+&4,/-'4!"#$%#&'(%#)'*&#+,-#./0.1&.23,'S%T%*9'?#9*%UE$+KV%W'?42%=E$$2%*91+#=%?E;%T+%E")*15?4")%*9$%A!B%=9'?42%<$%W"D%("=*$); 4"#5'*&#%0..2-0#),-62#+,-#./0.1&#7,8#&'.#9:#9;<#=%#&'.#9#9;<3#>'(?&@#',)#A*?+#1,8.%#*8.#+,-#-%(?B3C JD%K!B%9"=%@%5')$=%='%T%W'?42%9"0$%$XE$5*$2%*'%)?+%*W15$%"=%("=*;%YZ*%4$"=*%(')%*9$%?E2"*$%E")*;[ 9"#D(2#E0.?9:#B(=.#&'.#%*A.#>?-A.8(1C#*?%).8#*%#&'.#9#=.8%(,?3#5'+#,8#)'+#?,&3,'\%<?*%*9$D])$%)$"44D%54'=$;%C9$%K!B%1=%*9$%="-$%<?*%*9$%A!B%1=%21(($)$+*;%C91=%1=%E)'<"<4D%2?$%*'%21(($)$+*%E)$51=1'+%"+2%')2$)%'(%'E$)"*1'+=; The tutorial worksheet uses light blue to indicate places you need to fill in data, formulas, or analysis. The performance graphs are all normalized to the C-CPU speed (on the left) so bigger bars are worse. From this data the GPU version is 1.8 times slower than the C version, and the OpenCL C version is 25% slower. Not so good. Remember the questions are there to help you learn, so try to think about them rather than trying to just put down something. From the graph in the worksheet we can clearly see that we re spending a huge amount of time calculating the update. However, the OpenCL code for update is pretty complicated so this level of detail is not very helpful. What we need to do is get more detailed performance data. Open the main_opencl_2_profile.c file. This file is similar to the file from part 1 above except that it defines a bunch more performance counters. These are: Program_perf for loading and compiling the program Create_perf for creating the buffers Write_perf for writing to the buffers Read_perf for reading from the buffers Finish_perf for timing the finish Cleanup_perf for timing the cleanup Page 8 of 26

Your job is to now go into the update_cl() method and insert appropriate perf measurement calls to measure the performance of these parts of the program. Go ahead and do this and then run the program with make 2 and./2_profile.run CPU and./2_profile.run GPU. Fill in the data in section two of the worksheet and answer the questions. Note: for each section of this tutorial you use make N and./n_ GPU and./n_ CPU to run the code. It is important that you use the file names mentioned here and listed in the Makefile so that you will be able to submit everything to the cluster in the end. Don t turn the page until you ve finished filling in the worksheet. Page 9 of 26

!"#$%&%'(%& 2. Baseline with Profiling So what did you find? Is it what you expected? Here s what I got:!"#$%&'()*'#+),-#./01)()*2 343.5 3643.5 3647.5 80,%( &)*+& &*,-../,,0 59:%,'#;'/*'( &&0**, -*,+ <%*2'#30=9>,' &1,0 &1,* &0*- <'%:#?%,%,. 30=9)('#./02/%=.& -+- 3/'%,'#$>11'/&, &. @/),'#?%,%, +/ A)*)&- &/*.. /1.. 3('%*>9 +,& 01++ A)*%(#<%*2'#B%(>' --2*.*1&) --2*.*1&) --2*.*1*. CD'/-'%: 1+* -/, &./).% &2*% &21% &2/% &2.% &%,2*%!"#$%&()*'#+),-#./0E()*2# 89$:;$"<% 3=$">?@% AB>BC;% D:BE$%F"E"% 3:$"E$%G?H$:C% 3'I@B=$%!:'#:"I%,21%,2/% J$"<%F"E"% J">#$% 5@<"E$%,2.%,% 343!5% 3643!5% 3643!5% 3647!5% 3647!5% I ve removed the answers to the questions to make it less tempting to look ahead and cheat. But there s a discussion of them in the following text. A few things jump out at me immediately. The first is that I m spending a huge amount of time in Finish, particularly on the CPU. That s strange since Finish doesn t actually do any work other than call clfinish(). Of course if you answered question D (which was discussed in the lectures) you ll remember that OpenCL submits work asynchronously. That is, the work isn t done when you call clenqueue() it s just scheduled. So when you call clfinish(), the runtime will stop your program until all the work is done. Therefore, the timers you put around clenqueuendrange() and clenqueuereadbuffer, etc., don t record much (if any) time because they are just enqueueing the kernel. Another strange thing is that I see almost not time for compiling the kernel, despite the fact that I m creating and calling clbuildprogram() 40 times in this program. Your results are probably a bit different if you re not running on a Mac. The reason is that Mac OS X caches programs when you compile them the first time, so each subsequent compilation of the same source code is really fast. If they weren t cached the program would spend a huge amount of time re-compiling. Now make a copy of your source file and call it main_opencl_3_profile_finish.c and add in clfinish() as needed for the performance counters you added before. This will make sure that the host waits for the OpenCL commands to finish before continuing with your program. By doing this you will synchronize with the OpenCL device, and your performance measurements on the host side will be accurate. After you have done this, run your program, fill in part 3 of the worksheet, and answer the questions. Note: OpenCL provides the ability to request events from commands and use those events to get information on when they were submitted to OpenCL, when they started, and when they finished. This is another way (indeed the preferred way) to get this information, but it is tricker. Don t turn the page until you ve finished filling in the worksheet. Page 10 of 26

3. Baseline with Profiling (and clfinish) Now the results should make a lot more sense. Finish is now a tiny part of the overall time (nearly 0) and we can see where we are spending time.!"#$%&'()*'#+),-#./01)()*2#3%*4#5(6)*)&-7 898.: 8;98.: 8;9<.: =0,%( &)*+& &+*,* -.+++ :>4%,'#?'/*'( &&/** 0)0) +1-, @%*2'#80A>B,' &01/ &/,, &0-& @'%4#C%,% &100 &.11 80A>)('#./02/%A -&,,& 8/'%,'#$B11'/& 1 &- D/),'#C%,% +-*, 0+., 6)*)&- 1. 8('%*B> 0,- /+1-6)*%(#@%*2'#E%(B',,2*-*0&),,2*-*0&),,2*-*0*- FG'/-'%4 0+* *0* &-0,!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 B>4%,'#&>''4B>#G&#898.: &2*- &20/ H#,)A'#0*#4%,%#A0G'A'*,.+3 ))3 H#,)A'#0*#:>4%,' )03 -*3 -% &2*% &20% &2.% &2-% &% 12*% 120% 12.% 12-% 1%!"#$%&()*'#+),-#./0I()*2#3%*4#5(6)*)&-7# 9:$;<$"=% 4>$"?@A% BC?CD<% E;CF$%G"F"% 4;$"F$%H@I$;D% 4'JAC>$%!;'#;"J% K$"=%G"F"% K"?#$% 6A="F$% 454!6% 4754!6% 4754!6% 4758!6% 4758!6% With clfinish() forcing the runtime to complete each operation we can now accurately time what is going on. Clearly we are spending a lot of time moving data (write and read) and a good deal of time in cleanup. (Your results are likely to be different depending on your GPU and system!) For my system, I m wasting a lot of time in cleanup and moving data (write and read). The update calculation is actually pretty fast and compile program is not too bad. Now note that the ratios of these times will depend on the ratio of speeds of the CPU and GPU. If you have a much slower GPU (like the lab machines; roughly half as fast) and a faster CPU (like the lab machines; roughly 50% faster) then you may see that the update time is a lot larger on the GPU. Now with this data we can start to see some results. The speedup for the update calculation on the OpenCL CPU is 1.82 vs. the C version. This is decent (although not great) considering that we have 2 CPU cores running in OpenCL. The reason it s not 2.0 is that the kernel code has a lot of extra overhead to deal with the edge cases. The C code deals with them once in the loops and is therefore more efficient. If we look at where our time is being spent, we see that we re spending only 1/3 of our time doing update and nearly 50% of our time doing data movement. (You may see different numbers depending on things like how much of your time you re spending doing compilation.) This kind of information lets us know what optimization we need to do: we need to get the compilation, cleanup, and writing data out of the main loop since these things only need to be done once. Then we can keep the data on the OpenCL device and just read back the data for the range calculation. To do this open the file main_opencl_4_profile_nooverhead.c and fill in the missing parts. You can get most of the code you ll need from your current file. All this file does is add two calls at the beginning (setup_cl_compute() and copy_data_to_device()) and then change the main loop to call update_cl() with the appropriate cl_mem buffer objects instead of the data pointers, and then call read_back_data() to get the data back from the range calculation. At the end it calls cleanup_cl() to cleanup. The program uses the iterations count to determine which buffer to read from and write to (see the get_in_buffer() and get_out_buffer() functions). This way the data can be kept on the device and we just change the kernel arguments to swap buffers. Page 11 of 26

// ======== Setup the computation setup_cl_compute(); start_perf_measurement(&write_perf); copy_data_to_device(in, out); stop_perf_measurement(&write_perf); // ======== Compute while (range > LIMIT) { // Calculation start_perf_measurement(&update_perf); update_cl(get_in_buffer(), get_out_buffer()); stop_perf_measurement(&update_perf); // Read back the data start_perf_measurement(&read_perf); read_back_data(get_out_buffer(), out); stop_perf_measurement(&read_perf); // Compute Range start_perf_measurement(&range_perf); range = find_range(out, SIZE*SIZE); stop_perf_measurement(&range_perf); iterations++; printf("iteration %d, range=%f.\n", iterations, range); } // ======== Finish and cleanup OpenCL start_perf_measurement(&finish_perf); clfinish(opencl_queue); stop_perf_measurement(&finish_perf); start_perf_measurement(&cleanup_perf); cleanup_cl(); stop_perf_measurement(&cleanup_perf); The new main loop with the OpenCL setup code moved out. The functions you need to define are in yellow. Make the changes to get this program working and then get your performance numbers and fill in the next section of the worksheet. Remember to make sure your Final Range Values match so you can be (reasonably) sure that your code is correct! Don t turn the page until you ve finished filling in the worksheet. Page 12 of 26

4. Overhead Outside of Loop Now that we ve removed the obvious overheads (compilation, setup, cleanup, and copying data to the device) from the main loop we can start seeing some performance improvements.!"#$%&'(&)*#$+,-.*&#/0#1//2 34356 314356 314756 8/,)9 &)*+&,-*. +/-) &2/% 62*),&#:&';&9 &&0** -)1+ )1&* <);=&#3/>2+,& &-.0 &0,* &-/+ <&)*#?),) &.-. &)-/ &% 3/>2.9&#5'/=')>. /* 3'&),&#@+00&'-...2*% A'.,&#?),) &*& /&& B.;.-(.. 39&);+2 &+ &0+.2-% B.;)9#<);=&#C)9+&,,2*/*-&),,2*/*-&),,2*/*-*/ $%&'(&)* -+* 1++ 1-..21%!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 +2*),&#-2&&*+2#%-#34356 &2*) )2), D#,.>&#/;#*),)#>/%&>&;, &)3 //3.2/% D#,.>&#/;#62*),& --3 1+3 8/,)9#-2&&*+2#0'/>#EF &2* )21 D#*),)#>/%&>&;,#0'/>#E &03 &,3.%!"#$%&'(&)*#$+,-.*&#/0#1//2# 9:$;<$"=% 4>$"?@A% BC?CD<% E;CF$%G"F"% 4;$"F$%H@I$;D% 4'JAC>$%!;'#;"J% K$"=%G"F"% K"?#$% 6A="F$% 454!6% 4754!6% 4754!6% 4758!6% 4758!6% With the overhead removed from the loop we have now actually succeeded in accelerating our code using OpenCL. The CPU version now takes 75% as long as the C version (not so good given that we should be at 50% with two CPUs) and the GPU version is faster. The CL-CPU version is now 1.8x faster than the initial version and the CL-GPU version is 3.4x faster. Again, depending on the relative speeds of the GPUs and CPUs you may see different improvements, or even none at all. But the most relevant part is that our data movement is now 13% of our total time down from 47% before. If you were spending a lot of time on compilation you will see a smaller relative decrease. Looking at the change in just data movement time (read plus write time) we see that we are spending only 15% and 19% of the time, respectively, doing data movement on the CPU and GPU as we were before. This is a > 5x reduction in data movement time! So why is the data movement reduced? Well, simply because we re not constantly copying the data toand-from the device. We are copying it back each time we do range, but that s a lot less movement than before. Looking at the data above, the next optimization step for my machine is to get the range computation onto the OpenCL device so we can eliminate the read data and accelerate range. (Of course it would be great to get the update kernel to run even faster, but we re not going to touch it for now.) But before we do that we re going to take a look at the impact of local dimensions on performance. To do this, open the file main_opencl_5_explore_local.c. This file simply has a new main function that walks through a bunch of local dimensions (stored in locals[][]) and then calls run with them set. You need to copy your code from the last section into here, rename its main function to run and update the code to use the local dimension chosen by the new main loop. Do this and answer the questions in the worksheet. Don t turn the page until you ve finished filling in the worksheet. Page 13 of 26

5. Exploring Local Dimensions When we explicitly set the local dimensions we need to choose values that evenly divide the global dimensions. So for this program, with global dimensions of 4096x4096, we can choose just about any power of 2 by power of 2. However, the total size of the local dimensions is dictated by the hardware resources on the device. For the lab and my machine this limits the maximum workgroup size to 512, and for the cluster to 1024. While any value of local dimensions that evenly divides the global and whose product is less than that maximum is supported, the performance may vary dramatically. Remember that the hardware computeunits execute multiple threads from their work-groups at the same time. If the size of the work-group is less than the minimum number of threads the compute-unit executes at once, that additional hardware will be wasted. In the case of Nvidia hardware, each streaming multiprocessor (compute-unit) has 8 hardware cores which execute one thread on every other cycle (16 physical threads at once) and the architecture executes threads in groups of 32. This means that if the compute-unit has less than 32 threads at any given time it will be wasting processor cores. (Take a look at the slide in the OpenCL lecture.)!"#$%&'()*+,#-(./'#0*12+3*(+3#(+#452#678 98-- :%: ;%; <%< =%= :>%:> ;!>%:?(4/' )**+ &&***, -.*-, &,/.) /+-) /.*& /&*& 8&@/42#A2)+2' -*/0 &,++&0./**) 1/.& -/.* -*)/ --10 B/+,2#C(1&D42 &1.* &1-0 &1-) &1-) &1.* &1.0 &1+0 B2/@#0/4/ &-** &-/1 &-** &-1/ &-/) &--& &-.& C(1&*'2#7)(,)/1 &,-*./ -. -* -..1.0 C)2/42#EDFF2)3,,,,,,, G)*42#0/4/..* &1) &/, &11 &1/ &/, &11 H*+*35,,,,,,, C'2/+D& &+) &1/ &+) &1- &/. &+) &+) H*+/'#B/+,2#I/'D2 JK2)52/@ +). ++** &1*, 1-0 **& **, *-0!"#$%&'()*+,"+!-..+/",%&+/'$)+ $(# $'# $&# $%# $$# $!#,# +# *# )# (# '# &# %# $#!# 01+234&"#'56+."7%&+8'$)59'"59+ -.//# $0$# %0%# '0'# +0+# $)0$)# %()0$# $!!"#,!"# 12345367# +!"# 8936:;<# *!"# =>:>?5# )!"# @4>A3#B6A6# (!"# 8436A3#C;D34?# 8EF<>93#G4EH46F# '!"# I367#B6A6# &!"# I6:H3# %!"#.<76A3#.J9>K6JE:# $!"#!"# Performance differences for different local dimension sizes along with device utilization. Trying out different local dimensions gives us a feeling for how important this is. When we plot them together with utilization we see that we need full utilization to get the highest performance. (Note that our performance may not be limited by utilization, so it may not scale directly. In this case we are limited by non-coalesced accesses at some point.) But on my machine, the performance increases all the way, with the NULL version clearly not giving the best performance. The bottom line is that if you care about the last bit of performance you should not let the runtime try to guess the best local size. Page 14 of 26

6. Range in a Kernel Now it is time to put the range calculation into a kernel. The tricky part about calculating the range is that it is a reduction, and, as you remember from the lectures, reductions are limited by synchronization. To make this simple we re going to do the last stage of the reduction on the host. That is, you will write an OpenCL kernel where each work-item will calculate the min and max for part of the data and write those to an OpenCL buffer. This (smaller) buffer will then be read back to the host where the final reduction will be done.!"#$ %&'()$!!!!!!! " " " " " " " # # # # $! %&' %() # # # * * * * * * + + + + + +,,, $" %&' %(),, - - - - - - - * * * $# %&' %() $* %&' %() $+ %&' %() $, %&' %() $- %&' %() * * * * * * The range kernel. Each work-item processes some number of items from the update kernel s output and stores the minimum and maximum across those values in the range buffer. (e.g., work-item 0 processes the first 7 items in the data and work-item 1 processes the next 7.) The number of items each work-item processes is determined by the number of work-items and the size of the data to process. The range buffer is then read back to the host, which does the final reduction. The range kernel therefore takes the output from the update kernel as its input and produces a range_buffer output. To implement this, open the main_opencl_6_range_kernel.c file. This file is similar to what you had in step 4 (before adding the local size experiment) but it has an additional kernel variable (range_kernel), an additional cl_mem variable (range_buffer), and an additional host buffer (range_data) for interacting with the range kernel. You will need to add your range kernel to the kernel.cl file, update the setup_cl_compute() method to create the appropriate range_buffer, range_kernel, and range_data. (Also update cleanup_cl() to clean up after them.) I ve provided a skeleton range kernel below for you to start with. The number of work-items to use for the range kernel is defined in the RANGE_SIZE #define. // The number of work items to use to calculate the range #define RANGE_SIZE 1024*4 // ======== Compute while (range > LIMIT) { // Calculation start_perf_measurement(&update_perf); update_cl(get_in_buffer(), get_out_buffer()); stop_perf_measurement(&update_perf); // Range start_perf_measurement(&range_perf); range_cl(get_out_buffer()); stop_perf_measurement(&range_perf); // Read back the data start_perf_measurement(&read_perf); read_back_data(range_buffer, range_data); stop_perf_measurement(&read_perf); // Compute Range start_perf_measurement(&reduction_perf); range = find_range(range_data, RANGE_SIZE*2); stop_perf_measurement(&reduction_perf); iterations++; printf("iteration %d, range=%f.\n", iterations, range); } The changes to the program for running the range kernel. Instead of reading back the data and then calling range, we call range_cl() (which enqueues the range kernel) and then read back the range_buffer to do the final reduction on the Page 15 of 26

host. Note that you need to make sure the read_back_data() function reads the right amount of data. kernel void range(global float *data, int total_size, global float *range) { float max, min; // Find out which items this work-item processes int size_per_workitem =... int start_index =... int stop_index =... // Finds the min/max for our chunk of the data min = max = 0.0f; for (int i=start_index; i<stop_index; i++) { if (...) min =... else if (...) max =... } // Write the min and max back to the range we will return to the host range[...] = min; range[...] = max; } The range kernel skeleton. The input is the data to process (the output from the last update kernel, set by clsetkernelargs()), the total size of the data to process, and the results are written to the range output. 5.1*67 89 :;34434=%%% < '43/!!"$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$! This section of the Khronos OpenCL quick reference card will be helpful in filling in the kernel code above. :;3443A=%%%!!!!!#!43/#!43/!#!'43/#!'43/!!! # "-! "!!" $!!!!!!!!!!!!!!!!% Now.!!""!#$ implement the :!#!: range kernel and change the program &!,3%!/ to use it. Test the code by verifying that the %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!! number.!!""!##!"!'$ of iterations :!# ' and the final range "!!""!1#!!"!%$ value are the same %!)# 1 %!)#=!88!% %!)#= as the previous versions. If they are not you "!!""!##!!"!'$ #!;!' "!!""!##!!"!'$ # ' have a bug and will have to fix your code. "!!"7! $!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!!"!'$ "#!;!' "!!""!##!"!'$! "#!;!' Once you ve got it working, fill in the worksheet $ and continue. "!!""!#$ "!!""!&#!"!/#!"!0$ "! " $!!!!!!!!% "!!""!&#!"!/#!"!0$ "!!""!##!!"!'$ #! &#!/$!;!0 Don t &! turn the page $ until you ve finished filling in the worksheet.,3%!/ 0 &!<!/!;!0 '!42!#!8!' # :;3443"=%%% "!!""!##!!"!'$ "!!""!##!!"!'$ '!42!'!8!# # #!,3%!' $%&'(%)*!! #!!! # $ 43/!! #!! '43/!! # $!!!"43/ #!!!!!!'43/ $!!!"'43/ #!!!!!!'43/ $ % 2% % 34 %= % 2% % 34 %= % 2% % 34 %= % 2% % 34 %= % 2% %=88!>?$!:!34 %= % 2% %=88!>?$!:!34 %= :;3443<=%!!!! " " 3%%! "!!""!##!"!$%!#!"!$&#$!!!!!!!!!!!!!!!!!! +),-.!#!/&!!!!!# $%! $&#$!!!! %&'()*!!!"%&'()*!!##!%&'()*!$%!#!%&'()*!$&#$! $%!#!$&#!!!!# $%! $&#$!!!!!!!!!!!!!!!!!!!!!!! "!!" $!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!/&! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0,1!&2!#!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!,3%!' %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!+>1.!""!()*(#!"!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+>1. ()*(!!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! %&'()* +>1.!"%&'()*!()*(#!%&'()*!!#$!!!!!!!!+>1. ()*(!!#$!!!!! "!!""!()*(+#!"!()*(,#!"!#$!!!!!!! ()*(+ ()*(, $! %&'()*!!!"%&'()*!()*(+#!%&'()*!()*(,#!!%&'()*! #$!!!!!!!!! ()*(+ ()*(, $!!!! "!+?@*!""!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 043!&2!#!,3%!' 543*,6!()*3%! &2!#!,3%!'!/&! #!8! ()*( 9/*.!,3%! 43/*6.&),/* # :;3443"=%%%%%%%%%%!!!!$! 43/!#!'43/!! " " " 3% "!!""$! "!!""$! "!!""!#$! "!!""$! "!!""$! "!!""!#$! "!!""!'541(65#$! "!!""!'#!"!#$! "!!""$! "!!""!#$! "!!""!##!"!'$! "!!""$! "!B1?C!""$! "!!""!##!"!'$! # # ' #,/,3!"#,/,3?!"# @'(*!6&&/ # ' "!!""$!!!!!!!!!!!!!!!!!!!!!!! "!!" $! *A# "!!""$! "!!""!##!"!'$! #!,3%!' "!!""$! "!!""!&#!"!/#!"!0$! "!!""!##! $!!!!!!!! B*/'63!'!42!#!8!'#!!!!!# # "!!""!##! $!!!!!!!!!!!! B*/'63!'!42!'!8!##!!!!!# # "!!""!##!"!'$! # '!<!/6'3@!"# '$ "!!""!##!"!<%786$! # "!!""!##!43/ <(#7$! "!!""!##!"!'$! #A?;!'A?! 43/!!!""!#$! "!!""!##!43/ $! Page 16 of #!!<!?A! 26 "!!""!##!43/!!$ "!!""!#$! "!!""!1#!43/!!< $! "!C)@!""$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"'43/!!!&!04)($ C'4*/!D,D!!!"'43/!!!&!04)($!!!!!!&!04)($! %&'()*!!!!!&!04)($!! %&'()*!!!"'43/!!!&!04)($! "!!""!##!"!'$! # "!.)E!""!##!"!'$! +&-.'/*!# '!"#A'$ "!.)E*!""!##!43/ $! +&-.'/*!#A' ' "!.)E(!""!##!"!'$!!!!!!!!!!!!!! +&-.'/*!#A' # "!!" $!!! #!! "!!" $!! "!!""!##!"!'$! "!!""!##!"!'#!43/!! 9:;4$! "!(?*>!""$! "!())>*!""!##!43/ $! +&-.'/*!# ' "!!""!#$! # "!(+F(>!""$!!!!!!!!!!!!!!!!!!!!!!!!! "!+?*!""$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

7. Coalescing the Range Accesses So what happened? The performance got worse when you ran the range kernel on the GPU. Why is that? Well, as the questions in part 6 hinted, the way we wrote the range kernel resulted in a lot of uncoalesced memory accesses. (Go back to the lecture notes if you re not certain what this means.)!"#$%&'(#)&#%#*(+&(, -.-/0-1.-/0-1.2/0 345%, &)*+& *,-. &,/./ &2,% 067%5(#*(+&(, &&.** 00&) )1-* $%&'(#-48695( &0-. *0. *).1 $(%7#:%5%, ), &% -486),(#/+4'+%8 -,0 -+(%5(#;9<<(+= - - -2*% >+)5(#:%5% &*&,&/?)&)=@ - - -,(%&96 &+ &./ -20% $(79A5)4& - &?)&%,#$%&'(#B%,9( //2*,*0&) //2*,*0&) //2*,*0*, CD(+@(%7 0+*.,+ +0- -21%!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 967%5(#=6((796#D=#-.-/0 &2+. )21- E#5)8(#4&#7%5%#84D(8(&5,3,3-2,% E#5)8(#4&#067%5( *&3,03 +%&'(#=6((796#D=#-.-/0 &2*0-2&/ 345%,#=6((796#<+48#FG &2&* -2.0 -%!"#$%&'(#)&#%#*(+&(,# 454!6% 4754!6% 4754!6% 4758!6% 4758!6% 9:$;<$"=% >$=?@A'B% 4C$"B?D% EFBFG<% H;FI$%J"I"% 4;$"I$%K?L$;G% 4'MDFC$%!;'#;"M% >$"=%J"I"% >"B#$% 6D="I$% Putting the range kernel on the OpenCL device actually slowed down the GPU. It did give a small speedup on the CPU, however. Let s first take a look at the CPU. The CPU code is running faster because we ve eliminated the time spent reading the data back. We re still not at a 2x speedup, however. The GPU code is more disappointing. The resulting application is nearly twice as slow as when we ran the range code on the CPU, even though we ve eliminated the data movement. The reason for this is that the way we wrote the range kernel results in uncoalesced memory accesses, which dramatically reduce the available bandwidth. To understand this, it s important to remember that when you read memory from DRAM you get a large chunk of memory. In the case of a GPU, it s typically at least 384 bytes, which is 12 float values. If you only use one of those 12, then you are throwing away 11/12ths (92%) of your bandwidth. The wider the DRAM bandwidth the worse a problem uncoalesced access can be. Unfortunately coalescing is tricky. Older GPUs have very specific rules for coalescing (each thread has to access the next item in an array and they have to do it at the same time and in order) while newer GPUs have more relaxed rules (any threads that access neighboring data at the same time). So getting good coalescing behavior is tricky. Luckily our range kernel is simple enough that we can understand what is going on and how to fix it. You now need to change your range kernel so that each work-item in a work-group accesses an element in the input array that is next to the one accessed by the next work-item in the work-group. Take a look at the picture below to understand. Page 17 of 26

!"#$%&'()*+,-)./012)3)456781'9)3):);&<#19 =>?<#&'9?'0!!!!!!! " " " " " " " # # # # $%&'!!!! ()%'! # # # * * * * * * + + + + + +,,, $%&' " " " " ",, - - - - - - - @ @ @ $%&' # # # # # $%&' * * * * * $%&' + + + + + $%&',,,,, $%&' - - - - - E<#&'9?'0.#91')ABC)<;)7#>0D/012! " # * +, -! " # * +, -! " # * $%&'! " # * ()%'! " # * +, -! " # * +, -! " # * +, -! $%&' +, -! +, - " # * +, - @ @ @.#91')45@BC)<;)7#>0D/012 Example of our range kernel. In the original accesses (top) we assume we read 4 values from DRAM on every access, but only use one of them. (The real hardware is far worse in this regard.) In the re-ordered version below, we read 8 values from DRAM on every two accesses, and use 7 of them. This gives us a 7x increase in effective bandwidth, and can improve our performance by requiring only 2 memory accesses in place of 7. Now go and write a second kernel ( range_coalesced ) that works this way and fill in the next part of the tutorial. Make sure you validate your results by looking at the final range value. Copy your main_opencl_6_range_kernel.c file to a new main_opencl_7_range_coalesced.c file and change it to use the new kernel. Run it and fill in the worksheet. (You can add the new kernel to the same kernel.cl file by calling it range_coalesced instead.) Don t turn the page until you ve finished filling in the worksheet. Page 18 of 26

After fixing the range kernel to be coalescing we see that the GPU performance is vastly improved (13x faster range compared to uncoalesced, plus reduced data movement). The CPU performance is about the same, but the range kernel itself is actually 10% slower. The reason for this is that the CPU has a cache and hardware prefetcher, so it doesn t benefit much from changing the order of the accesses. In fact, straight linear accesses (the uncoalesced version) is faster on the CPU due to this hardware.!"#$%&'()*(#+,(#-&./(#0***())() $1$23 $41$23 $41523 6%+&' &)*+& *,-,.*)) 378&+(#9(:.(' &&/** 0..0 ).,/ -&./(#$%;7<+( &0,/ 1/) 0.* -(&8#=&+& - )- $%;7>'(#2:%/:&;, -0 $:(&+(#?<@@(:),, A:>+(#=&+& &*- -&. B>.>),,, $'(&.<7 &+ &/1 -(8<*+>%., & B>.&'#-&./(#C&'<( 112*-*0&) 112*-*0&) 112*-*0*- DE(:,(&8 0+*.-, ).*!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 <78&+(#)7((8<7#E)#$1$23 &2*, )2., F#+>;(#%.#8&+&#;%E(;(.+ -3 /3 F#+>;(#%.#*%;7<+&+>%. 1-3 *.3 :&./(#)7((8<7#E)#$1$23 &20* -2)0 :&./(#)7((8<7#@:%;#GH,21& &-21 6%+&'#)7((8<7#@:%;#GH &2, -2+ &2-% &%,2*%,20%,2.%,2-%,%!"#$%&'()*(#+,(#-&./(#0**())()# 454!6% 4754!6% 4754!6% 4758!6% 4758!6% 9:$;<$"=% >$=?@A'B% 4C$"B?D% After coalescing the range accesses we see that we ve basically eliminated data movement from our application and achieved a reasonable 3x speedup on a slow GPU. EFBFG<% H;FI$%J"I"% 4;$"I$%K?L$;G% 4'MDFC$%!;'#;"M% >$"=%J"I"% >"B#$% 6D="I$% Overall Speedup If you click on the Overall Speedup tab at the bottom of the tutorial worksheet you can see the impact of the various optimizations on the application performance. Clearly the biggest improvements came from removing the overhead from inside the loop (not surprising) and coalescing the range kernel. It s important to have this kind of information for your optimizations so you can see the benefit of each step. &"!"##$%#&'($)*++,-*$.+#"/0+$1&$%$%&,+$ %" $" #" MOM2H" M@OM2H" M@OG2H"!" #'"()*+,-.+" $'"()*+,-.+" %'"()*+,-.+" /-01"2345,-.6" /-01"2345,-.6" 7).8"9,:-.-*1;" &'"<=+31+)8" <>0*-8+"4?" @44A" B'"CDA,43-.6" @49)," E-F+.*-4.*"4." 01+"G2H" I'"J).6+"-.")" L'"M4),+*9+"01+" K+3.+," J).6+" N999+**+*" Overall speedups from the changes in the tutorial. CPU vs. GPU We ve now encountered two cases where the CPU runs more slowly with a GPU-optimized kernel: The GPU update kernel has the overhead of checking the borders each time vs. the C code which skips the borders in the loop. The coalesced GPU range kernel accesses data in a bad order for the CPU. Page 19 of 26

This points out that the particular kernel you want to use will depend on the hardware. Now you might think you could just change the global dimensions of the GPU kernel to skip the border and then have a kernel that runs great on both architectures. However, while doing this will get you 2.0x speedup on the CPU, it will get you a huge slowdown on the v1.1 GPUs (lab and my laptop) because the memory accesses will no longer be coalesced. But on the v2.0 GPU (in the cluster) you d still get great performance because they have more relaxed coalescing rules. So while OpenCL does offer portability, it clearly does not offer performance portability. Going Faster How could you make this code faster? Well, the biggest performance issue now is the update kernel. Since there is a lot of data reuse in this kernel we could use the local shared memory to manually cache the input data before we process. This could potentially speed up the kernel a lot, but would make it far more complicated, and would slow it down on the CPU. (CPUs don t have local memory, so any code that accesses it is wasted time.) On newer GPUs this will be almost a waste of effort since they have caches, which would do this for us. Other than that, the data clearly indicates that we re still spending a good deal of time on overhead. To minimize this we need to processes larger problem sizes (if they fit on the GPU) and run them for longer (e.g., more iterations). If our problem is larger and takes longer to run, the percent of time spent on overhead will decrease and we ll see a corresponding speedup. An Aside: How slow is Java? I re-wrote the simple C version in Java, keeping the code as similar as possible to the C. This was far easier than writing in an ancient, primitive language like C or Fortran. Now most people will tell you that Java is about half as fast as C code, but for this case, it was actually 3.5x faster, thereby beating the GPU implementation. Why? I don t honestly know, and this is a real problem for performance optimization. Java has the potential to do better compilation due to the runtime nature of its JIT (e.g., at runtime it can determine which parts of code to optimize based on how they are being used). So perhaps the java is unrolling the loop, or maybe automatically vectorizing? But on the other hand it has the overhead of array bounds checking and memory management. So who knows what is going on. However, on top of good performance, Java has very efficient libraries for parallelizing code, which would enable the code could run even faster with a bit more effort. (It would take far less effort to use Java s parallel libraries than to use OpenCL. But there are of course OpenCL interfaces for Java as well.) Regardless, this is a clear indication that you shouldn t discredit Java s ability to run fast. Indeed, combined with the fantastic productivity improvements for the developer, Java is an excellent choice for a large range of applications. &"!"##$%#&'($)*++,-*$%$./0$1*2345+,$1*+6%7$./0$8"."$ %" '(')*" $" '+(')*" '+(,)*" -./.(')*" #"!" '(')*" '+(')*" '+(,)*" -./.(')*" Comparing the performance of the optimized OpenCL implementations vs. Java. Surprised? Java is really quite good these days. Page 20 of 26

An Aside to the Aside: Scala If you re looking for a productive way to write programs I highly suggest you take a look at the Scala language. Scala is an industrial-quality, well-supported language, that adds most of the performance and productivity enhancements from functional languages to a very elegant object-oriented framework. And since it is based on the Java virtual machine, it has similar performance. In addition, you can access all Java libraries from within Scala, and the support for parallel collections, operations, and tasks is excellent. However, the coolest feature of Scala is the ability to embed domain specific languages within Scala. This means you can easily define new language constructs that handle specific operations for your domain (for example, a new for-loop that knows how to walk over a graph or mesh) in a way that is far more productive for the developer than libraries. The downside of Scala is that it suffers from the same performance issues/uncertainties as Java. The bottom line is that unless you will spend two (three? four?) orders of magnitude more time running the code than developing it, you should use a modern high-level language. CPU cycles are cheap; developer time is expensive. Compare to a fast GPU Now it s time to re-do all of these experiments on a really fast GPU to see how they compare. To do so simply submit a job to the GPU cluster that executes the command make run > cluster.out. This will run all the experiments on the cluster GPU. Once you ve done this, make a copy of the tutorial worksheet and fill it in with the cluster GPU data. Before you submit your job, make sure everything works. Type make run > lab.out and verify that the resulting file has all the data needed to fill in the worksheet. To submit your job on the cluster you need to: 1. Login to zorn: ssh username@zorn.pdc.kth.se 2. Change to your working directory: cd /cfs/zorn/nobackup/u/username (replace u/username with your username, of course) 3. Copy your code to this directory: cp r ~/yourcode./ 4. Go into that directory: cd yourcode 5. Compile the code: make clean; make all 6. Create a job script: nano job.sh a. Put in it: #!/bin/bash cd /cfs/zorn/nobackup/u/username/yourcode make run 7. Make your job executable: chmod +x job.sh 8. Submit your job: qsub l walltime=10:00./job.sh 9. You can check the status with: qstat 10. When you are done, your output will be in job.sh.e and job.sh.o in the same directory. While you re waiting for your data you can start answering the questions on the last page of the tutorial. Note: if this doesn t work we can take turns sshing into zorn, and from there manually running the script with make run. If we do this we need to make sure only one person uses the machine at a time. Don t turn the page until you ve finished filling in the worksheet. Page 21 of 26

Comparing to the Cluster GPU The first thing you should notice is the difference in CPU speed. The simple C program runs more than twice as fast in absolute time on the cluster machine than on the lab machine, despite the fact that the cluster machines have a slower clock frequency. This is due to the fact that they are newer machines and have much higher memory bandwidth. The next thing you ll notice is the huge amount of overhead for the cluster. It s about 8 seconds. This time accounts for getting the device and creating the context and queue. This is really shocking. But if you watch the program while it is running you ll see it s really there. This is a bug in something, but it s not clear what. Since we don t see it on the lab machines it could be with the drivers on the cluster machines or even possible the filesystem. The same thing happens for both the CPU and the GPU. You would have to run a program for a lot longer than we are to see a good speedup with this kind of overhead. Next take a look at the answers. The CPUs for both the lab and the cluster generate identical results, but the GPUs do not. They are both very close, but different. What is going on here? Well, OpenCL guarantees a certain amount of precision per operation, but when you do lots of operations your precision degrades over time. Further, with any floating point calculation, the accuracy you get depends on the order of execution. So if the GPU is executing mathematical operations in a different order (which is hardly surprising given that it s a completely different architecture) you may get different results. Now if we take a look at graph 3, we see that the update kernel is taking virtually no time, and the execution is dominated first by the initialization and then by the compilation. Because the ratio of CPU to GPU performance is different here, we see different ratios than we did before '"!"#$%&'()*#+(,-#./01'()2#3%)4#5'6()(&-7# '"!"#$%&'()*#+(,-#./01'()2#3%)4#5'6()(&-7# &#$" &"./012034" (503678" 9:6:;2" &#$" &"./012034" (503678" 9:6:;2" %#$" <1:=0">3=3" (103=0"?7@01;" %#$" <1:=0">3=3" (103=0"?7@01;" (AB8:50"*1AC13B" (AB8:50"*1AC13B" %" D034">3=3" %" D034">3=3" D36C0" D36C0"!#$" +843=0"!#$" +843=0"!" ()(*+" (,)(*+" (,)(*+" (,)-*+" (,)-*+"!" ()(*+" (,)(*+" (,)(*+" (,)-*+" (,)-*+" Cluster GPU (left) with its ridiculous overhead and the lab GPU (right) with a much lower overhead, but a much higher percentage of time spent on the update calculation. The higher percentage is due to the slower GPU relative to the CPU. It s worth comparing the absolute performance (wall clock time) of the cluster and lab machines. How big a difference is there? If we look at the local dimensions experiment, we can get some understanding of where the overhead is coming from. (See below.) The overhead is huge the first time we initialize OpenCL, but on subsequent runs it goes down from nearly 8 seconds to only 1 second. Page 22 of 26

%# $!!"# 01+234&"#'56+."7%&+8'$)59'"59+,!"# )# $!!"#!"#$%&'()*+,#-(./'#0*12+3*(+3#,!"#!"#$%&'()*+,"+!-..+/",%&+/'$)+ $# +!"# *!"# )!"# (!"# '!"# &!"# 12345367# 8936:;<# =>:>?5# @4>A3#B6A6# 8436A3#C;D34?# 8EF<>93#G4EH46F# I367#B6A6# I6:H3# (# '# &# %# +!"# *!"# )!"# (!"# '!"# &!"# 12345367# 8936:;<# =>:>?5# @4>A3#B6A6# 8436A3#C;D34?# 8EF<>93#G4EH46F# I367#B6A6# I6:H3# %!"# $!"#.<76A3#.J9>K6JE:# $# %!"# $!"#.<76A3#.J9>K6JE:#!# -.//# $0$# %0%# '0'# +0+# $)0$)# %()0$#!"# Cluster (left) and lab (right). Note that the overhead for the cluster is about 7x greater for the first run. This suggests that most of the overhead is due to initialization of the runtime on a per-application basis (probably when the library is loaded). Further, the overhead seems so much larger on the cluster GPU because everything else is far faster. In absolute terms the overheads are approximately the same. Note that on the lab machines NVIDIA s OpenCL failed to run a bunch of the local sizes for me. This says something about their commitment to making OpenCL work.!# -.//# $0$# %0%# '0'# +0+# $)0$)# %()0$#!"# However, because the GPU is so fast, this 1 second of overhead (which is actually comparable to what we ve seen on the other machines) is a much larger percentage of the execution time. This is an example of Amdahl s law which states that the relative impact of a serial part of your code (in this case the initialization) increases as you get more and more parallel hardware to execute the parallel part. For this problem, then we would need to run for more than 80 seconds to get a 10x speedup with such a large overhead. This is probably the most important lesson of this whole tutorial, by the way. '$!"##$%#&'($)*++,-*$.+#"/0+$1&$%$%&,+$ )!$!"#$%&'()*+,(-$''.%$(/'+012'(&"(!(!".'( &($ &)")$ &!$ &"'$ &"&$ &"&$ *($ &$!$!"#$!"#$!"#$!"($!"($!"($ &"$*+,-./0-$ '"$*+,-./0-$ 9"$*+,-./0-$ 1/23$4567./08$ 1/23$4567./08$ :+0;$<.=/0/,3>$!"%$!"%$ ("$?@-53-+;$?A2,/;-$6B$ C66D$!"!$!")$ #"$EFD.65/08$ C6<+.$ G/H-0,/60,$60$ 23-$I4J$!"%$ K"$L+08-$/0$+$ N"$O6+.-,<-$23-$ M-50-.$ L+08-$ P<<<-,,-,$ OQO4J$ OCQO4J$ OCQI4J$ Speedup for the really fancy cluster GPU. Left: wall clock time including the absurdly high 8 seconds of setup overhead. Right: speedup for just the computation. If we ignore the overhead (which really isn t fair unless your program runs for a very long time) we can claim nearly a 34x speedup for the GPU and 7.1x (best) for the CPU. *!$ '($ '!$ ($!$!"#$!"#$ %"&$%"'$ ("($ ("%$ &"&$ &"'$ '"$+,-./01.$ *"$+,-./01.$ &"$+,-./01.$ 2034$5678/019$ 2034$5678/019$ :,1;$</=010-4>$ )"$?@.64.,;$?A3-0;.$7B$ C77D$ '"!$!"!$ #"'$%")$ ("$EFD/76019$ %"$K,19.$01$,$ C7<,/$ L.61./$ G0H.1-071-$ 71$34.$I5J$ (")$ #"$M7,/.-<.$ 34.$K,19.$ N<<<.--.-$ MOM5J$ MCOM5J$ MCOI5J$ Discussion The end result is that on a very high-end machine you sped up your computation vs. straight, singlethreaded C code by 10%. Whoopdeedoo. What s going on? Page 23 of 26

(#$"!"#$%&'()*(#+,(#-&./(#0**())()# ("!#'"!#&"!#%"!#$" /0123145" 615789:;" )<14;7=" >?;?@3" A2?B1"C4B4" )214B1"D7E12@" ):F=?<1"+2:G24F" 6145"C4B4" 64;G1",=54B1"!" )*)+," )-*)+," )-*)+," )-*.+," )-*.+," Final performance results for the cluster GPU. Note how little of that total time is actually spent doing something useful Well, it s that setup overhead that is killing you. It s taking 90% of the execution time. Now if we run 10x longer the overhead goes down to 53%. If we run 100x longer it s 10%. If we run 1000x longer (only 11 minutes on the GPU, but 2.3 hours on the single CPU) then our overhead gets down to 1%. Now think about how you would report this speedup in an academic paper? If you look at the CPU results, the best CPU speedup is 7.1x before we did the range coalescing. It went down to 5.4x with that in place. This makes it clear that you need to specialize your kernels for your device. This is what I meant by OpenCL not being performance portable. Remember the data on how fast GPUs are relative to CPUs from the lecture? Well, if you want to put GPUs in the best light here you would say they are 34x faster than the CPU by comparing to singlethreaded code. To put CPUs in the best light you would way they are virtually the same by comparing wallclock time, or you could say the GPU was only 4.8x faster by comparing the best OpenCL CPU and GPU versions. You could actually do better by hand-coding the CPU in OpenMP or pthreads and getting nearly 8x speedup, which would make the GPU only 4.3x better. What would you report in an academic paper? Page 24 of 26

Concluding Questions (Short Answer) 1. How would you write this program to use the CPU and GPU together? a. How would you divide up the work? b. How would you minimize data movement? c. How would you do synchronization? d. How would you decide how much work each device does? e. What if you had 4 GPUs and a multicore CPU? (E.g., the cluster machines.) 2. How would you operate on large data sets? a. The lab machines only have 512MB of data, which means you can have at maximum an input size of 8192x8192. The GPU has 3GB of data, but the largest you can do is 12000x12000. b. How would you design your algorithm to process data that was 1M x 1M? c. How would you do it on multiple GPUs? Feedback Please answer the following questions and either hand them into me on paper or email them to me so I can improve the lab session. 1. Overall, this lab session was: a. Much too hard b. A bit too hard c. Just right d. A bit too easy e. Much too easy 2. Overall, the amount of work in this lab session was: a. Much too much b. A bit too much c. Just right d. A bit too little e. Not nearly enough 3. The tutorial format of this document, the worksheet, and the code samples was: a. Really bad b. Annoying c. Ok d. Helpful e. Excellent 4. How much do you feel you understand about the issues involved in using OpenCL and GPUs after this tutorial? a. None. I m more confused than before. b. A little bit, but I m still mostly confused c. A good deal, but I couldn t start from scratch yet d. I m ready to work on my own project today. 5. How well did this tutorial work together with the lectures? a. Not at all b. Poorly c. Ok d. Pretty well, but could have been better e. Excellent 6. Overall, how good was this tutorial? a. Terrible b. Poor c. Okay d. Good e. Excellent 7. Any other feedback? a. Page 25 of 26

/* Error Codes */ #define CL_SUCCESS 0 #define CL_DEVICE_NOT_FOUND -1 #define CL_DEVICE_NOT_AVAILABLE -2 #define CL_COMPILER_NOT_AVAILABLE -3 #define CL_MEM_OBJECT_ALLOCATION_FAILURE -4 #define CL_OUT_OF_RESOURCES -5 #define CL_OUT_OF_HOST_MEMORY -6 #define CL_PROFILING_INFO_NOT_AVAILABLE -7 #define CL_MEM_COPY_OVERLAP -8 #define CL_IMAGE_FORMAT_MISMATCH -9 #define CL_IMAGE_FORMAT_NOT_SUPPORTED -10 #define CL_BUILD_PROGRAM_FAILURE -11 #define CL_MAP_FAILURE -12 #define CL_MISALIGNED_SUB_BUFFER_OFFSET -13 #define CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST -14 #define CL_INVALID_VALUE -30 #define CL_INVALID_DEVICE_TYPE -31 #define CL_INVALID_PLATFORM -32 #define CL_INVALID_DEVICE -33 #define CL_INVALID_CONTEXT -34 #define CL_INVALID_QUEUE_PROPERTIES -35 #define CL_INVALID_COMMAND_QUEUE -36 #define CL_INVALID_HOST_PTR -37 #define CL_INVALID_MEM_OBJECT -38 #define CL_INVALID_IMAGE_FORMAT_DESCRIPTOR -39 #define CL_INVALID_IMAGE_SIZE -40 #define CL_INVALID_SAMPLER -41 #define CL_INVALID_BINARY -42 #define CL_INVALID_BUILD_OPTIONS -43 #define CL_INVALID_PROGRAM -44 #define CL_INVALID_PROGRAM_EXECUTABLE -45 #define CL_INVALID_KERNEL_NAME -46 #define CL_INVALID_KERNEL_DEFINITION -47 #define CL_INVALID_KERNEL -48 #define CL_INVALID_ARG_INDEX -49 #define CL_INVALID_ARG_VALUE -50 #define CL_INVALID_ARG_SIZE -51 #define CL_INVALID_KERNEL_ARGS -52 #define CL_INVALID_WORK_DIMENSION -53 #define CL_INVALID_WORK_GROUP_SIZE -54 #define CL_INVALID_WORK_ITEM_SIZE -55 #define CL_INVALID_GLOBAL_OFFSET -56 #define CL_INVALID_EVENT_WAIT_LIST -57 #define CL_INVALID_EVENT -58 #define CL_INVALID_OPERATION -59 #define CL_INVALID_GL_OBJECT -60 #define CL_INVALID_BUFFER_SIZE -61 #define CL_INVALID_MIP_LEVEL -62 #define CL_INVALID_GLOBAL_WORK_SIZE -63 #define CL_INVALID_PROPERTY -64 Page 26 of 26