PDC Summer School Introduction to High- Performance Computing: OpenCL Lab

Transcription

1 PDC Summer School Introduction to High- Performance Computing: OpenCL Lab Instructor: David Black-Schaffer Introduction This lab assignment is designed to give you experience leveraging OpenCL to convert a standard C program for parallel execution on multiple CPU and GPU cores. You will start by profiling a model PDE solver. Using those performance results you will move portions of the computation over to an OpenCL GPU device to accelerate the computation. At each step you will analyze the resulting performance and answer a series of questions to provoke you to think about what is going on. Learning Objectives 1. Write and understand OpenCL code 2. Understand the OpenCL compute and memory models 3. Profile and analyze application performance 4. Understand the impact of local work group size on performance 5. Write your own OpenCL kernel 6. Optimize a simple kernel for coalesced memory accesses Assumed Background 1. Basic knowledge of C programming, compilation, and debugging on linux 2. Basic understanding of GPU architectures (from the lecture earlier today) 3. Some exposure to program performance analysis If you are concerned about your level of background please make sure you choose a partner whose skills complement your own. Materials Lecture Notes: GPU Architectures for Non-Graphics People Lecture Notes: Introduction to OpenCL Source code starter files provided on the summer school website The OpenCL Specification v. 1.0 ( OpenCL Quick Reference Card ( Nvidia CUDA C Programming Guide v. 3.2 ( Guide.pdf) The last three pieces of documentation can be found by simply googling their titles. You may need them for the later parts of the lab. Hardware The lab machines are equipped with a GeForce 8400GS graphics card with 512MB of memory. These are Nvidia Compute version 1.1 cards with 1 streaming multiprocessor with 8 hardware cores processing 32 threads in a group (warp) running at up to 1.3GHz. The CPUs on the lab machines are Intel Core2Quad Q9550 (Yorkfield) at 2.83GHz. These are 2 dual-core dies in one package sharing a northbridge with 6MB of cache on each die. The cluster machines are equipped with four Tesla C2050 graphics cards with 3GB of memory. These are Nvidia Compute version 2.0 cards with 14 streaming multiprocessors with 32 hardware cores each processing 32 threads in a group (warp) running at up to 1.1GHz. The CPUs on the cluster machines are Page 1 of 26

2 dual Intel Xeon E5620s (Westmere-EP) at 2.4GHz. These are 4-core (8-thread) processors with 12MB of L3 cache each connected by QPI. The data used in this tutorial document is from an Apple MacBook Pro with a GeForce 9400M with 256MB of memory. It is an Nvidia Compute 1.1 device with 2 streaming multiprocessors with 16 cores each processing 32 threads in a group (warp) running at up to 1.1GHz. The CPU is an Intel Core2Duo P8800 at 2.66GHz with 3MB of L2 cache. A: Getting Started To get started you will run the OpenCL Hello World code that was discussed in the lecture. 1. Download the zip file of the code from the PDC website. 2. Unzip it in your home directory on one of the lab machines. 3. Type make hello 4. Run./hello_world.run If all goes well you should get one thousand sine calculations printed out to the terminal. To make sure you understand what is going on, go to the code and make and verify the following changes: 1. Change the code to run on the CPU or GPU (make sure it works on both). Note: if your code crashes when you change it to run on the CPU, you probably need to do some error checking. The hello_world program has none. Try looking at the error codes returned by clgetplatformids, clgetdeviceids, and clcreatecommandqueue. (You can find out how to get the error codes back by looking at the OpenCL documentation link.) What s going on? (You can look up the error IDs in the cl.h file which can be found by online or at the end of this document.) 2. Change the kernel to calculate the cosine of the numbers instead of the sine. 3. Change the size of the array processed to be 1024 times larger, remove the printout of the results, and time (using the time command on the command line) the difference in speed executing on the CPU vs. GPU. 4. Think about the results. Can you explain why one is faster than the other? 5. Now change the program to calculate the cosine without using OpenCL and compare the performance. What do you notice? B: Introduction to the PDE Solver Program Overview This tutorial uses a very simple PDE-solver-like program as a demonstration. The program is provided in a standard C version in the file main_c_1.c The program works as follows: int main (int argc, const char * argv[]) { float range = BIG_RANGE; float *in, *out; // ======== Initialize create_data(&in, &out); // ======== Compute while (range > LIMIT) { // Calculation update(in, out); // Compute Range range = find_range(out, SIZE*SIZE); swap(&in, &out); } Page 2 of 26

3 } Basic program flow. (Profiling measurements have been removed for simplicity.) The program creates two 4096x4096 arrays (in and out) and then performs an update by processing the in array to generate new values for the out array. The out array is then processed to determine the range (maximum value minus minimum value). If the range is too large, the program swaps the arrays (in becomes out and out becomes in) and repeats until the range converges below the limit. The update calculation takes in the four neighbors (a, b, c, d) along with the central point (e) to calculate the next central value as a weighted sum. This is a basic 5-point stencil operation, which can be visualized as:!"# $%&#! " # $ %! " # $ %! " # $ % The update iterates through the whole in matrix looking at the neighbors to calculate the new value, which is written into the out matrix. For the next update the in and out matrices are swapped, thereby overwriting the old values with the new ones. void update(float *in, float *out) { for (int y=1; y<size-1; y++) { for (int x=1; x<size-1; x++) { float a = in[size*(y-1)+(x)]; float b = in[size*(y)+(x-1)]; float c = in[size*(y+1)+(x)]; float d = in[size*(y)+(x+1)]; float e = in[size*y+x]; out[size*y+x] = (0.1*a+0.2*b+0.2*c+0.1*d+0.4*e); } } } The update calculation is simply a 5-point stencil. After the update, the program runs a range calculation on the output matrix that simply finds the minimum and maximum values and returns the difference between them as the range. float find_range(float *data, int size) { float max, min; max = min = 0.0f; // Iterate over the data and find the min/max for (int i=0; i<size; i++) { if (data[i] < min) min = data[i]; else if (data[i] > max) max = data[i]; } // Report the range return (max-min); } Finding the range simply iterates over all the data and keeps track of the minimum and maximum values, and then returns the difference. (This function is in util.c, as it is shared by all Page 3 of 26

4 the sample programs.) The program repeats the loop of update/range until the range is below the LIMIT specified in the parameters.h file. Don t change anything in the parameters.h file as it is used by all programs. Running the Program 1. Type make 1 to build the C-version of the program. 2. Run the program by typing./1_c.run When you run the program you will see the output converge after approximately 42 iterations. The program will print out the range at convergence. You should use this value to check that the optimized versions of the code later on in the lab are producing the same results. In addition to the range results, the program will print out profiling information. From this you can see how long the program took in total (Total) and how long the update (Update) and range (Range) calculations took. It also reports the standard deviation, which is a good sanity check as to how reliable your performance measurements are. 1. C Allocating data 2x (4096x4096 float, 64.0MB). Range starts as: Iteration 1, range= Iteration 42, range= Total: Total: ms (Avg: ms, Std. dev.: ms, 1 samples) Update: Total: ms (Avg: ms, Std. dev.: ms, 42 samples) [ignoring first/last samples] Range: Total: ms (Avg: ms, Std. dev.: ms, 42 samples) [ignoring first/last samples] Output of the C program showing the final range and the profiling information. Note the large standard deviation on the Update measurement (13%) indicating that something else may have been happening on the computer at the same time. Performance Measurements Take a look at the code in the main_c_1.c file to see how the application is implemented. The only significant differences from the code listed above are for the performance measurements. As you can see performance measurements are made using perf variables. These must be initialized first, which is done in init_all_perfs(). The perfs are timers which are started by calling start_perf_measurement() and stopped with stop_perf_measurement(). Each time a start-stop pair is called the perf records the elapsed time. At the end the results are printed out with the print_perfs() call. Now take a look at the performance numbers you got and decide which part of your code should be optimized with OpenCL first. C. Accelerating (Decelerating?) with OpenCL Now that you understand the basic algorithm and have seen the source code it is time to accelerate your program by using OpenCL. As you can see from the performance data for the C code, the best place to start is by replacing the update() function with an OpenCL one. Ready? If the lectures were any good this shouldn t take more than a few minutes. You can use the hello world file as a starting point. I m just kidding. Starting from nothing is a major pain. So I ve done this for you in the main_opencl_1.c file, which is summarized below: int main (int argc, const char * argv[]) { float range = BIG_RANGE; float *in, *out; // ======== Initialize init_all_perfs(); Page 4 of 26

5 create_data(&in, &out); // ======== Setup OpenCL setup_cl(argc, argv, &opencl_device, &opencl_context, &opencl_queue); // ======== Compute while (range > LIMIT) { // Calculation update_cl(in, out); // Compute Range range = find_range(out, SIZE*SIZE); iterations++; swap(&in, &out); printf("iteration %d, range=%f.\n", iterations, range); } } main_opencl_1.c Note that the only two changes to the high-level algorithm are to setup OpenCL at the beginning (setup_cl) and to call update_cl() instead of update. We re going to skip the contents of setup_cl() for now. You can take a look at it in opencl_utils.c if you re interested, but all it does is look for the right type of device based on whether you passed in CPU or GPU on the command line and creates a context and queue for you. Instead, take a look at update_cl(). This is where the work is being done. void update_cl(float *in, float *out) { cl_int error; // Load the program source char* program_text = load_source_file("kernel.cl"); // Create the program cl_program program; program = clcreateprogramwithsource(opencl_context, 1, (const char**)&program_text, NULL, &error); // Compile the program and check for errors error = clbuildprogram(program, 1, &opencl_device, NULL, NULL, NULL); // Create the computation kernel cl_kernel kernel = clcreatekernel(program, "update", &error); // Create the data objects cl_mem in_buffer, out_buffer; in_buffer = clcreatebuffer(opencl_context, CL_MEM_READ_ONLY, SIZE_BYTES, NULL, &error); out_buffer = clcreatebuffer(opencl_context, CL_MEM_WRITE_ONLY, SIZE_BYTES, NULL, &error); // Copy data to the device error = clenqueuewritebuffer(opencl_queue, in_buffer, CL_FALSE, 0, SIZE_BYTES, in, 0, NULL, NULL); error = clenqueuewritebuffer(opencl_queue, out_buffer, CL_FALSE, 0, SIZE_BYTES, out, 0, NULL, NULL); // Set the kernel arguments error = clsetkernelarg(kernel, 0, sizeof(in_buffer), &in_buffer); error = clsetkernelarg(kernel, 1, sizeof(out_buffer), &out_buffer); // Enqueue the kernel size_t global_dimensions[] = {SIZE,SIZE,0}; error = clenqueuendrangekernel(opencl_queue, kernel, 2, NULL, global_dimensions, NULL, 0, NULL, NULL); // Enqueue a read to get the data back error = clenqueuereadbuffer(opencl_queue, out_buffer, CL_FALSE, 0, SIZE_BYTES, out, 0, NULL, NULL); // Wait for it to finish error = clfinish(opencl_queue); // Cleanup clreleasememobject(out_buffer); clreleasememobject(in_buffer); clreleasekernel(kernel); clreleaseprogram(program); free(program_text); Page 5 of 26

6 } update_cl() with all the error-checking code removed to make it easier to read. As you can see, this code is shockingly similar to the code in the Hello World OpenCL program. All it does is: 1. Create a program (loaded from the text file kernel.cl ) 2. Build it 3. Create a kernel ( update ) 4. Create two memory objects ( in_buffer and out_buffer ) 5. Enqueue a write to write the data to the memory objects (why would we copy data to the out buffer?) 6. Set the kernel arguments 7. Enqueue the kernel execution 8. Enqueue a read to get the data back 9. Wait for the execution to finish 10. Cleanup all the resources we allocated There s really nothing more to it than that. You can open the kernel.cl file and see the update kernel. It is very similar to the update() C code, except that it has to find out which update it should do (get_global_id()) and needs to avoid processing if it is on the edge. (The C version uses loops from 1 to SIZE-1 to avoid the edges.) kernel void update(global float *in, global float *out) { int WIDTH = get_global_size(0); int HEIGHT = get_global_size(1); // Don't do anything if we are on the edge. if (get_global_id(0) == 0 get_global_id(1) == 0) return; if (get_global_id(0) == (WIDTH-1) get_global_id(1) == (HEIGHT-1)) return; int y = get_global_id(1); int x = get_global_id(0); // Load the data float a = in[width*(y-1)+(x)]; float b = in[width*(y)+(x-1)]; float c = in[width*(y+1)+(x)]; float d = in[width*(y)+(x+1)]; float e = in[width*y+x]; // Do the computation and write back the results out[width*y+x] = (0.1*a+0.2*b+0.2*c+0.1*d+0.4*e); } The OpenCL kernel update(). Running the OpenCL version 1. Type make 1 (as before) 2. To run the CPU version type./1_opencl.run CPU 3. To run the GPU version type./1_opencl.run GPU So what did you get? How does the multicore OpenCL CPU version and the OpenCL GPU version compare to the standard C version for speed? D. Tutorial Worksheet Well, it s kind of hard to make sense of all those performance numbers, so now is the time to open the Tutorial Worksheet file and start filling in some numbers. Indeed, for the rest of the tutorial you ll be using this to track and analyze your performance. Once you ve finished implementing the optimizations you ll then submit your code to the cluster. This will give you a second set of performance numbers Page 6 of 26

7 which you will then compare to the ones you get on the lab machine to see the impact of different GPUs on performance and optimizations. Don t Cheat For you to learn the most you should read the appropriate part of the text here, write and debug the code, fill in the worksheet, and answer the questions in the worksheet, in that order. So when the tutorial says don t turn the page until you ve finished filling in the worksheet please don t. If you cheat and read ahead you will bias yourself and will learn a lot less. If you get stuck ask for help before you read ahead. Benchmarking is hard You ll be getting timing information running on the lab machines. If you re doing something else at the same time (such as web browsing, filling in a worksheet, etc.) it will impact your measurements. You can see from the standard deviation in the measurements in the output to evaluate how much you can trust your measurements. But it would be best to fill in the worksheet on your own laptop using Excel, rather than open office on the lab machines. Get started Go ahead and fill in the numbers from your first three runs in the tutorial worksheet. They should go under the 1. Baseline section. Be sure to fill in the Final Range Values to make sure that you re actually computing the same thing on all versions of the code. (It s very easy to get amazing speedups by computing nothing without realizing it.) Now answer the questions for part 1 of the worksheet. Don t turn the page until you ve finished filling in the worksheet. Page 7 of 26

8 1. Baseline Note: the graphs and data shown here are for running on my MacBook Pro. Depending on your hardware and operating system you may see different results. So you re going to have to think for yourself to figure this stuff out. The performance discussion below is meant as a guideline.!"#$%&'()#*+,-./01!")*+$)%&%,"-$. /"012%34"567859"(($) :-"14. 2"012;<4"567=59"(($)>1*;??;=$!")*+$)%@%,"-$. :-"14. A!B%CDE$. K!B%CDE$. 2"#304&1/'& A$F')5$%GHIIJ K@/?'%!LLII%>%@;MMANO (5(67 ()5(67 ()5867 *-,01 &PLQ& 7%90,& &&RLL :0';& &MIR &RGR &MPI </'01#:0';&#=01+& GG;L@LM&P GG;L@LM&P GG;L@LML@ $>&.?&09 MQL LMH &;L% &;M% &;H% 2"#304&1/'&# &;@% &;I% I;L% U0$)9$"2% ^"+#$% BE2"*$% I;M% I;H% I;@% I;I% K7K!B% KV7K!B% 4"#5'*&#%0..2-0#),-62#+,-#./0.1&#7,8#&'.#9:#9;<#=%#&'.#9#9;<3#>'(?&@#',)#A*?+#1,8.%#*8.#+,-#-%(?B3C JD%K!B%9"=%@%5')$=%='%T%W'?42%9"0$%$XE$5*$2%*'%)?+%*W15$%"=%("=*;%YZ*%4$"=*%(')%*9$%?E2"*$%E")*;[ 9"#D(2#E0.?9:#B(=.#&'.#%*A.#>?-A.8(1C#*?%).8#*%#&'.#9#=.8%(,?3#5'+#,8#)'+#?,&3,'\%<?*%*9$D])$%)$"44D%54'=$;%C9$%K!B%1=%*9$%="-$%<?*%*9$%A!B%1=%21(($)$+*;%C91=%1=%E)'<"<4D%2?$%*'%21(($)$+*%E)$51=1'+%"+2%')2$)%'(%'E$)"*1'+=; The tutorial worksheet uses light blue to indicate places you need to fill in data, formulas, or analysis. The performance graphs are all normalized to the C-CPU speed (on the left) so bigger bars are worse. From this data the GPU version is 1.8 times slower than the C version, and the OpenCL C version is 25% slower. Not so good. Remember the questions are there to help you learn, so try to think about them rather than trying to just put down something. From the graph in the worksheet we can clearly see that we re spending a huge amount of time calculating the update. However, the OpenCL code for update is pretty complicated so this level of detail is not very helpful. What we need to do is get more detailed performance data. Open the main_opencl_2_profile.c file. This file is similar to the file from part 1 above except that it defines a bunch more performance counters. These are: Program_perf for loading and compiling the program Create_perf for creating the buffers Write_perf for writing to the buffers Read_perf for reading from the buffers Finish_perf for timing the finish Cleanup_perf for timing the cleanup Page 8 of 26

9 Your job is to now go into the update_cl() method and insert appropriate perf measurement calls to measure the performance of these parts of the program. Go ahead and do this and then run the program with make 2 and./2_profile.run CPU and./2_profile.run GPU. Fill in the data in section two of the worksheet and answer the questions. Note: for each section of this tutorial you use make N and./n_ GPU and./n_ CPU to run the code. It is important that you use the file names mentioned here and listed in the Makefile so that you will be able to submit everything to the cluster in the end. Don t turn the page until you ve finished filling in the worksheet. Page 9 of 26

10 !"#$%&%'(%& 2. Baseline with Profiling So what did you find? Is it what you expected? Here s what I got:!"#$%&'()*'#+),-#./01)()* ,%( &)*+& &*,-../,,0 59:%,'#;'/*'( &&0**, -*,+ <%*2'#30=9>,' &1,0 &1,* &0*- <'%:#?%,%,. 30=9)('#./02/%=.& -+- 3/'%,'#$>11'/&, +/ A)*)&- &/*.. /1.. 3('%*>9 +,& 01++ A)*%(#<%*2'#B%(>' --2*.*1&) --2*.*1&) --2*.*1*. CD'/-'%: 1+* -/, &./).% &2*% &21% &2/% &2.% &%,2*%!"#$%&()*'#+),-#./0E()*2# 89$:;$"<% 3=$">?@% AB>BC;% D:BE$%F"E"% 3:$"E$%G?H$:C% 3'I@B=$%!:'#:"I%,21%,2/% J$"<%F"E"% J">#$% 5@<"E$%,2.%,% 343!5% 3643!5% 3643!5% 3647!5% 3647!5% I ve removed the answers to the questions to make it less tempting to look ahead and cheat. But there s a discussion of them in the following text. A few things jump out at me immediately. The first is that I m spending a huge amount of time in Finish, particularly on the CPU. That s strange since Finish doesn t actually do any work other than call clfinish(). Of course if you answered question D (which was discussed in the lectures) you ll remember that OpenCL submits work asynchronously. That is, the work isn t done when you call clenqueue() it s just scheduled. So when you call clfinish(), the runtime will stop your program until all the work is done. Therefore, the timers you put around clenqueuendrange() and clenqueuereadbuffer, etc., don t record much (if any) time because they are just enqueueing the kernel. Another strange thing is that I see almost not time for compiling the kernel, despite the fact that I m creating and calling clbuildprogram() 40 times in this program. Your results are probably a bit different if you re not running on a Mac. The reason is that Mac OS X caches programs when you compile them the first time, so each subsequent compilation of the same source code is really fast. If they weren t cached the program would spend a huge amount of time re-compiling. Now make a copy of your source file and call it main_opencl_3_profile_finish.c and add in clfinish() as needed for the performance counters you added before. This will make sure that the host waits for the OpenCL commands to finish before continuing with your program. By doing this you will synchronize with the OpenCL device, and your performance measurements on the host side will be accurate. After you have done this, run your program, fill in part 3 of the worksheet, and answer the questions. Note: OpenCL provides the ability to request events from commands and use those events to get information on when they were submitted to OpenCL, when they started, and when they finished. This is another way (indeed the preferred way) to get this information, but it is tricker. Don t turn the page until you ve finished filling in the worksheet. Page 10 of 26

11 3. Baseline with Profiling (and clfinish) Now the results should make a lot more sense. Finish is now a tiny part of the overall time (nearly 0) and we can see where we are spending time.!"#$%&'()*'#+),-#./01)()*2#3%*4#5(6)*)& : 8;98.: 8;9<.: =0,%( &)*+& &+*,* :>4%,'#?'/*'( &&/** 0)0) &01/ &/,, &100 &.11 80A>)('#./02/%A -&,,& 8/'%,'#$B11'/& 1 &- D/),'#C%,% +-*, 0+., 6)*)&- 1. 8('%*B> 0,- /+1-6)*%(#@%*2'#E%(B',,2*-*0&),,2*-*0&),,2*-*0*- FG'/-'%4 0+* *0* &-0,!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 B>4%,'#&>''4B>#G&#898.: &2*- &20/ H#,)A'#0*#4%,%#A0G'A'*,.+3 ))3 H#,)A'#0*#:>4%,' )03 -*3 -% &2*% &20% &2.% &2-% &% 12*% 120% 12.% 12-% 1%!"#$%&()*'#+),-#./0I()*2#3%*4#5(6)*)&-7# 9:$;<$"=% 4>$"?@A% BC?CD<% E;CF$%G"F"% 4;$"F$%H@I$;D% 4'JAC>$%!;'#;"J% K$"=%G"F"% K"?#$% 6A="F$% 454!6% 4754!6% 4754!6% 4758!6% 4758!6% With clfinish() forcing the runtime to complete each operation we can now accurately time what is going on. Clearly we are spending a lot of time moving data (write and read) and a good deal of time in cleanup. (Your results are likely to be different depending on your GPU and system!) For my system, I m wasting a lot of time in cleanup and moving data (write and read). The update calculation is actually pretty fast and compile program is not too bad. Now note that the ratios of these times will depend on the ratio of speeds of the CPU and GPU. If you have a much slower GPU (like the lab machines; roughly half as fast) and a faster CPU (like the lab machines; roughly 50% faster) then you may see that the update time is a lot larger on the GPU. Now with this data we can start to see some results. The speedup for the update calculation on the OpenCL CPU is 1.82 vs. the C version. This is decent (although not great) considering that we have 2 CPU cores running in OpenCL. The reason it s not 2.0 is that the kernel code has a lot of extra overhead to deal with the edge cases. The C code deals with them once in the loops and is therefore more efficient. If we look at where our time is being spent, we see that we re spending only 1/3 of our time doing update and nearly 50% of our time doing data movement. (You may see different numbers depending on things like how much of your time you re spending doing compilation.) This kind of information lets us know what optimization we need to do: we need to get the compilation, cleanup, and writing data out of the main loop since these things only need to be done once. Then we can keep the data on the OpenCL device and just read back the data for the range calculation. To do this open the file main_opencl_4_profile_nooverhead.c and fill in the missing parts. You can get most of the code you ll need from your current file. All this file does is add two calls at the beginning (setup_cl_compute() and copy_data_to_device()) and then change the main loop to call update_cl() with the appropriate cl_mem buffer objects instead of the data pointers, and then call read_back_data() to get the data back from the range calculation. At the end it calls cleanup_cl() to cleanup. The program uses the iterations count to determine which buffer to read from and write to (see the get_in_buffer() and get_out_buffer() functions). This way the data can be kept on the device and we just change the kernel arguments to swap buffers. Page 11 of 26

12 // ======== Setup the computation setup_cl_compute(); start_perf_measurement(&write_perf); copy_data_to_device(in, out); stop_perf_measurement(&write_perf); // ======== Compute while (range > LIMIT) { // Calculation start_perf_measurement(&update_perf); update_cl(get_in_buffer(), get_out_buffer()); stop_perf_measurement(&update_perf); // Read back the data start_perf_measurement(&read_perf); read_back_data(get_out_buffer(), out); stop_perf_measurement(&read_perf); // Compute Range start_perf_measurement(&range_perf); range = find_range(out, SIZE*SIZE); stop_perf_measurement(&range_perf); iterations++; printf("iteration %d, range=%f.\n", iterations, range); } // ======== Finish and cleanup OpenCL start_perf_measurement(&finish_perf); clfinish(opencl_queue); stop_perf_measurement(&finish_perf); start_perf_measurement(&cleanup_perf); cleanup_cl(); stop_perf_measurement(&cleanup_perf); The new main loop with the OpenCL setup code moved out. The functions you need to define are in yellow. Make the changes to get this program working and then get your performance numbers and fill in the next section of the worksheet. Remember to make sure your Final Range Values match so you can be (reasonably) sure that your code is correct! Don t turn the page until you ve finished filling in the worksheet. Page 12 of 26

13 4. Overhead Outside of Loop Now that we ve removed the obvious overheads (compilation, setup, cleanup, and copying data to the device) from the main loop we can start seeing some performance improvements.!"#$%&'(&)*#$+,-.*&#/0#1// /,)9 &)*+&,-*. +/-) &2/% 62*),&#:&';&9 &&0** -)1+ )1&* <);=&#3/>2+,& &-.0 &0,* &-/+ <&)*#?),) &.-. &)-/ &% 3/>2.9&#5'/=')>. /* A'.,&#?),) &*& /&& B.;.-(.. 39&);+2 &+ &0+.2-% B.;)9#<);=&#C)9+&,,2*/*-&),,2*/*-&),,2*/*-*/ $%&'(&)* -+* %!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 +2*),&#-2&&*+2#%-#34356 &2*) )2), D#,.>&#/;#*),)#>/%&>&;, &)3 //3.2/% D#,.>&#/;#62*),& /,)9#-2&&*+2#0'/>#EF &2* )21 D#*),)#>/%&>&;,#0'/>#E &03 &,3.%!"#$%&'(&)*#$+,-.*&#/0#1//2# 9:$;<$"=% BC?CD<% E;CF$%G"F"% 4'JAC>$%!;'#;"J% K$"=%G"F"% K"?#$% 6A="F$% 454!6% 4754!6% 4754!6% 4758!6% 4758!6% With the overhead removed from the loop we have now actually succeeded in accelerating our code using OpenCL. The CPU version now takes 75% as long as the C version (not so good given that we should be at 50% with two CPUs) and the GPU version is faster. The CL-CPU version is now 1.8x faster than the initial version and the CL-GPU version is 3.4x faster. Again, depending on the relative speeds of the GPUs and CPUs you may see different improvements, or even none at all. But the most relevant part is that our data movement is now 13% of our total time down from 47% before. If you were spending a lot of time on compilation you will see a smaller relative decrease. Looking at the change in just data movement time (read plus write time) we see that we are spending only 15% and 19% of the time, respectively, doing data movement on the CPU and GPU as we were before. This is a > 5x reduction in data movement time! So why is the data movement reduced? Well, simply because we re not constantly copying the data toand-from the device. We are copying it back each time we do range, but that s a lot less movement than before. Looking at the data above, the next optimization step for my machine is to get the range computation onto the OpenCL device so we can eliminate the read data and accelerate range. (Of course it would be great to get the update kernel to run even faster, but we re not going to touch it for now.) But before we do that we re going to take a look at the impact of local dimensions on performance. To do this, open the file main_opencl_5_explore_local.c. This file simply has a new main function that walks through a bunch of local dimensions (stored in locals[][]) and then calls run with them set. You need to copy your code from the last section into here, rename its main function to run and update the code to use the local dimension chosen by the new main loop. Do this and answer the questions in the worksheet. Don t turn the page until you ve finished filling in the worksheet. Page 13 of 26

14 5. Exploring Local Dimensions When we explicitly set the local dimensions we need to choose values that evenly divide the global dimensions. So for this program, with global dimensions of 4096x4096, we can choose just about any power of 2 by power of 2. However, the total size of the local dimensions is dictated by the hardware resources on the device. For the lab and my machine this limits the maximum workgroup size to 512, and for the cluster to While any value of local dimensions that evenly divides the global and whose product is less than that maximum is supported, the performance may vary dramatically. Remember that the hardware computeunits execute multiple threads from their work-groups at the same time. If the size of the work-group is less than the minimum number of threads the compute-unit executes at once, that additional hardware will be wasted. In the case of Nvidia hardware, each streaming multiprocessor (compute-unit) has 8 hardware cores which execute one thread on every other cycle (16 physical threads at once) and the architecture executes threads in groups of 32. This means that if the compute-unit has less than 32 threads at any given time it will be wasting processor cores. (Take a look at the slide in the OpenCL lecture.)!"#$%&'()*+,#-(./'#0*12+3*(+3#(+#452# :%: ;%; <%< =%= :>%:> ;!>%:?(4/' )**+ &&***, -.*-, &,/.) /+-) /.*& /&*& 8&@/42#A2)+2' -*/0 &,++&0./**) 1/.& -/.* -*)/ --10 B/+,2#C(1&D42 &1.* &1-0 &1-) &1-) &1.* &1.0 &1+0 B2/@#0/4/ &-** &-/1 &-** &-1/ &-/) &--& &-.& C(1&*'2#7)(,)/1 &,-*./ -. -* C)2/42#EDFF2)3,,,,,,, G)*42#0/4/..* &1) &/, &11 &1/ &/, &11 H*+*35,,,,,,, C'2/+D& &+) &1/ &+) &1- &/. &+) &+) H*+/'#B/+,2#I/'D2 JK2)52/@ +). ++** &1*, 1-0 **& **, *-0!"#$%&'()*+,"+!-..+/",%&+/'$)+ $(# $'# $&# $%# $$# $!#,# +# *# )# (# '# &# %# $#!# &"#'56+."7%&+8'$)59'"59+ -.//# $0$# %0%# '0'# +0+# $)0$)# %()0$# $!!"#,!"# # +!"# 8936:;<# *!"# =>:>?5# (!"# 8436A3#C;D34?# 8EF<>93#G4EH46F# '!"# I367#B6A6# &!"# I6:H3# %!"#.<76A3#.J9>K6JE:# $!"#!"# Performance differences for different local dimension sizes along with device utilization. Trying out different local dimensions gives us a feeling for how important this is. When we plot them together with utilization we see that we need full utilization to get the highest performance. (Note that our performance may not be limited by utilization, so it may not scale directly. In this case we are limited by non-coalesced accesses at some point.) But on my machine, the performance increases all the way, with the NULL version clearly not giving the best performance. The bottom line is that if you care about the last bit of performance you should not let the runtime try to guess the best local size. Page 14 of 26

15 6. Range in a Kernel Now it is time to put the range calculation into a kernel. The tricky part about calculating the range is that it is a reduction, and, as you remember from the lectures, reductions are limited by synchronization. To make this simple we re going to do the last stage of the reduction on the host. That is, you will write an OpenCL kernel where each work-item will calculate the min and max for part of the data and write those to an OpenCL buffer. This (smaller) buffer will then be read back to the host where the final reduction will be done.!"#$ %&'()$!!!!!!! " " " " " " " # # # # $! %&' %() # # # * * * * * * ,,, $" %&' %(),, * * * $# %&' %() $* %&' %() $+ %&' %() $, %&' %() $- %&' %() * * * * * * The range kernel. Each work-item processes some number of items from the update kernel s output and stores the minimum and maximum across those values in the range buffer. (e.g., work-item 0 processes the first 7 items in the data and work-item 1 processes the next 7.) The number of items each work-item processes is determined by the number of work-items and the size of the data to process. The range buffer is then read back to the host, which does the final reduction. The range kernel therefore takes the output from the update kernel as its input and produces a range_buffer output. To implement this, open the main_opencl_6_range_kernel.c file. This file is similar to what you had in step 4 (before adding the local size experiment) but it has an additional kernel variable (range_kernel), an additional cl_mem variable (range_buffer), and an additional host buffer (range_data) for interacting with the range kernel. You will need to add your range kernel to the kernel.cl file, update the setup_cl_compute() method to create the appropriate range_buffer, range_kernel, and range_data. (Also update cleanup_cl() to clean up after them.) I ve provided a skeleton range kernel below for you to start with. The number of work-items to use for the range kernel is defined in the RANGE_SIZE #define. // The number of work items to use to calculate the range #define RANGE_SIZE 1024*4 // ======== Compute while (range > LIMIT) { // Calculation start_perf_measurement(&update_perf); update_cl(get_in_buffer(), get_out_buffer()); stop_perf_measurement(&update_perf); // Range start_perf_measurement(&range_perf); range_cl(get_out_buffer()); stop_perf_measurement(&range_perf); // Read back the data start_perf_measurement(&read_perf); read_back_data(range_buffer, range_data); stop_perf_measurement(&read_perf); // Compute Range start_perf_measurement(&reduction_perf); range = find_range(range_data, RANGE_SIZE*2); stop_perf_measurement(&reduction_perf); iterations++; printf("iteration %d, range=%f.\n", iterations, range); } The changes to the program for running the range kernel. Instead of reading back the data and then calling range, we call range_cl() (which enqueues the range kernel) and then read back the range_buffer to do the final reduction on the Page 15 of 26

16 host. Note that you need to make sure the read_back_data() function reads the right amount of data. kernel void range(global float *data, int total_size, global float *range) { float max, min; // Find out which items this work-item processes int size_per_workitem =... int start_index =... int stop_index =... // Finds the min/max for our chunk of the data min = max = 0.0f; for (int i=start_index; i<stop_index; i++) { if (...) min =... else if (...) max =... } // Write the min and max back to the range we will return to the host range[...] = min; range[...] = max; } The range kernel skeleton. The input is the data to process (the output from the last update kernel, set by clsetkernelargs()), the total size of the data to process, and the results are written to the range output. 5.1*67 89 :;34434=%%% < '43/!!"$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$! This section of the Khronos OpenCL quick reference card will be helpful in filling in the kernel code above. :;3443A=%%%!!!!!#!43/#!43/!#!'43/#!'43/!!! # "-! "!!" $!!!!!!!!!!!!!!!!% Now.!!""!#$ implement the :!#!: range kernel and change the program &!,3%!/ to use it. Test the code by verifying that the %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!! number.!!""!##!"!'$ of iterations :!# ' and the final range "!!""!1#!!"!%$ value are the same %!)# 1 %!)#=!88!% %!)#= as the previous versions. If they are not you "!!""!##!!"!'$ #!;!' "!!""!##!!"!'$ # ' have a bug and will have to fix your code. "!!"7! $!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!!"!'$ "#!;!' "!!""!##!"!'$! "#!;!' Once you ve got it working, fill in the worksheet $ and continue. "!!""!#$ "!!""!&#!"!/#!"!0$ "! " $!!!!!!!!% "!!""!&#!"!/#!"!0$ "!!""!##!!"!'$ #! &#!/$!;!0 Don t &! turn the page $ until you ve finished filling in the worksheet.,3%!/ 0 &!<!/!;!0 '!42!#!8!' # :;3443"=%%% "!!""!##!!"!'$ "!!""!##!!"!'$ '!42!'!8!# # #!,3%!' $%&'(%)*!! #!!! # $ 43/!! #!! '43/!! # $!!!"43/ #!!!!!!'43/ $!!!"'43/ #!!!!!!'43/ $ % 2% % 34 %= % 2% % 34 %= % 2% % 34 %= % 2% % 34 %= % 2% %=88!>?$!:!34 %= % 2% %=88!>?$!:!34 %= :;3443<=%!!!! " " 3%%! "!!""!##!"!$%!#!"!$&#$!!!!!!!!!!!!!!!!!! +),-.!#!/&!!!!!# $%! $&#$!!!! %&'()*!!!"%&'()*!!##!%&'()*!$%!#!%&'()*!$&#$! $%!#!$&#!!!!# $%! $&#$!!!!!!!!!!!!!!!!!!!!!!! "!!" $!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!/&! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0,1!&2!#!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!,3%!' %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!+>1.!""!()*(#!"!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+>1. ()*(!!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! %&'()* +>1.!"%&'()*!()*(#!%&'()*!!#$!!!!!!!!+>1. ()*(!!#$!!!!! "!!""!()*(+#!"!()*(,#!"!#$!!!!!!! ()*(+ ()*(, $! %&'()*!!!"%&'()*!()*(+#!%&'()*!()*(,#!!%&'()*! #$!!!!!!!!! ()*(+ ()*(, $!!!! "!+?@*!""!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 043!&2!#!,3%!' 543*,6!()*3%! &2!#!,3%!'!/&! #!8! ()*( 9/*.!,3%! 43/*6.&),/* # :;3443"=%%%%%%%%%%!!!!$! 43/!#!'43/!! " " " 3% "!!""$! "!!""$! "!!""!#$! "!!""$! "!!""$! "!!""!#$! "!!""!'541(65#$! "!!""!'#!"!#$! "!!""$! "!!""!#$! "!!""!##!"!'$! "!!""$! "!B1?C!""$! "!!""!##!"!'$! # # ' # ' "!!""$!!!!!!!!!!!!!!!!!!!!!!! "!!" $! *A# "!!""$! "!!""!##!"!'$! #!,3%!' "!!""$! "!!""!&#!"!/#!"!0$! "!!""!##! $!!!!!!!! B*/'63!'!42!#!8!'#!!!!!# # "!!""!##! $!!!!!!!!!!!! B*/'63!'!42!'!8!##!!!!!# # "!!""!##!"!'$! # '!<!/6'3@!"# '$ "!!""!##!"!<%786$! # "!!""!##!43/ <(#7$! "!!""!##!"!'$! #A?;!'A?! 43/!!!""!#$! "!!""!##!43/ $! Page 16 of #!!<!?A! 26 "!!""!##!43/!!$ "!!""!#$! "!!""!1#!43/!!< $! "!C)@!""$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"'43/!!!&!04)($ C'4*/!D,D!!!"'43/!!!&!04)($!!!!!!&!04)($! %&'()*!!!!!&!04)($!! %&'()*!!!"'43/!!!&!04)($! "!!""!##!"!'$! # "!.)E!""!##!"!'$! +&-.'/*!# '!"#A'$ "!.)E*!""!##!43/ $! +&-.'/*!#A' ' "!.)E(!""!##!"!'$!!!!!!!!!!!!!! +&-.'/*!#A' # "!!" $!!! #!! "!!" $!! "!!""!##!"!'$! "!!""!##!"!'#!43/!! 9:;4$! "!(?*>!""$! "!())>*!""!##!43/ $! +&-.'/*!# ' "!!""!#$! # "!(+F(>!""$!!!!!!!!!!!!!!!!!!!!!!!!! "!+?*!""$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

17 7. Coalescing the Range Accesses So what happened? The performance got worse when you ran the range kernel on the GPU. Why is that? Well, as the questions in part 6 hinted, the way we wrote the range kernel resulted in a lot of uncoalesced memory accesses. (Go back to the lecture notes if you re not certain what this means.)!"#$%&'(#)&#%#*(+&(, -.-/0-1.-/0-1.2/0 345%, &)*+& *,-. &,/./ &2,% 067%5(#*(+&(, &&.** 00&) )1-* $%&'(#-48695( &0-. *0. *).1 $(%7#:%5%, ), &% -486),(#/+4'+%8 -,0 -+(%5(#;9<<(+= *% >+)5(#:%5% &*&,&/?)&)=@ - - -,(%&96 &+ &./ -20% $(79A5)4& - &?)&%,#$%&'(#B%,9( //2*,*0&) //2*,*0&) //2*,*0*, CD(+@(%7 0+*., %!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 967%5(#=6((796#D=#-.-/0 &2+. )21- E#5)8(#4&#7%5%#84D(8(&5,3,3-2,% E#5)8(#4&#067%5( *&3,03 +%&'(#=6((796#D=#-.-/0 &2*0-2&/ 345%,#=6((796#<+48#FG &2&* %!"#$%&'(#)&#%#*(+&(,# 454!6% 4754!6% 4754!6% 4758!6% 4758!6% 9:$;<$"=% >$=?@A'B% 4C$"B?D% EFBFG<% H;FI$%J"I"% 4;$"I$%K?L$;G% 4'MDFC$%!;'#;"M% >$"=%J"I"% >"B#$% 6D="I$% Putting the range kernel on the OpenCL device actually slowed down the GPU. It did give a small speedup on the CPU, however. Let s first take a look at the CPU. The CPU code is running faster because we ve eliminated the time spent reading the data back. We re still not at a 2x speedup, however. The GPU code is more disappointing. The resulting application is nearly twice as slow as when we ran the range code on the CPU, even though we ve eliminated the data movement. The reason for this is that the way we wrote the range kernel results in uncoalesced memory accesses, which dramatically reduce the available bandwidth. To understand this, it s important to remember that when you read memory from DRAM you get a large chunk of memory. In the case of a GPU, it s typically at least 384 bytes, which is 12 float values. If you only use one of those 12, then you are throwing away 11/12ths (92%) of your bandwidth. The wider the DRAM bandwidth the worse a problem uncoalesced access can be. Unfortunately coalescing is tricky. Older GPUs have very specific rules for coalescing (each thread has to access the next item in an array and they have to do it at the same time and in order) while newer GPUs have more relaxed rules (any threads that access neighboring data at the same time). So getting good coalescing behavior is tricky. Luckily our range kernel is simple enough that we can understand what is going on and how to fix it. You now need to change your range kernel so that each work-item in a work-group accesses an element in the input array that is next to the one accessed by the next work-item in the work-group. Take a look at the picture below to understand. Page 17 of 26

18 !"#$%&'()*+,-)./012)3)456781'9)3):);&<#19 =>?<#&'9?'0!!!!!!! " " " " " " " # # # # $%&'!!!! ()%'! # # # * * * * * * ,,, $%&' " " " " ",, @ $%&' # # # # # $%&' * * * * * $%&' $%&',,,,, $%&' E<#&'9?'0.#91')ABC)<;)7#>0D/012! " # * +, -! " # * +, -! " # * $%&'! " # * ()%'! " # * +, -! " # * +, -! " # * +, -! $%&' +, -! +, - " # * @.#91')45@BC)<;)7#>0D/012 Example of our range kernel. In the original accesses (top) we assume we read 4 values from DRAM on every access, but only use one of them. (The real hardware is far worse in this regard.) In the re-ordered version below, we read 8 values from DRAM on every two accesses, and use 7 of them. This gives us a 7x increase in effective bandwidth, and can improve our performance by requiring only 2 memory accesses in place of 7. Now go and write a second kernel ( range_coalesced ) that works this way and fill in the next part of the tutorial. Make sure you validate your results by looking at the final range value. Copy your main_opencl_6_range_kernel.c file to a new main_opencl_7_range_coalesced.c file and change it to use the new kernel. Run it and fill in the worksheet. (You can add the new kernel to the same kernel.cl file by calling it range_coalesced instead.) Don t turn the page until you ve finished filling in the worksheet. Page 18 of 26

19 After fixing the range kernel to be coalescing we see that the GPU performance is vastly improved (13x faster range compared to uncoalesced, plus reduced data movement). The CPU performance is about the same, but the range kernel itself is actually 10% slower. The reason for this is that the CPU has a cache and hardware prefetcher, so it doesn t benefit much from changing the order of the accesses. In fact, straight linear accesses (the uncoalesced version) is faster on the CPU due to this hardware.!"#$%&'()*(#+,(#-&./(#0***())() $1$23 $41$23 $ %+&' &)*+& *,-,.*)) 378&+(#9(:.(' &&/** 0..0 ).,/ -&./(#$%;7<+( &0,/ 1/) 0.* -(&8#=&+& - )- $%;7>'(#2:%/:&;, -0 $:(&+(#?<@@(:),, A:>+(#=&+& &*- -&. B>.>),,, $'(&.<7 &+ &/1 -(8<*+>%., & B>.&'#-&./(#C&'<( 112*-*0&) 112*-*0&) 112*-*0*- DE(:,(&8 0+*.-, ).*!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 <78&+(#)7((8<7#E)#$1$23 &2*, )2., F#+>;(#%.#8&+&#;%E(;(.+ -3 /3 F#+>;(#%.#*%;7<+&+>%. 1-3 *.3 :&./(#)7((8<7#E)#$1$23 &20* -2)0 :&./(#)7((8<7#@:%;#GH,21& &-21 6%+&'#)7((8<7#@:%;#GH &2, -2+ &2-% &%,2*%,20%,2.%,2-%,%!"#$%&'()*(#+,(#-&./(#0**())()# 454!6% 4754!6% 4754!6% 4758!6% 4758!6% 9:$;<$"=% >$=?@A'B% 4C$"B?D% After coalescing the range accesses we see that we ve basically eliminated data movement from our application and achieved a reasonable 3x speedup on a slow GPU. EFBFG<% H;FI$%J"I"% 4;$"I$%K?L$;G% 4'MDFC$%!;'#;"M% >$"=%J"I"% >"B#$% 6D="I$% Overall Speedup If you click on the Overall Speedup tab at the bottom of the tutorial worksheet you can see the impact of the various optimizations on the application performance. Clearly the biggest improvements came from removing the overhead from inside the loop (not surprising) and coalescing the range kernel. It s important to have this kind of information for your optimizations so you can see the benefit of each step. &"!"##$%#&'($)*++,-*$.+#"/0+$1&$%$%&,+$ %" $" #" MOM2H" M@OM2H" M@OG2H"!" #'"()*+,-.+" $'"()*+,-.+" %'"()*+,-.+" /-01"2345,-.6" /-01"2345,-.6" 7).8"9,:-.-*1;" &'"<=+31+)8" E-F+.*-4.*"4." 01+"G2H" I'"J).6+"-.")" L'"M4),+*9+"01+" K+3.+," J).6+" N999+**+*" Overall speedups from the changes in the tutorial. CPU vs. GPU We ve now encountered two cases where the CPU runs more slowly with a GPU-optimized kernel: The GPU update kernel has the overhead of checking the borders each time vs. the C code which skips the borders in the loop. The coalesced GPU range kernel accesses data in a bad order for the CPU. Page 19 of 26

20 This points out that the particular kernel you want to use will depend on the hardware. Now you might think you could just change the global dimensions of the GPU kernel to skip the border and then have a kernel that runs great on both architectures. However, while doing this will get you 2.0x speedup on the CPU, it will get you a huge slowdown on the v1.1 GPUs (lab and my laptop) because the memory accesses will no longer be coalesced. But on the v2.0 GPU (in the cluster) you d still get great performance because they have more relaxed coalescing rules. So while OpenCL does offer portability, it clearly does not offer performance portability. Going Faster How could you make this code faster? Well, the biggest performance issue now is the update kernel. Since there is a lot of data reuse in this kernel we could use the local shared memory to manually cache the input data before we process. This could potentially speed up the kernel a lot, but would make it far more complicated, and would slow it down on the CPU. (CPUs don t have local memory, so any code that accesses it is wasted time.) On newer GPUs this will be almost a waste of effort since they have caches, which would do this for us. Other than that, the data clearly indicates that we re still spending a good deal of time on overhead. To minimize this we need to processes larger problem sizes (if they fit on the GPU) and run them for longer (e.g., more iterations). If our problem is larger and takes longer to run, the percent of time spent on overhead will decrease and we ll see a corresponding speedup. An Aside: How slow is Java? I re-wrote the simple C version in Java, keeping the code as similar as possible to the C. This was far easier than writing in an ancient, primitive language like C or Fortran. Now most people will tell you that Java is about half as fast as C code, but for this case, it was actually 3.5x faster, thereby beating the GPU implementation. Why? I don t honestly know, and this is a real problem for performance optimization. Java has the potential to do better compilation due to the runtime nature of its JIT (e.g., at runtime it can determine which parts of code to optimize based on how they are being used). So perhaps the java is unrolling the loop, or maybe automatically vectorizing? But on the other hand it has the overhead of array bounds checking and memory management. So who knows what is going on. However, on top of good performance, Java has very efficient libraries for parallelizing code, which would enable the code could run even faster with a bit more effort. (It would take far less effort to use Java s parallel libraries than to use OpenCL. But there are of course OpenCL interfaces for Java as well.) Regardless, this is a clear indication that you shouldn t discredit Java s ability to run fast. Indeed, combined with the fantastic productivity improvements for the developer, Java is an excellent choice for a large range of applications. &"!"##$%#&'($)*++,-*$%$./0$1*2345+,$1*+6%7$./0$8"."$ %" '(')*" $" '+(')*" '+(,)*" -./.(')*" #"!" '(')*" '+(')*" '+(,)*" -./.(')*" Comparing the performance of the optimized OpenCL implementations vs. Java. Surprised? Java is really quite good these days. Page 20 of 26