PDC Summer School Introduction to High- Performance Computing: OpenCL Lab

Size: px
Start display at page:

Download "PDC Summer School Introduction to High- Performance Computing: OpenCL Lab"


1 PDC Summer School Introduction to High- Performance Computing: OpenCL Lab Instructor: David Black-Schaffer Introduction This lab assignment is designed to give you experience leveraging OpenCL to convert a standard C program for parallel execution on multiple CPU and GPU cores. You will start by profiling a model PDE solver. Using those performance results you will move portions of the computation over to an OpenCL GPU device to accelerate the computation. At each step you will analyze the resulting performance and answer a series of questions to provoke you to think about what is going on. Learning Objectives 1. Write and understand OpenCL code 2. Understand the OpenCL compute and memory models 3. Profile and analyze application performance 4. Understand the impact of local work group size on performance 5. Write your own OpenCL kernel 6. Optimize a simple kernel for coalesced memory accesses Assumed Background 1. Basic knowledge of C programming, compilation, and debugging on linux 2. Basic understanding of GPU architectures (from the lecture earlier today) 3. Some exposure to program performance analysis If you are concerned about your level of background please make sure you choose a partner whose skills complement your own. Materials Lecture Notes: GPU Architectures for Non-Graphics People Lecture Notes: Introduction to OpenCL Source code starter files provided on the summer school website The OpenCL Specification v. 1.0 ( OpenCL Quick Reference Card ( Nvidia CUDA C Programming Guide v. 3.2 ( Guide.pdf) The last three pieces of documentation can be found by simply googling their titles. You may need them for the later parts of the lab. Hardware The lab machines are equipped with a GeForce 8400GS graphics card with 512MB of memory. These are Nvidia Compute version 1.1 cards with 1 streaming multiprocessor with 8 hardware cores processing 32 threads in a group (warp) running at up to 1.3GHz. The CPUs on the lab machines are Intel Core2Quad Q9550 (Yorkfield) at 2.83GHz. These are 2 dual-core dies in one package sharing a northbridge with 6MB of cache on each die. The cluster machines are equipped with four Tesla C2050 graphics cards with 3GB of memory. These are Nvidia Compute version 2.0 cards with 14 streaming multiprocessors with 32 hardware cores each processing 32 threads in a group (warp) running at up to 1.1GHz. The CPUs on the cluster machines are Page 1 of 26

2 dual Intel Xeon E5620s (Westmere-EP) at 2.4GHz. These are 4-core (8-thread) processors with 12MB of L3 cache each connected by QPI. The data used in this tutorial document is from an Apple MacBook Pro with a GeForce 9400M with 256MB of memory. It is an Nvidia Compute 1.1 device with 2 streaming multiprocessors with 16 cores each processing 32 threads in a group (warp) running at up to 1.1GHz. The CPU is an Intel Core2Duo P8800 at 2.66GHz with 3MB of L2 cache. A: Getting Started To get started you will run the OpenCL Hello World code that was discussed in the lecture. 1. Download the zip file of the code from the PDC website. 2. Unzip it in your home directory on one of the lab machines. 3. Type make hello 4. Run./hello_world.run If all goes well you should get one thousand sine calculations printed out to the terminal. To make sure you understand what is going on, go to the code and make and verify the following changes: 1. Change the code to run on the CPU or GPU (make sure it works on both). Note: if your code crashes when you change it to run on the CPU, you probably need to do some error checking. The hello_world program has none. Try looking at the error codes returned by clgetplatformids, clgetdeviceids, and clcreatecommandqueue. (You can find out how to get the error codes back by looking at the OpenCL documentation link.) What s going on? (You can look up the error IDs in the cl.h file which can be found by online or at the end of this document.) 2. Change the kernel to calculate the cosine of the numbers instead of the sine. 3. Change the size of the array processed to be 1024 times larger, remove the printout of the results, and time (using the time command on the command line) the difference in speed executing on the CPU vs. GPU. 4. Think about the results. Can you explain why one is faster than the other? 5. Now change the program to calculate the cosine without using OpenCL and compare the performance. What do you notice? B: Introduction to the PDE Solver Program Overview This tutorial uses a very simple PDE-solver-like program as a demonstration. The program is provided in a standard C version in the file main_c_1.c The program works as follows: int main (int argc, const char * argv[]) { float range = BIG_RANGE; float *in, *out; // ======== Initialize create_data(&in, &out); // ======== Compute while (range > LIMIT) { // Calculation update(in, out); // Compute Range range = find_range(out, SIZE*SIZE); swap(&in, &out); } Page 2 of 26

3 } Basic program flow. (Profiling measurements have been removed for simplicity.) The program creates two 4096x4096 arrays (in and out) and then performs an update by processing the in array to generate new values for the out array. The out array is then processed to determine the range (maximum value minus minimum value). If the range is too large, the program swaps the arrays (in becomes out and out becomes in) and repeats until the range converges below the limit. The update calculation takes in the four neighbors (a, b, c, d) along with the central point (e) to calculate the next central value as a weighted sum. This is a basic 5-point stencil operation, which can be visualized as:!"# $%&#! " # $ %! " # $ %! " # $ % The update iterates through the whole in matrix looking at the neighbors to calculate the new value, which is written into the out matrix. For the next update the in and out matrices are swapped, thereby overwriting the old values with the new ones. void update(float *in, float *out) { for (int y=1; y<size-1; y++) { for (int x=1; x<size-1; x++) { float a = in[size*(y-1)+(x)]; float b = in[size*(y)+(x-1)]; float c = in[size*(y+1)+(x)]; float d = in[size*(y)+(x+1)]; float e = in[size*y+x]; out[size*y+x] = (0.1*a+0.2*b+0.2*c+0.1*d+0.4*e); } } } The update calculation is simply a 5-point stencil. After the update, the program runs a range calculation on the output matrix that simply finds the minimum and maximum values and returns the difference between them as the range. float find_range(float *data, int size) { float max, min; max = min = 0.0f; // Iterate over the data and find the min/max for (int i=0; i<size; i++) { if (data[i] < min) min = data[i]; else if (data[i] > max) max = data[i]; } // Report the range return (max-min); } Finding the range simply iterates over all the data and keeps track of the minimum and maximum values, and then returns the difference. (This function is in util.c, as it is shared by all Page 3 of 26

4 the sample programs.) The program repeats the loop of update/range until the range is below the LIMIT specified in the parameters.h file. Don t change anything in the parameters.h file as it is used by all programs. Running the Program 1. Type make 1 to build the C-version of the program. 2. Run the program by typing./1_c.run When you run the program you will see the output converge after approximately 42 iterations. The program will print out the range at convergence. You should use this value to check that the optimized versions of the code later on in the lab are producing the same results. In addition to the range results, the program will print out profiling information. From this you can see how long the program took in total (Total) and how long the update (Update) and range (Range) calculations took. It also reports the standard deviation, which is a good sanity check as to how reliable your performance measurements are. 1. C Allocating data 2x (4096x4096 float, 64.0MB). Range starts as: Iteration 1, range= Iteration 42, range= Total: Total: ms (Avg: ms, Std. dev.: ms, 1 samples) Update: Total: ms (Avg: ms, Std. dev.: ms, 42 samples) [ignoring first/last samples] Range: Total: ms (Avg: ms, Std. dev.: ms, 42 samples) [ignoring first/last samples] Output of the C program showing the final range and the profiling information. Note the large standard deviation on the Update measurement (13%) indicating that something else may have been happening on the computer at the same time. Performance Measurements Take a look at the code in the main_c_1.c file to see how the application is implemented. The only significant differences from the code listed above are for the performance measurements. As you can see performance measurements are made using perf variables. These must be initialized first, which is done in init_all_perfs(). The perfs are timers which are started by calling start_perf_measurement() and stopped with stop_perf_measurement(). Each time a start-stop pair is called the perf records the elapsed time. At the end the results are printed out with the print_perfs() call. Now take a look at the performance numbers you got and decide which part of your code should be optimized with OpenCL first. C. Accelerating (Decelerating?) with OpenCL Now that you understand the basic algorithm and have seen the source code it is time to accelerate your program by using OpenCL. As you can see from the performance data for the C code, the best place to start is by replacing the update() function with an OpenCL one. Ready? If the lectures were any good this shouldn t take more than a few minutes. You can use the hello world file as a starting point. I m just kidding. Starting from nothing is a major pain. So I ve done this for you in the main_opencl_1.c file, which is summarized below: int main (int argc, const char * argv[]) { float range = BIG_RANGE; float *in, *out; // ======== Initialize init_all_perfs(); Page 4 of 26

5 create_data(&in, &out); // ======== Setup OpenCL setup_cl(argc, argv, &opencl_device, &opencl_context, &opencl_queue); // ======== Compute while (range > LIMIT) { // Calculation update_cl(in, out); // Compute Range range = find_range(out, SIZE*SIZE); iterations++; swap(&in, &out); printf("iteration %d, range=%f.\n", iterations, range); } } main_opencl_1.c Note that the only two changes to the high-level algorithm are to setup OpenCL at the beginning (setup_cl) and to call update_cl() instead of update. We re going to skip the contents of setup_cl() for now. You can take a look at it in opencl_utils.c if you re interested, but all it does is look for the right type of device based on whether you passed in CPU or GPU on the command line and creates a context and queue for you. Instead, take a look at update_cl(). This is where the work is being done. void update_cl(float *in, float *out) { cl_int error; // Load the program source char* program_text = load_source_file("kernel.cl"); // Create the program cl_program program; program = clcreateprogramwithsource(opencl_context, 1, (const char**)&program_text, NULL, &error); // Compile the program and check for errors error = clbuildprogram(program, 1, &opencl_device, NULL, NULL, NULL); // Create the computation kernel cl_kernel kernel = clcreatekernel(program, "update", &error); // Create the data objects cl_mem in_buffer, out_buffer; in_buffer = clcreatebuffer(opencl_context, CL_MEM_READ_ONLY, SIZE_BYTES, NULL, &error); out_buffer = clcreatebuffer(opencl_context, CL_MEM_WRITE_ONLY, SIZE_BYTES, NULL, &error); // Copy data to the device error = clenqueuewritebuffer(opencl_queue, in_buffer, CL_FALSE, 0, SIZE_BYTES, in, 0, NULL, NULL); error = clenqueuewritebuffer(opencl_queue, out_buffer, CL_FALSE, 0, SIZE_BYTES, out, 0, NULL, NULL); // Set the kernel arguments error = clsetkernelarg(kernel, 0, sizeof(in_buffer), &in_buffer); error = clsetkernelarg(kernel, 1, sizeof(out_buffer), &out_buffer); // Enqueue the kernel size_t global_dimensions[] = {SIZE,SIZE,0}; error = clenqueuendrangekernel(opencl_queue, kernel, 2, NULL, global_dimensions, NULL, 0, NULL, NULL); // Enqueue a read to get the data back error = clenqueuereadbuffer(opencl_queue, out_buffer, CL_FALSE, 0, SIZE_BYTES, out, 0, NULL, NULL); // Wait for it to finish error = clfinish(opencl_queue); // Cleanup clreleasememobject(out_buffer); clreleasememobject(in_buffer); clreleasekernel(kernel); clreleaseprogram(program); free(program_text); Page 5 of 26

6 } update_cl() with all the error-checking code removed to make it easier to read. As you can see, this code is shockingly similar to the code in the Hello World OpenCL program. All it does is: 1. Create a program (loaded from the text file kernel.cl ) 2. Build it 3. Create a kernel ( update ) 4. Create two memory objects ( in_buffer and out_buffer ) 5. Enqueue a write to write the data to the memory objects (why would we copy data to the out buffer?) 6. Set the kernel arguments 7. Enqueue the kernel execution 8. Enqueue a read to get the data back 9. Wait for the execution to finish 10. Cleanup all the resources we allocated There s really nothing more to it than that. You can open the kernel.cl file and see the update kernel. It is very similar to the update() C code, except that it has to find out which update it should do (get_global_id()) and needs to avoid processing if it is on the edge. (The C version uses loops from 1 to SIZE-1 to avoid the edges.) kernel void update(global float *in, global float *out) { int WIDTH = get_global_size(0); int HEIGHT = get_global_size(1); // Don't do anything if we are on the edge. if (get_global_id(0) == 0 get_global_id(1) == 0) return; if (get_global_id(0) == (WIDTH-1) get_global_id(1) == (HEIGHT-1)) return; int y = get_global_id(1); int x = get_global_id(0); // Load the data float a = in[width*(y-1)+(x)]; float b = in[width*(y)+(x-1)]; float c = in[width*(y+1)+(x)]; float d = in[width*(y)+(x+1)]; float e = in[width*y+x]; // Do the computation and write back the results out[width*y+x] = (0.1*a+0.2*b+0.2*c+0.1*d+0.4*e); } The OpenCL kernel update(). Running the OpenCL version 1. Type make 1 (as before) 2. To run the CPU version type./1_opencl.run CPU 3. To run the GPU version type./1_opencl.run GPU So what did you get? How does the multicore OpenCL CPU version and the OpenCL GPU version compare to the standard C version for speed? D. Tutorial Worksheet Well, it s kind of hard to make sense of all those performance numbers, so now is the time to open the Tutorial Worksheet file and start filling in some numbers. Indeed, for the rest of the tutorial you ll be using this to track and analyze your performance. Once you ve finished implementing the optimizations you ll then submit your code to the cluster. This will give you a second set of performance numbers Page 6 of 26

7 which you will then compare to the ones you get on the lab machine to see the impact of different GPUs on performance and optimizations. Don t Cheat For you to learn the most you should read the appropriate part of the text here, write and debug the code, fill in the worksheet, and answer the questions in the worksheet, in that order. So when the tutorial says don t turn the page until you ve finished filling in the worksheet please don t. If you cheat and read ahead you will bias yourself and will learn a lot less. If you get stuck ask for help before you read ahead. Benchmarking is hard You ll be getting timing information running on the lab machines. If you re doing something else at the same time (such as web browsing, filling in a worksheet, etc.) it will impact your measurements. You can see from the standard deviation in the measurements in the output to evaluate how much you can trust your measurements. But it would be best to fill in the worksheet on your own laptop using Excel, rather than open office on the lab machines. Get started Go ahead and fill in the numbers from your first three runs in the tutorial worksheet. They should go under the 1. Baseline section. Be sure to fill in the Final Range Values to make sure that you re actually computing the same thing on all versions of the code. (It s very easy to get amazing speedups by computing nothing without realizing it.) Now answer the questions for part 1 of the worksheet. Don t turn the page until you ve finished filling in the worksheet. Page 7 of 26

8 1. Baseline Note: the graphs and data shown here are for running on my MacBook Pro. Depending on your hardware and operating system you may see different results. So you re going to have to think for yourself to figure this stuff out. The performance discussion below is meant as a guideline.!"#$%&'()#*+,-./01!")*+$)%&%,"-$. /"012%34"567859"(($) :-"14. 2"012;<4"567=59"(($)>1*;??;=$!")*+$)%@%,"-$. :-"14. A!B%CDE$. K!B%CDE$. 2"#304&1/'& A$F')5$%GHIIJ K@/?'%!LLII%>%@;MMANO (5(67 ()5(67 ()5867 *-,01 &PLQ& 7%90,& &&RLL :0';& &MIR &RGR &MPI </'01#:0';&#=01+& GG;L@LM&P GG;L@LM&P GG;L@LML@ $>&.?&09 MQL LMH &;L% &;M% &;H% 2"#304&1/'&# &;@% &;I% I;L% U0$)9$"2% ^"+#$% BE2"*$% I;M% I;H% I;@% I;I% K7K!B% KV7K!B% 4"#5'*&#%0..2-0#),-62#+,-#./0.1&#7,8#&'.#9:#9;<#=%#&'.#9#9;<3#>'(?&@#',)#A*?+#1,8.%#*8.#+,-#-%(?B3C JD%K!B%9"=%@%5')$=%='%T%W'?42%9"0$%$XE$5*$2%*'%)?+%*W15$%"=%("=*;%YZ*%4$"=*%(')%*9$%?E2"*$%E")*;[ 9"#D(2#E0.?9:#B(=.#&'.#%*A.#>?-A.8(1C#*?%).8#*%#&'.#9#=.8%(,?3#5'+#,8#)'+#?,&3,'\%<?*%*9$D])$%)$"44D%54'=$;%C9$%K!B%1=%*9$%="-$%<?*%*9$%A!B%1=%21(($)$+*;%C91=%1=%E)'<"<4D%2?$%*'%21(($)$+*%E)$51=1'+%"+2%')2$)%'(%'E$)"*1'+=; The tutorial worksheet uses light blue to indicate places you need to fill in data, formulas, or analysis. The performance graphs are all normalized to the C-CPU speed (on the left) so bigger bars are worse. From this data the GPU version is 1.8 times slower than the C version, and the OpenCL C version is 25% slower. Not so good. Remember the questions are there to help you learn, so try to think about them rather than trying to just put down something. From the graph in the worksheet we can clearly see that we re spending a huge amount of time calculating the update. However, the OpenCL code for update is pretty complicated so this level of detail is not very helpful. What we need to do is get more detailed performance data. Open the main_opencl_2_profile.c file. This file is similar to the file from part 1 above except that it defines a bunch more performance counters. These are: Program_perf for loading and compiling the program Create_perf for creating the buffers Write_perf for writing to the buffers Read_perf for reading from the buffers Finish_perf for timing the finish Cleanup_perf for timing the cleanup Page 8 of 26

9 Your job is to now go into the update_cl() method and insert appropriate perf measurement calls to measure the performance of these parts of the program. Go ahead and do this and then run the program with make 2 and./2_profile.run CPU and./2_profile.run GPU. Fill in the data in section two of the worksheet and answer the questions. Note: for each section of this tutorial you use make N and./n_ GPU and./n_ CPU to run the code. It is important that you use the file names mentioned here and listed in the Makefile so that you will be able to submit everything to the cluster in the end. Don t turn the page until you ve finished filling in the worksheet. Page 9 of 26

10 !"#$%&%'(%& 2. Baseline with Profiling So what did you find? Is it what you expected? Here s what I got:!"#$%&'()*'#+),-#./01)()* ,%( &)*+& &*,-../,,0 59:%,'#;'/*'( &&0**, -*,+ <%*2'#30=9>,' &1,0 &1,* &0*- <'%:#?%,%,. 30=9)('#./02/%=.& -+- 3/'%,'#$>11'/&, +/ A)*)&- &/*.. /1.. 3('%*>9 +,& 01++ A)*%(#<%*2'#B%(>' --2*.*1&) --2*.*1&) --2*.*1*. CD'/-'%: 1+* -/, &./).% &2*% &21% &2/% &2.% &%,2*%!"#$%&()*'#+),-#./0E()*2# 89$:;$"<% 3=$">?@% AB>BC;% D:BE$%F"E"% 3:$"E$%G?H$:C% 3'I@B=$%!:'#:"I%,21%,2/% J$"<%F"E"% J">#$% 5@<"E$%,2.%,% 343!5% 3643!5% 3643!5% 3647!5% 3647!5% I ve removed the answers to the questions to make it less tempting to look ahead and cheat. But there s a discussion of them in the following text. A few things jump out at me immediately. The first is that I m spending a huge amount of time in Finish, particularly on the CPU. That s strange since Finish doesn t actually do any work other than call clfinish(). Of course if you answered question D (which was discussed in the lectures) you ll remember that OpenCL submits work asynchronously. That is, the work isn t done when you call clenqueue() it s just scheduled. So when you call clfinish(), the runtime will stop your program until all the work is done. Therefore, the timers you put around clenqueuendrange() and clenqueuereadbuffer, etc., don t record much (if any) time because they are just enqueueing the kernel. Another strange thing is that I see almost not time for compiling the kernel, despite the fact that I m creating and calling clbuildprogram() 40 times in this program. Your results are probably a bit different if you re not running on a Mac. The reason is that Mac OS X caches programs when you compile them the first time, so each subsequent compilation of the same source code is really fast. If they weren t cached the program would spend a huge amount of time re-compiling. Now make a copy of your source file and call it main_opencl_3_profile_finish.c and add in clfinish() as needed for the performance counters you added before. This will make sure that the host waits for the OpenCL commands to finish before continuing with your program. By doing this you will synchronize with the OpenCL device, and your performance measurements on the host side will be accurate. After you have done this, run your program, fill in part 3 of the worksheet, and answer the questions. Note: OpenCL provides the ability to request events from commands and use those events to get information on when they were submitted to OpenCL, when they started, and when they finished. This is another way (indeed the preferred way) to get this information, but it is tricker. Don t turn the page until you ve finished filling in the worksheet. Page 10 of 26

11 3. Baseline with Profiling (and clfinish) Now the results should make a lot more sense. Finish is now a tiny part of the overall time (nearly 0) and we can see where we are spending time.!"#$%&'()*'#+),-#./01)()*2#3%*4#5(6)*)& : 8;98.: 8;9<.: =0,%( &)*+& &+*,* :>4%,'#?'/*'( &&/** 0)0) &01/ &/,, &100 &.11 80A>)('#./02/%A -&,,& 8/'%,'#$B11'/& 1 &- D/),'#C%,% +-*, 0+., 6)*)&- 1. 8('%*B> 0,- /+1-6)*%(#@%*2'#E%(B',,2*-*0&),,2*-*0&),,2*-*0*- FG'/-'%4 0+* *0* &-0,!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 B>4%,'#&>''4B>#G&#898.: &2*- &20/ H#,)A'#0*#4%,%#A0G'A'*,.+3 ))3 H#,)A'#0*#:>4%,' )03 -*3 -% &2*% &20% &2.% &2-% &% 12*% 120% 12.% 12-% 1%!"#$%&()*'#+),-#./0I()*2#3%*4#5(6)*)&-7# 9:$;<$"=% 4>$"?@A% BC?CD<% E;CF$%G"F"% 4;$"F$%H@I$;D% 4'JAC>$%!;'#;"J% K$"=%G"F"% K"?#$% 6A="F$% 454!6% 4754!6% 4754!6% 4758!6% 4758!6% With clfinish() forcing the runtime to complete each operation we can now accurately time what is going on. Clearly we are spending a lot of time moving data (write and read) and a good deal of time in cleanup. (Your results are likely to be different depending on your GPU and system!) For my system, I m wasting a lot of time in cleanup and moving data (write and read). The update calculation is actually pretty fast and compile program is not too bad. Now note that the ratios of these times will depend on the ratio of speeds of the CPU and GPU. If you have a much slower GPU (like the lab machines; roughly half as fast) and a faster CPU (like the lab machines; roughly 50% faster) then you may see that the update time is a lot larger on the GPU. Now with this data we can start to see some results. The speedup for the update calculation on the OpenCL CPU is 1.82 vs. the C version. This is decent (although not great) considering that we have 2 CPU cores running in OpenCL. The reason it s not 2.0 is that the kernel code has a lot of extra overhead to deal with the edge cases. The C code deals with them once in the loops and is therefore more efficient. If we look at where our time is being spent, we see that we re spending only 1/3 of our time doing update and nearly 50% of our time doing data movement. (You may see different numbers depending on things like how much of your time you re spending doing compilation.) This kind of information lets us know what optimization we need to do: we need to get the compilation, cleanup, and writing data out of the main loop since these things only need to be done once. Then we can keep the data on the OpenCL device and just read back the data for the range calculation. To do this open the file main_opencl_4_profile_nooverhead.c and fill in the missing parts. You can get most of the code you ll need from your current file. All this file does is add two calls at the beginning (setup_cl_compute() and copy_data_to_device()) and then change the main loop to call update_cl() with the appropriate cl_mem buffer objects instead of the data pointers, and then call read_back_data() to get the data back from the range calculation. At the end it calls cleanup_cl() to cleanup. The program uses the iterations count to determine which buffer to read from and write to (see the get_in_buffer() and get_out_buffer() functions). This way the data can be kept on the device and we just change the kernel arguments to swap buffers. Page 11 of 26

12 // ======== Setup the computation setup_cl_compute(); start_perf_measurement(&write_perf); copy_data_to_device(in, out); stop_perf_measurement(&write_perf); // ======== Compute while (range > LIMIT) { // Calculation start_perf_measurement(&update_perf); update_cl(get_in_buffer(), get_out_buffer()); stop_perf_measurement(&update_perf); // Read back the data start_perf_measurement(&read_perf); read_back_data(get_out_buffer(), out); stop_perf_measurement(&read_perf); // Compute Range start_perf_measurement(&range_perf); range = find_range(out, SIZE*SIZE); stop_perf_measurement(&range_perf); iterations++; printf("iteration %d, range=%f.\n", iterations, range); } // ======== Finish and cleanup OpenCL start_perf_measurement(&finish_perf); clfinish(opencl_queue); stop_perf_measurement(&finish_perf); start_perf_measurement(&cleanup_perf); cleanup_cl(); stop_perf_measurement(&cleanup_perf); The new main loop with the OpenCL setup code moved out. The functions you need to define are in yellow. Make the changes to get this program working and then get your performance numbers and fill in the next section of the worksheet. Remember to make sure your Final Range Values match so you can be (reasonably) sure that your code is correct! Don t turn the page until you ve finished filling in the worksheet. Page 12 of 26

13 4. Overhead Outside of Loop Now that we ve removed the obvious overheads (compilation, setup, cleanup, and copying data to the device) from the main loop we can start seeing some performance improvements.!"#$%&'(&)*#$+,-.*&#/0#1// /,)9 &)*+&,-*. +/-) &2/% 62*),&#:&';&9 &&0** -)1+ )1&* <);=&#3/>2+,& &-.0 &0,* &-/+ <&)*#?),) &.-. &)-/ &% 3/>2.9&#5'/=')>. /* A'.,&#?),) &*& /&& B.;.-(.. 39&);+2 &+ &0+.2-% B.;)9#<);=&#C)9+&,,2*/*-&),,2*/*-&),,2*/*-*/ $%&'(&)* -+* %!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 +2*),&#-2&&*+2#%-#34356 &2*) )2), D#,.>&#/;#*),)#>/%&>&;, &)3 //3.2/% D#,.>&#/;#62*),& /,)9#-2&&*+2#0'/>#EF &2* )21 D#*),)#>/%&>&;,#0'/>#E &03 &,3.%!"#$%&'(&)*#$+,-.*&#/0#1//2# 9:$;<$"=% BC?CD<% E;CF$%G"F"% 4'JAC>$%!;'#;"J% K$"=%G"F"% K"?#$% 6A="F$% 454!6% 4754!6% 4754!6% 4758!6% 4758!6% With the overhead removed from the loop we have now actually succeeded in accelerating our code using OpenCL. The CPU version now takes 75% as long as the C version (not so good given that we should be at 50% with two CPUs) and the GPU version is faster. The CL-CPU version is now 1.8x faster than the initial version and the CL-GPU version is 3.4x faster. Again, depending on the relative speeds of the GPUs and CPUs you may see different improvements, or even none at all. But the most relevant part is that our data movement is now 13% of our total time down from 47% before. If you were spending a lot of time on compilation you will see a smaller relative decrease. Looking at the change in just data movement time (read plus write time) we see that we are spending only 15% and 19% of the time, respectively, doing data movement on the CPU and GPU as we were before. This is a > 5x reduction in data movement time! So why is the data movement reduced? Well, simply because we re not constantly copying the data toand-from the device. We are copying it back each time we do range, but that s a lot less movement than before. Looking at the data above, the next optimization step for my machine is to get the range computation onto the OpenCL device so we can eliminate the read data and accelerate range. (Of course it would be great to get the update kernel to run even faster, but we re not going to touch it for now.) But before we do that we re going to take a look at the impact of local dimensions on performance. To do this, open the file main_opencl_5_explore_local.c. This file simply has a new main function that walks through a bunch of local dimensions (stored in locals[][]) and then calls run with them set. You need to copy your code from the last section into here, rename its main function to run and update the code to use the local dimension chosen by the new main loop. Do this and answer the questions in the worksheet. Don t turn the page until you ve finished filling in the worksheet. Page 13 of 26

14 5. Exploring Local Dimensions When we explicitly set the local dimensions we need to choose values that evenly divide the global dimensions. So for this program, with global dimensions of 4096x4096, we can choose just about any power of 2 by power of 2. However, the total size of the local dimensions is dictated by the hardware resources on the device. For the lab and my machine this limits the maximum workgroup size to 512, and for the cluster to While any value of local dimensions that evenly divides the global and whose product is less than that maximum is supported, the performance may vary dramatically. Remember that the hardware computeunits execute multiple threads from their work-groups at the same time. If the size of the work-group is less than the minimum number of threads the compute-unit executes at once, that additional hardware will be wasted. In the case of Nvidia hardware, each streaming multiprocessor (compute-unit) has 8 hardware cores which execute one thread on every other cycle (16 physical threads at once) and the architecture executes threads in groups of 32. This means that if the compute-unit has less than 32 threads at any given time it will be wasting processor cores. (Take a look at the slide in the OpenCL lecture.)!"#$%&'()*+,#-(./'#0*12+3*(+3#(+#452# :%: ;%; <%< =%= :>%:> ;!>%:?(4/' )**+ &&***, -.*-, &,/.) /+-) /.*& /&*& 8&@/42#A2)+2' -*/0 &,++&0./**) 1/.& -/.* -*)/ --10 B/+,2#C(1&D42 &1.* &1-0 &1-) &1-) &1.* &1.0 &1+0 B2/@#0/4/ &-** &-/1 &-** &-1/ &-/) &--& &-.& C(1&*'2#7)(,)/1 &,-*./ -. -* C)2/42#EDFF2)3,,,,,,, G)*42#0/4/..* &1) &/, &11 &1/ &/, &11 H*+*35,,,,,,, C'2/+D& &+) &1/ &+) &1- &/. &+) &+) H*+/'#B/+,2#I/'D2 JK2)52/@ +). ++** &1*, 1-0 **& **, *-0!"#$%&'()*+,"+!-..+/",%&+/'$)+ $(# $'# $&# $%# $$# $!#,# +# *# )# (# '# &# %# $#!# &"#'56+."7%&+8'$)59'"59+ -.//# $0$# %0%# '0'# +0+# $)0$)# %()0$# $!!"#,!"# # +!"# 8936:;<# *!"# =>:>?5# (!"# 8436A3#C;D34?# 8EF<>93#G4EH46F# '!"# I367#B6A6# &!"# I6:H3# %!"#.<76A3#.J9>K6JE:# $!"#!"# Performance differences for different local dimension sizes along with device utilization. Trying out different local dimensions gives us a feeling for how important this is. When we plot them together with utilization we see that we need full utilization to get the highest performance. (Note that our performance may not be limited by utilization, so it may not scale directly. In this case we are limited by non-coalesced accesses at some point.) But on my machine, the performance increases all the way, with the NULL version clearly not giving the best performance. The bottom line is that if you care about the last bit of performance you should not let the runtime try to guess the best local size. Page 14 of 26

15 6. Range in a Kernel Now it is time to put the range calculation into a kernel. The tricky part about calculating the range is that it is a reduction, and, as you remember from the lectures, reductions are limited by synchronization. To make this simple we re going to do the last stage of the reduction on the host. That is, you will write an OpenCL kernel where each work-item will calculate the min and max for part of the data and write those to an OpenCL buffer. This (smaller) buffer will then be read back to the host where the final reduction will be done.!"#$ %&'()$!!!!!!! " " " " " " " # # # # $! %&' %() # # # * * * * * * ,,, $" %&' %(),, * * * $# %&' %() $* %&' %() $+ %&' %() $, %&' %() $- %&' %() * * * * * * The range kernel. Each work-item processes some number of items from the update kernel s output and stores the minimum and maximum across those values in the range buffer. (e.g., work-item 0 processes the first 7 items in the data and work-item 1 processes the next 7.) The number of items each work-item processes is determined by the number of work-items and the size of the data to process. The range buffer is then read back to the host, which does the final reduction. The range kernel therefore takes the output from the update kernel as its input and produces a range_buffer output. To implement this, open the main_opencl_6_range_kernel.c file. This file is similar to what you had in step 4 (before adding the local size experiment) but it has an additional kernel variable (range_kernel), an additional cl_mem variable (range_buffer), and an additional host buffer (range_data) for interacting with the range kernel. You will need to add your range kernel to the kernel.cl file, update the setup_cl_compute() method to create the appropriate range_buffer, range_kernel, and range_data. (Also update cleanup_cl() to clean up after them.) I ve provided a skeleton range kernel below for you to start with. The number of work-items to use for the range kernel is defined in the RANGE_SIZE #define. // The number of work items to use to calculate the range #define RANGE_SIZE 1024*4 // ======== Compute while (range > LIMIT) { // Calculation start_perf_measurement(&update_perf); update_cl(get_in_buffer(), get_out_buffer()); stop_perf_measurement(&update_perf); // Range start_perf_measurement(&range_perf); range_cl(get_out_buffer()); stop_perf_measurement(&range_perf); // Read back the data start_perf_measurement(&read_perf); read_back_data(range_buffer, range_data); stop_perf_measurement(&read_perf); // Compute Range start_perf_measurement(&reduction_perf); range = find_range(range_data, RANGE_SIZE*2); stop_perf_measurement(&reduction_perf); iterations++; printf("iteration %d, range=%f.\n", iterations, range); } The changes to the program for running the range kernel. Instead of reading back the data and then calling range, we call range_cl() (which enqueues the range kernel) and then read back the range_buffer to do the final reduction on the Page 15 of 26

16 host. Note that you need to make sure the read_back_data() function reads the right amount of data. kernel void range(global float *data, int total_size, global float *range) { float max, min; // Find out which items this work-item processes int size_per_workitem =... int start_index =... int stop_index =... // Finds the min/max for our chunk of the data min = max = 0.0f; for (int i=start_index; i<stop_index; i++) { if (...) min =... else if (...) max =... } // Write the min and max back to the range we will return to the host range[...] = min; range[...] = max; } The range kernel skeleton. The input is the data to process (the output from the last update kernel, set by clsetkernelargs()), the total size of the data to process, and the results are written to the range output. 5.1*67 89 :;34434=%%% < '43/!!"$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$! This section of the Khronos OpenCL quick reference card will be helpful in filling in the kernel code above. :;3443A=%%%!!!!!#!43/#!43/!#!'43/#!'43/!!! # "-! "!!" $!!!!!!!!!!!!!!!!% Now.!!""!#$ implement the :!#!: range kernel and change the program &!,3%!/ to use it. Test the code by verifying that the %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!! number.!!""!##!"!'$ of iterations :!# ' and the final range "!!""!1#!!"!%$ value are the same %!)# 1 %!)#=!88!% %!)#= as the previous versions. If they are not you "!!""!##!!"!'$ #!;!' "!!""!##!!"!'$ # ' have a bug and will have to fix your code. "!!"7! $!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!!"!'$ "#!;!' "!!""!##!"!'$! "#!;!' Once you ve got it working, fill in the worksheet $ and continue. "!!""!#$ "!!""!&#!"!/#!"!0$ "! " $!!!!!!!!% "!!""!&#!"!/#!"!0$ "!!""!##!!"!'$ #! &#!/$!;!0 Don t &! turn the page $ until you ve finished filling in the worksheet.,3%!/ 0 &!<!/!;!0 '!42!#!8!' # :;3443"=%%% "!!""!##!!"!'$ "!!""!##!!"!'$ '!42!'!8!# # #!,3%!' $%&'(%)*!! #!!! # $ 43/!! #!! '43/!! # $!!!"43/ #!!!!!!'43/ $!!!"'43/ #!!!!!!'43/ $ % 2% % 34 %= % 2% % 34 %= % 2% % 34 %= % 2% % 34 %= % 2% %=88!>?$!:!34 %= % 2% %=88!>?$!:!34 %= :;3443<=%!!!! " " 3%%! "!!""!##!"!$%!#!"!$&#$!!!!!!!!!!!!!!!!!! +),-.!#!/&!!!!!# $%! $&#$!!!! %&'()*!!!"%&'()*!!##!%&'()*!$%!#!%&'()*!$&#$! $%!#!$&#!!!!# $%! $&#$!!!!!!!!!!!!!!!!!!!!!!! "!!" $!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!/&! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0,1!&2!#!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!,3%!' %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!+>1.!""!()*(#!"!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+>1. ()*(!!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! %&'()* +>1.!"%&'()*!()*(#!%&'()*!!#$!!!!!!!!+>1. ()*(!!#$!!!!! "!!""!()*(+#!"!()*(,#!"!#$!!!!!!! ()*(+ ()*(, $! %&'()*!!!"%&'()*!()*(+#!%&'()*!()*(,#!!%&'()*! #$!!!!!!!!! ()*(+ ()*(, $!!!! "!+?@*!""!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 043!&2!#!,3%!' 543*,6!()*3%! &2!#!,3%!'!/&! #!8! ()*( 9/*.!,3%! 43/*6.&),/* # :;3443"=%%%%%%%%%%!!!!$! 43/!#!'43/!! " " " 3% "!!""$! "!!""$! "!!""!#$! "!!""$! "!!""$! "!!""!#$! "!!""!'541(65#$! "!!""!'#!"!#$! "!!""$! "!!""!#$! "!!""!##!"!'$! "!!""$! "!B1?C!""$! "!!""!##!"!'$! # # ' # ' "!!""$!!!!!!!!!!!!!!!!!!!!!!! "!!" $! *A# "!!""$! "!!""!##!"!'$! #!,3%!' "!!""$! "!!""!&#!"!/#!"!0$! "!!""!##! $!!!!!!!! B*/'63!'!42!#!8!'#!!!!!# # "!!""!##! $!!!!!!!!!!!! B*/'63!'!42!'!8!##!!!!!# # "!!""!##!"!'$! # '!<!/6'3@!"# '$ "!!""!##!"!<%786$! # "!!""!##!43/ <(#7$! "!!""!##!"!'$! #A?;!'A?! 43/!!!""!#$! "!!""!##!43/ $! Page 16 of #!!<!?A! 26 "!!""!##!43/!!$ "!!""!#$! "!!""!1#!43/!!< $! "!C)@!""$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"'43/!!!&!04)($ C'4*/!D,D!!!"'43/!!!&!04)($!!!!!!&!04)($! %&'()*!!!!!&!04)($!! %&'()*!!!"'43/!!!&!04)($! "!!""!##!"!'$! # "!.)E!""!##!"!'$! +&-.'/*!# '!"#A'$ "!.)E*!""!##!43/ $! +&-.'/*!#A' ' "!.)E(!""!##!"!'$!!!!!!!!!!!!!! +&-.'/*!#A' # "!!" $!!! #!! "!!" $!! "!!""!##!"!'$! "!!""!##!"!'#!43/!! 9:;4$! "!(?*>!""$! "!())>*!""!##!43/ $! +&-.'/*!# ' "!!""!#$! # "!(+F(>!""$!!!!!!!!!!!!!!!!!!!!!!!!! "!+?*!""$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

17 7. Coalescing the Range Accesses So what happened? The performance got worse when you ran the range kernel on the GPU. Why is that? Well, as the questions in part 6 hinted, the way we wrote the range kernel resulted in a lot of uncoalesced memory accesses. (Go back to the lecture notes if you re not certain what this means.)!"#$%&'(#)&#%#*(+&(, -.-/0-1.-/0-1.2/0 345%, &)*+& *,-. &,/./ &2,% 067%5(#*(+&(, &&.** 00&) )1-* $%&'(#-48695( &0-. *0. *).1 $(%7#:%5%, ), &% -486),(#/+4'+%8 -,0 -+(%5(#;9<<(+= *% >+)5(#:%5% &*&,&/?)&)=@ - - -,(%&96 &+ &./ -20% $(79A5)4& - &?)&%,#$%&'(#B%,9( //2*,*0&) //2*,*0&) //2*,*0*, CD(+@(%7 0+*., %!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 967%5(#=6((796#D=#-.-/0 &2+. )21- E#5)8(#4&#7%5%#84D(8(&5,3,3-2,% E#5)8(#4&#067%5( *&3,03 +%&'(#=6((796#D=#-.-/0 &2*0-2&/ 345%,#=6((796#<+48#FG &2&* %!"#$%&'(#)&#%#*(+&(,# 454!6% 4754!6% 4754!6% 4758!6% 4758!6% 9:$;<$"=% >$=?@A'B% 4C$"B?D% EFBFG<% H;FI$%J"I"% 4;$"I$%K?L$;G% 4'MDFC$%!;'#;"M% >$"=%J"I"% >"B#$% 6D="I$% Putting the range kernel on the OpenCL device actually slowed down the GPU. It did give a small speedup on the CPU, however. Let s first take a look at the CPU. The CPU code is running faster because we ve eliminated the time spent reading the data back. We re still not at a 2x speedup, however. The GPU code is more disappointing. The resulting application is nearly twice as slow as when we ran the range code on the CPU, even though we ve eliminated the data movement. The reason for this is that the way we wrote the range kernel results in uncoalesced memory accesses, which dramatically reduce the available bandwidth. To understand this, it s important to remember that when you read memory from DRAM you get a large chunk of memory. In the case of a GPU, it s typically at least 384 bytes, which is 12 float values. If you only use one of those 12, then you are throwing away 11/12ths (92%) of your bandwidth. The wider the DRAM bandwidth the worse a problem uncoalesced access can be. Unfortunately coalescing is tricky. Older GPUs have very specific rules for coalescing (each thread has to access the next item in an array and they have to do it at the same time and in order) while newer GPUs have more relaxed rules (any threads that access neighboring data at the same time). So getting good coalescing behavior is tricky. Luckily our range kernel is simple enough that we can understand what is going on and how to fix it. You now need to change your range kernel so that each work-item in a work-group accesses an element in the input array that is next to the one accessed by the next work-item in the work-group. Take a look at the picture below to understand. Page 17 of 26

18 !"#$%&'()*+,-)./012)3)456781'9)3):);&<#19 =>?<#&'9?'0!!!!!!! " " " " " " " # # # # $%&'!!!! ()%'! # # # * * * * * * ,,, $%&' " " " " ",, @ $%&' # # # # # $%&' * * * * * $%&' $%&',,,,, $%&' E<#&'9?'0.#91')ABC)<;)7#>0D/012! " # * +, -! " # * +, -! " # * $%&'! " # * ()%'! " # * +, -! " # * +, -! " # * +, -! $%&' +, -! +, - " # * @.#91')45@BC)<;)7#>0D/012 Example of our range kernel. In the original accesses (top) we assume we read 4 values from DRAM on every access, but only use one of them. (The real hardware is far worse in this regard.) In the re-ordered version below, we read 8 values from DRAM on every two accesses, and use 7 of them. This gives us a 7x increase in effective bandwidth, and can improve our performance by requiring only 2 memory accesses in place of 7. Now go and write a second kernel ( range_coalesced ) that works this way and fill in the next part of the tutorial. Make sure you validate your results by looking at the final range value. Copy your main_opencl_6_range_kernel.c file to a new main_opencl_7_range_coalesced.c file and change it to use the new kernel. Run it and fill in the worksheet. (You can add the new kernel to the same kernel.cl file by calling it range_coalesced instead.) Don t turn the page until you ve finished filling in the worksheet. Page 18 of 26

19 After fixing the range kernel to be coalescing we see that the GPU performance is vastly improved (13x faster range compared to uncoalesced, plus reduced data movement). The CPU performance is about the same, but the range kernel itself is actually 10% slower. The reason for this is that the CPU has a cache and hardware prefetcher, so it doesn t benefit much from changing the order of the accesses. In fact, straight linear accesses (the uncoalesced version) is faster on the CPU due to this hardware.!"#$%&'()*(#+,(#-&./(#0***())() $1$23 $41$23 $ %+&' &)*+& *,-,.*)) 378&+(#9(:.(' &&/** 0..0 ).,/ -&./(#$%;7<+( &0,/ 1/) 0.* -(&8#=&+& - )- $%;7>'(#2:%/:&;, -0 $:(&+(#?<@@(:),, A:>+(#=&+& &*- -&. B>.>),,, $'(&.<7 &+ &/1 -(8<*+>%., & B>.&'#-&./(#C&'<( 112*-*0&) 112*-*0&) 112*-*0*- DE(:,(&8 0+*.-, ).*!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 <78&+(#)7((8<7#E)#$1$23 &2*, )2., F#+>;(#%.#8&+&#;%E(;(.+ -3 /3 F#+>;(#%.#*%;7<+&+>%. 1-3 *.3 :&./(#)7((8<7#E)#$1$23 &20* -2)0 :&./(#)7((8<7#@:%;#GH,21& &-21 6%+&'#)7((8<7#@:%;#GH &2, -2+ &2-% &%,2*%,20%,2.%,2-%,%!"#$%&'()*(#+,(#-&./(#0**())()# 454!6% 4754!6% 4754!6% 4758!6% 4758!6% 9:$;<$"=% >$=?@A'B% 4C$"B?D% After coalescing the range accesses we see that we ve basically eliminated data movement from our application and achieved a reasonable 3x speedup on a slow GPU. EFBFG<% H;FI$%J"I"% 4;$"I$%K?L$;G% 4'MDFC$%!;'#;"M% >$"=%J"I"% >"B#$% 6D="I$% Overall Speedup If you click on the Overall Speedup tab at the bottom of the tutorial worksheet you can see the impact of the various optimizations on the application performance. Clearly the biggest improvements came from removing the overhead from inside the loop (not surprising) and coalescing the range kernel. It s important to have this kind of information for your optimizations so you can see the benefit of each step. &"!"##$%#&'($)*++,-*$.+#"/0+$1&$%$%&,+$ %" $" #" MOM2H" M@OM2H" M@OG2H"!" #'"()*+,-.+" $'"()*+,-.+" %'"()*+,-.+" /-01"2345,-.6" /-01"2345,-.6" 7).8"9,:-.-*1;" &'"<=+31+)8" E-F+.*-4.*"4." 01+"G2H" I'"J).6+"-.")" L'"M4),+*9+"01+" K+3.+," J).6+" N999+**+*" Overall speedups from the changes in the tutorial. CPU vs. GPU We ve now encountered two cases where the CPU runs more slowly with a GPU-optimized kernel: The GPU update kernel has the overhead of checking the borders each time vs. the C code which skips the borders in the loop. The coalesced GPU range kernel accesses data in a bad order for the CPU. Page 19 of 26

20 This points out that the particular kernel you want to use will depend on the hardware. Now you might think you could just change the global dimensions of the GPU kernel to skip the border and then have a kernel that runs great on both architectures. However, while doing this will get you 2.0x speedup on the CPU, it will get you a huge slowdown on the v1.1 GPUs (lab and my laptop) because the memory accesses will no longer be coalesced. But on the v2.0 GPU (in the cluster) you d still get great performance because they have more relaxed coalescing rules. So while OpenCL does offer portability, it clearly does not offer performance portability. Going Faster How could you make this code faster? Well, the biggest performance issue now is the update kernel. Since there is a lot of data reuse in this kernel we could use the local shared memory to manually cache the input data before we process. This could potentially speed up the kernel a lot, but would make it far more complicated, and would slow it down on the CPU. (CPUs don t have local memory, so any code that accesses it is wasted time.) On newer GPUs this will be almost a waste of effort since they have caches, which would do this for us. Other than that, the data clearly indicates that we re still spending a good deal of time on overhead. To minimize this we need to processes larger problem sizes (if they fit on the GPU) and run them for longer (e.g., more iterations). If our problem is larger and takes longer to run, the percent of time spent on overhead will decrease and we ll see a corresponding speedup. An Aside: How slow is Java? I re-wrote the simple C version in Java, keeping the code as similar as possible to the C. This was far easier than writing in an ancient, primitive language like C or Fortran. Now most people will tell you that Java is about half as fast as C code, but for this case, it was actually 3.5x faster, thereby beating the GPU implementation. Why? I don t honestly know, and this is a real problem for performance optimization. Java has the potential to do better compilation due to the runtime nature of its JIT (e.g., at runtime it can determine which parts of code to optimize based on how they are being used). So perhaps the java is unrolling the loop, or maybe automatically vectorizing? But on the other hand it has the overhead of array bounds checking and memory management. So who knows what is going on. However, on top of good performance, Java has very efficient libraries for parallelizing code, which would enable the code could run even faster with a bit more effort. (It would take far less effort to use Java s parallel libraries than to use OpenCL. But there are of course OpenCL interfaces for Java as well.) Regardless, this is a clear indication that you shouldn t discredit Java s ability to run fast. Indeed, combined with the fantastic productivity improvements for the developer, Java is an excellent choice for a large range of applications. &"!"##$%#&'($)*++,-*$%$./0$1*2345+,$1*+6%7$./0$8"."$ %" '(')*" $" '+(')*" '+(,)*" -./.(')*" #"!" '(')*" '+(')*" '+(,)*" -./.(')*" Comparing the performance of the optimized OpenCL implementations vs. Java. Surprised? Java is really quite good these days. Page 20 of 26

Lecture 3. Optimising OpenCL performance

Lecture 3. Optimising OpenCL performance Lecture 3 Optimising OpenCL performance Based on material by Benedict Gaster and Lee Howes (AMD), Tim Mattson (Intel) and several others. - Page 1 Agenda Heterogeneous computing and the origins of OpenCL

More information

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011 Mitglied der Helmholtz-Gemeinschaft OpenCL Basics Parallel Computing on GPU and CPU Willi Homberg Agenda Introduction OpenCL architecture Platform model Execution model Memory model Programming model Platform

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0 Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Introduction to OpenCL Programming. Training Guide

Introduction to OpenCL Programming. Training Guide Introduction to OpenCL Programming Training Guide Publication #: 137-41768-10 Rev: A Issue Date: May, 2010 Introduction to OpenCL Programming PID: 137-41768-10 Rev: A May, 2010 2010 Advanced Micro Devices

More information

Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic

More information

CS3813 Performance Monitoring Project

CS3813 Performance Monitoring Project CS3813 Performance Monitoring Project Owen Kaser October 8, 2014 1 Introduction In this project, you should spend approximately 20 hours to experiment with Intel performance monitoring facilities, and

More information

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

Intelligent Heuristic Construction with Active Learning

Intelligent Heuristic Construction with Active Learning Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

CS 378: Computer Game Technology

CS 378: Computer Game Technology CS 378: Computer Game Technology http://www.cs.utexas.edu/~fussell/courses/cs378/ Spring 2013 University of Texas at Austin CS 378 Game Technology Don Fussell Instructor and TAs! Instructor: Don Fussell!

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Linux Driver Devices. Why, When, Which, How?

Linux Driver Devices. Why, When, Which, How? Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

COSCO 2015 Heterogeneous Computing Programming

COSCO 2015 Heterogeneous Computing Programming COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology

More information

Java GPU Computing. Maarten Steur & Arjan Lamers

Java GPU Computing. Maarten Steur & Arjan Lamers Java GPU Computing Maarten Steur & Arjan Lamers Overzicht OpenCL Simpel voorbeeld Casus Tips & tricks Vragen Waarom GPU Computing Afkortingen CPU, GPU, APU Khronos: OpenCL, OpenGL Nvidia: CUDA JogAmp JOCL,

More information

Parallel and Distributed Computing Programming Assignment 1

Parallel and Distributed Computing Programming Assignment 1 Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

14:440:127 Introduction to Computers for Engineers. Notes for Lecture 06

14:440:127 Introduction to Computers for Engineers. Notes for Lecture 06 14:440:127 Introduction to Computers for Engineers Notes for Lecture 06 Rutgers University, Spring 2010 Instructor- Blase E. Ur 1 Loop Examples 1.1 Example- Sum Primes Let s say we wanted to sum all 1,

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information


INTEL PARALLEL STUDIO XE EVALUATION GUIDE Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Whitepaper: performance of SqlBulkCopy

Whitepaper: performance of SqlBulkCopy We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

More information

Hypercosm. Studio. www.hypercosm.com

Hypercosm. Studio. www.hypercosm.com Hypercosm Studio www.hypercosm.com Hypercosm Studio Guide 3 Revision: November 2005 Copyright 2005 Hypercosm LLC All rights reserved. Hypercosm, OMAR, Hypercosm 3D Player, and Hypercosm Studio are trademarks

More information

OpenACC Programming and Best Practices Guide

OpenACC Programming and Best Practices Guide OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python Introduction Welcome to our Python sessions. University of Hull Department of Computer Science Wrestling with Python Week 01 Playing with Python Vsn. 1.0 Rob Miles 2013 Please follow the instructions carefully.

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014. Copyright Khronos Group 2014 - Page 1

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014. Copyright Khronos Group 2014 - Page 1 SYCL for OpenCL Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014 Copyright Khronos Group 2014 - Page 1 Where is OpenCL today? OpenCL: supported by a very wide range of platforms

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

CSCI E 98: Managed Environments for the Execution of Programs

CSCI E 98: Managed Environments for the Execution of Programs CSCI E 98: Managed Environments for the Execution of Programs Draft Syllabus Instructor Phil McGachey, PhD Class Time: Mondays beginning Sept. 8, 5:30-7:30 pm Location: 1 Story Street, Room 304. Office

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

What you should know about: Windows 7. What s changed? Why does it matter to me? Do I have to upgrade? Tim Wakeling

What you should know about: Windows 7. What s changed? Why does it matter to me? Do I have to upgrade? Tim Wakeling What you should know about: Windows 7 What s changed? Why does it matter to me? Do I have to upgrade? Tim Wakeling Contents What s all the fuss about?...1 Different Editions...2 Features...4 Should you

More information

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Rootbeer: Seamlessly using GPUs from Java

Rootbeer: Seamlessly using GPUs from Java Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java

More information

GDB Tutorial. A Walkthrough with Examples. CMSC 212 - Spring 2009. Last modified March 22, 2009. GDB Tutorial

GDB Tutorial. A Walkthrough with Examples. CMSC 212 - Spring 2009. Last modified March 22, 2009. GDB Tutorial A Walkthrough with Examples CMSC 212 - Spring 2009 Last modified March 22, 2009 What is gdb? GNU Debugger A debugger for several languages, including C and C++ It allows you to inspect what the program

More information

Parallelism and Cloud Computing

Parallelism and Cloud Computing Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information


ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon

More information


INSTALLATION GUIDE ENTERPRISE DYNAMICS 9.0 INSTALLATION GUIDE ENTERPRISE DYNAMICS 9.0 PLEASE NOTE PRIOR TO INSTALLING On Windows 8, Windows 7 and Windows Vista you must have Administrator rights to install the software. Installing Enterprise Dynamics

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Grid Computing for Artificial Intelligence

Grid Computing for Artificial Intelligence Grid Computing for Artificial Intelligence J.M.P. van Waveren May 25th 2007 2007, Id Software, Inc. Abstract To show intelligent behavior in a First Person Shooter (FPS) game an Artificial Intelligence

More information


PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. achrosny@treeage.com Andrew Munzer Director, Training and Customer

More information

Tech Tip: Understanding Server Memory Counters

Tech Tip: Understanding Server Memory Counters Tech Tip: Understanding Server Memory Counters Written by Bill Bach, President of Goldstar Software Inc. This tech tip is the second in a series of tips designed to help you understand the way that your

More information

CUDA Basics. Murphy Stein New York University

CUDA Basics. Murphy Stein New York University CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture

More information

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

A3 Computer Architecture

A3 Computer Architecture A3 Computer Architecture Engineering Science 3rd year A3 Lectures Prof David Murray david.murray@eng.ox.ac.uk www.robots.ox.ac.uk/ dwm/courses/3co Michaelmas 2000 1 / 1 6. Stacks, Subroutines, and Memory

More information

The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server

The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server Research Report The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server Executive Summary Information technology (IT) executives should be

More information

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() CS61: Systems Programming and Machine Organization Harvard University, Fall 2009 Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() Prof. Matt Welsh October 6, 2009 Topics for today Dynamic

More information

Practice #3: Receive, Process and Transmit

Practice #3: Receive, Process and Transmit INSTITUTO TECNOLOGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY CAMPUS MONTERREY Pre-Practice: Objective Practice #3: Receive, Process and Transmit Learn how the C compiler works simulating a simple program

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

MEAP Edition Manning Early Access Program Hello! ios Development version 14

MEAP Edition Manning Early Access Program Hello! ios Development version 14 MEAP Edition Manning Early Access Program Hello! ios Development version 14 Copyright 2013 Manning Publications For more information on this and other Manning titles go to www.manning.com brief contents

More information

CSC230 Getting Starting in C. Tyler Bletsch

CSC230 Getting Starting in C. Tyler Bletsch CSC230 Getting Starting in C Tyler Bletsch What is C? The language of UNIX Procedural language (no classes) Low-level access to memory Easy to map to machine language Not much run-time stuff needed Surprisingly

More information

Format string exploitation on windows Using Immunity Debugger / Python. By Abysssec Inc WwW.Abysssec.Com

Format string exploitation on windows Using Immunity Debugger / Python. By Abysssec Inc WwW.Abysssec.Com Format string exploitation on windows Using Immunity Debugger / Python By Abysssec Inc WwW.Abysssec.Com For real beneficiary this post you should have few assembly knowledge and you should know about classic

More information

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental

More information

The continuum of data management techniques for explicitly managed systems

The continuum of data management techniques for explicitly managed systems The continuum of data management techniques for explicitly managed systems Svetozar Miucin, Craig Mustard Simon Fraser University MCES 2013. Montreal Introduction Explicitly Managed Memory systems lack

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and

More information

Pristine s Day Trading Journal...with Strategy Tester and Curve Generator

Pristine s Day Trading Journal...with Strategy Tester and Curve Generator Pristine s Day Trading Journal...with Strategy Tester and Curve Generator User Guide Important Note: Pristine s Day Trading Journal uses macros in an excel file. Macros are an embedded computer code within

More information

Achieving business benefits through automated software testing. By Dr. Mike Bartley, Founder and CEO, TVS (mike@testandverification.

Achieving business benefits through automated software testing. By Dr. Mike Bartley, Founder and CEO, TVS (mike@testandverification. Achieving business benefits through automated software testing By Dr. Mike Bartley, Founder and CEO, TVS (mike@testandverification.com) 1 Introduction During my experience of test automation I have seen

More information

2: Computer Performance

2: Computer Performance 2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12

More information



More information