PDC Summer School Introduction to High- Performance Computing: OpenCL Lab
|
|
- Rafe Bishop
- 8 years ago
- Views:
Transcription
1 PDC Summer School Introduction to High- Performance Computing: OpenCL Lab Instructor: David Black-Schaffer Introduction This lab assignment is designed to give you experience leveraging OpenCL to convert a standard C program for parallel execution on multiple CPU and GPU cores. You will start by profiling a model PDE solver. Using those performance results you will move portions of the computation over to an OpenCL GPU device to accelerate the computation. At each step you will analyze the resulting performance and answer a series of questions to provoke you to think about what is going on. Learning Objectives 1. Write and understand OpenCL code 2. Understand the OpenCL compute and memory models 3. Profile and analyze application performance 4. Understand the impact of local work group size on performance 5. Write your own OpenCL kernel 6. Optimize a simple kernel for coalesced memory accesses Assumed Background 1. Basic knowledge of C programming, compilation, and debugging on linux 2. Basic understanding of GPU architectures (from the lecture earlier today) 3. Some exposure to program performance analysis If you are concerned about your level of background please make sure you choose a partner whose skills complement your own. Materials Lecture Notes: GPU Architectures for Non-Graphics People Lecture Notes: Introduction to OpenCL Source code starter files provided on the summer school website The OpenCL Specification v. 1.0 ( OpenCL Quick Reference Card ( Nvidia CUDA C Programming Guide v. 3.2 ( Guide.pdf) The last three pieces of documentation can be found by simply googling their titles. You may need them for the later parts of the lab. Hardware The lab machines are equipped with a GeForce 8400GS graphics card with 512MB of memory. These are Nvidia Compute version 1.1 cards with 1 streaming multiprocessor with 8 hardware cores processing 32 threads in a group (warp) running at up to 1.3GHz. The CPUs on the lab machines are Intel Core2Quad Q9550 (Yorkfield) at 2.83GHz. These are 2 dual-core dies in one package sharing a northbridge with 6MB of cache on each die. The cluster machines are equipped with four Tesla C2050 graphics cards with 3GB of memory. These are Nvidia Compute version 2.0 cards with 14 streaming multiprocessors with 32 hardware cores each processing 32 threads in a group (warp) running at up to 1.1GHz. The CPUs on the cluster machines are Page 1 of 26
2 dual Intel Xeon E5620s (Westmere-EP) at 2.4GHz. These are 4-core (8-thread) processors with 12MB of L3 cache each connected by QPI. The data used in this tutorial document is from an Apple MacBook Pro with a GeForce 9400M with 256MB of memory. It is an Nvidia Compute 1.1 device with 2 streaming multiprocessors with 16 cores each processing 32 threads in a group (warp) running at up to 1.1GHz. The CPU is an Intel Core2Duo P8800 at 2.66GHz with 3MB of L2 cache. A: Getting Started To get started you will run the OpenCL Hello World code that was discussed in the lecture. 1. Download the zip file of the code from the PDC website. 2. Unzip it in your home directory on one of the lab machines. 3. Type make hello 4. Run./hello_world.run If all goes well you should get one thousand sine calculations printed out to the terminal. To make sure you understand what is going on, go to the code and make and verify the following changes: 1. Change the code to run on the CPU or GPU (make sure it works on both). Note: if your code crashes when you change it to run on the CPU, you probably need to do some error checking. The hello_world program has none. Try looking at the error codes returned by clgetplatformids, clgetdeviceids, and clcreatecommandqueue. (You can find out how to get the error codes back by looking at the OpenCL documentation link.) What s going on? (You can look up the error IDs in the cl.h file which can be found by online or at the end of this document.) 2. Change the kernel to calculate the cosine of the numbers instead of the sine. 3. Change the size of the array processed to be 1024 times larger, remove the printout of the results, and time (using the time command on the command line) the difference in speed executing on the CPU vs. GPU. 4. Think about the results. Can you explain why one is faster than the other? 5. Now change the program to calculate the cosine without using OpenCL and compare the performance. What do you notice? B: Introduction to the PDE Solver Program Overview This tutorial uses a very simple PDE-solver-like program as a demonstration. The program is provided in a standard C version in the file main_c_1.c The program works as follows: int main (int argc, const char * argv[]) { float range = BIG_RANGE; float *in, *out; // ======== Initialize create_data(&in, &out); // ======== Compute while (range > LIMIT) { // Calculation update(in, out); // Compute Range range = find_range(out, SIZE*SIZE); swap(&in, &out); } Page 2 of 26
3 } Basic program flow. (Profiling measurements have been removed for simplicity.) The program creates two 4096x4096 arrays (in and out) and then performs an update by processing the in array to generate new values for the out array. The out array is then processed to determine the range (maximum value minus minimum value). If the range is too large, the program swaps the arrays (in becomes out and out becomes in) and repeats until the range converges below the limit. The update calculation takes in the four neighbors (a, b, c, d) along with the central point (e) to calculate the next central value as a weighted sum. This is a basic 5-point stencil operation, which can be visualized as:!"# $%&#! " # $ %! " # $ %! " # $ % The update iterates through the whole in matrix looking at the neighbors to calculate the new value, which is written into the out matrix. For the next update the in and out matrices are swapped, thereby overwriting the old values with the new ones. void update(float *in, float *out) { for (int y=1; y<size-1; y++) { for (int x=1; x<size-1; x++) { float a = in[size*(y-1)+(x)]; float b = in[size*(y)+(x-1)]; float c = in[size*(y+1)+(x)]; float d = in[size*(y)+(x+1)]; float e = in[size*y+x]; out[size*y+x] = (0.1*a+0.2*b+0.2*c+0.1*d+0.4*e); } } } The update calculation is simply a 5-point stencil. After the update, the program runs a range calculation on the output matrix that simply finds the minimum and maximum values and returns the difference between them as the range. float find_range(float *data, int size) { float max, min; max = min = 0.0f; // Iterate over the data and find the min/max for (int i=0; i<size; i++) { if (data[i] < min) min = data[i]; else if (data[i] > max) max = data[i]; } // Report the range return (max-min); } Finding the range simply iterates over all the data and keeps track of the minimum and maximum values, and then returns the difference. (This function is in util.c, as it is shared by all Page 3 of 26
4 the sample programs.) The program repeats the loop of update/range until the range is below the LIMIT specified in the parameters.h file. Don t change anything in the parameters.h file as it is used by all programs. Running the Program 1. Type make 1 to build the C-version of the program. 2. Run the program by typing./1_c.run When you run the program you will see the output converge after approximately 42 iterations. The program will print out the range at convergence. You should use this value to check that the optimized versions of the code later on in the lab are producing the same results. In addition to the range results, the program will print out profiling information. From this you can see how long the program took in total (Total) and how long the update (Update) and range (Range) calculations took. It also reports the standard deviation, which is a good sanity check as to how reliable your performance measurements are. 1. C Allocating data 2x (4096x4096 float, 64.0MB). Range starts as: Iteration 1, range= Iteration 42, range= Total: Total: ms (Avg: ms, Std. dev.: ms, 1 samples) Update: Total: ms (Avg: ms, Std. dev.: ms, 42 samples) [ignoring first/last samples] Range: Total: ms (Avg: ms, Std. dev.: ms, 42 samples) [ignoring first/last samples] Output of the C program showing the final range and the profiling information. Note the large standard deviation on the Update measurement (13%) indicating that something else may have been happening on the computer at the same time. Performance Measurements Take a look at the code in the main_c_1.c file to see how the application is implemented. The only significant differences from the code listed above are for the performance measurements. As you can see performance measurements are made using perf variables. These must be initialized first, which is done in init_all_perfs(). The perfs are timers which are started by calling start_perf_measurement() and stopped with stop_perf_measurement(). Each time a start-stop pair is called the perf records the elapsed time. At the end the results are printed out with the print_perfs() call. Now take a look at the performance numbers you got and decide which part of your code should be optimized with OpenCL first. C. Accelerating (Decelerating?) with OpenCL Now that you understand the basic algorithm and have seen the source code it is time to accelerate your program by using OpenCL. As you can see from the performance data for the C code, the best place to start is by replacing the update() function with an OpenCL one. Ready? If the lectures were any good this shouldn t take more than a few minutes. You can use the hello world file as a starting point. I m just kidding. Starting from nothing is a major pain. So I ve done this for you in the main_opencl_1.c file, which is summarized below: int main (int argc, const char * argv[]) { float range = BIG_RANGE; float *in, *out; // ======== Initialize init_all_perfs(); Page 4 of 26
5 create_data(&in, &out); // ======== Setup OpenCL setup_cl(argc, argv, &opencl_device, &opencl_context, &opencl_queue); // ======== Compute while (range > LIMIT) { // Calculation update_cl(in, out); // Compute Range range = find_range(out, SIZE*SIZE); iterations++; swap(&in, &out); printf("iteration %d, range=%f.\n", iterations, range); } } main_opencl_1.c Note that the only two changes to the high-level algorithm are to setup OpenCL at the beginning (setup_cl) and to call update_cl() instead of update. We re going to skip the contents of setup_cl() for now. You can take a look at it in opencl_utils.c if you re interested, but all it does is look for the right type of device based on whether you passed in CPU or GPU on the command line and creates a context and queue for you. Instead, take a look at update_cl(). This is where the work is being done. void update_cl(float *in, float *out) { cl_int error; // Load the program source char* program_text = load_source_file("kernel.cl"); // Create the program cl_program program; program = clcreateprogramwithsource(opencl_context, 1, (const char**)&program_text, NULL, &error); // Compile the program and check for errors error = clbuildprogram(program, 1, &opencl_device, NULL, NULL, NULL); // Create the computation kernel cl_kernel kernel = clcreatekernel(program, "update", &error); // Create the data objects cl_mem in_buffer, out_buffer; in_buffer = clcreatebuffer(opencl_context, CL_MEM_READ_ONLY, SIZE_BYTES, NULL, &error); out_buffer = clcreatebuffer(opencl_context, CL_MEM_WRITE_ONLY, SIZE_BYTES, NULL, &error); // Copy data to the device error = clenqueuewritebuffer(opencl_queue, in_buffer, CL_FALSE, 0, SIZE_BYTES, in, 0, NULL, NULL); error = clenqueuewritebuffer(opencl_queue, out_buffer, CL_FALSE, 0, SIZE_BYTES, out, 0, NULL, NULL); // Set the kernel arguments error = clsetkernelarg(kernel, 0, sizeof(in_buffer), &in_buffer); error = clsetkernelarg(kernel, 1, sizeof(out_buffer), &out_buffer); // Enqueue the kernel size_t global_dimensions[] = {SIZE,SIZE,0}; error = clenqueuendrangekernel(opencl_queue, kernel, 2, NULL, global_dimensions, NULL, 0, NULL, NULL); // Enqueue a read to get the data back error = clenqueuereadbuffer(opencl_queue, out_buffer, CL_FALSE, 0, SIZE_BYTES, out, 0, NULL, NULL); // Wait for it to finish error = clfinish(opencl_queue); // Cleanup clreleasememobject(out_buffer); clreleasememobject(in_buffer); clreleasekernel(kernel); clreleaseprogram(program); free(program_text); Page 5 of 26
6 } update_cl() with all the error-checking code removed to make it easier to read. As you can see, this code is shockingly similar to the code in the Hello World OpenCL program. All it does is: 1. Create a program (loaded from the text file kernel.cl ) 2. Build it 3. Create a kernel ( update ) 4. Create two memory objects ( in_buffer and out_buffer ) 5. Enqueue a write to write the data to the memory objects (why would we copy data to the out buffer?) 6. Set the kernel arguments 7. Enqueue the kernel execution 8. Enqueue a read to get the data back 9. Wait for the execution to finish 10. Cleanup all the resources we allocated There s really nothing more to it than that. You can open the kernel.cl file and see the update kernel. It is very similar to the update() C code, except that it has to find out which update it should do (get_global_id()) and needs to avoid processing if it is on the edge. (The C version uses loops from 1 to SIZE-1 to avoid the edges.) kernel void update(global float *in, global float *out) { int WIDTH = get_global_size(0); int HEIGHT = get_global_size(1); // Don't do anything if we are on the edge. if (get_global_id(0) == 0 get_global_id(1) == 0) return; if (get_global_id(0) == (WIDTH-1) get_global_id(1) == (HEIGHT-1)) return; int y = get_global_id(1); int x = get_global_id(0); // Load the data float a = in[width*(y-1)+(x)]; float b = in[width*(y)+(x-1)]; float c = in[width*(y+1)+(x)]; float d = in[width*(y)+(x+1)]; float e = in[width*y+x]; // Do the computation and write back the results out[width*y+x] = (0.1*a+0.2*b+0.2*c+0.1*d+0.4*e); } The OpenCL kernel update(). Running the OpenCL version 1. Type make 1 (as before) 2. To run the CPU version type./1_opencl.run CPU 3. To run the GPU version type./1_opencl.run GPU So what did you get? How does the multicore OpenCL CPU version and the OpenCL GPU version compare to the standard C version for speed? D. Tutorial Worksheet Well, it s kind of hard to make sense of all those performance numbers, so now is the time to open the Tutorial Worksheet file and start filling in some numbers. Indeed, for the rest of the tutorial you ll be using this to track and analyze your performance. Once you ve finished implementing the optimizations you ll then submit your code to the cluster. This will give you a second set of performance numbers Page 6 of 26
7 which you will then compare to the ones you get on the lab machine to see the impact of different GPUs on performance and optimizations. Don t Cheat For you to learn the most you should read the appropriate part of the text here, write and debug the code, fill in the worksheet, and answer the questions in the worksheet, in that order. So when the tutorial says don t turn the page until you ve finished filling in the worksheet please don t. If you cheat and read ahead you will bias yourself and will learn a lot less. If you get stuck ask for help before you read ahead. Benchmarking is hard You ll be getting timing information running on the lab machines. If you re doing something else at the same time (such as web browsing, filling in a worksheet, etc.) it will impact your measurements. You can see from the standard deviation in the measurements in the output to evaluate how much you can trust your measurements. But it would be best to fill in the worksheet on your own laptop using Excel, rather than open office on the lab machines. Get started Go ahead and fill in the numbers from your first three runs in the tutorial worksheet. They should go under the 1. Baseline section. Be sure to fill in the Final Range Values to make sure that you re actually computing the same thing on all versions of the code. (It s very easy to get amazing speedups by computing nothing without realizing it.) Now answer the questions for part 1 of the worksheet. Don t turn the page until you ve finished filling in the worksheet. Page 7 of 26
8 1. Baseline Note: the graphs and data shown here are for running on my MacBook Pro. Depending on your hardware and operating system you may see different results. So you re going to have to think for yourself to figure this stuff out. The performance discussion below is meant as a guideline.!"#$%&'()#*+,-./01!")*+$)%&%,"-$. /"012%34"567859"(($) :-"14. 2"012;<4"567=59"(($)>1*;??;=$!")*+$)%@%,"-$. :-"14. A!B%CDE$. K!B%CDE$. 2"#304&1/'& A$F')5$%GHIIJ K@/?'%!LLII%>%@;MMANO (5(67 ()5(67 ()5867 *-,01 &PLQ& 7%90,& &&RLL :0';& &MIR &RGR &MPI </'01#:0';&#=01+& GG;L@LM&P GG;L@LM&P GG;L@LML@ $>&.?&09 MQL LMH &;L% &;M% &;H% 2"#304&1/'&# &;@% &;I% I;L% U0$)9$"2% ^"+#$% BE2"*$% I;M% I;H% I;@% I;I% K7K!B% KV7K!B% 4"#5'*&#%0..2-0#),-62#+,-#./0.1,8#&'.#9:#9;<#=%#&'.#9#9;<3#>'(?&@#',)#A*?+#1,8.%#*8.#+,-#-%(?B3C JD%K!B%9"=%@%5')$=%='%T%W'?42%9"0$%$XE$5*$2%*'%)?+%*W15$%"=%("=*;%YZ*%4$"=*%(')%*9$%?E2"*$%E")*;[ 9"#D(2#E0.?9:#B(=.#&'.#%*A.#>?-A.8(1C#*?%).8#*%#&'.#9#=.8%(,?3#5'+#,8#)'+#?,&3,'\%<?*%*9$D])$%)$"44D%54'=$;%C9$%K!B%1=%*9$%="-$%<?*%*9$%A!B%1=%21(($)$+*;%C91=%1=%E)'<"<4D%2?$%*'%21(($)$+*%E)$51=1'+%"+2%')2$)%'(%'E$)"*1'+=; The tutorial worksheet uses light blue to indicate places you need to fill in data, formulas, or analysis. The performance graphs are all normalized to the C-CPU speed (on the left) so bigger bars are worse. From this data the GPU version is 1.8 times slower than the C version, and the OpenCL C version is 25% slower. Not so good. Remember the questions are there to help you learn, so try to think about them rather than trying to just put down something. From the graph in the worksheet we can clearly see that we re spending a huge amount of time calculating the update. However, the OpenCL code for update is pretty complicated so this level of detail is not very helpful. What we need to do is get more detailed performance data. Open the main_opencl_2_profile.c file. This file is similar to the file from part 1 above except that it defines a bunch more performance counters. These are: Program_perf for loading and compiling the program Create_perf for creating the buffers Write_perf for writing to the buffers Read_perf for reading from the buffers Finish_perf for timing the finish Cleanup_perf for timing the cleanup Page 8 of 26
9 Your job is to now go into the update_cl() method and insert appropriate perf measurement calls to measure the performance of these parts of the program. Go ahead and do this and then run the program with make 2 and./2_profile.run CPU and./2_profile.run GPU. Fill in the data in section two of the worksheet and answer the questions. Note: for each section of this tutorial you use make N and./n_ GPU and./n_ CPU to run the code. It is important that you use the file names mentioned here and listed in the Makefile so that you will be able to submit everything to the cluster in the end. Don t turn the page until you ve finished filling in the worksheet. Page 9 of 26
10 !"#$%&%'(%& 2. Baseline with Profiling So what did you find? Is it what you expected? Here s what I got:!"#$%&'()*'#+),-#./01)()* ,%( &)*+& &*,-../,,0 59:%,'#;'/*'( &&0**, -*,+ <%*2'#30=9>,' &1,0 &1,* &0*- <'%:#?%,%,. 30=9)('#./02/%=.& -+- 3/'%,'#$>11'/&, +/ A)*)&- &/*.. /1.. 3('%*>9 +,& 01++ A)*%(#<%*2'#B%(>' --2*.*1&) --2*.*1&) --2*.*1*. CD'/-'%: 1+* -/, &./).% &2*% &21% &2/% &2.% &%,2*%!"#$%&()*'#+),-#./0E()*2# 89$:;$"<% 3=$">?@% AB>BC;% D:BE$%F"E"% 3:$"E$%G?H$:C% 3'I@B=$%!:'#:"I%,21%,2/% J$"<%F"E"% J">#$% 5@<"E$%,2.%,% 343!5% 3643!5% 3643!5% 3647!5% 3647!5% I ve removed the answers to the questions to make it less tempting to look ahead and cheat. But there s a discussion of them in the following text. A few things jump out at me immediately. The first is that I m spending a huge amount of time in Finish, particularly on the CPU. That s strange since Finish doesn t actually do any work other than call clfinish(). Of course if you answered question D (which was discussed in the lectures) you ll remember that OpenCL submits work asynchronously. That is, the work isn t done when you call clenqueue() it s just scheduled. So when you call clfinish(), the runtime will stop your program until all the work is done. Therefore, the timers you put around clenqueuendrange() and clenqueuereadbuffer, etc., don t record much (if any) time because they are just enqueueing the kernel. Another strange thing is that I see almost not time for compiling the kernel, despite the fact that I m creating and calling clbuildprogram() 40 times in this program. Your results are probably a bit different if you re not running on a Mac. The reason is that Mac OS X caches programs when you compile them the first time, so each subsequent compilation of the same source code is really fast. If they weren t cached the program would spend a huge amount of time re-compiling. Now make a copy of your source file and call it main_opencl_3_profile_finish.c and add in clfinish() as needed for the performance counters you added before. This will make sure that the host waits for the OpenCL commands to finish before continuing with your program. By doing this you will synchronize with the OpenCL device, and your performance measurements on the host side will be accurate. After you have done this, run your program, fill in part 3 of the worksheet, and answer the questions. Note: OpenCL provides the ability to request events from commands and use those events to get information on when they were submitted to OpenCL, when they started, and when they finished. This is another way (indeed the preferred way) to get this information, but it is tricker. Don t turn the page until you ve finished filling in the worksheet. Page 10 of 26
11 3. Baseline with Profiling (and clfinish) Now the results should make a lot more sense. Finish is now a tiny part of the overall time (nearly 0) and we can see where we are spending time.!"#$%&'()*'#+),-#./01)()*2#3%*4#5(6)*)& : 8;98.: 8;9<.: =0,%( &)*+& &+*,* :>4%,'#?'/*'( &&/** 0)0) &01/ &/,, &100 &.11 80A>)('#./02/%A -&,,& 8/'%,'#$B11'/& 1 &- D/),'#C%,% +-*, 0+., 6)*)&- 1. 8('%*B> 0,- /+1-6)*%(#@%*2'#E%(B',,2*-*0&),,2*-*0&),,2*-*0*- FG'/-'%4 0+* *0* &-0,!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 B>4%,'#&>''4B>#G΂.: &2*- &20/ H#,)A'#0*#4%,%#A0G'A'*,.+3 ))3 H#,)A'#0*#:>4%,' )03 -*3 -% &2*% &20% &2.% &2-% &% 12*% 120% 12.% 12-% 1%!"#$%&()*'#+),-#./0I()*2#3%*4#5(6)*)&-7# 9:$;<$"=% 4>$"?@A% BC?CD<% E;CF$%G"F"% 4;$"F$%H@I$;D% 4'JAC>$%!;'#;"J% K$"=%G"F"% K"?#$% 6A="F$% 454!6% 4754!6% 4754!6% 4758!6% 4758!6% With clfinish() forcing the runtime to complete each operation we can now accurately time what is going on. Clearly we are spending a lot of time moving data (write and read) and a good deal of time in cleanup. (Your results are likely to be different depending on your GPU and system!) For my system, I m wasting a lot of time in cleanup and moving data (write and read). The update calculation is actually pretty fast and compile program is not too bad. Now note that the ratios of these times will depend on the ratio of speeds of the CPU and GPU. If you have a much slower GPU (like the lab machines; roughly half as fast) and a faster CPU (like the lab machines; roughly 50% faster) then you may see that the update time is a lot larger on the GPU. Now with this data we can start to see some results. The speedup for the update calculation on the OpenCL CPU is 1.82 vs. the C version. This is decent (although not great) considering that we have 2 CPU cores running in OpenCL. The reason it s not 2.0 is that the kernel code has a lot of extra overhead to deal with the edge cases. The C code deals with them once in the loops and is therefore more efficient. If we look at where our time is being spent, we see that we re spending only 1/3 of our time doing update and nearly 50% of our time doing data movement. (You may see different numbers depending on things like how much of your time you re spending doing compilation.) This kind of information lets us know what optimization we need to do: we need to get the compilation, cleanup, and writing data out of the main loop since these things only need to be done once. Then we can keep the data on the OpenCL device and just read back the data for the range calculation. To do this open the file main_opencl_4_profile_nooverhead.c and fill in the missing parts. You can get most of the code you ll need from your current file. All this file does is add two calls at the beginning (setup_cl_compute() and copy_data_to_device()) and then change the main loop to call update_cl() with the appropriate cl_mem buffer objects instead of the data pointers, and then call read_back_data() to get the data back from the range calculation. At the end it calls cleanup_cl() to cleanup. The program uses the iterations count to determine which buffer to read from and write to (see the get_in_buffer() and get_out_buffer() functions). This way the data can be kept on the device and we just change the kernel arguments to swap buffers. Page 11 of 26
12 // ======== Setup the computation setup_cl_compute(); start_perf_measurement(&write_perf); copy_data_to_device(in, out); stop_perf_measurement(&write_perf); // ======== Compute while (range > LIMIT) { // Calculation start_perf_measurement(&update_perf); update_cl(get_in_buffer(), get_out_buffer()); stop_perf_measurement(&update_perf); // Read back the data start_perf_measurement(&read_perf); read_back_data(get_out_buffer(), out); stop_perf_measurement(&read_perf); // Compute Range start_perf_measurement(&range_perf); range = find_range(out, SIZE*SIZE); stop_perf_measurement(&range_perf); iterations++; printf("iteration %d, range=%f.\n", iterations, range); } // ======== Finish and cleanup OpenCL start_perf_measurement(&finish_perf); clfinish(opencl_queue); stop_perf_measurement(&finish_perf); start_perf_measurement(&cleanup_perf); cleanup_cl(); stop_perf_measurement(&cleanup_perf); The new main loop with the OpenCL setup code moved out. The functions you need to define are in yellow. Make the changes to get this program working and then get your performance numbers and fill in the next section of the worksheet. Remember to make sure your Final Range Values match so you can be (reasonably) sure that your code is correct! Don t turn the page until you ve finished filling in the worksheet. Page 12 of 26
13 4. Overhead Outside of Loop Now that we ve removed the obvious overheads (compilation, setup, cleanup, and copying data to the device) from the main loop we can start seeing some performance improvements.!"#$%&'(&)*#$+,-.*&#/0#1// /,)9 &)*+&,-*. +/-) &2/% 62*),&#:&';&9 &&0** -)1+ )1&* <);=/>2+,& &-.0 &0,* &-/+ <&)*#?),) &.-. &)-/ &% 3/>2.9'/=')>. /* A'.,&#?),) &*& /&& B.;.-(.. 39&);+2 &+ &0+.2-% B.;)9#<);=&#C)9+&,,2*/*-&),,2*/*-&),,2*/*-*/ $%&'(&)* -+* %!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 +2*),&#-2&&*+2#%-#34356 &2*) )2), D#,.>&#/;#*),)#>/%&>&;, &)3 //3.2/% D#,.>&#/;#62*),& /,)9#-2&&*+2#0'/>#EF &2* )21 D#*),)#>/%&>&;,#0'/>#E &03 &,3.%!"#$%&'(&)*#$+,-.*&#/0#1//2# 9:$;<$"=% BC?CD<% E;CF$%G"F"% 4'JAC>$%!;'#;"J% K$"=%G"F"% K"?#$% 6A="F$% 454!6% 4754!6% 4754!6% 4758!6% 4758!6% With the overhead removed from the loop we have now actually succeeded in accelerating our code using OpenCL. The CPU version now takes 75% as long as the C version (not so good given that we should be at 50% with two CPUs) and the GPU version is faster. The CL-CPU version is now 1.8x faster than the initial version and the CL-GPU version is 3.4x faster. Again, depending on the relative speeds of the GPUs and CPUs you may see different improvements, or even none at all. But the most relevant part is that our data movement is now 13% of our total time down from 47% before. If you were spending a lot of time on compilation you will see a smaller relative decrease. Looking at the change in just data movement time (read plus write time) we see that we are spending only 15% and 19% of the time, respectively, doing data movement on the CPU and GPU as we were before. This is a > 5x reduction in data movement time! So why is the data movement reduced? Well, simply because we re not constantly copying the data toand-from the device. We are copying it back each time we do range, but that s a lot less movement than before. Looking at the data above, the next optimization step for my machine is to get the range computation onto the OpenCL device so we can eliminate the read data and accelerate range. (Of course it would be great to get the update kernel to run even faster, but we re not going to touch it for now.) But before we do that we re going to take a look at the impact of local dimensions on performance. To do this, open the file main_opencl_5_explore_local.c. This file simply has a new main function that walks through a bunch of local dimensions (stored in locals[][]) and then calls run with them set. You need to copy your code from the last section into here, rename its main function to run and update the code to use the local dimension chosen by the new main loop. Do this and answer the questions in the worksheet. Don t turn the page until you ve finished filling in the worksheet. Page 13 of 26
14 5. Exploring Local Dimensions When we explicitly set the local dimensions we need to choose values that evenly divide the global dimensions. So for this program, with global dimensions of 4096x4096, we can choose just about any power of 2 by power of 2. However, the total size of the local dimensions is dictated by the hardware resources on the device. For the lab and my machine this limits the maximum workgroup size to 512, and for the cluster to While any value of local dimensions that evenly divides the global and whose product is less than that maximum is supported, the performance may vary dramatically. Remember that the hardware computeunits execute multiple threads from their work-groups at the same time. If the size of the work-group is less than the minimum number of threads the compute-unit executes at once, that additional hardware will be wasted. In the case of Nvidia hardware, each streaming multiprocessor (compute-unit) has 8 hardware cores which execute one thread on every other cycle (16 physical threads at once) and the architecture executes threads in groups of 32. This means that if the compute-unit has less than 32 threads at any given time it will be wasting processor cores. (Take a look at the slide in the OpenCL lecture.)!"#$%&'()*+,#-(./'#0*12+3*(+3#(+#452# :%: ;%; <%< =%= :>%:> ;!>%:?(4/' )**+ &&***, -.*-, &,/.) /+-) /.*& /&*& 8&@/42#A2)+2' -*/0 &,++&0./**) 1/.& -/.* -*)/ --10 B/+,2#C(1&D42 &1.* &1-0 &1-) &1-) &1.* &1.0 &1+0 B2/@#0/4/ &-** &-/1 &-** &-1/ &-/) &--& &-.& C(1&*'2#7)(,)/1 &,-*./ -. -* C)2/42#EDFF2)3,,,,,,, G)*42#0/4/..* &1) &/, &11 &1/ &/, &11 H*+*35,,,,,,, C'2/+D& &+) &1/ &+) &1- &/. &+) &+) H*+/'#B/+,2#I/'D2 JK2)52/@ +). ++** &1*, 1-0 **& **, *-0!"#$%&'()*+,"+!-..+/",%&+/'$)+ $(# $'# $&# $%# $$# $!#,# +# *# )# (# '# &# %# $#!# &"#'56+."7%&+8'$)59'"59+ -.//# $0$# %0%# '0'# +0+# $)0$)# %()0$# $!!"#,!"# # +!"# 8936:;<# *!"# =>:>?5# (!"# 8436A3#C;D34?# 8EF<>93#G4EH46F# '!"# I367#B6A6# &!"# I6:H3# %!"#.<76A3#.J9>K6JE:# $!"#!"# Performance differences for different local dimension sizes along with device utilization. Trying out different local dimensions gives us a feeling for how important this is. When we plot them together with utilization we see that we need full utilization to get the highest performance. (Note that our performance may not be limited by utilization, so it may not scale directly. In this case we are limited by non-coalesced accesses at some point.) But on my machine, the performance increases all the way, with the NULL version clearly not giving the best performance. The bottom line is that if you care about the last bit of performance you should not let the runtime try to guess the best local size. Page 14 of 26
15 6. Range in a Kernel Now it is time to put the range calculation into a kernel. The tricky part about calculating the range is that it is a reduction, and, as you remember from the lectures, reductions are limited by synchronization. To make this simple we re going to do the last stage of the reduction on the host. That is, you will write an OpenCL kernel where each work-item will calculate the min and max for part of the data and write those to an OpenCL buffer. This (smaller) buffer will then be read back to the host where the final reduction will be done.!"#$ %&'()$!!!!!!! " " " " " " " # # # # $! %&' %() # # # * * * * * * ,,, $" %&' %(),, * * * $# %&' %() $* %&' %() $+ %&' %() $, %&' %() $- %&' %() * * * * * * The range kernel. Each work-item processes some number of items from the update kernel s output and stores the minimum and maximum across those values in the range buffer. (e.g., work-item 0 processes the first 7 items in the data and work-item 1 processes the next 7.) The number of items each work-item processes is determined by the number of work-items and the size of the data to process. The range buffer is then read back to the host, which does the final reduction. The range kernel therefore takes the output from the update kernel as its input and produces a range_buffer output. To implement this, open the main_opencl_6_range_kernel.c file. This file is similar to what you had in step 4 (before adding the local size experiment) but it has an additional kernel variable (range_kernel), an additional cl_mem variable (range_buffer), and an additional host buffer (range_data) for interacting with the range kernel. You will need to add your range kernel to the kernel.cl file, update the setup_cl_compute() method to create the appropriate range_buffer, range_kernel, and range_data. (Also update cleanup_cl() to clean up after them.) I ve provided a skeleton range kernel below for you to start with. The number of work-items to use for the range kernel is defined in the RANGE_SIZE #define. // The number of work items to use to calculate the range #define RANGE_SIZE 1024*4 // ======== Compute while (range > LIMIT) { // Calculation start_perf_measurement(&update_perf); update_cl(get_in_buffer(), get_out_buffer()); stop_perf_measurement(&update_perf); // Range start_perf_measurement(&range_perf); range_cl(get_out_buffer()); stop_perf_measurement(&range_perf); // Read back the data start_perf_measurement(&read_perf); read_back_data(range_buffer, range_data); stop_perf_measurement(&read_perf); // Compute Range start_perf_measurement(&reduction_perf); range = find_range(range_data, RANGE_SIZE*2); stop_perf_measurement(&reduction_perf); iterations++; printf("iteration %d, range=%f.\n", iterations, range); } The changes to the program for running the range kernel. Instead of reading back the data and then calling range, we call range_cl() (which enqueues the range kernel) and then read back the range_buffer to do the final reduction on the Page 15 of 26
16 host. Note that you need to make sure the read_back_data() function reads the right amount of data. kernel void range(global float *data, int total_size, global float *range) { float max, min; // Find out which items this work-item processes int size_per_workitem =... int start_index =... int stop_index =... // Finds the min/max for our chunk of the data min = max = 0.0f; for (int i=start_index; i<stop_index; i++) { if (...) min =... else if (...) max =... } // Write the min and max back to the range we will return to the host range[...] = min; range[...] = max; } The range kernel skeleton. The input is the data to process (the output from the last update kernel, set by clsetkernelargs()), the total size of the data to process, and the results are written to the range output. 5.1*67 89 :;34434=%%% < '43/!!"$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$!!"'43/!<$! This section of the Khronos OpenCL quick reference card will be helpful in filling in the kernel code above. :;3443A=%%%!!!!!#!43/#!43/!#!'43/#!'43/!!! # "-! "!!" $!!!!!!!!!!!!!!!!% Now.!!""!#$ implement the :!#!: range kernel and change the program &!,3%!/ to use it. Test the code by verifying that the %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!! number.!!""!##!"!'$ of iterations :!# ' and the final range "!!""!1#!!"!%$ value are the same %!)# 1 %!)#=!88!% %!)#= as the previous versions. If they are not you "!!""!##!!"!'$ #!;!' "!!""!##!!"!'$ # ' have a bug and will have to fix your code. "!!"7! $!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!!"!'$ "#!;!' "!!""!##!"!'$! "#!;!' Once you ve got it working, fill in the worksheet $ and continue. "!!""!#$ "!!""!&#!"!/#!"!0$ "! " $!!!!!!!!% "!!""!&#!"!/#!"!0$ "!!""!##!!"!'$ #! &#!/$!;!0 Don t &! turn the page $ until you ve finished filling in the worksheet.,3%!/ 0 &!<!/!;!0 '!42!#!8!' # :;3443"=%%% "!!""!##!!"!'$ "!!""!##!!"!'$ '!42!'!8!# # #!,3%!' $%&'(%)*!! #!!! # $ 43/!! #!! '43/!! # $!!!"43/ #!!!!!!'43/ $!!!"'43/ #!!!!!!'43/ $ % 2% % 34 %= % 2% % 34 %= % 2% % 34 %= % 2% % 34 %= % 2% %=88!>?$!:!34 %= % 2% %=88!>?$!:!34 %= :;3443<=%!!!! " " 3%%! "!!""!##!"!$%!#!"!$&#$!!!!!!!!!!!!!!!!!! +),-.!#!/&!!!!!# $%! $&#$!!!! %&'()*!!!"%&'()*!!##!%&'()*!$%!#!%&'()*!$&#$! $%!#!$&#!!!!# $%! $&#$!!!!!!!!!!!!!!!!!!!!!!! "!!" $!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!/&! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 0,1!&2!#!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!,3%!' %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! %&'()*!!!"%&'()*!!##!%&'()*!'$!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!!""!##!"!'$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!# '$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! "!+>1.!""!()*(#!"!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+>1. ()*(!!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! %&'()* +>1.!"%&'()*!()*(#!%&'()*!!#$!!!!!!!!+>1. ()*(!!#$!!!!! "!!""!()*(+#!"!()*(,#!"!#$!!!!!!! ()*(+ ()*(, $! %&'()*!!!"%&'()*!()*(+#!%&'()*!()*(,#!!%&'()*! #$!!!!!!!!! ()*(+ ()*(, $!!!! "!+?@*!""!#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 043!&2!#!,3%!' 543*,6!()*3%! &2!#!,3%!'!/&! #!8! ()*( 9/*.!,3%! 43/*6.&),/* # :;3443"=%%%%%%%%%%!!!!$! 43/!#!'43/!! " " " 3% "!!""$! "!!""$! "!!""!#$! "!!""$! "!!""$! "!!""!#$! "!!""!'541(65#$! "!!""!'#!"!#$! "!!""$! "!!""!#$! "!!""!##!"!'$! "!!""$! "!B1?C!""$! "!!""!##!"!'$! # # ' # ' "!!""$!!!!!!!!!!!!!!!!!!!!!!! "!!" $! *A# "!!""$! "!!""!##!"!'$! #!,3%!' "!!""$! "!!""!&#!"!/#!"!0$! "!!""!##! $!!!!!!!! B*/'63!'!42!#!8!'#!!!!!# # "!!""!##! $!!!!!!!!!!!! B*/'63!'!42!'!8!##!!!!!# # "!!""!##!"!'$! # '!<!/6'3@!"# '$ "!!""!##!"!<%786$! # "!!""!##!43/ <(#7$! "!!""!##!"!'$! #A?;!'A?! 43/!!!""!#$! "!!""!##!43/ $! Page 16 of #!!<!?A! 26 "!!""!##!43/!!$ "!!""!#$! "!!""!1#!43/!!< $! "!C)@!""$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"'43/!!!&!04)($ C'4*/!D,D!!!"'43/!!!&!04)($!!!!!!&!04)($! %&'()*!!!!!&!04)($!! %&'()*!!!"'43/!!!&!04)($! "!!""!##!"!'$! # "!.)E!""!##!"!'$! +&-.'/*!# '!"#A'$ "!.)E*!""!##!43/ $! +&-.'/*!#A' ' "!.)E(!""!##!"!'$!!!!!!!!!!!!!! +&-.'/*!#A' # "!!" $!!! #!! "!!" $!! "!!""!##!"!'$! "!!""!##!"!'#!43/!! 9:;4$! "!(?*>!""$! "!())>*!""!##!43/ $! +&-.'/*!# ' "!!""!#$! # "!(+F(>!""$!!!!!!!!!!!!!!!!!!!!!!!!! "!+?*!""$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
17 7. Coalescing the Range Accesses So what happened? The performance got worse when you ran the range kernel on the GPU. Why is that? Well, as the questions in part 6 hinted, the way we wrote the range kernel resulted in a lot of uncoalesced memory accesses. (Go back to the lecture notes if you re not certain what this means.)!"#$%&'(#)&#%#*(+&(, -.-/0-1.-/0-1.2/0 345%, &)*+& *,-. &,/./ &2,% 067%5(#*(+&(, &&.** 00&) )1-* $%&'(#-48695( &0-. *0. *).1 $(%7#:%5%, ), &% -486),(#/+4'+%8 -,0 -+(%5(#;9<<(+= *% >+)5(#:%5% &*&,&/?)&)=@ - - -,(%&96 &+ &./ -20% $(79A5)4& - &?)&%,#$%&'(#B%,9( //2*,*0&) //2*,*0&) //2*,*0*, CD(+@(%7 0+*., %!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 967%5(#=6((796#D=#-.-/0 &2+. )21- E#5)8(#4%5%#84D(8(&5,3,3-2,% E#5)8(#4C%5( *&3,03 +%&'(#=6((796#D=#-.-/0 &2*0-2&/ 345%,#=6((796#<+48#FG &2&* %!"#$%&'(#)&#%#*(+&(,# 454!6% 4754!6% 4754!6% 4758!6% 4758!6% 9:$;<$"=% >$=?@A'B% 4C$"B?D% EFBFG<% H;FI$%J"I"% 4;$"I$%K?L$;G% 4'MDFC$%!;'#;"M% >$"=%J"I"% >"B#$% 6D="I$% Putting the range kernel on the OpenCL device actually slowed down the GPU. It did give a small speedup on the CPU, however. Let s first take a look at the CPU. The CPU code is running faster because we ve eliminated the time spent reading the data back. We re still not at a 2x speedup, however. The GPU code is more disappointing. The resulting application is nearly twice as slow as when we ran the range code on the CPU, even though we ve eliminated the data movement. The reason for this is that the way we wrote the range kernel results in uncoalesced memory accesses, which dramatically reduce the available bandwidth. To understand this, it s important to remember that when you read memory from DRAM you get a large chunk of memory. In the case of a GPU, it s typically at least 384 bytes, which is 12 float values. If you only use one of those 12, then you are throwing away 11/12ths (92%) of your bandwidth. The wider the DRAM bandwidth the worse a problem uncoalesced access can be. Unfortunately coalescing is tricky. Older GPUs have very specific rules for coalescing (each thread has to access the next item in an array and they have to do it at the same time and in order) while newer GPUs have more relaxed rules (any threads that access neighboring data at the same time). So getting good coalescing behavior is tricky. Luckily our range kernel is simple enough that we can understand what is going on and how to fix it. You now need to change your range kernel so that each work-item in a work-group accesses an element in the input array that is next to the one accessed by the next work-item in the work-group. Take a look at the picture below to understand. Page 17 of 26
18 !"#$%&'()*+,-)./012)3)456781'9)3):);&<#19 =>?<#&'9?'0!!!!!!! " " " " " " " # # # # $%&'!!!! ()%'! # # # * * * * * * ,,, $%&' " " " " ",, @ $%&' # # # # # $%&' * * * * * $%&' $%&',,,,, $%&' E<#&'9?'0.#91')ABC)<;)7#>0D/012! " # * +, -! " # * +, -! " # * $%&'! " # * ()%'! " # * +, -! " # * +, -! " # * +, -! $%&' +, -! +, - " # * @.#91')45@BC)<;)7#>0D/012 Example of our range kernel. In the original accesses (top) we assume we read 4 values from DRAM on every access, but only use one of them. (The real hardware is far worse in this regard.) In the re-ordered version below, we read 8 values from DRAM on every two accesses, and use 7 of them. This gives us a 7x increase in effective bandwidth, and can improve our performance by requiring only 2 memory accesses in place of 7. Now go and write a second kernel ( range_coalesced ) that works this way and fill in the next part of the tutorial. Make sure you validate your results by looking at the final range value. Copy your main_opencl_6_range_kernel.c file to a new main_opencl_7_range_coalesced.c file and change it to use the new kernel. Run it and fill in the worksheet. (You can add the new kernel to the same kernel.cl file by calling it range_coalesced instead.) Don t turn the page until you ve finished filling in the worksheet. Page 18 of 26
19 After fixing the range kernel to be coalescing we see that the GPU performance is vastly improved (13x faster range compared to uncoalesced, plus reduced data movement). The CPU performance is about the same, but the range kernel itself is actually 10% slower. The reason for this is that the CPU has a cache and hardware prefetcher, so it doesn t benefit much from changing the order of the accesses. In fact, straight linear accesses (the uncoalesced version) is faster on the CPU due to this hardware.!"#$%&'()*(#+,(#-&./(#0***())() $1$23 $41$23 $ %+&' &)*+& *,-,.*)) 378&+(#9(:.(' &&/** 0..0 ).,/ -&./(#$%;7<+( &0,/ 1/) 0.* -(&8#=&+& - )- $%;7>'(#2:%/:&;, -0 $:(&+(#?<@@(:),, A:>+(#=&+& &*- -&. B>.>),,, $'(&.<7 &+ &/1 -(8<*+>%., & B>.&'#-&./(#C&'<( 112*-*0&) 112*-*0&) 112*-*0*- DE(:,(&8 0+*.-, ).*!"#$$%#&%'()%"*+,-$./%0)$*1%'*%2.2$-.')%'()/)%3.$-)/4 <78&+(#)7((8<7#E)#$1$23 &2*, )2., F#+>;(#%.#8&+&#;%E(;(.+ -3 /3 F#+>;(#%.#*%;7<+&+>%. 1-3 *.3 :&./(#)7((8<7#E)#$1$23 &20* -2)0 :&./(#)7((8<7#@:%;#GH,21& &-21 6%+&'#)7((8<7#@:%;#GH &2, -2+ &2-% &%,2*%,20%,2.%,2-%,%!"#$%&'()*(#+,(#-&./(#0**())()# 454!6% 4754!6% 4754!6% 4758!6% 4758!6% 9:$;<$"=% >$=?@A'B% 4C$"B?D% After coalescing the range accesses we see that we ve basically eliminated data movement from our application and achieved a reasonable 3x speedup on a slow GPU. EFBFG<% H;FI$%J"I"% 4;$"I$%K?L$;G% 4'MDFC$%!;'#;"M% >$"=%J"I"% >"B#$% 6D="I$% Overall Speedup If you click on the Overall Speedup tab at the bottom of the tutorial worksheet you can see the impact of the various optimizations on the application performance. Clearly the biggest improvements came from removing the overhead from inside the loop (not surprising) and coalescing the range kernel. It s important to have this kind of information for your optimizations so you can see the benefit of each step. &"!"##$%#&'($)*++,-*$.+#"/0+$1&$%$%&,+$ %" $" #" MOM2H" M@OM2H" M@OG2H"!" #'"()*+,-.+" $'"()*+,-.+" %'"()*+,-.+" /-01"2345,-.6" /-01"2345,-.6" 7).8"9,:-.-*1;" &'"<=+31+)8" E-F+.*-4.*"4." 01+"G2H" I'"J).6+"-.")" L'"M4),+*9+"01+" K+3.+," J).6+" N999+**+*" Overall speedups from the changes in the tutorial. CPU vs. GPU We ve now encountered two cases where the CPU runs more slowly with a GPU-optimized kernel: The GPU update kernel has the overhead of checking the borders each time vs. the C code which skips the borders in the loop. The coalesced GPU range kernel accesses data in a bad order for the CPU. Page 19 of 26
20 This points out that the particular kernel you want to use will depend on the hardware. Now you might think you could just change the global dimensions of the GPU kernel to skip the border and then have a kernel that runs great on both architectures. However, while doing this will get you 2.0x speedup on the CPU, it will get you a huge slowdown on the v1.1 GPUs (lab and my laptop) because the memory accesses will no longer be coalesced. But on the v2.0 GPU (in the cluster) you d still get great performance because they have more relaxed coalescing rules. So while OpenCL does offer portability, it clearly does not offer performance portability. Going Faster How could you make this code faster? Well, the biggest performance issue now is the update kernel. Since there is a lot of data reuse in this kernel we could use the local shared memory to manually cache the input data before we process. This could potentially speed up the kernel a lot, but would make it far more complicated, and would slow it down on the CPU. (CPUs don t have local memory, so any code that accesses it is wasted time.) On newer GPUs this will be almost a waste of effort since they have caches, which would do this for us. Other than that, the data clearly indicates that we re still spending a good deal of time on overhead. To minimize this we need to processes larger problem sizes (if they fit on the GPU) and run them for longer (e.g., more iterations). If our problem is larger and takes longer to run, the percent of time spent on overhead will decrease and we ll see a corresponding speedup. An Aside: How slow is Java? I re-wrote the simple C version in Java, keeping the code as similar as possible to the C. This was far easier than writing in an ancient, primitive language like C or Fortran. Now most people will tell you that Java is about half as fast as C code, but for this case, it was actually 3.5x faster, thereby beating the GPU implementation. Why? I don t honestly know, and this is a real problem for performance optimization. Java has the potential to do better compilation due to the runtime nature of its JIT (e.g., at runtime it can determine which parts of code to optimize based on how they are being used). So perhaps the java is unrolling the loop, or maybe automatically vectorizing? But on the other hand it has the overhead of array bounds checking and memory management. So who knows what is going on. However, on top of good performance, Java has very efficient libraries for parallelizing code, which would enable the code could run even faster with a bit more effort. (It would take far less effort to use Java s parallel libraries than to use OpenCL. But there are of course OpenCL interfaces for Java as well.) Regardless, this is a clear indication that you shouldn t discredit Java s ability to run fast. Indeed, combined with the fantastic productivity improvements for the developer, Java is an excellent choice for a large range of applications. &"!"##$%#&'($)*++,-*$%$./0$1*2345+,$1*+6%7$./0$8"."$ %" '(')*" $" '+(')*" '+(,)*" -./.(')*" #"!" '(')*" '+(')*" '+(,)*" -./.(')*" Comparing the performance of the optimized OpenCL implementations vs. Java. Surprised? Java is really quite good these days. Page 20 of 26
Lecture 3. Optimising OpenCL performance
Lecture 3 Optimising OpenCL performance Based on material by Benedict Gaster and Lee Howes (AMD), Tim Mattson (Intel) and several others. - Page 1 Agenda Heterogeneous computing and the origins of OpenCL
More informationMitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011
Mitglied der Helmholtz-Gemeinschaft OpenCL Basics Parallel Computing on GPU and CPU Willi Homberg Agenda Introduction OpenCL architecture Platform model Execution model Memory model Programming model Platform
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationOptimization. NVIDIA OpenCL Best Practices Guide. Version 1.0
Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationIntroduction to OpenCL Programming. Training Guide
Introduction to OpenCL Programming Training Guide Publication #: 137-41768-10 Rev: A Issue Date: May, 2010 Introduction to OpenCL Programming PID: 137-41768-10 Rev: A May, 2010 2010 Advanced Micro Devices
More informationCross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic
More informationCS3813 Performance Monitoring Project
CS3813 Performance Monitoring Project Owen Kaser October 8, 2014 1 Introduction In this project, you should spend approximately 20 hours to experiment with Intel performance monitoring facilities, and
More informationCourse materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:
Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will
More informationOpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationExperiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
More informationGPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More informationIntelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationCS 378: Computer Game Technology
CS 378: Computer Game Technology http://www.cs.utexas.edu/~fussell/courses/cs378/ Spring 2013 University of Texas at Austin CS 378 Game Technology Don Fussell Instructor and TAs! Instructor: Don Fussell!
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationLinux Driver Devices. Why, When, Which, How?
Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationLearn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
More informationCOSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
More informationJava GPU Computing. Maarten Steur & Arjan Lamers
Java GPU Computing Maarten Steur & Arjan Lamers Overzicht OpenCL Simpel voorbeeld Casus Tips & tricks Vragen Waarom GPU Computing Afkortingen CPU, GPU, APU Khronos: OpenCL, OpenGL Nvidia: CUDA JogAmp JOCL,
More informationParallel and Distributed Computing Programming Assignment 1
Parallel and Distributed Computing Programming Assignment 1 Due Monday, February 7 For programming assignment 1, you should write two C programs. One should provide an estimate of the performance of ping-pong
More informationChoosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
More informationOverlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer
More informationParallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More information14:440:127 Introduction to Computers for Engineers. Notes for Lecture 06
14:440:127 Introduction to Computers for Engineers Notes for Lecture 06 Rutgers University, Spring 2010 Instructor- Blase E. Ur 1 Loop Examples 1.1 Example- Sum Primes Let s say we wanted to sum all 1,
More informationEmbedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
More informationInterpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationLast Class: OS and Computer Architecture. Last Class: OS and Computer Architecture
Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationINTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationWhitepaper: performance of SqlBulkCopy
We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis
More informationIntro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationOptimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
More informationHypercosm. Studio. www.hypercosm.com
Hypercosm Studio www.hypercosm.com Hypercosm Studio Guide 3 Revision: November 2005 Copyright 2005 Hypercosm LLC All rights reserved. Hypercosm, OMAR, Hypercosm 3D Player, and Hypercosm Studio are trademarks
More informationOpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationUniversity of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python
Introduction Welcome to our Python sessions. University of Hull Department of Computer Science Wrestling with Python Week 01 Playing with Python Vsn. 1.0 Rob Miles 2013 Please follow the instructions carefully.
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationSYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014. Copyright Khronos Group 2014 - Page 1
SYCL for OpenCL Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March 2014 Copyright Khronos Group 2014 - Page 1 Where is OpenCL today? OpenCL: supported by a very wide range of platforms
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationParallel Processing and Software Performance. Lukáš Marek
Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More informationCSCI E 98: Managed Environments for the Execution of Programs
CSCI E 98: Managed Environments for the Execution of Programs Draft Syllabus Instructor Phil McGachey, PhD Class Time: Mondays beginning Sept. 8, 5:30-7:30 pm Location: 1 Story Street, Room 304. Office
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationWhat you should know about: Windows 7. What s changed? Why does it matter to me? Do I have to upgrade? Tim Wakeling
What you should know about: Windows 7 What s changed? Why does it matter to me? Do I have to upgrade? Tim Wakeling Contents What s all the fuss about?...1 Different Editions...2 Features...4 Should you
More informationImplementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationRootbeer: Seamlessly using GPUs from Java
Rootbeer: Seamlessly using GPUs from Java Phil Pratt-Szeliga. Dr. Jim Fawcett. Dr. Roy Welch. Syracuse University. Rootbeer Overview and Motivation Rootbeer allows a developer to program a GPU in Java
More informationGDB Tutorial. A Walkthrough with Examples. CMSC 212 - Spring 2009. Last modified March 22, 2009. GDB Tutorial
A Walkthrough with Examples CMSC 212 - Spring 2009 Last modified March 22, 2009 What is gdb? GNU Debugger A debugger for several languages, including C and C++ It allows you to inspect what the program
More informationParallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
More informationContributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
More informationANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
More informationPARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
More informationINSTALLATION GUIDE ENTERPRISE DYNAMICS 9.0
INSTALLATION GUIDE ENTERPRISE DYNAMICS 9.0 PLEASE NOTE PRIOR TO INSTALLING On Windows 8, Windows 7 and Windows Vista you must have Administrator rights to install the software. Installing Enterprise Dynamics
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationGrid Computing for Artificial Intelligence
Grid Computing for Artificial Intelligence J.M.P. van Waveren May 25th 2007 2007, Id Software, Inc. Abstract To show intelligent behavior in a First Person Shooter (FPS) game an Artificial Intelligence
More informationPERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. achrosny@treeage.com Andrew Munzer Director, Training and Customer
More informationTech Tip: Understanding Server Memory Counters
Tech Tip: Understanding Server Memory Counters Written by Bill Bach, President of Goldstar Software Inc. This tech tip is the second in a series of tips designed to help you understand the way that your
More informationCUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
More informationVirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationA3 Computer Architecture
A3 Computer Architecture Engineering Science 3rd year A3 Lectures Prof David Murray david.murray@eng.ox.ac.uk www.robots.ox.ac.uk/ dwm/courses/3co Michaelmas 2000 1 / 1 6. Stacks, Subroutines, and Memory
More informationThe Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server
Research Report The Mainframe Virtualization Advantage: How to Save Over Million Dollars Using an IBM System z as a Linux Cloud Server Executive Summary Information technology (IT) executives should be
More informationLecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()
CS61: Systems Programming and Machine Organization Harvard University, Fall 2009 Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() Prof. Matt Welsh October 6, 2009 Topics for today Dynamic
More informationPractice #3: Receive, Process and Transmit
INSTITUTO TECNOLOGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY CAMPUS MONTERREY Pre-Practice: Objective Practice #3: Receive, Process and Transmit Learn how the C compiler works simulating a simple program
More informationMulti-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationMEAP Edition Manning Early Access Program Hello! ios Development version 14
MEAP Edition Manning Early Access Program Hello! ios Development version 14 Copyright 2013 Manning Publications For more information on this and other Manning titles go to www.manning.com brief contents
More informationCSC230 Getting Starting in C. Tyler Bletsch
CSC230 Getting Starting in C Tyler Bletsch What is C? The language of UNIX Procedural language (no classes) Low-level access to memory Easy to map to machine language Not much run-time stuff needed Surprisingly
More informationFormat string exploitation on windows Using Immunity Debugger / Python. By Abysssec Inc WwW.Abysssec.Com
Format string exploitation on windows Using Immunity Debugger / Python By Abysssec Inc WwW.Abysssec.Com For real beneficiary this post you should have few assembly knowledge and you should know about classic
More informationEfficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
More informationThe continuum of data management techniques for explicitly managed systems
The continuum of data management techniques for explicitly managed systems Svetozar Miucin, Craig Mustard Simon Fraser University MCES 2013. Montreal Introduction Explicitly Managed Memory systems lack
More informationTurbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
More informationINTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
More informationPristine s Day Trading Journal...with Strategy Tester and Curve Generator
Pristine s Day Trading Journal...with Strategy Tester and Curve Generator User Guide Important Note: Pristine s Day Trading Journal uses macros in an excel file. Macros are an embedded computer code within
More informationAchieving business benefits through automated software testing. By Dr. Mike Bartley, Founder and CEO, TVS (mike@testandverification.
Achieving business benefits through automated software testing By Dr. Mike Bartley, Founder and CEO, TVS (mike@testandverification.com) 1 Introduction During my experience of test automation I have seen
More information2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
More informationSUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE
SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks
More information