Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Size: px
Start display at page:

Download "Data-parallel Acceleration of PARSEC Black-Scholes Benchmark"

Transcription

1 Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

2

3 Data-parallel acceleration of PARSEC Black-Scholes benchmarks Patrik Hagernäs August Andrén Stockholm 2013 Parallel programming School of Information and Communication Technology Royal Institute of Technology IR-EE number

4 Abstract The way programmers has been relying on processor improvements to gain speedup in their applications is no longer applicable in the same fashion. Programmers usually have to parallelize their code to utilize the CPU cores in the system to gain a signicant speedup. To accelerate parallel applications furthermore there are a couple of techniques available. One technique is to vectorize some of the parallel code. Another technique is to move parts of the parallel code to the GPGPU and utilize this very good multithreading unit of the system. The main focus of this report is to accelerate the data-parallel workload Black-Scholes of PARSEC benchmark suite. We are going to compare three accelerations of this workload, using vector instructions in the CPU, using the GPGPU and using a combination of them both. The two fundamental aspects are to look at the speedup and determine which technique requires more or less programming eort. To accelerate with vectorization in the CPU we use SSE & AVX techniques and to accelerate the workload in the GPGPU we use OpenACC. 1

5 Contents 1 Introduction Background Problem description Problem Problem statement Purpose Goal Method Limitations Theoretic background PARSEC PARSEC benchmark suite Black-Scholes Vector instructions SIMD - Single Instruction, Multiple Data SSE - Streaming SIMD Extensions AVX - Advanced Vector Execution GPGPU - General Purpose GPU CUDA OpenACC Methodology Vector instructions GPGPU OpenACC Larger input Combined vector instruction with GPGPU

6 4 Result Vector instructions GPGPU acceleration Compare GPU to vector instruction Combined GPU with vector instructions Summary & Conclusion Conclusion Vectorization GPGPU - OpenACC Combined CPU and GPGPU General conclusions Summary Future research Appendix GPGPU - OpenACC blackscholes-acc.c Vector instructions - SSE blackscholes-simd.c Combined - OpenACC & SSE blackscholes-combined.c

7 List of Figures 2.1 Single Instruction Single Data vs Single Instruction Multiple Data YMM and XMM registers Structure of a CUDA kernel SSE and AVX functions used to test vector instructions Denition set for SSE and AVX Alignment for SSE and AVX. Note that the _MM_ALIGN16 is present in more than one place in the code Graph to establish how well vector instructions work with threads for SSE and dierent size of data. Data span: Large Graph to establish how well vector instructions work with threads for AVX and dierent size of data. Data span: Large Graph to establish how well vector instructions work with threads for AVX and dierent size of data. Data span: Small - Medium The ratio how much time the application spends on transferring data and how much it actually spends on the algorithm SSE speedup & GPGPU speedup Execution time of the CPU & the GPU kernel Execution time of combined Black-Scholes & and GPGPU accelerated Black-Scholes when prior to and after GPGPU has reached maximum capacity Execution time in seconds of all the three versions of Black- Scholes used in the research

8 List of Tables 2.1 PARSEC workloads SSE AVX OpenACC - Zorn vs Single GPU system Programming eort

9 1 Introduction This section will focus on introducing the background of the problem statement, the goals and purpose of the research. It will also include what tools have been used and the delimitations for this research. 1.1 Background Computer industries have for a very long time relied on processor improvement that lead to upgrade in clock cycles per seconds. These processor improvements could be connected to the prediction known as Moore's Law. Even as early as 1965 Gordon Moore made this prediction. He predicted that the number of transistors on a processor chip would double every twelfth month. Now it's more commonly quoted as every eighteenth month or even every second year but is still considered to be accurate [3]. This process improvements was a reality for the developers and they could rely on it for their applications to gain speedup. The reality went on until the processors reached a clock frequency limit which was hard to push further without receiving serious thermal problems. To maintain speedup in processors the industries set a new standard of using multi core processors. Multicore processors as a standard did not come without a sacrice. Developers could no longer rely on the processors to improve and wait for the increased amount clock cycles of the processors to achieve speedup. Software had to be parallelised in order to utilize the processor to achieve the desired speedup. This parallelization was not always considered to be an easy task. There are a lot of techniques to make the program run parallel and there are still more being developed. These techniques are based on dierent ways to thread the program, dividing the workload and utilizing the hardware. Developers are constantly trying to adapt to these new techniques to achieve speedup in their applications [7]. 6

10 1.2 Problem description Many of today's parallel applications cannot be fully utilized because they are limited by the amount of cores in the CPU. There are techniques which can be used to achieve a speedup without increasing the number of cores and instead utilizing the advanced extended CPU registers. Speedup could also be achieved from moving parts of the workload from the CPU to the GPGPU (General Purpose Graphical Processing Unit) Problem The main focus in this study is to utilize the PARSEC benchmark and the Black-Scholes workload in a heterogenous system. Using both vector instructions in the CPU as well as moving part of the workload to the GPGPU and achieve a speedup compared to the now already parallelized code. The reasons for using Black-Scholes workload from the PARSEC benchmark suite is because it is already data-parallel code which can easily be vectorized and GPGPU accelerated. There are already many comparisons between running data-parallel code on the CPU and running it on the GPGPU. Most of the time these comparisons are made between a not vectorized data-parallel algorithm running of the CPU and an equivalent algorithm on the GPGPU, these condition are not considered to be fair. In some cases the comparisons have been remade using vectorized code and the result did come out dierently Problem statement Which techniques is most ecient to increase the speed of an benchmark data-parallel application within the PARSEC suite? Vector instructions, accelerating it on GPGPU or even a combination of both? 1.3 Purpose The purpose of this study is to focus on two fundamental acceleration techniques, vectorization in the CPU and running code on GPGPU. Applying these techniques onto data-parallel applications from the PARSEC benchmark suite to determine the optimal solution for speedup and performance acceleration. This research will also look into how much programming eort is required of each technique to create a relevant and functional program. 7

11 1.4 Goal The goal of this study is to manage to run the PARSEC benchmark Black- Scholes with both vectorization and GPGPU and get a speedup. Hopefully the result will help developers to map some sort of general idea when it is good to vectorize the code, use the GPGPU or perhaps a combination of them both. 1.5 Method This research is not about reinventing the wheel and we will try if possible not to rewrite any code that has been written by people with greater knowledge in this area. We must however learn these techniques and understand the code we manage to nd or be able to write the code that we could not nd. To our help in achieving the goals, we have access to a four core Intel i QM process with AVX/SSE functions, a twelve core AMD Opteron 6172 with SSE instructions, a graphics card Nvidia GTX 680 and a cluster of graphics cards with Nvidia Tesla M2090, Nvidia GeForce GTX 580 and Nvidia Tesla C2050. In order to apply these techniques we have read previous research rapports, tutorials and manuals. Our research materials and manuals have been obtained mostly from the archives of the developers for these dierent techniques. To obtain our results we have programmed in C using GPGPU and vector instruction APIs with dierent compilers such as g++, ICC, PGI, nvcc and Visual Studio. We started programming from the beginning with smaller programs in each technique and then move forward to even more advanced programs. Eventually we applied the techniques on the desired application. Using the time operations and estimating the eort needed to complete the task we could determine our result in acceleration and programming eort. 8

12 1.6 Limitations This report may not include fully optimized code which has been determined through tests due to lack of knowledge with these techniques. The performance tests may have been run on dierent level of optimization due to diculty of applying the techniques. The hardware architecture cannot really be compared equally which means that there cannot be any guarantee that the results will be completely fair. 9

13 2 Theoretic background This research is based on accelerating applications using dierent accelerating techniques. The techniques used in this study are vector instructions as well as usage of GPGPU. To limit the area of interest the main focus was set to a few specic techniques within these areas. This section will introduce both of the areas and techniques chosen for this study. It will also explain PARSEC (The Princeton Application Repository for Shared-Memory Computers) and also compare the PARSEC benchmark suite against other benchmark suits. 2.1 PARSEC PARSEC (The Princeton Application Repository for Shared-Memory Computers) is one of the more popular benchmark suites within parallel programming. PARSEC has earned this popularity because it provides a large and diverse repository of applications. All these applications have been selected from several application domains and they cover dierent areas in parallel programming. In the latest version of PARSEC (3.0) the benchmark suite contains thirteen dierent workloads. Every single one of these workloads have been parallelised using dierent techniques [2]. When learning about PARSEC other benchmarks are being mentioned. The most common benchmark mentioned is SPLASH-2. SPLASH-2 is a benchmark that also focus on parallel applications. Naturally this has led to a lot of comparisons between PARSEC and SPLASH-2. There is no conclusion which determine which of these benchmark is better than the other. SPLASH-2 is an older benchmark suite and was started in 1990 and the applications are outdated in some manner. Despite the younger and more diverse PARSEC benchmark SPLASH-2 with its old programs are still useful depending on model and research area [1][2]. 10

14 Program Application Domain Parallelization Working Set Data Usage Model Granularity Sharing Exchange blackscholes Financial Analysis data-parallel coarse small low low bodytrack Computer Vision data-parallel medium medium high medium canneal Engineering unstructured ne unbounded high high dedup Enterprise Storage pipeline medium unbounded high high faceism Animation data-parallel coarse large low medium ferret Similarity Search pipeline medium unbounded high high uidanimate Animate data-parallel ne large low medium freqmine Data Mining data-parallel medium unbounded high medium raytrace Rendering data-parallel medium unbounded high low streamcluster Data Mining data-parallel medium medium low medium raytrace Financial Analysis data-parallel coarse medium low low vips Media Processing data-parallel coarse medium low medium x264 Rendering pipeline coarse medium high high Table 2.1: PARSEC workloads PARSEC benchmark suite PARSEC contains thirteen applications as previously mentioned, which can be seen in table 2.1. Each of these is an application in a specic area of interest. What makes these applications so valuable is the fact that they are all state of the art applications within their area. Each workload is parallelised in multiple ways which enables various benchmark studies [2][1]. Every workload is interesting but this research will focus on Black-Scholes Black-Scholes Black-Scholes is the workload in PARSEC that this study is trying to accelerate. It is a application that comes from a mathematical model that contains investment instruments for the nancial market. The application is data-parallel so it's very convenient for since it is not needed to make it dataparallel before accelerating it. The parallel part of Black-Scholes consists of only one for loop that contains the Black-Scholes algorithm. Parallelizing it with OpenMP is therefore very simple and the ow of the application consists of three basic parts, copy in data, the parallel for loop with the Black-Scholes algorithm and copy out result. 2.2 Vector instructions This section will introduce vector instructions used by the CPU. To explain more in detail how this can be used we will introduce two vector instruction related techniques and one architecture that we used in our research. SIMD (Single Instruction, Multiple Data) - Architecture, SEE (Streaming SIMD 11

15 Extensions) and AVX (Advanced Vector Execution) - Techniques SIMD - Single Instruction, Multiple Data SIMD is a cpu architecture which allows instructions to be vectorized. SIMD architecture allows for one CPU instruction to work on multiple blocks of data simultaneously. For instance you may need to do a operation on all elements in arrays, with SIMD you can then work on large chunks of data instead of one element at the time. The number of cores in the system sets the limit for the level of parallelism. Figure 2.1: SISD vs SIMD SIMD is an architecture that has been around for sometime now and it is based on oating point calculations. Earlier processors was not capable to handle oating point, which meant that oating point calculations had to be done in a separate unit. Floating point calculations was in high demand and it was not not long before oating points were introduced to the processors. Along with the introduction of oating point calculation capabilities in the processors new classications were introduced, one of these was SIMD [4]. 12

16 In order to use SIMD the developer need to invoke vector instructions so the compiler will convert your instructions correctly. A correct conversion can be viewed in the assembler code where you have one instruction containing multiple data instructions. Otherwise they will convert them to the more common SISD (Single Instruction, Single Data). Where each instruction can only handle one set of data. Note that this can also be done in parallel depending on the algorithms parallelism and the number of available cores in the processor.[6][4] SSE - Streaming SIMD Extensions SSE is one of the dierent vector instructions that has been developed as the evolution of SIMD. SSE was not the rst instruction set that was implemented with the SIMD architecture, before that Intel had implemented an instruction set called MMX. The MMX instructions was later re modied to what is now known as the rst version of SSE. SSE has from this point been updated to several versions and implemented functions based on user feedback. The most commonly used version is the SSE3 which came with the second generation family of Intel processors.[9] In modern processors there are certain registers used for the purpose of vector instructions such as SSE. These are commonly known as XMM (as can be seen in gure 2.2) registers and can contain 128 bits. Using these registers instead of the 64 or 32 bits registers (based on cpu architecture) you can scale down two to four instructions into one single instruction if done properly.[8] AVX - Advanced Vector Execution AVX is an advanced version of the previously mentioned SSE. This new instruction set has been developed with the advanced register architecture. As mentioned in the SSE section there are registers named XMM which was 128 bit sized registers. The new register architecture has 256 bit registers called YMM where the last 128 bits are the former known XMM registers (See gure 2.2).[5][6] 13

17 Figure 2.2: YMM and XMM registers 2.3 GPGPU - General Purpose GPU GPGPU is a term that was founded when it became an advantage to run parallel code on the GPU that did not have anything to do with graphics. The advantage with a GPU against an ordinary CPU is the multi thread capability, GPU can multithread much better and with many more threads. The GPU core is running on a lower frequency, generating less heat. This is what many parallel applications are looking for CUDA CUDA is an environment to use when programming in parallel. CUDA is a full C++ compiler with the purpose to scale the code to hundreds of cores and thousands of threads on a GPU. With CUDA you enable heterogeneous systems meaning you can combine both the CPU and the GPU within the code. The way it works in details is that the main serial thread runs on the CPU and your parallel parts of the code runs on the GPU. These parallel parts are called kernels and the GPU can run multiple kernels at the same time. The way that the GPU run in parallel is in kernels and these kernels run as a grid of blocks of threads as shown in the gure

18 Figure 2.3: CUDA kernel. The grid contains three blocks, each block contains threads. Communication between threads in a parallel code is available within the same block. The communication is then achieved with shared memory and synchronization with each other. The way a thread know its own index in the kernel is by taking its block index and multiply with the number of block added with the thread index, this is shown in the code below. int idx = blockidx.x * blockdim.x + threadidx.x; There are ways to optimize CUDA code and reduce overheads, Memory Coalescing, Shared Memory, Cache-ecient texture memory accesses, loop unrolling, Parallel Reduction and Page-locked Memory allocation OpenACC OpenACC is a programming standard that is developed to make it easier to program in parallel on heterogeneous systems. To accelerate an application on the GPGPU with plain CUDA can be very hurtful if the programmer is new to these concepts. OpenACC can be used as a bridge between programming in parallel on the CPU to program plain CUDA on the GPGPU. OpenACC can be compared to OpenMP where both uses the PRAGMA commands to execute code in parallel, however OpenMP can only execute on the CPU, where OpenACC can execute on both the CPU and the GPGPU. This standard is developed by the companies Cray, CAPS, Nvidia, and PGI. An OpenACC command can look like this: #pragma acc kernels loop independent copyin(neededarray[0:max]) copy(resultarray[0:10]) 15

19 It rst denes that it is an independent loop that will be accelerated. The function will need the array that is copied in using the copyin() function and the whole array is copied in, from zero to the predened MAX denition. The resultarray is then being copied in before the kernel launch and copied out after the kernel has terminated using the copy() function, only the rst ten values is copied. 16

20 3 Methodology This chapter will document the methodology of the study with both vectorization, accelerating with GPGPU and using a combination of them both. If the reader follows this chapter the same result should be obtained if the exact same hardware architecture environment is used. The source les and input les that is used in this study comes from the standard PARSEC 3.0 package [11]. 3.1 Vector instructions In order to vectorize the PARSEC application Black-Scholes we had to learn how to work with vector instructions to begin with. PARSEC already included a vectorized version of Black-Scholes but to understand it and adjust it to AVX we needed to know more about vector instructions. To achieve our goals to vectorize Black-Scholes we had to start from the beginning. We decided to gure out when and where it is possible to use vector instructions. First of all we wanted to know when we could apply the vector instructions to the code and then evaluate when we thought it was worth applying it. It was important for us to analyze and know what actually happened with the code. When we reached this point we would evolve our code to a more advanced stage and later with our new knowledge work with Black-Scholes. When researching about vector instructions and looking at Black-Scholes we narrowed down the critical sections in which we could utilize the vector instructions. These critical section contained loops and arrays. With this knowledge we could minimize the area of code we wished to look at. Knowing that we should focus on loops and arrays we created a simple program using arrays and two kinds of loops, one which was dependant and one non dependant loop. We predicted the non dependant loop was going to work 17

21 because if there were no dependencies there would be no restrictions in working with multiple instructions at the same time. So we was more interested in the dependant loop. As suspected the non dependant loop was vectorized while the dependant was not able to be vectorized. The way we were able to determine if the loop could be vectorized was to use the Intel C++ Compiler (ICC). What this compiler allows us to do is to automatically check the code and validate if a loop can or cannot be vectorized. If a loop can be vectorized it will automatically vectorize in the most optimal way it is capable of. This is very much like the gcc -O2 option which allows for automatic optimization. What the ICC is also contributing with is a report option that prints which loops have been vectorized. Using these features we could determine which part of the code we could apply these vector instructions on. To be able to analyze further of what we know of our auto vectorized code we created an assembler le using a compilation command to compare the non vectorized and vectorized code to determine what instructions is used by the compiler to see how they decided to vectorize the code. We also used the GDB (the GNU Project Debugger) with split options so we could see the assembler code while debugging the program. We were able to nd the instructions which used the XMM or YMM cpu registers that is signicant for vector instructions. Learning from this analysis we consider us ready to evolve our code to apply these vector instructions on our own. To evolve our code we decided to create a new version for SSE and one for AVX and time the three methods against each other to see if we could manage to control vector instructions and also compare the time dierence between our results. We successfully produced three dierent parallel functions using AVX, SSE and without vector instructions. We however felt that we could apply and understand vector instructions to the point where we could look into Black-Scholes. Looking into the vectorized Black-Scholes code we could discuss the content and understand how it will be converted to the registers. To convert the SSE functions to AVX we followed the already implemented denition set they used to separate if the program should run with oat and double instructions. We implemented a new denition using the AVX 256 bit commands. Only correcting this was not sucient because when using SSE command you only align 16 bytes when moving to and from the registers while AVX 18

22 Figure 3.1: SSE and AVX functions used to test vector instructions. are using 32 bytes. So we had to create our own denition when using the AVX set. So we changed it to the following. We also automatically activated OpenMP to allow for parallel execution and use the OpenMP time calculations. With this experiment we also received expected and unexpected result which we will present in the result section. We ran a lot of dierent tests to gather the results we considered where needed to determine how well the vector instructions worked with parallelism or how well the GPU managed with smaller data sets. To be able to get the most fair comparison with vector instructions and GPU we ran the tests on our dierent hardware architectures to see which of them gave the most fair result. 3.2 GPGPU The idea is to rst learn about CUDA and OpenACC to implement Black- Scholes on the GPGPU. The OpenACC standard is very clear and easy Figure 3.2: Denition set for SSE and AVX. 19

23 to use hence the rst implementation was with OpenACC. The Portland Group compiler will be used to accelerate the code on the device. A licence is needed for the compiler, and this was obtained from their website( OpenACC To run OpenACC on our test devices we used PGI(The Portland Group)[10] compiler for the language C(pgcc). Since PGI is one of the founders of OpenACC there are a lot of help and tutorials with this compiler and its availability for us supported our choice to use it. Changing the code is very simple since there is a OpenMP version of Black-Scholes and OpenMP is very similar to OpenACC. The Black-Scholes algorithm is implemented in the method BlkSchlsEqEuroNoDiv and the rst thing that must be changed is to make this function inline. If we are trying to compile with functions that are not inlined the compiler tells us that function calls are not supported. An inlined function is a function where all the code can simply be put where the function was called. fptype BlkSchlsEqEuroNoDiv( inline fptype BlkSchlsEqEuroNoDiv( This noties the compiler that the complete body of the function can be transferred into the code where the function was called. The CNDF method also need to be inlined in the Black-Scholes application. Now it's time to locate all the data that the device will need inside the function and copy those data in using the copyin() function. With a quick look in the method we locate that these arrays are used. Figure 3.3: Alignment for SSE and AVX. Note that the _MM_ALIGN16 is present in more than one place in the code. 20

24 price sptprice strike rate volatility otime otype prices If we look after the method call we see that it's only the prices array that is needed and therefore we use the copy() function for that array so it gets copied out from the device after the kernel has terminated. The OpenMP Black-Scholes version is changed to run OpenACC by changing the PRAGMA command. OpenMP #pragma omp parallel for private(i, price, pricedelta) OpenACC #pragma acc kernels loop independent copyin(price[0:numoptions], sptprice[0:numoptions], strike[0:numoptions], rate[0:numoptions], volatility[0:numoptions], otime[0:numoptions], otype[0:numoptions]) copy(prices[0:numoptions]) When we give the attribute kernels loop the compiler will try to accelerate the loop below this line on the GPGPU. The independent keyword tells the compiler that the loop is independent, that not any turn in the loop changes the data in some other turn. Using the PGI accelerator compiler we set four ags. fast ta=nvidia Minfo=all,accel Minline These ags are telling the compiler to accelerate it and the target should be a Nvidia device and to give us all the information about the automatic acceleration of the compilation. With the ag -Minline we tell the compiler to support inline functions and consider oat constants as type oat. The whole compilation process is in the appendix under OpenACC Compile & Run. 21

25 The altered code has a timing function for the total execution time. The PGI compiler have support for timings on the device, transfer time and kernel time. To use this the environment variable PGI_ACC_TIME needs to be set to 1. The code is now able to be accelerated on the GPGPU with OpenACC and the whole code is attached in the appendix. Now we run tests with the given input les in PARSEC Black-Scholes application, the sizes on input rows are 4, 16, 4 000, , and This gives us results that shows speedup and a ratio between transporting data to the GPGPU and execute the algorithm on that data Larger input After managing to run the two major programs we decided to run tests to gather the information needed to evaluate our problem statement. After doing speedup tests on both the GPU and CPU we realised that our data was not sucient to determine which of the optimization techniques was the optimal one. We decided then to change the code a bit to be able to work with larger sets of data. We generated our own data set which we ran up to 280 million instead of the largest given data set of 10 million from PARSEC. With these new data sets we managed to compare the GPU against the CPU successfully. To change the application to do this we need to add a for loop that initializes the data array with data. We need to add it somewhere before the for loop that adds the values to the input arrays. The two for loops will then look like this. For CPU: f o r ( loopnum = 0 ; loopnum < numoptions ; ++ loopnum ){ data [ loopnum ]. s = ; data [ loopnum ]. s t r i k e = ; data [ loopnum ]. r = ; data [ loopnum ]. divq = ; data [ loopnum ]. v = ; data [ loopnum ]. t = ; data [ loopnum ]. OptionType = 'C ' ; data [ loopnum ]. d i v s = ; data [ loopnum ]. DGrefval = ; 22

26 For GPU: f o r ( i =0; i<numoptions ; i++) { otype [ i ] = ( data [ i ]. OptionType == 'P' )? 1 : 0 ; s p t p r i c e [ i ] = data [ i ]. s ; s t r i k e [ i ] = data [ i ]. s t r i k e ; r a t e [ i ] = data [ i ]. r ; v o l a t i l i t y [ i ] = data [ i ]. v ; otime [ i ] = data [ i ]. t ; This will ll the input arrays with dummy values that will not give us any reasonable result but we can now see the speedup gained over 10 millions data rows. 3.3 Combined vector instruction with GPGPU We managed to create a test le that had the base of the vectorized version of Black-Scholes. We looked at where the program called the calculation algorithm for the rst time and divided that part into methods. One of which called the GPU and the other to use the CPU. The major problem was to get these methods to run parallel, meaning that the GPU and the CPU would execute at its fullest potential simultaneously. We used OpenMP to try and get this to work properly but without success. It was not that we did not manage to run them parallel but in fact that is was not the best solution for the way our program worked and the data transfer compared to the execution time on the GPU. Another problem we ran into was that we got a hard time to determine how much data the GPU actually could handle. With dierent ways of trying to solve this we did not manage to create a self sucient running program which was able to determine the upper limit for the GPU. So manually we had to adjust the size which would be computed on the GPU and CPU or run twice on the GPU. In the code this can be view as the MAX_SPAN variable which is the amount of data rows transferred to the GPU. We found that the max capacity for the GPU in our case was 198 millions data rows. After that point it got interesting to see what we could achieve by combining the GPU with the CPU. 23

27 We moved the program over to our GPU cluster Zorn which was capable of running both vector instructions(sse4.2) and GPU code properly. But in order to compile using the PGI compiler a few adjustments had to be made to the vectorized code. The major change was that we could no longer align data in 16 bytes because the compiler did not allow that. Even though we have to remove the _MM_ALIGN16 denition from the code the PGI compiler helps out with the vectorization which meant that we could utilize the SSE part of Black-Scholes anyways. 24

28 4 Result In this chapter all of the result we have gathered will be presented, note that we will not draw any conclusions on the results in this chapter. Conclusions will be presented in the next section. We have divided the result into four parts, vector instructions, GPGPU acceleration, the combination and one part where present results when comparing GPU and SSE. All of the test that have generated the results below have executed at least ten times and then we have selected the median to minimize the risk of gathering results which can be aected by hardware issues or other unforeseen complications. We have also booked and run the tests on dierent time of day and validated the results to make sure we could manage to utilize the CPU and GPU to their fullest potential. 4.1 Vector instructions We started comparing our two CPU:s the Intel i7 third generation quad core using hyper-threading and AMD Opteron 6172 with 48 cores available. The following results were gathered running with the maximum amount of threads on both CPU:s and the SSE vector instructions: Hardware Architecture Workload (data rows) Execution time (Maximum number of cores) Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Table 4.1: SSE 25

29 The AMD Opteron processor do not support the AVX vector instructions and it was only the Intel i7 processor who was able to run the following tests. These test are also done using the maximum amount of threads. Note that the data set which contains four rows could not be run with AVX instructions because it was too small: Hardware Architecture Workload (data rows) Execution time (Maximum number of cores) Intel i7 4 cores 4 - Intel i7 4 cores Intel i7 4 cores Intel i7 4 cores Intel i7 4 cores Intel i7 4 cores Table 4.2: AVX Gathering the data we could see that the Intel i7 was faster so we decided to make a few tests to determine which of the data sets would be interesting to look at during a speedup scenario. The only interesting data set was the 10 million rows. We generated the following graphs containing execution time for one, two, four and eight cores using both AVX and SSE to see if there is a similar pattern. 26

30 Figure 4.1: Graph to establish how well vector instructions work with threads for SSE and dierent size of data. Data span: Large. Figure 4.2: Graph to establish how well vector instructions work with threads for AVX and dierent size of data. Data span: Large. These tests gave us a view of how well vector instructions can work with parallelism. Using only eight threads this generated the following speedup. Speedup x = Execution time(1 thread) / Execution time(n threads) Speedup SSE = 1453,404 / 299,996 = 4,9x Speedup AVX = 1745,111 / 316,59 = 5,5x Even though there is a good speedup from the sequential to eight threads there is potential for even greater speedup. Going back to using the AMD Opteron processor and its 48 cores we ran new tests to determine how much speedup we could obtain using a lot more cores in order to see if there was a limit to the speedup with vector instructions. The result can be view in gure 4.3 below. 27

31 Figure 4.3: Graph to establish how well vector instructions work with threads for AVX with dierent size of data. Data span: Small - Medium. After only reaching a speedup of 21x on the AMD processor we decided to run it on Zorn to see if this was a hardware issue and also to determine which of our hardware would be the most optimal for our comparison. The best speedup we could manage to get was Speedup Zorn: 16 threads and 10,8x with a max speed of 308,5 ms 28

32 4.2 GPGPU acceleration The following data is comparing our GPGPU cluster Zorn to a single GTX 680, this result might ease the conclusion on the long initiation time on Zorn. Hardware Workload Transfer time Alg time Total time NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M Table 4.3: OpenACC - Zorn vs Single GPU system In table 4.3 we see that both the transfer time and algorithm execution time. We can see that the total time does not add up with the other results. The reason for the total time being so much higher is because each GPU got an initiation time. The initiation time can be determined by the total time subtracted by the algorithm execution and transfer time. As we can see moving the data to the GPU is very time consuming. When running Black-Scholes with input rows we see this performance output, using the environment variable PGI_ACC_TIME=1. This row below shows how much time the acceleration spends on executing the kernel and how much on moving data. kernels=87ms data=12678ms This tells us that the time the program spent on moving data is actually 145 times more then the program spent on the algorithm itself. The power in the GPU is rst shown when the input rows are many more. When running Black-Scholes with as much as 10 million input rows we see this performance output instead: kernels=9198ms data=108629ms 29

33 In comparison to the previous execution this execution only spends about 11 times more on data transfer then algorithm execution. When we do runs larger than 10 million we see that this ratio stagnates on 10 times more on data transfer than on algorithm execution time. Figure 4.4: The ratio how much time the application spends on transferring data and how much it actually spends on the algorithm. From gure 4.4 we see that the time that is spent on moving data is overwhelming until the workload is close to 10 million input rows. Then the ratio lies steady at around ten times more on moving data than to do the execution. 4.3 Compare GPU to vector instruction In order to answer our problem statement we have to compare these dierent optimization techniques against each other. We decided to look at the execution time on a data set span. We choose the data span of 10 millions to 100 million data rows and compared SSE vs OpenACC, CPU vs GPU. 30

34 Figure 4.5: SSE speedup & GPGPU speedup. We also decided to show how powerful the GPU kernel is when executing the algorithm and how much time that is actually spent transferring the data to the GPU. Using a log10 graph of the executions time in milliseconds where we compare CPU and the GPU kernel we can view the dierence in execution time. Figure 4.6: Execution time of the CPU & the GPU kernel. 31

35 4.4 Combined GPU with vector instructions After comparing the CPU vs the GPU we took the time to combine these two into one solution to determine at which point we benet from using the the CPU and GPU together. As mentioned before we found the point where the memory for our GPU reached maximum capacity (198 millions data rows) and worked on tests around that area. So we started to measure the maximum speed for 180 millions to 280 millions data rows. Figure 4.7: Execution time of combined Black-Scholes & and GPGPU accelerated Black-Scholes when prior to and after GPGPU has reached maximum capacity. In gure 4.7 we have used the most optimal time for the combined version prior to the GPU memory allocation cap. During our test executing the combined version was most optimal using only the GPU before the cap. We also wanted to view the entire scope we been working on with all the dierent techniques. We ran and plotted the following graph (gure 4.8) to show the dierence using these three techniques in terms of execution time. We chose the data span from 10 million to 280 million to really show the dierence in the execution time and how these techniques compares to each other. 32

36 Figure 4.8: Execution time in seconds of all three versions of Black-Scholes used in the research. We can actually see using this graph that we managed to increase the speed from the start as well by adjusting the SSE version to run on the PGI compiler. We can also see how fast the SSE only Black-Scholes version is being less eective than the GPU version when working with such big data sets. 33

37 5 Summary & Conclusion This chapter will include explanations and ideas of the results in this thesis. It will also include a summary of the whole thesis. 5.1 Conclusion Within this section we will discuss and evaluate the gathered results. We will both talk about the speedup as well as the programming eort in the two areas vectorization and GPGPU acceleration Vectorization This part will focus on the result of the vectorized version of Black-Scholes. Discuss how well vectorized Black-Scholes speed increases with parallelism and how much programming eort needed to compile and run a vectorized Black-Scholes. Speedup Our rst speedup result was to look at AVX and SSE instructions to compare which one was faster. In terms of speedup we managed to obtain 5,5 times speedup using the AVX instructions but the SSE only reached a 4,9 times speedup. Still the SSE managed to perform better when doing our tests on our only AVX capable CPU which only had four cores. We started looking around for a reason why this occurred because generally AVX is said to be faster then SSE. We could not determine for sure why this was happening but we found out what other developers had answered with when asked about the same problems. The rst interesting reason was pointed out by a developer at Intel who said that SSE and AVX has the same upper limit for store performance the amount of data that could be loaded and stored simultaneously was equal. The second interesting point we found out was that two move instructions with SSE was faster than one move instructions with AVX. This 34

38 gave us and realisation we might have reach the upper limit of store performance and using slower instructions for moving it seemed reasonable why the SSE managed to be faster. However since the AVX managed to reach a higher speedup we cannot be sure that it would not exceed or reach the same limit as SSE in speed with more cores applied, unfortunately we did not have the hardware to test this. When looking only at SSE in turns of speedup we were a bit disappointed when it came to utilizing the parallelism. We managed to get speedups but not in the fashion we expected. Running the vectorized version on our AMD Opteron 6172 and only managed to get a 21 times (see gure 4.3)speedup while the normal version got a 47 times speed up as most. The major concern here was that when applying more cores than 32 we got a speed decrease. This made us wonder if there was any cap to the registers of some sort or increase amount of cache misses when applying more cores and working with the vector instructions. We were not able to determine why this happened we just had to accept the fact that we did not manage to get the speedup we expected. When looking at the speedup generated over the dierent hardware we can conclude that our solution is not scaling all too well with parallelism. This is not a proof that in general SSE is not optimal with parallelism but in our case it was. Programming eort In terms of programming eort we managed to avoid a lot of work when working with the vector instructions version of Black-Scholes since as mentioned earlier PARSEC supplied us with an already vectorized version. But looking through the vectorized version and the normal Black-Scholes it contains a lot of changes and it would probably have taken a very long time to implement this by ourselves and make it as optimal as they have done it. Even though its not at all impossible to recreate a fully functional vectorized version. The eort to vectorizing Black-Scholes by ourselves would have been demanding but implementing the AVX support in the already vectorized version was very little eort. That being said if you have already vectorized a program for SSE then the transition to AVX is almost eortless and converting a non vectorized program to become vectorized would be dicult and take time if you have no earlier experience of doing so. You really have to understand the code to its entirely to be able to apply the correct functions for the most optimal use. 35

39 5.1.2 GPGPU - OpenACC OpenACC is a good technique that is used to accelerate code on the GPGPU. This thesis chose between OpenCL, CUDA and OpenACC. All three was tried out but the main research used OpenACC because of the minimal problems it produced when programming and compiling the code. Speedup In this part the main discussion will be about the speedup and timings when running Black-Scholes with OpenACC on a GPU cluster. The big issue when accelerating on the GPGPU is the transfer time. This issue cannot be minimized in any way, the result tells us that when we have the best data optimization possible and the largest workload, the data movement takes about ten times longer than how long the execution of the algorithm takes. This conclusion led us to not spend more time on trying to optimize the application any more. Where we could have done some optimizations is in the algorithm part of the application by mapping the OpenACC code better to CUDA code by specifying number of grids and blocks to use. Since the data transfer is optimized as much as possible and the data transfer is so much larger then the algorithm execution time the optimization would not gain the study very much. The speedup we get when we compare running Black-Scholes on the CPU without vector instructions and accelerating it on the GPU is what we had expected. With lower inputs it gives no speedup at all and it is actually slower to run on the GPU since the initiation time and transfer time is longer than the whole execution time on the CPU. Programming eort When accelerating code on the GPGPU there are some important concept to understand and is somewhat time consuming. The biggest thing to understand is the architecture of a GPGPU with the aspect of grids, blocks and threads. This research used OpenACC and the compiler is mapping our algorithm to CUDA code. If the acceleration don't need to be optimized any further on the GPGPU the knowledge about the GPGPU architecture is not necessary. Since it is so easy to translate an OpenMP application to OpenACC and this research was lucky enough to start of with a good OpenMP version of the Black-Scholes application the programming eort was very small. To accelerate using OpenACC it is still needed to learn about a OpenACC compiler and the delimitations of OpenACC, for example functions 36

40 have to be able to be inlined and the lack of support of pointers that are ambiguous. To learn about this does not need to take more than one or two days. To accelerate with plain CUDA or OpenCL more knowledge is required. From our perspective those techniques are way more complicated Combined CPU and GPGPU This section we will discuss the result we received when running our combined version of Black-Scholes. Speedup We got some interesting results when combining the CPU with the GPU in terms of speedup. We started doing tests from a workload below the max capacity for the GPU where we realised there was no real need to use both the GPU and the CPU because we had already proven previously that already after 30 millions the GPU was much fast than the CPU. We decided to focus on the area where the GPU would reach max capacity (180 million to 198 million) and see if we could manage to get a speedup with help of the CPU. We made sure the GPU only version was faster than the GPU combined with CPU version for our data span before the cap to validate our results. We predicted that when the GPU lls up there would have to be another data transfer when the GPU was done and for the lower data sets of 10 million and 20 millions we had already proven that it was worth using the CPU. So these were the most interesting values to look at when we started running the tests. To our surprise combining the code for the CPU and the GPU we actually managed to get a speedup on the vectorized part compared to our previously vectorized version of Black-Scholes. So not only did we get a speedup on the lower part of the data sets but even up to 60 millions data rows we obtained a speedup by using the CPU. We can not really explain the speedup of the CPU using the PGI compiler because we got a vectorized code but we suspect it has something to do with the alignment of data. When using the align functions you set boundaries in bytes to the amount you declared to the alignment functions and it is also used to align data to the cache line to improve cache performance. We suspect that the compiler optimize this alignment in a way that was more ecient than our own alignments. 37

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

All ju The State of Software Development Today: A Parallel View. June 2012

All ju The State of Software Development Today: A Parallel View. June 2012 All ju The State of Software Development Today: A Parallel View June 2012 2 What is Parallel Programming? When students study computer programming, the normal approach is to learn to program sequentially.

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

OpenACC Basics Directive-based GPGPU Programming

OpenACC Basics Directive-based GPGPU Programming OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. wienke@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model 5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Hands-on CUDA exercises

Hands-on CUDA exercises Hands-on CUDA exercises CUDA Exercises We have provided skeletons and solutions for 6 hands-on CUDA exercises In each exercise (except for #5), you have to implement the missing portions of the code Finished

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild. Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2 GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,

More information

OpenACC Programming and Best Practices Guide

OpenACC Programming and Best Practices Guide OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set

More information

Black-Scholes option pricing. Victor Podlozhnyuk vpodlozhnyuk@nvidia.com

Black-Scholes option pricing. Victor Podlozhnyuk vpodlozhnyuk@nvidia.com Black-Scholes option pricing Victor Podlozhnyuk vpodlozhnyuk@nvidia.com June 007 Document Change History Version Date Responsible Reason for Change 0.9 007/03/19 vpodlozhnyuk Initial release 1.0 007/04/06

More information

Intelligent Heuristic Construction with Active Learning

Intelligent Heuristic Construction with Active Learning Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools

More information

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Texture Cache Approximation on GPUs

Texture Cache Approximation on GPUs Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

CUDA Programming. Week 4. Shared memory and register

CUDA Programming. Week 4. Shared memory and register CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK

More information

High Performance GPGPU Computer for Embedded Systems

High Performance GPGPU Computer for Embedded Systems High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not

More information

Accelerating variant calling

Accelerating variant calling Accelerating variant calling Mauricio Carneiro GSA Broad Institute Intel Genomic Sequencing Pipeline Workshop Mount Sinai 12/10/2013 This is the work of many Genome sequencing and analysis team Mark DePristo

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

Generations of the computer. processors.

Generations of the computer. processors. . Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

More information

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and

More information

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware 25 August 2014 Copyright 2001 2014 by NHL Hogeschool and Van de Loosdrecht Machine Vision BV All

More information

Direct GPU/FPGA Communication Via PCI Express

Direct GPU/FPGA Communication Via PCI Express Direct GPU/FPGA Communication Via PCI Express Ray Bittner, Erik Ruf Microsoft Research Redmond, USA {raybit,erikruf}@microsoft.com Abstract Parallel processing has hit mainstream computing in the form

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

GPGPUs, CUDA and OpenCL

GPGPUs, CUDA and OpenCL GPGPUs, CUDA and OpenCL Timo Lilja January 21, 2010 Timo Lilja () GPGPUs, CUDA and OpenCL January 21, 2010 1 / 42 Course arrangements Course code: T-106.5800 Seminar on Software Techniques Credits: 3 Thursdays

More information

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1 AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

More information

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper Four Keys to Successful Multicore Optimization for Machine Vision White Paper Optimizing a machine vision application for multicore PCs can be a complex process with unpredictable results. Developers need

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

The Bus (PCI and PCI-Express)

The Bus (PCI and PCI-Express) 4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the

More information

EE361: Digital Computer Organization Course Syllabus

EE361: Digital Computer Organization Course Syllabus EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)

More information

GPGPU Parallel Merge Sort Algorithm

GPGPU Parallel Merge Sort Algorithm GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

More information

PCI vs. PCI Express vs. AGP

PCI vs. PCI Express vs. AGP PCI vs. PCI Express vs. AGP What is PCI Express? Introduction So you want to know about PCI Express? PCI Express is a recent feature addition to many new motherboards. PCI Express support can have a big

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information