Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Transcription

1 Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

2

3 Data-parallel acceleration of PARSEC Black-Scholes benchmarks Patrik Hagernäs August Andrén Stockholm 2013 Parallel programming School of Information and Communication Technology Royal Institute of Technology IR-EE number

4 Abstract The way programmers has been relying on processor improvements to gain speedup in their applications is no longer applicable in the same fashion. Programmers usually have to parallelize their code to utilize the CPU cores in the system to gain a signicant speedup. To accelerate parallel applications furthermore there are a couple of techniques available. One technique is to vectorize some of the parallel code. Another technique is to move parts of the parallel code to the GPGPU and utilize this very good multithreading unit of the system. The main focus of this report is to accelerate the data-parallel workload Black-Scholes of PARSEC benchmark suite. We are going to compare three accelerations of this workload, using vector instructions in the CPU, using the GPGPU and using a combination of them both. The two fundamental aspects are to look at the speedup and determine which technique requires more or less programming eort. To accelerate with vectorization in the CPU we use SSE & AVX techniques and to accelerate the workload in the GPGPU we use OpenACC. 1

5 Contents 1 Introduction Background Problem description Problem Problem statement Purpose Goal Method Limitations Theoretic background PARSEC PARSEC benchmark suite Black-Scholes Vector instructions SIMD - Single Instruction, Multiple Data SSE - Streaming SIMD Extensions AVX - Advanced Vector Execution GPGPU - General Purpose GPU CUDA OpenACC Methodology Vector instructions GPGPU OpenACC Larger input Combined vector instruction with GPGPU

6 4 Result Vector instructions GPGPU acceleration Compare GPU to vector instruction Combined GPU with vector instructions Summary & Conclusion Conclusion Vectorization GPGPU - OpenACC Combined CPU and GPGPU General conclusions Summary Future research Appendix GPGPU - OpenACC blackscholes-acc.c Vector instructions - SSE blackscholes-simd.c Combined - OpenACC & SSE blackscholes-combined.c

7 List of Figures 2.1 Single Instruction Single Data vs Single Instruction Multiple Data YMM and XMM registers Structure of a CUDA kernel SSE and AVX functions used to test vector instructions Denition set for SSE and AVX Alignment for SSE and AVX. Note that the _MM_ALIGN16 is present in more than one place in the code Graph to establish how well vector instructions work with threads for SSE and dierent size of data. Data span: Large Graph to establish how well vector instructions work with threads for AVX and dierent size of data. Data span: Large Graph to establish how well vector instructions work with threads for AVX and dierent size of data. Data span: Small - Medium The ratio how much time the application spends on transferring data and how much it actually spends on the algorithm SSE speedup & GPGPU speedup Execution time of the CPU & the GPU kernel Execution time of combined Black-Scholes & and GPGPU accelerated Black-Scholes when prior to and after GPGPU has reached maximum capacity Execution time in seconds of all the three versions of Black- Scholes used in the research

8 List of Tables 2.1 PARSEC workloads SSE AVX OpenACC - Zorn vs Single GPU system Programming eort

9 1 Introduction This section will focus on introducing the background of the problem statement, the goals and purpose of the research. It will also include what tools have been used and the delimitations for this research. 1.1 Background Computer industries have for a very long time relied on processor improvement that lead to upgrade in clock cycles per seconds. These processor improvements could be connected to the prediction known as Moore's Law. Even as early as 1965 Gordon Moore made this prediction. He predicted that the number of transistors on a processor chip would double every twelfth month. Now it's more commonly quoted as every eighteenth month or even every second year but is still considered to be accurate [3]. This process improvements was a reality for the developers and they could rely on it for their applications to gain speedup. The reality went on until the processors reached a clock frequency limit which was hard to push further without receiving serious thermal problems. To maintain speedup in processors the industries set a new standard of using multi core processors. Multicore processors as a standard did not come without a sacrice. Developers could no longer rely on the processors to improve and wait for the increased amount clock cycles of the processors to achieve speedup. Software had to be parallelised in order to utilize the processor to achieve the desired speedup. This parallelization was not always considered to be an easy task. There are a lot of techniques to make the program run parallel and there are still more being developed. These techniques are based on dierent ways to thread the program, dividing the workload and utilizing the hardware. Developers are constantly trying to adapt to these new techniques to achieve speedup in their applications [7]. 6

10 1.2 Problem description Many of today's parallel applications cannot be fully utilized because they are limited by the amount of cores in the CPU. There are techniques which can be used to achieve a speedup without increasing the number of cores and instead utilizing the advanced extended CPU registers. Speedup could also be achieved from moving parts of the workload from the CPU to the GPGPU (General Purpose Graphical Processing Unit) Problem The main focus in this study is to utilize the PARSEC benchmark and the Black-Scholes workload in a heterogenous system. Using both vector instructions in the CPU as well as moving part of the workload to the GPGPU and achieve a speedup compared to the now already parallelized code. The reasons for using Black-Scholes workload from the PARSEC benchmark suite is because it is already data-parallel code which can easily be vectorized and GPGPU accelerated. There are already many comparisons between running data-parallel code on the CPU and running it on the GPGPU. Most of the time these comparisons are made between a not vectorized data-parallel algorithm running of the CPU and an equivalent algorithm on the GPGPU, these condition are not considered to be fair. In some cases the comparisons have been remade using vectorized code and the result did come out dierently Problem statement Which techniques is most ecient to increase the speed of an benchmark data-parallel application within the PARSEC suite? Vector instructions, accelerating it on GPGPU or even a combination of both? 1.3 Purpose The purpose of this study is to focus on two fundamental acceleration techniques, vectorization in the CPU and running code on GPGPU. Applying these techniques onto data-parallel applications from the PARSEC benchmark suite to determine the optimal solution for speedup and performance acceleration. This research will also look into how much programming eort is required of each technique to create a relevant and functional program. 7

11 1.4 Goal The goal of this study is to manage to run the PARSEC benchmark Black- Scholes with both vectorization and GPGPU and get a speedup. Hopefully the result will help developers to map some sort of general idea when it is good to vectorize the code, use the GPGPU or perhaps a combination of them both. 1.5 Method This research is not about reinventing the wheel and we will try if possible not to rewrite any code that has been written by people with greater knowledge in this area. We must however learn these techniques and understand the code we manage to nd or be able to write the code that we could not nd. To our help in achieving the goals, we have access to a four core Intel i QM process with AVX/SSE functions, a twelve core AMD Opteron 6172 with SSE instructions, a graphics card Nvidia GTX 680 and a cluster of graphics cards with Nvidia Tesla M2090, Nvidia GeForce GTX 580 and Nvidia Tesla C2050. In order to apply these techniques we have read previous research rapports, tutorials and manuals. Our research materials and manuals have been obtained mostly from the archives of the developers for these dierent techniques. To obtain our results we have programmed in C using GPGPU and vector instruction APIs with dierent compilers such as g++, ICC, PGI, nvcc and Visual Studio. We started programming from the beginning with smaller programs in each technique and then move forward to even more advanced programs. Eventually we applied the techniques on the desired application. Using the time operations and estimating the eort needed to complete the task we could determine our result in acceleration and programming eort. 8

12 1.6 Limitations This report may not include fully optimized code which has been determined through tests due to lack of knowledge with these techniques. The performance tests may have been run on dierent level of optimization due to diculty of applying the techniques. The hardware architecture cannot really be compared equally which means that there cannot be any guarantee that the results will be completely fair. 9

13 2 Theoretic background This research is based on accelerating applications using dierent accelerating techniques. The techniques used in this study are vector instructions as well as usage of GPGPU. To limit the area of interest the main focus was set to a few specic techniques within these areas. This section will introduce both of the areas and techniques chosen for this study. It will also explain PARSEC (The Princeton Application Repository for Shared-Memory Computers) and also compare the PARSEC benchmark suite against other benchmark suits. 2.1 PARSEC PARSEC (The Princeton Application Repository for Shared-Memory Computers) is one of the more popular benchmark suites within parallel programming. PARSEC has earned this popularity because it provides a large and diverse repository of applications. All these applications have been selected from several application domains and they cover dierent areas in parallel programming. In the latest version of PARSEC (3.0) the benchmark suite contains thirteen dierent workloads. Every single one of these workloads have been parallelised using dierent techniques [2]. When learning about PARSEC other benchmarks are being mentioned. The most common benchmark mentioned is SPLASH-2. SPLASH-2 is a benchmark that also focus on parallel applications. Naturally this has led to a lot of comparisons between PARSEC and SPLASH-2. There is no conclusion which determine which of these benchmark is better than the other. SPLASH-2 is an older benchmark suite and was started in 1990 and the applications are outdated in some manner. Despite the younger and more diverse PARSEC benchmark SPLASH-2 with its old programs are still useful depending on model and research area [1][2]. 10

14 Program Application Domain Parallelization Working Set Data Usage Model Granularity Sharing Exchange blackscholes Financial Analysis data-parallel coarse small low low bodytrack Computer Vision data-parallel medium medium high medium canneal Engineering unstructured ne unbounded high high dedup Enterprise Storage pipeline medium unbounded high high faceism Animation data-parallel coarse large low medium ferret Similarity Search pipeline medium unbounded high high uidanimate Animate data-parallel ne large low medium freqmine Data Mining data-parallel medium unbounded high medium raytrace Rendering data-parallel medium unbounded high low streamcluster Data Mining data-parallel medium medium low medium raytrace Financial Analysis data-parallel coarse medium low low vips Media Processing data-parallel coarse medium low medium x264 Rendering pipeline coarse medium high high Table 2.1: PARSEC workloads PARSEC benchmark suite PARSEC contains thirteen applications as previously mentioned, which can be seen in table 2.1. Each of these is an application in a specic area of interest. What makes these applications so valuable is the fact that they are all state of the art applications within their area. Each workload is parallelised in multiple ways which enables various benchmark studies [2][1]. Every workload is interesting but this research will focus on Black-Scholes Black-Scholes Black-Scholes is the workload in PARSEC that this study is trying to accelerate. It is a application that comes from a mathematical model that contains investment instruments for the nancial market. The application is data-parallel so it's very convenient for since it is not needed to make it dataparallel before accelerating it. The parallel part of Black-Scholes consists of only one for loop that contains the Black-Scholes algorithm. Parallelizing it with OpenMP is therefore very simple and the ow of the application consists of three basic parts, copy in data, the parallel for loop with the Black-Scholes algorithm and copy out result. 2.2 Vector instructions This section will introduce vector instructions used by the CPU. To explain more in detail how this can be used we will introduce two vector instruction related techniques and one architecture that we used in our research. SIMD (Single Instruction, Multiple Data) - Architecture, SEE (Streaming SIMD 11

15 Extensions) and AVX (Advanced Vector Execution) - Techniques SIMD - Single Instruction, Multiple Data SIMD is a cpu architecture which allows instructions to be vectorized. SIMD architecture allows for one CPU instruction to work on multiple blocks of data simultaneously. For instance you may need to do a operation on all elements in arrays, with SIMD you can then work on large chunks of data instead of one element at the time. The number of cores in the system sets the limit for the level of parallelism. Figure 2.1: SISD vs SIMD SIMD is an architecture that has been around for sometime now and it is based on oating point calculations. Earlier processors was not capable to handle oating point, which meant that oating point calculations had to be done in a separate unit. Floating point calculations was in high demand and it was not not long before oating points were introduced to the processors. Along with the introduction of oating point calculation capabilities in the processors new classications were introduced, one of these was SIMD [4]. 12

16 In order to use SIMD the developer need to invoke vector instructions so the compiler will convert your instructions correctly. A correct conversion can be viewed in the assembler code where you have one instruction containing multiple data instructions. Otherwise they will convert them to the more common SISD (Single Instruction, Single Data). Where each instruction can only handle one set of data. Note that this can also be done in parallel depending on the algorithms parallelism and the number of available cores in the processor.[6][4] SSE - Streaming SIMD Extensions SSE is one of the dierent vector instructions that has been developed as the evolution of SIMD. SSE was not the rst instruction set that was implemented with the SIMD architecture, before that Intel had implemented an instruction set called MMX. The MMX instructions was later re modied to what is now known as the rst version of SSE. SSE has from this point been updated to several versions and implemented functions based on user feedback. The most commonly used version is the SSE3 which came with the second generation family of Intel processors.[9] In modern processors there are certain registers used for the purpose of vector instructions such as SSE. These are commonly known as XMM (as can be seen in gure 2.2) registers and can contain 128 bits. Using these registers instead of the 64 or 32 bits registers (based on cpu architecture) you can scale down two to four instructions into one single instruction if done properly.[8] AVX - Advanced Vector Execution AVX is an advanced version of the previously mentioned SSE. This new instruction set has been developed with the advanced register architecture. As mentioned in the SSE section there are registers named XMM which was 128 bit sized registers. The new register architecture has 256 bit registers called YMM where the last 128 bits are the former known XMM registers (See gure 2.2).[5][6] 13

17 Figure 2.2: YMM and XMM registers 2.3 GPGPU - General Purpose GPU GPGPU is a term that was founded when it became an advantage to run parallel code on the GPU that did not have anything to do with graphics. The advantage with a GPU against an ordinary CPU is the multi thread capability, GPU can multithread much better and with many more threads. The GPU core is running on a lower frequency, generating less heat. This is what many parallel applications are looking for CUDA CUDA is an environment to use when programming in parallel. CUDA is a full C++ compiler with the purpose to scale the code to hundreds of cores and thousands of threads on a GPU. With CUDA you enable heterogeneous systems meaning you can combine both the CPU and the GPU within the code. The way it works in details is that the main serial thread runs on the CPU and your parallel parts of the code runs on the GPU. These parallel parts are called kernels and the GPU can run multiple kernels at the same time. The way that the GPU run in parallel is in kernels and these kernels run as a grid of blocks of threads as shown in the gure

18 Figure 2.3: CUDA kernel. The grid contains three blocks, each block contains threads. Communication between threads in a parallel code is available within the same block. The communication is then achieved with shared memory and synchronization with each other. The way a thread know its own index in the kernel is by taking its block index and multiply with the number of block added with the thread index, this is shown in the code below. int idx = blockidx.x * blockdim.x + threadidx.x; There are ways to optimize CUDA code and reduce overheads, Memory Coalescing, Shared Memory, Cache-ecient texture memory accesses, loop unrolling, Parallel Reduction and Page-locked Memory allocation OpenACC OpenACC is a programming standard that is developed to make it easier to program in parallel on heterogeneous systems. To accelerate an application on the GPGPU with plain CUDA can be very hurtful if the programmer is new to these concepts. OpenACC can be used as a bridge between programming in parallel on the CPU to program plain CUDA on the GPGPU. OpenACC can be compared to OpenMP where both uses the PRAGMA commands to execute code in parallel, however OpenMP can only execute on the CPU, where OpenACC can execute on both the CPU and the GPGPU. This standard is developed by the companies Cray, CAPS, Nvidia, and PGI. An OpenACC command can look like this: #pragma acc kernels loop independent copyin(neededarray[0:max]) copy(resultarray[0:10]) 15

19 It rst denes that it is an independent loop that will be accelerated. The function will need the array that is copied in using the copyin() function and the whole array is copied in, from zero to the predened MAX denition. The resultarray is then being copied in before the kernel launch and copied out after the kernel has terminated using the copy() function, only the rst ten values is copied. 16

20 3 Methodology This chapter will document the methodology of the study with both vectorization, accelerating with GPGPU and using a combination of them both. If the reader follows this chapter the same result should be obtained if the exact same hardware architecture environment is used. The source les and input les that is used in this study comes from the standard PARSEC 3.0 package [11]. 3.1 Vector instructions In order to vectorize the PARSEC application Black-Scholes we had to learn how to work with vector instructions to begin with. PARSEC already included a vectorized version of Black-Scholes but to understand it and adjust it to AVX we needed to know more about vector instructions. To achieve our goals to vectorize Black-Scholes we had to start from the beginning. We decided to gure out when and where it is possible to use vector instructions. First of all we wanted to know when we could apply the vector instructions to the code and then evaluate when we thought it was worth applying it. It was important for us to analyze and know what actually happened with the code. When we reached this point we would evolve our code to a more advanced stage and later with our new knowledge work with Black-Scholes. When researching about vector instructions and looking at Black-Scholes we narrowed down the critical sections in which we could utilize the vector instructions. These critical section contained loops and arrays. With this knowledge we could minimize the area of code we wished to look at. Knowing that we should focus on loops and arrays we created a simple program using arrays and two kinds of loops, one which was dependant and one non dependant loop. We predicted the non dependant loop was going to work 17

21 because if there were no dependencies there would be no restrictions in working with multiple instructions at the same time. So we was more interested in the dependant loop. As suspected the non dependant loop was vectorized while the dependant was not able to be vectorized. The way we were able to determine if the loop could be vectorized was to use the Intel C++ Compiler (ICC). What this compiler allows us to do is to automatically check the code and validate if a loop can or cannot be vectorized. If a loop can be vectorized it will automatically vectorize in the most optimal way it is capable of. This is very much like the gcc -O2 option which allows for automatic optimization. What the ICC is also contributing with is a report option that prints which loops have been vectorized. Using these features we could determine which part of the code we could apply these vector instructions on. To be able to analyze further of what we know of our auto vectorized code we created an assembler le using a compilation command to compare the non vectorized and vectorized code to determine what instructions is used by the compiler to see how they decided to vectorize the code. We also used the GDB (the GNU Project Debugger) with split options so we could see the assembler code while debugging the program. We were able to nd the instructions which used the XMM or YMM cpu registers that is signicant for vector instructions. Learning from this analysis we consider us ready to evolve our code to apply these vector instructions on our own. To evolve our code we decided to create a new version for SSE and one for AVX and time the three methods against each other to see if we could manage to control vector instructions and also compare the time dierence between our results. We successfully produced three dierent parallel functions using AVX, SSE and without vector instructions. We however felt that we could apply and understand vector instructions to the point where we could look into Black-Scholes. Looking into the vectorized Black-Scholes code we could discuss the content and understand how it will be converted to the registers. To convert the SSE functions to AVX we followed the already implemented denition set they used to separate if the program should run with oat and double instructions. We implemented a new denition using the AVX 256 bit commands. Only correcting this was not sucient because when using SSE command you only align 16 bytes when moving to and from the registers while AVX 18

22 Figure 3.1: SSE and AVX functions used to test vector instructions. are using 32 bytes. So we had to create our own denition when using the AVX set. So we changed it to the following. We also automatically activated OpenMP to allow for parallel execution and use the OpenMP time calculations. With this experiment we also received expected and unexpected result which we will present in the result section. We ran a lot of dierent tests to gather the results we considered where needed to determine how well the vector instructions worked with parallelism or how well the GPU managed with smaller data sets. To be able to get the most fair comparison with vector instructions and GPU we ran the tests on our dierent hardware architectures to see which of them gave the most fair result. 3.2 GPGPU The idea is to rst learn about CUDA and OpenACC to implement Black- Scholes on the GPGPU. The OpenACC standard is very clear and easy Figure 3.2: Denition set for SSE and AVX. 19

23 to use hence the rst implementation was with OpenACC. The Portland Group compiler will be used to accelerate the code on the device. A licence is needed for the compiler, and this was obtained from their website( OpenACC To run OpenACC on our test devices we used PGI(The Portland Group)[10] compiler for the language C(pgcc). Since PGI is one of the founders of OpenACC there are a lot of help and tutorials with this compiler and its availability for us supported our choice to use it. Changing the code is very simple since there is a OpenMP version of Black-Scholes and OpenMP is very similar to OpenACC. The Black-Scholes algorithm is implemented in the method BlkSchlsEqEuroNoDiv and the rst thing that must be changed is to make this function inline. If we are trying to compile with functions that are not inlined the compiler tells us that function calls are not supported. An inlined function is a function where all the code can simply be put where the function was called. fptype BlkSchlsEqEuroNoDiv( inline fptype BlkSchlsEqEuroNoDiv( This noties the compiler that the complete body of the function can be transferred into the code where the function was called. The CNDF method also need to be inlined in the Black-Scholes application. Now it's time to locate all the data that the device will need inside the function and copy those data in using the copyin() function. With a quick look in the method we locate that these arrays are used. Figure 3.3: Alignment for SSE and AVX. Note that the _MM_ALIGN16 is present in more than one place in the code. 20

24 price sptprice strike rate volatility otime otype prices If we look after the method call we see that it's only the prices array that is needed and therefore we use the copy() function for that array so it gets copied out from the device after the kernel has terminated. The OpenMP Black-Scholes version is changed to run OpenACC by changing the PRAGMA command. OpenMP #pragma omp parallel for private(i, price, pricedelta) OpenACC #pragma acc kernels loop independent copyin(price[0:numoptions], sptprice[0:numoptions], strike[0:numoptions], rate[0:numoptions], volatility[0:numoptions], otime[0:numoptions], otype[0:numoptions]) copy(prices[0:numoptions]) When we give the attribute kernels loop the compiler will try to accelerate the loop below this line on the GPGPU. The independent keyword tells the compiler that the loop is independent, that not any turn in the loop changes the data in some other turn. Using the PGI accelerator compiler we set four ags. fast ta=nvidia Minfo=all,accel Minline These ags are telling the compiler to accelerate it and the target should be a Nvidia device and to give us all the information about the automatic acceleration of the compilation. With the ag -Minline we tell the compiler to support inline functions and consider oat constants as type oat. The whole compilation process is in the appendix under OpenACC Compile & Run. 21

25 The altered code has a timing function for the total execution time. The PGI compiler have support for timings on the device, transfer time and kernel time. To use this the environment variable PGI_ACC_TIME needs to be set to 1. The code is now able to be accelerated on the GPGPU with OpenACC and the whole code is attached in the appendix. Now we run tests with the given input les in PARSEC Black-Scholes application, the sizes on input rows are 4, 16, 4 000, , and This gives us results that shows speedup and a ratio between transporting data to the GPGPU and execute the algorithm on that data Larger input After managing to run the two major programs we decided to run tests to gather the information needed to evaluate our problem statement. After doing speedup tests on both the GPU and CPU we realised that our data was not sucient to determine which of the optimization techniques was the optimal one. We decided then to change the code a bit to be able to work with larger sets of data. We generated our own data set which we ran up to 280 million instead of the largest given data set of 10 million from PARSEC. With these new data sets we managed to compare the GPU against the CPU successfully. To change the application to do this we need to add a for loop that initializes the data array with data. We need to add it somewhere before the for loop that adds the values to the input arrays. The two for loops will then look like this. For CPU: f o r ( loopnum = 0 ; loopnum < numoptions ; ++ loopnum ){ data [ loopnum ]. s = ; data [ loopnum ]. s t r i k e = ; data [ loopnum ]. r = ; data [ loopnum ]. divq = ; data [ loopnum ]. v = ; data [ loopnum ]. t = ; data [ loopnum ]. OptionType = 'C ' ; data [ loopnum ]. d i v s = ; data [ loopnum ]. DGrefval = ; 22

26 For GPU: f o r ( i =0; i<numoptions ; i++) { otype [ i ] = ( data [ i ]. OptionType == 'P' )? 1 : 0 ; s p t p r i c e [ i ] = data [ i ]. s ; s t r i k e [ i ] = data [ i ]. s t r i k e ; r a t e [ i ] = data [ i ]. r ; v o l a t i l i t y [ i ] = data [ i ]. v ; otime [ i ] = data [ i ]. t ; This will ll the input arrays with dummy values that will not give us any reasonable result but we can now see the speedup gained over 10 millions data rows. 3.3 Combined vector instruction with GPGPU We managed to create a test le that had the base of the vectorized version of Black-Scholes. We looked at where the program called the calculation algorithm for the rst time and divided that part into methods. One of which called the GPU and the other to use the CPU. The major problem was to get these methods to run parallel, meaning that the GPU and the CPU would execute at its fullest potential simultaneously. We used OpenMP to try and get this to work properly but without success. It was not that we did not manage to run them parallel but in fact that is was not the best solution for the way our program worked and the data transfer compared to the execution time on the GPU. Another problem we ran into was that we got a hard time to determine how much data the GPU actually could handle. With dierent ways of trying to solve this we did not manage to create a self sucient running program which was able to determine the upper limit for the GPU. So manually we had to adjust the size which would be computed on the GPU and CPU or run twice on the GPU. In the code this can be view as the MAX_SPAN variable which is the amount of data rows transferred to the GPU. We found that the max capacity for the GPU in our case was 198 millions data rows. After that point it got interesting to see what we could achieve by combining the GPU with the CPU. 23

27 We moved the program over to our GPU cluster Zorn which was capable of running both vector instructions(sse4.2) and GPU code properly. But in order to compile using the PGI compiler a few adjustments had to be made to the vectorized code. The major change was that we could no longer align data in 16 bytes because the compiler did not allow that. Even though we have to remove the _MM_ALIGN16 denition from the code the PGI compiler helps out with the vectorization which meant that we could utilize the SSE part of Black-Scholes anyways. 24

28 4 Result In this chapter all of the result we have gathered will be presented, note that we will not draw any conclusions on the results in this chapter. Conclusions will be presented in the next section. We have divided the result into four parts, vector instructions, GPGPU acceleration, the combination and one part where present results when comparing GPU and SSE. All of the test that have generated the results below have executed at least ten times and then we have selected the median to minimize the risk of gathering results which can be aected by hardware issues or other unforeseen complications. We have also booked and run the tests on dierent time of day and validated the results to make sure we could manage to utilize the CPU and GPU to their fullest potential. 4.1 Vector instructions We started comparing our two CPU:s the Intel i7 third generation quad core using hyper-threading and AMD Opteron 6172 with 48 cores available. The following results were gathered running with the maximum amount of threads on both CPU:s and the SSE vector instructions: Hardware Architecture Workload (data rows) Execution time (Maximum number of cores) Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Intel i7 4 cores AMD Opteron Table 4.1: SSE 25

29 The AMD Opteron processor do not support the AVX vector instructions and it was only the Intel i7 processor who was able to run the following tests. These test are also done using the maximum amount of threads. Note that the data set which contains four rows could not be run with AVX instructions because it was too small: Hardware Architecture Workload (data rows) Execution time (Maximum number of cores) Intel i7 4 cores 4 - Intel i7 4 cores Intel i7 4 cores Intel i7 4 cores Intel i7 4 cores Intel i7 4 cores Table 4.2: AVX Gathering the data we could see that the Intel i7 was faster so we decided to make a few tests to determine which of the data sets would be interesting to look at during a speedup scenario. The only interesting data set was the 10 million rows. We generated the following graphs containing execution time for one, two, four and eight cores using both AVX and SSE to see if there is a similar pattern. 26

30 Figure 4.1: Graph to establish how well vector instructions work with threads for SSE and dierent size of data. Data span: Large. Figure 4.2: Graph to establish how well vector instructions work with threads for AVX and dierent size of data. Data span: Large. These tests gave us a view of how well vector instructions can work with parallelism. Using only eight threads this generated the following speedup. Speedup x = Execution time(1 thread) / Execution time(n threads) Speedup SSE = 1453,404 / 299,996 = 4,9x Speedup AVX = 1745,111 / 316,59 = 5,5x Even though there is a good speedup from the sequential to eight threads there is potential for even greater speedup. Going back to using the AMD Opteron processor and its 48 cores we ran new tests to determine how much speedup we could obtain using a lot more cores in order to see if there was a limit to the speedup with vector instructions. The result can be view in gure 4.3 below. 27

31 Figure 4.3: Graph to establish how well vector instructions work with threads for AVX with dierent size of data. Data span: Small - Medium. After only reaching a speedup of 21x on the AMD processor we decided to run it on Zorn to see if this was a hardware issue and also to determine which of our hardware would be the most optimal for our comparison. The best speedup we could manage to get was Speedup Zorn: 16 threads and 10,8x with a max speed of 308,5 ms 28

32 4.2 GPGPU acceleration The following data is comparing our GPGPU cluster Zorn to a single GTX 680, this result might ease the conclusion on the long initiation time on Zorn. Hardware Workload Transfer time Alg time Total time NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M NVIDIA GTX NVIDIA Tesla M Table 4.3: OpenACC - Zorn vs Single GPU system In table 4.3 we see that both the transfer time and algorithm execution time. We can see that the total time does not add up with the other results. The reason for the total time being so much higher is because each GPU got an initiation time. The initiation time can be determined by the total time subtracted by the algorithm execution and transfer time. As we can see moving the data to the GPU is very time consuming. When running Black-Scholes with input rows we see this performance output, using the environment variable PGI_ACC_TIME=1. This row below shows how much time the acceleration spends on executing the kernel and how much on moving data. kernels=87ms data=12678ms This tells us that the time the program spent on moving data is actually 145 times more then the program spent on the algorithm itself. The power in the GPU is rst shown when the input rows are many more. When running Black-Scholes with as much as 10 million input rows we see this performance output instead: kernels=9198ms data=108629ms 29

33 In comparison to the previous execution this execution only spends about 11 times more on data transfer then algorithm execution. When we do runs larger than 10 million we see that this ratio stagnates on 10 times more on data transfer than on algorithm execution time. Figure 4.4: The ratio how much time the application spends on transferring data and how much it actually spends on the algorithm. From gure 4.4 we see that the time that is spent on moving data is overwhelming until the workload is close to 10 million input rows. Then the ratio lies steady at around ten times more on moving data than to do the execution. 4.3 Compare GPU to vector instruction In order to answer our problem statement we have to compare these dierent optimization techniques against each other. We decided to look at the execution time on a data set span. We choose the data span of 10 millions to 100 million data rows and compared SSE vs OpenACC, CPU vs GPU. 30

34 Figure 4.5: SSE speedup & GPGPU speedup. We also decided to show how powerful the GPU kernel is when executing the algorithm and how much time that is actually spent transferring the data to the GPU. Using a log10 graph of the executions time in milliseconds where we compare CPU and the GPU kernel we can view the dierence in execution time. Figure 4.6: Execution time of the CPU & the GPU kernel. 31

35 4.4 Combined GPU with vector instructions After comparing the CPU vs the GPU we took the time to combine these two into one solution to determine at which point we benet from using the the CPU and GPU together. As mentioned before we found the point where the memory for our GPU reached maximum capacity (198 millions data rows) and worked on tests around that area. So we started to measure the maximum speed for 180 millions to 280 millions data rows. Figure 4.7: Execution time of combined Black-Scholes & and GPGPU accelerated Black-Scholes when prior to and after GPGPU has reached maximum capacity. In gure 4.7 we have used the most optimal time for the combined version prior to the GPU memory allocation cap. During our test executing the combined version was most optimal using only the GPU before the cap. We also wanted to view the entire scope we been working on with all the dierent techniques. We ran and plotted the following graph (gure 4.8) to show the dierence using these three techniques in terms of execution time. We chose the data span from 10 million to 280 million to really show the dierence in the execution time and how these techniques compares to each other. 32

36 Figure 4.8: Execution time in seconds of all three versions of Black-Scholes used in the research. We can actually see using this graph that we managed to increase the speed from the start as well by adjusting the SSE version to run on the PGI compiler. We can also see how fast the SSE only Black-Scholes version is being less eective than the GPU version when working with such big data sets. 33

37 5 Summary & Conclusion This chapter will include explanations and ideas of the results in this thesis. It will also include a summary of the whole thesis. 5.1 Conclusion Within this section we will discuss and evaluate the gathered results. We will both talk about the speedup as well as the programming eort in the two areas vectorization and GPGPU acceleration Vectorization This part will focus on the result of the vectorized version of Black-Scholes. Discuss how well vectorized Black-Scholes speed increases with parallelism and how much programming eort needed to compile and run a vectorized Black-Scholes. Speedup Our rst speedup result was to look at AVX and SSE instructions to compare which one was faster. In terms of speedup we managed to obtain 5,5 times speedup using the AVX instructions but the SSE only reached a 4,9 times speedup. Still the SSE managed to perform better when doing our tests on our only AVX capable CPU which only had four cores. We started looking around for a reason why this occurred because generally AVX is said to be faster then SSE. We could not determine for sure why this was happening but we found out what other developers had answered with when asked about the same problems. The rst interesting reason was pointed out by a developer at Intel who said that SSE and AVX has the same upper limit for store performance the amount of data that could be loaded and stored simultaneously was equal. The second interesting point we found out was that two move instructions with SSE was faster than one move instructions with AVX. This 34

38 gave us and realisation we might have reach the upper limit of store performance and using slower instructions for moving it seemed reasonable why the SSE managed to be faster. However since the AVX managed to reach a higher speedup we cannot be sure that it would not exceed or reach the same limit as SSE in speed with more cores applied, unfortunately we did not have the hardware to test this. When looking only at SSE in turns of speedup we were a bit disappointed when it came to utilizing the parallelism. We managed to get speedups but not in the fashion we expected. Running the vectorized version on our AMD Opteron 6172 and only managed to get a 21 times (see gure 4.3)speedup while the normal version got a 47 times speed up as most. The major concern here was that when applying more cores than 32 we got a speed decrease. This made us wonder if there was any cap to the registers of some sort or increase amount of cache misses when applying more cores and working with the vector instructions. We were not able to determine why this happened we just had to accept the fact that we did not manage to get the speedup we expected. When looking at the speedup generated over the dierent hardware we can conclude that our solution is not scaling all too well with parallelism. This is not a proof that in general SSE is not optimal with parallelism but in our case it was. Programming eort In terms of programming eort we managed to avoid a lot of work when working with the vector instructions version of Black-Scholes since as mentioned earlier PARSEC supplied us with an already vectorized version. But looking through the vectorized version and the normal Black-Scholes it contains a lot of changes and it would probably have taken a very long time to implement this by ourselves and make it as optimal as they have done it. Even though its not at all impossible to recreate a fully functional vectorized version. The eort to vectorizing Black-Scholes by ourselves would have been demanding but implementing the AVX support in the already vectorized version was very little eort. That being said if you have already vectorized a program for SSE then the transition to AVX is almost eortless and converting a non vectorized program to become vectorized would be dicult and take time if you have no earlier experience of doing so. You really have to understand the code to its entirely to be able to apply the correct functions for the most optimal use. 35

39 5.1.2 GPGPU - OpenACC OpenACC is a good technique that is used to accelerate code on the GPGPU. This thesis chose between OpenCL, CUDA and OpenACC. All three was tried out but the main research used OpenACC because of the minimal problems it produced when programming and compiling the code. Speedup In this part the main discussion will be about the speedup and timings when running Black-Scholes with OpenACC on a GPU cluster. The big issue when accelerating on the GPGPU is the transfer time. This issue cannot be minimized in any way, the result tells us that when we have the best data optimization possible and the largest workload, the data movement takes about ten times longer than how long the execution of the algorithm takes. This conclusion led us to not spend more time on trying to optimize the application any more. Where we could have done some optimizations is in the algorithm part of the application by mapping the OpenACC code better to CUDA code by specifying number of grids and blocks to use. Since the data transfer is optimized as much as possible and the data transfer is so much larger then the algorithm execution time the optimization would not gain the study very much. The speedup we get when we compare running Black-Scholes on the CPU without vector instructions and accelerating it on the GPU is what we had expected. With lower inputs it gives no speedup at all and it is actually slower to run on the GPU since the initiation time and transfer time is longer than the whole execution time on the CPU. Programming eort When accelerating code on the GPGPU there are some important concept to understand and is somewhat time consuming. The biggest thing to understand is the architecture of a GPGPU with the aspect of grids, blocks and threads. This research used OpenACC and the compiler is mapping our algorithm to CUDA code. If the acceleration don't need to be optimized any further on the GPGPU the knowledge about the GPGPU architecture is not necessary. Since it is so easy to translate an OpenMP application to OpenACC and this research was lucky enough to start of with a good OpenMP version of the Black-Scholes application the programming eort was very small. To accelerate using OpenACC it is still needed to learn about a OpenACC compiler and the delimitations of OpenACC, for example functions 36

40 have to be able to be inlined and the lack of support of pointers that are ambiguous. To learn about this does not need to take more than one or two days. To accelerate with plain CUDA or OpenCL more knowledge is required. From our perspective those techniques are way more complicated Combined CPU and GPGPU This section we will discuss the result we received when running our combined version of Black-Scholes. Speedup We got some interesting results when combining the CPU with the GPU in terms of speedup. We started doing tests from a workload below the max capacity for the GPU where we realised there was no real need to use both the GPU and the CPU because we had already proven previously that already after 30 millions the GPU was much fast than the CPU. We decided to focus on the area where the GPU would reach max capacity (180 million to 198 million) and see if we could manage to get a speedup with help of the CPU. We made sure the GPU only version was faster than the GPU combined with CPU version for our data span before the cap to validate our results. We predicted that when the GPU lls up there would have to be another data transfer when the GPU was done and for the lower data sets of 10 million and 20 millions we had already proven that it was worth using the CPU. So these were the most interesting values to look at when we started running the tests. To our surprise combining the code for the CPU and the GPU we actually managed to get a speedup on the vectorized part compared to our previously vectorized version of Black-Scholes. So not only did we get a speedup on the lower part of the data sets but even up to 60 millions data rows we obtained a speedup by using the CPU. We can not really explain the speedup of the CPU using the PGI compiler because we got a vectorized code but we suspect it has something to do with the alignment of data. When using the align functions you set boundaries in bytes to the amount you declared to the alignment functions and it is also used to align data to the cache line to improve cache performance. We suspect that the compiler optimize this alignment in a way that was more ecient than our own alignments. 37