Project INF BigData. Figure 1: Plot of the learned function from the checker board data set.

Size: px
Start display at page:

Download "Project INF BigData. Figure 1: Plot of the learned function from the checker board data set."

Transcription

1 Project INF BigData Roberto Fontanarosa, Tobias Rupp, and Steffen Hirschmann Figure 1: Plot of the learned function from the checker board data set. Abstract Prediction and forecasting has become very important in modern society. Regression analysis enables to predict easily based on given data. This paper focuses on regression analysis on spatially adaptive sparse grids using the existing toolbox SG++. It will be implemented on graphics cards using NVIDIA s CUDA technology. Several big data sets and results obtained by the authors implementation are presented. Finally the results will be compared to existing implementations. Index Terms Sparse Grids, Regression Analysis, Big Data, NVIDIA CUDA 1 INTRODUCTION Today almost all information is being stored using computers. In recent years data sets are getting bigger and they get collected, because huge disk space is affordable. These large data sets are called big data and they are produced almost everywhere. To name only a few: medicine, astrophysics, banks and online shops. Big Data can be collected for many purposes as well. One is to learn from the data for future events, so called prediction. This can be done via regression analysis. Since this is quite time consuming, this paper adresses regression analysis on sparse grids. 1.1 Regression Analysis Regression analysis is a statistical tool for the investigation of relationships between variables. [13] The investigator usually seeks to ascertain the casual effect of one variable upon another. This can for example be the change of the weather, when the wind changes direction. Independent variables represent the input, while dependent variables represent the output. The result of a regression analysis is a regression function y = f (x), where x is an independent variable and y the dependent variable. After a function has been learned, one must ensure its quality. Because this testing should never be done with the learning data, one needs to provide seperate data for testing. To this end the initial data set is being seperated into two parts. Typically the ratio between the training and the testing data is 2 : 1. To actually do a regression analysis on a sparse grid, the following equation has to be solved according to [11] ( 1 M BBT + λc ) α = 1 By (1) M where λ is the regularisation operator, ϕ the basis function and the coefficient vector α of f N the solution. The N N matrix C, Roberto Fontanarosa Tobias Rupp Steffen Hirschmann Mat.-Nr Mat.-Nr Mat.-Nr {fontanro, rupptl, hirschsn}@studi.informatik.uni-stuttgart.de c i j = ϕ i (x) ϕ j (x)dx stems from the smoothness term; the N M and M N matrices B and B T,b i j = ϕ i (x j ), and the vector y of the target values y i from the error term. [11] 1.2 Practical use The practical use is as stated above typically prediction and forecasting. But it is also used to understand which of the independent and dependent variables are related. Prediction and forecasting are applied widely. Most people get to see obvious applications like weather forecasts. But these techniques are also in use behind sophisticated product proposals or classifications in astrophysics. One could even use them in medical diagnosis. 2 RELATED WORK SG++ is a toolbox, that allows to use spatially adaptive sparse grids without great expense. It is flexible and doesn t need the vast initial overhead, that has to be spent while implementing sparse grids and corresponding algorithms. To be able to deal with different kinds of problems in a spatially adaptive way - ranging from interpolation and quadrature via the solution of differential equations to regression, classification, and more - a main motivation behind the development and all considerations was to create a toolbox which can be used in a very flexible and modular way by different users in different applications. SG++ is capable of doing regression analysis using the CPU and the GPU. CPU implementations may use traditionally recursive sparse grid algorithms or iterative ones described in [1]. They can be parallelized using OpenMP or MPI. Also, Heinecke provided an implementation which was specially adapted to Intel CPU architectures. There is also a module which performs the calculations using OpenCL. So SG++ is capable of calculating on the GPU. Since OpenCL is not able to utilize modern NVIDIA graphics cards to their maximum capacity, we took the effort and implemented regression analysis using NVIDIA s CUDA technology. 1

2 3 COMPUTE UNIFIED DEVICE ARCHITECTURE The Compute Unified Device Architecture ( CUDA ) is a parallel computing architecture from NVIDIA. It allows to perform calculation using NVIDIA graphics cards. Since the architecture of graphics cards is very different to the one of CPU s, CUDA allows huge speedups in several applications, particularly highly parallelizable ones. Compare [3]. CUDA has been successfully used in astro physics, computer biology, dynamical fluid simulation and many more. It is available for most of the modern NVIDIA graphics cards from the series GeForce, ION, Quadro and Tesla. For a list of all CUDA capable devices see [5]. 3.1 Concepts CUDA is an extension to the C programming language. It adds three main abstractions, thread hierarchy, shared memory and synchronization [3] Thread Hierarchy Every CUDA thread executes a so called CUDA kernel. This is a special function which can be executed on the device and called from the host. In this context, device means the graphics card and host the CPU. Threads are gathered in blocks. Threads from the same block are scheduled together, reside on the same processor core and share this core s memory. Currently, a block may contain up to 1024 threads [3]. However the threads in one block may not cover the whole problem. Consequently, blocks are arranged in a so called grid. Both, block size and grid size can be up to three dimensional. The value of these sizes must be specified at kernel launch time. Depending on the actual calculations, the amount of memory that needs to be shared, the device used and the values for block and grid size, the best performance results will be achieved Memory Hierarchy A CUDA capable graphics card provides different kinds of memory. One gets the longest access times at the highest level, where global memory is. It is shared throughout the whole device and can be accessed from the host via copy methods. Basically, every program that wants to calculate something on the device must first copy its data to the global memory. Memory shared by a block is called shared memory. It is a low level memory which can be accessed very fast. Graphics cards with a compute capability of 2.0 or higher also have cache (like CPU machines) [3]. The lowest cache level provides shared memory as well as level 1 cache. One may set preferences between these two. There are also other kinds of memory on a graphics card, which we don t use. Namely, constant memory, texture memory and surface memory. We didn t use these types of memory because we explicitly made use of shared memory when loading data, so there wouldn t have been any advantage since they are all located in global memory. Besides this, we didn t need byte addressing of surface memory nor the filtering support which texture memory offers. 4 PRESENTATION OF THE PRODUCT As already mentioned, a program which wants to do a regression analysis on sparse grids has to solve equation 1. Matrix B is formed by evaluating every basis function of the sparse grid at every data point. For a given grid point g with level l g and index i g the tensor product basis function ϕ g is evaluated according to the following rule, where x m is the data point and d the dimensionality of the problem: d ( ) ϕ(x m ) = max 1 2 l g k,0 x mk i gk k=1 (2) Figure 2: Important part of the architecture of SG++. LearnerVectorizedIdentity instantiates a DMSystemMatrixVectorizedIdentity which uses an OperationMultipleEvalIterativeX to provide the mult and multtranpose functions. These operation classes typically rely on so called kernels to implement the mult functions. 4.1 Functionality from SG++ SG++ provides classes for creating sparse grids and refining them. It implements the abstract learning process for regression analysis. Since our product will be included in SG++, we relied on it as much as possible to minimize redundancies. The abstract learning process implemented in SG++ uses the BiCGSTAB solver to solve the system of linear equations 1. Like all CG algorithms it therefore only needs a matrix multiplication. This multiplication with the system matrix BB T + λc is for parallel applications implemented in a class called DMSystemMatrixVectorizedIdentity. The name tells that the identity matrix serves as C. This class, in turn, instantiates an OperationMultipleEvalIterative which implements the routines for multiplication with B and B T. A general overview of the important SG++ classes is given in Figure The Authors Implementation Since we used all the existing SG++ infrastructure, all we had left to do was defining an OperationMutipleEval which implements the multiplications using CUDA. For this purpose we designed a simple layer architecture. At the top layer there is the OperationMultipleEvalIterativeCUDA. It implements the basic functionality which is required by SG++, like measuring the execution time. It also instantiates a class from the mid layer, called CUDAKernelWrapper. Basically, this class does memory management. After the KernelWrapper prepared the device it executes the respective function from the lowest layer called CUDAKernels. This module defines the CUDA kernels which do the actual calculation of v = Bu and u = B T α with b i j = ϕ i (x j ) defined in equation 2. An overview of how our classes are integrated in SG++ is given in Figure 3. All the classes and modules have been implemented in single and double precision according to the architecture of SG++. In the following, we will describe these layers or rather classes more thoroughly OperationMultipleEvalIterativeCUDA ( OMI ) This class represents the interface between SG++ and the authors implementation. Because it is derived from the general Operation- 2

3 Figure 3: Integration of the authors classes into SG++ (right). One can easily spot the layered architecture. MultipleEval classes it inherits the required multiplication methods from SG++. Basically, it instantiates a KernelWrapper object and forwards all multiplication method calls to it. Moreover, as mentioned above, it implements a simple time measuring CUDAKernelWrapper The CUDAKernelWrapper has to upload all the data needed by the kernel to the device. This includes the grid, the data set and allocating memory for the result. For this purpose, the KernelWrapper has to handle the case that the data set is too big to entirely fit into the device memory and thus, if necessary has to split up the data set. After preparing the device memory, the KernelWrapper invokes the kernel with pointers on the allocated memory and their respective sizes. After the kernel execution, the result is downloaded into main memory and returned to the caller. Moreover, the KernelWrapper has to handle the calculation for surplus grid points which can t be covered by the kernels, because of the chosen block and grid sizes. This surplus calculation is CPU based and the implementation follows the general algorithms described in the subsection For the sake of high performance most of the outer loops in these algorithms are parallelized using OpenMP CUDAKernels As mentioned above the kernels are the lowest level in the layered architecture. They have to implement the actual multiplication with B and B T where B = b i j = ϕ i (x j ). The evaluation of the basis functions ϕ is defined in equation 2. The grid points levels and indices are provided by SG++ as matrices in R g,d, where g is the number of grid points and d the number of dimensions. The data set is a matrix R m,d, where m is the number of data points. Result and source vectors are just simple arrays. The implementation of mult and multtranspose are adapted from Heinecke [1]. The basic algorithm for the mult function with linear basis functions is shown in Algorithm 1; multtranpose is shown in Algorithm 2. l denotes the matrix of the grid s levels. i the grid s indices. x is the data set: As one may see, the most inner loop is the evaluation of the basis function. We also implemented modified basis functions which can handle boundary values. These require a slight change of the basis function evaluation and will lead to a case differentiation in the inner loop, if the function is adjacent to the boundary. All the inner basis Algorithm 1 Pseudo code for function mult. Calculates u = Bv. See [1] for g 0..g max do u g 0 for m 0..m max do temp v m for d 0..d max do temp temp max(1 lg,d x m,d i g,d,0) u g u g +temp Algorithm 2 Pseudo code for function multtranspose. Calculates v = B T α. See [1] for m 0..m max do v m 0 for g 0..g max do temp α m for d 0..d max do temp temp max(1 l g,d x m,d i g,d,0) v m v m +temp functions will just be evaluated in the same way as the linear basis functions. These algorithms are converted to CUDA kernels in a quite obvious way. The most outer FOR loop is left out and m max respectively g max kernels are being launched. Each with the loop variable as thread index. The kernel makes use of several CUDA techniques to achieve a good performance. First of all, the kernels are implemented as templates with the dimensionality as template argument. So the compiler is enabled to unroll the most inner loop, because it has a constant trip count. To force the NVIDIA CUDA compiler to unroll loops with constant trip count, #pragma unroll has been added at first. Another benefit of template kernels is that a thread can keep his associated grid or data points in registers, since register count must 3

4 be determined at compile time. The algorithms require each thread to access the same data or grid point at the same time. To ensure maximum performance, the kernel will first load this data into shared memory before giving all threads access to the shared memory. In most cases grid size is smaller than data set size. If this factor gets too large, performance will suffer. To take care of this problem, the mult kernel, which iterates over all data set points, is able to be launched with a block size in y direction bigger than 1. The mult kernel then will split up the data set evenly between these y threads. Afterwards they will accumulate their results using the atomicadd instruction. Atomic instructions are well known to give bad performance, but as time will roughly be reduced by 1 y if the device has enough resources, the overall performance will be better. The kernels also enable this accumulation to take place in shared memory rather than in global memory. CUDA supports atomic operations only for single precision. Self coded atomic operations, which do not rely directly on hardware synchronization, will lead to a bad performance. So we decided to support this y thread concept only for single precision. Kernel Calls The CUDAKernels module also provides sample kernel calls which are prepared to handle a dimensionality up to 100, using template kernels. A call with a greater number of dimensions will not be able to use template kernels. The prespecified block size is 640, since this gave the best performance in the tests we carried out. The amount of y threads spent for a specific kernel call is automatically determined at run time. 5 KERNEL BENCHMARK In order to ensure functionality of the kernel, it has been tested heavily. Later, we used these test programs to perform benchmarks on the kernel without any other piece of software than just the test drivers and the kernels themselves. The results are being presented in this section. 5.1 Configuration The kernel has been tested on several NVIDIA graphics cards, namely GeForce 560Ti, GeForce GTX680, Tesla K10 and Tesla K20. On the K10 only one of the two available cores has been used. Ubuntu LTS (kernel ) served as operating system using the NVIDIA Linux graphics driver for the 560Ti, GTX680 and K10. The K20 was installed in a Red Hat linux machine with kernel and NVIDIA Linux graphics driver All cards were installed on a PCIe 2.0 slot. CUDA compilation tools, release 5.0, V were used to compile the executables for all the benchmarks applying these compiler flags: -O3 -use fast math -arch=sm 20 -DFORCE UNROLL. Since the 560Ti only supports compute capability 2.0 we used -arch=sm 20 for its benchmarks. For benchmarks with a dimensionality higher than two, we additionally set the -DUSE SHARED flag. The benchmarks have been conducted in single and double precision. As a reference, we also ran a CPU based kernel, which was implemented according to Heinecke[1] just like the CPU calculations in the KernelWrapper. It has been shared memory parallelized with OpenMP, using all available cores of an Intel Core i (@ 3.4 GHz). This processor is based on the Sandy Bridge architecture and has four physical cores. Besides those it uses Intel s simultaneous multi threading implementation Hyper Threading Technology (HTT)[2]. This makes a total of eight parallel OpenMP tasks. The CPU based calculation was hosted on the Ubuntu machine named above. GCC (Ubuntu/Linaro ubuntu5) has been used with the following compiler flags: -fopenmp -mfpmath=sse -msse3 -O3. Random numbers in [0,1) created by rand() of the glibc served as input for the data set and the grid s levels and indices. Every shown result has been averaged over ten tests. The Speedup S = T CPU T GPU is shown in braces. It is rounded to two decimal places. The 4 times are as exact as they can be measured by using sys/time.h. Here, only the results of the linear no boundary version are presented. The versions with modified boundary functions perform a bit worse, but the ratio of the different cards is very similar. The following abbreviations will be used for the sake of brevity: ds (dataset size), gs (grid size), dims (number of dimensions), sp (single precision), d p (double precision). 5.2 Results Dimensional comparison We fixed the dataset and grid size and varied the number of dimensions. Calculations have been carried out using single precision. The results are shown in table 1. Remarks The 560Ti has too few registers, so it was not able to calculate the 21 dimensions using 640 threads per block. Reducing the number of threads per block would have solved this problem, but we didn t want to vary them Data Set Size Comparison In this benchmark, the dimensions were fixed and the data set and grid size were varied. These calculations were also carried out using single precision. Table 2 shows the results Single-double precision comparison This final benchmark compares the single precision performance versus double precision performance of the cards. We chose a moderate problem size for this benchmark. The results are shown in table Conclusion As one can easily see, the card performing best in our conducted benchmarks is the GTX680. It has a clock size of 1006MHz[7]. The Tesla K10 has a clock size of 745MHz[4]. Since the GTX680 and the Tesla K10 both use the Kepler architecture, this explains most of the difference. Also, we don t make explicitly use of shader model 3.x s innovations. Implicitly through -arch=sm 30 the compiler may use one of the benefits discussed in [3] like more resident blocks and threads per multiprocessor or twice as many registers per multiprocessor. The 560Ti runs at 822MHz[6]. This is approximately 15% slower than the clock frequency GTX680. However, the GTX680 performs up to 1.86 times faster than the 560Ti in our benchmarks. Besides the shader clock frequency, the GTX680 has a higher memory clock frequency and a newer architecture. The peak performance of the GTX680 is according to NVIDIA 2.44 times higher than the one of the 560Ti (optimal conditions; only fused multiply add instructions). Since we include data transfers, this number is not realistic in our benchmarks. Besides this, our instructions are far from being perfect multiply add instructions. We even have to call max and abs functions. The K20 is in single precision as good as the GTX680. The more interesting part of this card is the double precision performance. As one can see, the quotient T SP T DP is almost comparable to the CPU. This is a tremendously higher double precision performance than by any other tested card Performance Finally, we ran performance tests. The tests were carried out on different configurations, some with multiple cards or GPUs. We used test sizes where the kernels roughly have their best overall performance. Therefore, ds must equal gs since again, both kernels were executed in sequence. For configurations with multiple cards, the distribution of the problem between the multiple cards was based on the data set. An evenly distributed part of the data set was associated with each card or GPU. Then they calculated their respective intermediate result and from this their final result. Afterwards the multiple vectors were added together using CUDA,

5 dims = 1 dims = 11 dims = 21 i s (1) 51.35s (1) 1min 40.3s (1) 560Ti s (26.65) s (42.05) - GTX s (45.55) s (62.74) s (74.37) K10 (single) s (30.28) s (42.25) s (49.72) K s (19.95) s (59.78) s (63.49) Table 1: Variation of dims; parameters were: sp, ds = , gs = Figure 4: See Table 1 for description. All times in seconds. ds = gs = ds = gs = ds = gs = i s (1) 2min s (1) 3h 47min s (1) 560Ti s (30.36) s (38.49) 5min s (38.7) GTX s (53.88) s (70.05) 3min s (72.2) K10 (single) s (36.66) s (46.73) 4min s (48.96) K s (34.84) s (65.41) 3min s (71.42) Table 2: Variation of ds and gs; parameters were: sp, dims = 5 Figure 5: See Table 2 for description. All times in seconds. again. To accumulate the results of two cards we simply transferred one of the results to the other card using cudamemcpypeer. Then this cards added the two results together. The timings we measured include as above all memory transfers and for multiple cards or GPUs accumulation of the result. Table 4 shows the achieved effective floating point operation per second for single precision. These don t count the operations which are really done by the GPU but rather the operations needed for the abstract algorithm like shown in 1 and 2. Configuration eff. GFLOPS 560Ti (300) K10 (1 GPU) 420 K10 (2 GPUs) 800 GTX x GTX Table 4: Performance overview at gs = ds = , dims = 20. For multiple cards or GPUs, ds was multiplied by the total number of cards. On the 560Ti, only 16 dimensions were used (see above). One can see that the use of two cards doesn t double the performance. This is most likely due to the fact that the two card benchmark had to accumulate two result parts after the actual kernel execution. For a comparison to the OpenCL version build into SG++, see section 7. 6 TESTS Our product has been tested with different kinds of big data sets to assure it works perfectly with every kind of data set. 6.1 Data Sets To compare our results to others, we first tested the checker board data set and the five dimensional astrophysical data set ( DR5 ). The DR5 data set is a real-world data set, that contains photometric data. Regression analysis on it allows astrophysicists to predict the spectroscopic red shift of galaxies. The checker board data set has a size of and three dimensions. DR5 has instances and five dimensions. Both data sets are included within SG++. As already mentioned above, we tested our product with a lot of different data sets, that vary in their size, number of dimensions and number of attributes. We wanted to cover every kind of data set, so we can assure our product works perfectly, regardless of the data set s size or dimensionality. Afterwards we processed the Contraceptive Method Choice (CMC) data set. It is a subset of the 1987 National Indonesia Contraceptive Prevalence survey. The Contraceptive Method Choice data set has 1473 instances and ten attributes. The goal is to predict the current contraceptive method choice of a non-pregnant woman, based on her demographic and socio-economic characteristics. [10] After that we sampled the Spambase data set, that has a size of 4601 and 57 dimensions. It contains information of spam mails. It s goal is to find out weather an is a spam mail, or not. [9] The Poker Hand data set was the next one we tested. It has 11 dimensions and a size of , which makes it one of the biggest data sets we tested. The goal is to find out, if a poker hand 5

6 sp d p T d p T sp i min s (1) 2min s (1) Ti s (42.17) s (10.03) 4.45 GTX s (61.72) s (12.46) 5.24 K10 (single) s (44.32) s (8.2) 5.72 K s (64.93) s (43.82) 1.57 Table 3: single vs. double precision benchmark. Parameters were: ds = , gs = , dims = 10. Figure 6: See Table 3 for description. All times in seconds. is a Straight Flush, Royal Flush, two pair etc. The data consists of 10 numbers indicating the suit (hearts, spade, diamond or clubs) and the value (2, 3, 4, etc) of 5 cards, followed by an 11th number, that indicates the poker hand (Straight Flush, Royal Flush, two pair etc) [12]. The liver disorders data set has 6 dimensions which are medical attributes of persons. The objective is to classify the 681 instances whether they suffer a liver disorder or not. The objective of the skin segmentation data set is to segment people into groups by their skin color [8]. Therefore it has 4 attributes (RGB values of the skin colors and a group). It has instances. 6.2 Configuration We did the regression analysis respectively classification on the GTX680 with the following parameters: parameter value start level 3 λ 1e-6 CG max. iterations 250 CG eps. 1e-4 #refinements 6 refinement threshold 0.0 #points refined 100 CG max. iter (first steps) 20 CG eps. (first steps) 1e-1 Table 5: Test parameters for the NativeClassifyBenchmark of SG++. For all tests the respective data set has been split into two third training instances and one third testing instances. When there was a big imbalance between grid and data set size (number of training instances) and the data set had only very few dimensions, we chose to rather use y threads than shared memory for single precision. The concerning results in the next section are asterisked. 6.3 Performance Since we didn t vary the parameters for any data set, the standard linear grid was sometimes not able to cover all the features of a particular data set. Hence, the mean squares error of the regression was too big to call it successful. We put the concerning results in brackets. Be aware that the resulting timings of this benchmark also depend on SG++ since we relied on it for creating and refining the sparse grid and implementing the abstract learning process. For timings regarding only the CUDA part of the authors see section 5. The results are shown in table Remarks Considering the results and the size of the poker hand data set, it obviously needs quite exact calculations so that double precision has an advantage over single precision leading to a less refined grid and less CG iterations. 7 CONCLUSIONS Heinecke achieved within OpenCL implementation for the DR5 data set, using single precision and a linear grid, an execution time of 740 seconds. He performed these tests on a NVIDIA GTX 470. Since we did not have this particular card for testing purpose, one has to compare them by a rule of thumb estimation. The NVIDIA GTX 470 has a peak performance of about 1088 GFLOPS, the NVIDIA GTX680 has around 3090 GFLOPS. So the card Heinecke used was approximately three times slower in peak performance. If we divide his results by three a comparison should be more equitable. The quotient is This is still higher than our achieved 189 seconds. The conclusion might be that CUDA is better suited for doing heavy computations on NVIDIA graphics cards. CUDA Features Since we directly relied on CUDA, we could make use of CUDA specific features: CUDA streams we are able to upload/download and compute on different data simultaneously. Pre-compiled kernels this can actually be a disadvantage or lead to more work see the template kernel we had to use in order to unroll. Explicit exploitation of the memory hierarchy we did make use of shared memory to share the same data among each thread in a block as fast as possible. Explicit memory management by the use of cudamalloc() and cudafree() one is more flexible in memory management compared to OpenCL s buffer object, especially when it comes to working with streams. Atomic-operations they allow for rapid merging of results of different threads. (Although they are not the best choice most of the time.) The Implementation We also implemented the kernels a bit differently which might come in handy for the measurements: We used restricted pointers for all pointers to device memory in the kernel (restricted pointers are a promise not to point to aliased memory and therefore the compiler is able to optimize more aggressively), the OpenCL kernels are only compiled with strict aliasing rules (-fstrict-aliasing) which is a less limiting variant of restricted pointers. 6

7 precision grid type Checker Board DR5 CMC Poker Hand Spambase Liver Disorders Skin Segmentation sp dp linear s* s s s ( s) s ( s) mod s* s s s s s s linear s s s ( s) s ( s) mod s s s s s s s Table 6: Timings of the conducted data set regressions respectively classifications. For a description of the data sets see section 6.1. For the configuration see section 6.2 and performance remarks see section 6.3. Also, we implemented the possibility to use y threads if there is a strong disproportion between data set size and grid size, several threads can be used to compute the same grid point. The kernels are implemented as templates which ensures the unrolling of the most inner loop at compile time. The OpenCL implementation doesn t have to rely on this feature since it is compiled at run-time and so the most inner loop can always be unrolled. But this has to be done at run-time. In general it can be assumed that specific CUDA commands are better optimized than corresponding OpenCL commands which must be more general. Furthermore, that NVIDIA s CUDA drivers are of better quality than their OpenCL drivers. Sadly, using CUDA is a loss of portability. As stated above, run-time compiled code can be a huge benefit when it comes to create code that is well suited for a specific task that can only be determined at run-time. There may be many more benefits of using OpenCL but the authors are not very familiar with OpenCL. As a matter of fact the stake holders of a project should ponder which technique is better suited for their particular project. REFERENCES [1] Dirk Pflüger Alexander Heinecke. Emerging architectures enable to boost massively parallel data mining using adaptive sparse grids. pages 14 16, [2] Intel Corporation. Intel R Core TM i Processor. website, March [3] NVIDIA Corporation. NVIDIA CUDA C Programming Guide, Version 4.2. [4] NVIDIA Corporation. Tesla K10 GPU Accelerator - Board Specification, November tesla/pdf/nv_ds_teslak_family_may_2012_lr.pdf. [5] NVIDIA Corporation. A Supervised Machine Learning Algorithm for Arrhythmia Analysis. website, March ics.uci.edu/ml/datasets/arrhythmia. [6] NVIDIA Corporation. GeForce GTX 560 Ti and GeForce GTX 550 Ti. website, March product-geforce-gtx-560ti-gtx-550ti-us.html. [7] NVIDIA Corporation. NVIDIA GeForce GTX 680. website, March geforce-gtx-680-in.html#pdpcontent=2. [8] Rajen B. Bhatt et al. IEEE-INDICON. In Efficient skin region segmentation using low complexity fuzzy decision tree model, pages 1 4, Ahmedabad, India, [9] M Hopkins, E Reeber, G Forman, and J Suermondt. archive.ics.uci.edu/ml/datasets/spambase. [10] T-S Lim, W-Y Loh, and Y-S Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. In Machine Learning, [11] Dirk Pflüger. Spatially adaptive sparse grids for high-dimensional problems. pages , [12] D. Deugo. R. Cattral, F. Oppacher. Evolutionary Data Mining with Automatic Rule Generalization. Recent Advances in Computers, Computing and Communications. This was a slightly different dataset that had more classes, and was considerably more difficult. [13] Alan O. Sykes. An introduction to regression analysis. Sykes_.Regression.pdf. 7

Intelligent Heuristic Construction with Active Learning

Intelligent Heuristic Construction with Active Learning Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

QCD as a Video Game?

QCD as a Video Game? QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

2: Computer Performance

2: Computer Performance 2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum

More information

A quick tutorial on Intel's Xeon Phi Coprocessor

A quick tutorial on Intel's Xeon Phi Coprocessor A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be damien.francois@uclouvain.be Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Please Note IBM s statements regarding its plans, directions, and intent are subject to

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

More information

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Real-time Visual Tracker by Stream Processing

Real-time Visual Tracker by Stream Processing Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

THE NAS KERNEL BENCHMARK PROGRAM

THE NAS KERNEL BENCHMARK PROGRAM THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

Texture Cache Approximation on GPUs

Texture Cache Approximation on GPUs Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

ST810 Advanced Computing

ST810 Advanced Computing ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

More information

Graphic Processing Units: a possible answer to High Performance Computing?

Graphic Processing Units: a possible answer to High Performance Computing? 4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Accelerating variant calling

Accelerating variant calling Accelerating variant calling Mauricio Carneiro GSA Broad Institute Intel Genomic Sequencing Pipeline Workshop Mount Sinai 12/10/2013 This is the work of many Genome sequencing and analysis team Mark DePristo

More information

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

More information

Part I Courses Syllabus

Part I Courses Syllabus Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment

More information

Big-data Analytics: Challenges and Opportunities

Big-data Analytics: Challenges and Opportunities Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)

More information

GPGPU Computing. Yong Cao

GPGPU Computing. Yong Cao GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Optimizing Code for Accelerators: The Long Road to High Performance

Optimizing Code for Accelerators: The Long Road to High Performance Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?

More information

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

OpenACC Programming and Best Practices Guide

OpenACC Programming and Best Practices Guide OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What

More information

Summer Student Project Report

Summer Student Project Report Summer Student Project Report Dimitris Kalimeris National and Kapodistrian University of Athens June September 2014 Abstract This report will outline two projects that were done as part of a three months

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and

More information

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms P. E. Vincent! Department of Aeronautics Imperial College London! 25 th March 2014 Overview Motivation Flux Reconstruction Many-Core

More information

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue

More information

High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion with Several GPUs High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

More information

GURLS: A Least Squares Library for Supervised Learning

GURLS: A Least Squares Library for Supervised Learning Journal of Machine Learning Research 14 (2013) 3201-3205 Submitted 1/12; Revised 2/13; Published 10/13 GURLS: A Least Squares Library for Supervised Learning Andrea Tacchetti Pavan K. Mallapragada Center

More information

Retargeting PLAPACK to Clusters with Hardware Accelerators

Retargeting PLAPACK to Clusters with Hardware Accelerators Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.

More information

SIDN Server Measurements

SIDN Server Measurements SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop Emily Apsey Performance Engineer Presentation Overview What it takes to successfully virtualize ArcGIS Pro in Citrix XenApp and XenDesktop - Shareable

More information

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques

More information

Deploying De-Duplication on Ext4 File System

Deploying De-Duplication on Ext4 File System Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College

More information

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka René Widera1, Erik Zenker1,2, Guido Juckeland1, Benjamin Worpitz1,2, Axel Huebl1,2, Andreas Knüpfer2, Wolfgang E. Nagel2,

More information

1. If we need to use each thread to calculate one output element of a vector addition, what would

1. If we need to use each thread to calculate one output element of a vector addition, what would Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x

More information

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

A Case Study - Scaling Legacy Code on Next Generation Platforms

A Case Study - Scaling Legacy Code on Next Generation Platforms Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy

More information

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) 1. Foreword Magento is a PHP/Zend application which intensively uses the CPU. Since version 1.1.6, each new version includes some

More information

Big Data Visualization on the MIC

Big Data Visualization on the MIC Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Scalability evaluation of barrier algorithms for OpenMP

Scalability evaluation of barrier algorithms for OpenMP Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science

More information