Efficient Gather and Scatter Operations on Graphics Processors

Size: px
Start display at page:

Download "Efficient Gather and Scatter Operations on Graphics Processors"

Transcription

1 Efficient Gather and Operations on Graphics Processors ingsheng He # Naga K. Govindaraju * Qiong Luo # urton Smith * # Hong Kong Univ. of Science and Technology * Microsoft Corp. {saven, luo}@cse.ust.hk {nagag, burtons}@microsoft.com ASTRACT Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). With superior computing power and high memory bandwidth, GPUs have become a commodity multiprocessor platform for general-purpose high-performance computing. However, due to the random access nature of gather and scatter, a naive implementation of the two operations suffers from a low utilization of the memory bandwidth and conuently a long, unhidden memory latency. Additionally, the architectural details of the GPUs, in particular, the memory hierarchy design, are unclear to the programmers. Therefore, we design multi-pass gather and scatter operations to improve their data access locality, and develop a performance model to help understand and optimize these two operations. We have evaluated our algorithms in sorting, hashing, and the sparse matrix-vector multiplication in comparison with their optimized CPU counterparts. Our results show that these optimizations yield -4X improvement on the GPU bandwidth utilization and -5% improvement on the response time. Overall, our optimized GPU implementations are -7X faster than their optimized CPU counterparts. Categories and Subject Descriptors E.5 [Files] Optimization, sorting/searching; I.. [COMPUTER GRAPHICS] Hardware Architecture Graphics processors, parallel processing. General Terms Algorithms, Measurement, Performance. Keywords (c) 7 ACM /7/ $5.Gather, scatter, parallel processing, graphics processors, cache optimization. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC7 November -6, 7, Reno, Nevada, USA (c) 7 ACM /7/ $5.. INTRODUCTION The increasing demand for faster scientific routines to enable physics and multimedia applications on commodity PCs has transformed the graphics processing unit (GPU) into a massively parallel general-purpose co-processor [9]. These applications exhibit a large amount of data parallelism and map well to the data parallel architecture of the GPU. For instance, the current NVIDIA GPU has over 8 data parallel processors and a peak memory bandwidth of 86 G/s. Many numerical algorithms including matrix multiplication [6][7][], sorting [][5] [], LU decomposition [9], and fast Fourier transforms [4][8] have been designed on GPUs. Due to the high performance capabilities of GPUs, these scientific algorithms achieve -5X performance improvement over optimized CPU-based algorithms. In order to further improve the performance of GPU-based scientific algorithms and enable optimized implementations similar to those for the CPU-based algorithms, recent GPUs include support for inter-processor communication using shared local stores, and support for scatter and gather operations [6]. and gather operations are two fundamental operations in many scientific and enterprise computing applications. These operations are implemented as native collective operations in message passing interfaces (MPI) to define communication patterns across the processors [4], and in parallel programming languages such as ZPL [8] and HPF []. operations write data to arbitrary locations and gather operations read data from arbitrary locations. oth operations are highly memory intensive and form the basic primitives to implement many parallel algorithms such as quicksort [], sparse matrix transpose [8], and others. In this paper, we study the performance of scatter and gather operations on GPUs. Figure shows the execution time of the scatter and the gather on a GPU with the same input array but either uential or random read/write locations. The input array is 8M. The detailed experimental setup is described in Section.5. This figure shows that location distribution greatly affects the performance of the scatter and the gather. For uential locations, if we consider the data transfer of the output data array and internal data structures, both operations are able to achieve a system bandwidth of over 6 G/s. In contrast, for random locations, both operations yield a low utilization of the memory bandwidth. The performance comparison indicates that locality in memory accesses is an important factor for the performance of gather and scatter operations on GPUs.

2 Gather Sequential 76.9 Random 76.9 Figure. The elapsed time of the scatter and the gather on a GPU. oth operations underutilize the bandwidth for random locations. GPU memory architectures are significantly different from CPU memory architectures []. Specifically, GPUs consist of highbandwidth, high-latency video memory and the GPU cache sizes are significantly smaller than the CPUs therefore, the performance characteristics of scatter and gather operations on GPUs may involve different optimizations than corresponding CPU-based algorithms. Additionally, it was only recently that the scatter functionality was introduced on GPUs, and there is little work in identifying the performance characteristics of the scatter on GPUs. In this paper, we present a probabilistic analysis to estimate the performance of scatter and gather operations on GPUs. Our analysis accounts for the memory access locality and the parallelism in GPUs. Using our analysis, we design optimized algorithms for scatter and gather. In particular, we design multipass algorithms to improve the locality in the memory access and thereby improve the memory performance of these operations. ased on our analysis, our algorithms are able to determine the number of passes required to improve the performance of scatter and gather algorithms. Moreover, we demonstrate through experiments that our performance model is able to closely estimate the performance characteristics of the operations. As a result, our analysis can be applied to runtime systems to generate better execution plans for scatter and gather operations. We use our scatter and gather to implement three common applications on an NVIDIA GeForce 88 GPU (G8) - radix sort using scatter operations, and the hash search and the sparse-matrix vector multiplication using gather operations. Our results indicate that our optimizations can greatly improve the utilization of the memory bandwidth. Specifically, our optimized algorithms achieve a -4X performance improvement over single-pass GPUbased implementations. We also compared the performance of our algorithms with optimized CPU-based algorithms on high-end multi-core CPUs. In practice, our results indicate a -7X performance improvement over CPU-based algorithms. Organization: The remainder of the paper is organized as follows. We give a brief overview of the GPU and the prior work on GPU- or CPU-based cache efficient algorithms in Section. Section describes our performance model and presents our optimization techniques for GPUs. In Section 4, we use the optimized scatter and gather to improve the performance of sorting, the hash search and the sparse-matrix vector multiplication on the GPU. Finally, we conclude in Section 5.. ACKGROUND AND RELATED WORK In this section, we first review techniques in general-purpose computing on the GPU. Next, we survey the cache-optimized techniques on the CPU and the GPU.. GPGPU (General-purpose computation on GPU) A GPU is a SIMD processing unit with high memory bandwidth primarily intended for use in computer graphics rendering applications. In recent years, the GPU has become extremely flexible and programmable. This programmability is strengthened in every major generation (roughly every 8 months) [9]. Such programming flexibility further facilitates the use of the GPU as a general-purpose processor. Recently, NVIDIA has released the CUDA (Compute Unified Device Architecture) framework [6] together with the G8 GPU for general-purpose computation. Unlike DirectX or OpenGL, CUDA provides a programming model for a thread-based programming language similar to C. Thus, programmers can take advantage of GPUs without requiring graphics know-how. Two of the newly exposed features on the G8 GPU - data scattering and on-chip shared memory for the processing engines, are exploited by our implementation. GPUs have been recently used for various applications such as matrix operations [6][7][] and FFT computation [4][8]. We now briefly survey the techniques that use GPUs to improve the performance of scientific and database operations. The GPUbased scientific applications include linear algebra computations such as matrix multiplication [6] and sparse matrix computations [5]. Govindaraju et al. presented novel GPU-based algorithms for relational database operators including selections, aggregations [] as well as sorting [], and for data mining operations such as computing frequencies and quantiles for data streams []. For additional information on the state-of-the-art GPGPU techniques, we refer the reader to a recent survey by Owens et al. [9]. The existing work mainly develops OpenGL/DirectX programs to exploit the specialized hardware features of GPUs. In contrast, we use a general-purpose programming model to utilize the GPU hardware features, and implement our algorithms with two common building blocks, i.e., scatter and gather. Recently, Sengupta et al. implemented the segmented scan using the scatter on the GPU []. Our optimization on the scatter can be applied to the segmented scan to further improve its performance. There is relatively less work on developing efficient algorithms for scatter and gather on GPUs, even though these two operations are commonly provided primitives in traditional MPI architectures [4]. Previous-generation GPUs support gather but do not directly support scatter. uck described algorithms to implement the scatter using the gather [4]. However, these algorithms usually require complex mechanisms such as sorting, which can result in low bandwidth utilization. Even though new generation GPUs have more programming flexibility supporting both scatter and gather, the high programmability does not necessarily achieve high bandwidth utilization, as seen in Figure. These existing limitations motivate us to develop a performance model to understand the scatter and the gather on the GPU, and to improve their overall performance.

3 . Memory optimizations Memory stalls are an important factor for the overall performance of data-intensive applications, such as relational database systems []. For the design and analysis of memory-efficient algorithms, a number of memory models have been proposed, such as the external memory model [] (also known as the cache-conscious model) and the cache-oblivious model [7]. Our model is similar to the external memory model but applies to the parallel computation on the graphics processor. In addition to modeling the cache cost, ailey proposed a memory model to estimate the cost of memory bank contention on the vector computer []. In contrast, we focus on the cache performance of the graphics processor. The algorithms reducing the memory stalls can be categorized into two kinds, cache-conscious [] and cache-oblivious [7]. Cacheconscious algorithms utilize knowledge of cache parameters, such as cache size. On the other hand, cache-oblivious algorithms do not assume any knowledge of cache parameters. Cache-conscious techniques have been extensively used to improve the memory performance of the CPU-based algorithms. LaMarca et al. [6] studied the cache performance for the quick sort and showed that cache optimizations can significantly improve the performance of the quick sort. oncz et al. [] proposed the radix clustering with a multi-pass partitioning method in order to optimize the cache performance. Williams et al. [] proposed cache optimizations for matrix operations on the Cell processor. Memory optimization techniques have also been shown useful for GPU-based algorithms. These memory optimizations need to adapt to the massive threading architecture of the GPU. Govindaraju et al. [] proposed a memory model for scientific applications and showed that optimizations based on the model greatly improved the overall performance. Fatahalian et al. [6] analyzed the cache and bandwidth utilization of the matrix-matrix multiplication. Galoppo et al. [9] designed block-based data access patterns to optimize the cache performance of dense linear systems. In comparison, we propose a general performance model for scatter and gather, and use a multi-pass scheme to improve the cache locality.. MODEL AND ALGORITHM In this section, we first present our memory model on the graphics processor. Next, we give the definitions for the scatter and the gather. We then present our modeling and optimization techniques for the scatter and the gather. Finally, we present our evaluation results for our techniques.. Memory model on GPU Current GPUs achieve a high memory bandwidth using a wide memory interface, e.g., the memory interface of G8 is 84-bit. In order to mask high memory latency, GPUs have small L and L SRAM caches. Compared with the CPU, the GPU typically has a flat memory hierarchy and the details of the GPU memory hierarchy are often missing in vendor hardware specifications. We model the GPU memory as a two-level hierarchy, the cache and the device memory, as shown in Figure. The data access to the cache can be either a hit or a miss. If the data is found in the cache, this access is a hit. Otherwise, it is a miss and a memory block is fetched from the device memory. Given the number of cache misses, #miss, and that of cache hits, #hit, the cache hit rate, #hit h h, is defined as #hit#miss. We model the GPU as M SIMD multiprocessors. At any time, all processors in a multiprocessor execute the same instruction. The threads on each multiprocessor are grouped into thread warps. A warp is the minimum schedule unit on a multiprocessor. If a warp is stalled by memory access, it will be put to the end of a waiting list of thread warps. Then, the first warp in the waiting list becomes active and is scheduled to run on the multiprocessor. Multiprocessor Cache Device memory Multiprocessor M Small size High memory bandwidth Figure. The memory model of GPU. The GPU consists of M SIMD multiprocessors. The cache is shared by all the multiprocessors.. Definitions Gather and scatter are dual operations. A scatter performs indexed writes to an array, and a gather performs indexed reads from an array. We define the two operations in Figure. The array L for the scatter contains distinct write locations for each R in tuple, and that for the gather the read locations for each R out tuple. Essentially, the scatter consists of uential reads and random writes. In contrast, the gather consists of uential writes and random reads. Figure 4 illustrates examples for the scatter and the gather. Primitive: Input: R in [,, n], L[,, n]. Output: R out [,, n]. Function: R out [L[i]]=R in [i], i=, n. Primitive: Gather Input: R in [,, n], L[,, n]. Output: R out [,, n]. Function: R out [i]=r in [L[i]], i=, n. Figure. Definitions of scatter and gather To quantify the performance of the scatter and gather operations, we define the effective bandwidth for the uential access (resp. the random access) to be the ratio of the amount of the data array tuples accessed by the uential access (resp. the random access) in the scatter and gather operations to the elapsed time. This measure indicates how much bandwidth is utilized to access the target (effective) data. That means, we define two kinds of effective bandwidth, uential and random bandwidths, to distinguish the uential and the random access patterns, respectively. Since the uential access has a better locality than the random access, the uential bandwidth is usually higher than the random bandwidth.

4 L Rin Rout R R R R4 R5 R6 R R R5 R R4 R6 L Rin Rout R R R R4 R5 R6 R R4 R R5 R R6 between the device memory and the GPU for a cache miss is z / ll. Since measures the data transfer between the device memory and the GPU as well as the accesses to the GPU z / l l cache, we estimate the total cost for a cache miss to be. (a) (b) Gather Figure 4. Examples of scatter and gather.. Modeling the effective bandwidth In this subsection, we present our probabilistic model on estimating the uential bandwidth and the random bandwidth (denoted as and rand ). Since the estimation and optimization techniques are similar on the scatter and the gather operations, we describe the techniques for the scatter in detail, and present those for the gather in brief... Modeling the uential bandwidth A uential access fetches each memory block exactly once. A memory block causes a cache miss at the first access, and the later accesses to this memory block are cache hits. Thus, the uential access achieves a high effective bandwidth. We estimate the uential bandwidth using our measured effective bandwidth. To obtain the measured bandwidth, we perform a set of calibration measurements with input data arrays of different sizes. Given the capacity of the device memory, D m M, we measure the execution time for the uential scan when the input array size is i log M, i=,,, D m. Let the execution time for the input array of i M be t i. We compute the uential bandwidth to be the weighted bandwidth of all log calibrations as Eq., where n= D m. Compared with the estimation using the average value, the weighted sum gives more weight to the bandwidth of the larger sizes, which fully utilize the bus bandwidth of the GPU. n i ( i ) n i i t i i.. Modeling the random bandwidth Compared with the uential access, the random access has a low cache locality. Some accesses are cache hits, and many others are misses. Given the cache hit rate, h, the expected time for accessing a data array is defined as t, given in Eq.. z / l l t n(( h) h z / l' ) In this equation, n is the number of data array tuples; z is the size of each tuple (bytes); l and l are the memory block size and the transfer unit between the cache and the GPU (bytes); is the estimated uential bandwidth; is the time for fetching a cache line from the cache to the GPU. In this equation, the amortized cost for a cache hit is z / l'. The data transfer () () We define the random bandwidth as follows. rand n z t z z / ll z / l ( l h z / l' ) Given a fixed input, we maximize the cache hit rate in order to maximize the random bandwidth. We develop a probabilistic model to estimate the cache hit rate. Specifically, we estimate the cache hit rate in two cases. One is that the access locations are totally random, and the other is with the knowledge of the number of partitions in the input data. The number of partitions is usually a priori knowledge for various applications such as radix sort and hash partitioning. For instance, given the number of bits used in the radix sort, its, the number of partitions for the input data is its. With the knowledge of the number of partitions, we have a more accurate estimation on the cache hit rate. When there is no knowledge of the number of partitions, we assume the access locations are random and use the first case to estimate the cache hit rate. Case I: totally random accesses. We first estimate the expected number of distinct memory blocks, E, accessed by n random accesses. Suppose the total number of memory blocks for storing the input data array is k. Let E j be the number of cases accessing exactly j (<=j<=n) distinct memory blocks. We estimate E j as follows. E j C( k, j) S( n, j) j! (4) In this equation, C(k, j) is the number of cases of choosing j memory blocks from the available k memory blocks; S(n,j) is the Stirling number of the second kind [9] that equals the number of cases of partitioning n accesses into j memory blocks; j! is the number of cases in the permutation of the j memory blocks. Since the total number of cases in accessing the k memory blocks is k n, we estimate the expected number of distinct memory blocks by these n accesses in Eq. 5. E n j ( E j) k n j ased on E and the number of cache lines in the GPU cache, N, we estimate the cache miss rate, m, in Eq. 6. E n m N E ( ) ( n E) E n, E N, otherwise When E>N, we model cache misses in two categories -- the compulsory misses of loading E memory blocks, and the capacity N ( ) (ne) misses, E. Intuitively, given a fixed input, the larger the cache size, the smaller the cache miss rate. (5) (6) ()

5 Finally, we compute the cache hit rate of Case I to be, h=(-m). Case II: Given the number of partitions, p. In this case, if the write locations of a thread warp belong to the same partition, they are consecutive. The writes to consecutive locations are uential, and have a good cache locality. Suppose a warp consists of w threads. Each thread accesses one tuple at one time. The number of tuples accessed by the warp is w. To estimate the number of cache misses, we estimate the expected number of distinct partitions, D, accessed by the warp. Let Dj be the number of cases containing exactly j distinct partitions, <=j<=min(w, p). We estimate Dj to be Dj=C(p,j)*S(w,j)*j!. The estimation is similar to that on the number of distinct memory blocks accessed in Case I. Since the total number of cases is w p, we estimate D in Eq. 7. D min( p, w) j w p ( D j) j The average number of tuples belonging to each of these D w partitions is D. The number of memory blocks for each partition wz is Dl. Therefore, the cache miss rate within a thread warp is estimated in Eq. 8. w z D D l m w The cache hit rate within a thread warp to be (-m). Finally, we compute the cache hit rate of Case II to be the sum of the intra- and inter-warp cache hit rates. The inter-warp cache hit rate is estimated as in Case I, totally random accesses. Note, even with the knowledge of the number of partitions, the accesses of warps on different multiprocessors are random due to the unknown execution order among warps..4 Improving the memory performance A basic implementation of the scatter is to uentially scan L and R in once, and output all R in tuples to R out during the scan. Similarly, the basic implementation of the gather is to scan L once, read the R in tuples according to L, and write the tuples to R out uentially. Given and rand, the total execution time of the scatter and the gather is estimated in Eq. 9 and, respectively. Given an array, X, we denote the number of tuples in X and the size of X (bytes) to be X and X, respectively. T T scatter gather R in R in rand L R L R out rand out (7) (8) (9) () This basic implementation is simple. However, if L is random, the scatter and the gather suffer from the random access, which has low cache locality and results in a low bandwidth utilization. Since rand is usually much lower than, we consider a multipass optimization scheme to improve the cache locality. We apply the optimization technique to both the scatter and gather operations, and illustrate the optimization technique using the scatter as an example. Suppose we perform the scatter in nchunk passes. In each pass, we output only those R in tuples that belong to a certain region of R out. That is, the algorithm first divides R out into nchunk chunks, and then performs the scatter in nchunk passes. In the ith pass (<=i<=nchunk), it scans L once, and outputs the R in tuples belonging to the ith chuck of R out (i.e., the write locations are R out (i ) nchunk R out i nchunk between and ). Since each chunk is much smaller than R out, our multi-pass scatter has a better cache locality than the single-pass one. L Rin R R R R4 R5 R6 Rout R R R5 R R4 R6 miss hit (a) Single-pass scatter L Pass Rin R R R R4 R5 R6 Pass Rin Rout R R R R R R R4 R5 R6 Rout R R R5 R R4 R6 (b) Multi-pass scatter Figure 5. Single-pass vs. multi-pass scatter: the multi-pass scatter has a better cache locality than the single-pass one. Let us illustrate our multi-pass scheme using an example, as shown in Figure 5. For simplicity, we assume that the GPU cache can hold only one memory block and a block can hold three tuples. Under this assumption, each write in the single-pass scheme in Figure 5 (a) results in a miss. Conuently, the singlepass scheme has a low cache locality. In comparison, the multipass scheme divides R out into two chunks, and writes one chunk in each pass. It achieves an average hit rate of /. We estimate the total execution time of the multi-pass scatter as scanning R in and L for nchunk times and outputting the data. Given the nchunk value, we estimate the cache hit rate in a similar way to the single-pass scheme. The major difference is that in the estimation of the random bandwidth, the number of data accesses considered in each pass is n/nchunk, where n is the total number of data accesses in the scatter. When nchunks>, n/nchunks is smaller than n, the multi-pass scheme has a higher cache hit rate than the single-pass scheme. Given the estimated random bandwidth for the multi-pass scheme when the number of passes is nchunk, rand, we estimate the execution time of the multi-pass scatter, T scatter, in Eq.. We choose the nchunk value so that T scatter is minimized. T ' scatter ( R in L ) nchunk R out rand ().5 Evaluation We evaluated our model and optimization techniques with different data sizes and data distributions. We increase the data size to be sufficiently large to show the performance trend. The default tuple size is 8 bytes, and the default size of the input data array is 8M (i.e., the input data array consists of 6 million tuples). The default data distribution is random. We have implemented and tested our algorithms on a PC with an NVidia G8 GPU. The PC runs Windows XP on an Intel Core

6 Quad CPU. We also run our CPU-based algorithms on a Windows Server server machine with two AMD Opteron 8 dual-core processors. The hardware configuration of the PC and the server is shown in Table. The cache configuration of the GPU is obtained based on the GPU memory model proposed by Govindaraju et al. []. The cache latency information of the CPU are obtained using a cache calibrator [], and those of the GPU are obtained from the hardware specification [6]. We estimate the transfer unit between the cache and the GPU to be the memory interface, i.e., 84 bits, since we do not have any details on this value. We compute the theoretical memory bandwidth to be the bus width multiplying the memory clock rate. Thus, the GPU, Intel and AMD have a theoretical bandwidth of 86.4 G/s,.4 G/s and 8. G/sec, respectively. ased on our measurements on uentially scanning a large array, the G8 achieves a peak memory bandwidth of around 69. G/s whereas the Intel and the AMD 5.6 G/s and 5. G/s, respectively. To evaluate our model, we define the accuracy of our model to be vv' (- v ), where v and v are the measured and estimated values, respectively. We run each experiment five times and report the average value. Since we aim at validating our cache model and evaluating the scatter and the gather on the GPU, the results for the GPU-based algorithms in this section do not include the data transfer time between the CPU and the GPU. Table. Hardware configuration GPU(G8) CPU(Intel) CPU(AMD) Processors 5MHz GHz 4 (Quad-core).4GHz (two dual-core) Cache size 9 K L: K 4, L: 8M L: 64K, L: M Cache block 56 L: 64, L: 8 L: 8, L: size (bytes) 8 Cache access L:, L: L:, L: 9 time (cycle) DRAM (M) DRAM latency (cycle) 8 8 us width (bit) Memory clock.8.. (GHz) Validating the performance model. Figure 6 demonstrates the measured and estimated performance of uential scatter and gather operations on the GPU. The gather and the scatter have a very similar performance trend as the data size increases. The uential bandwidth is estimated to be 6.9 G/s. The figure also indicates that our estimation achieves an accuracy of over 87% for the gather and the scatter. The range of the accuracy is 8%~97%, i.e., min=7%, max=99%. Figure 7 shows the measured and estimated performance with the data size varied when the read/write locations are totally random. Our model has a high accuracy on predicting the performance of the gather and the scatter. The average accuracy of our model on the gather and the scatter is over 9% (min: 6%, max: 98%). Again, the gather and the scatter have a very similar performance trend as the data size increases. In the following, we report the results for the scatter only, because the results of the gather are similar to those of the gather Data size (M) Gather Data size (M) Figure 6. The measured and estimated performance of the uential single-pass scatter (top) and gather (bottom) on the GPU Data size (M) Gather Data size (M) Figure 7. The measured and estimated performance of the single-pass scatter (top) and gather (bottom) on the GPU with random locations. Figure 8 shows the measured and estimated performance with the number of partitions in the input data varied. The data size is

7 Effective bandwidth (G/s) 8M. Given the number of partitions, the partition ID of each tuple is randomly assigned. When the number of partitions is larger than (i.e., the warp size on the G8), the measured time increases dramatically due to the reduced cache reuse within a warp. We also observe this performance increase in our estimated time. These figures indicate that our estimation is accurate on different data sizes and different numbers of partitions in the input data. The average accuracy in Figure 8 is over 8% (min: 7%, max: 99%) #partition #partition Figure 8. The measured and estimated single-pass scatter performance on the GPU: (top) the effective bandwidth of the random access; (bottom) the execution time Figure 9. The measured and estimated performance of the multi-pass scatter on GPU with the number of passes varied. Finally, we validate our performance model on the multi-pass scheme. Figure 9 shows the measured and estimated performance of the multi-pass scatter with the number of passes varied. The suitable value for the number of passes is 6. When the number of passes is smaller than the suitable value, the scatter performance improves because the cache locality of the scatter is improved. When the number of passes is larger than the suitable value, the #pass overhead of extra scans in the multi-pass scheme becomes significant. The average accuracy of our model in Figure 9 is 86% (min: 7%, max: 99%), and our estimations for different data sizes and different number of partitions are also highly accurate. Evaluating the multi-pass scheme. We first consider whether we can improve the performance through dividing the input array into multiple segments. We apply our multi-pass scheme to each segment. Since the segment can fit into the cache, the cost of the multiple scans is minimized. Figure shows the scatter performance when the number of segments, #bin, increases. The input array is 8 M. When #bin is 5, the segment size is 56K, which is smaller than the GPU cache size. When the #bin value increases, the scatter time increases as well. This is because the cache locality between the scatters on different segments gets worse. We next evaluate our multi-pass scheme with the number of partitions, p, varied. The result is shown in Figure. When p<=8, the suitable number of passes in our multi-pass scheme is one, and the multi-pass scatter reduces to the single-pass one. As p increases from 8, our multi-pass scheme outperforms the singlepass scheme. Regardless of the number of partitions in the input relation, our model correctly predicts a suitable value for the number of passes, and our multi-pass optimization technique improves the scatter performance up to three times log(#bin) Figure. The scatter performance with the number of segments varied Single-pass Multi-pass Figure. The performance of the multi-pass scatter on the GPU with the number of partitions, p, varied. Finally, we compare the performance of scatter on the GPU and the CPU when the write locations are random. We obtain similar performance results as the number of partitions varied, and omit the results. Figure shows the scatter performance with the tuple size varied and with the number of tuples fixed to be 6 million. The maximum tuple size is 6 due to the limited size of the device memory. Figure shows the elapsed time with the number of tuples varied and the tuple size fixed to be 8 bytes. p

8 Time(ms) Effective bandwidth (G/s) Time(ms) On the CPU, we consider whether our multi-pass scheme and the multithreading technique can improve the performance of the scatter. We find that the multi-pass scheme has little performance improvement on both Intel and AMD. This is due to the difference of memory architectures of the GPU and the CPU. We apply the single-pass scheme to the gather and the scatter on the CPU. The multithreading technique improves the performance of the scatter by -X and -4X on Intel and AMD, respectively. On the GPU, the effective bandwidth of the random access increases as the tuple size increases. When the tuple size is fixed, the effective bandwidth is stable as the data size increases. Regardless of the tuple size and the data size, the GPU-based scatter has a much better performance than the optimized CPUbased one. The speedup is 7-X and -4X on Intel and AMD, respectively. On both the GPU and CPUs, the effective bandwidth is much lower than the peak bandwidth. This indicates the random access pattern has low locality and yield low bus utilization. Nevertheless, our multi-pass scheme improves the effective bandwidth of the GPU-based scatter by -4X (CPU vs. GPU) Tuple size (ytes) (CPU vs. GPU) Tuple size (ytes) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) Figure. The scatter performance on the CPU and the GPU with tuple size varied (CPU vs. GPU) Data size (M) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) Figure. The scatter time on the CPU and the GPU with data size varied. 4. APPLICATION AND ANALYSIS and gather are two fundamental data-parallel operations for many parallel algorithms [9]. In this section, we use scatter and gather to implement three memory intensive algorithms on GPUs, including the radix sort, the hash search and the sparsematrix vector multiplication. Specifically, we use our improved scatter to implement the radix sort and our improved gather for the hash search and the sparse-matrix vector multiplication. We also compare the end-to-end performance of the three applications on the GPU and the CPU. Our total execution time for the GPUbased algorithms includes the data transfer time between the CPU and the GPU. 4. Radix sort Radix sort is one of the highly parallel sorting algorithms [4]. It has a linear-time computational complexity. In this study, we implement the most significant digit (MSD) radix sort. The algorithm first performs P-phase partitioning on the input data, and then sorts the partitions in parallel. In each phase, we divide the input data into multiple partitions using its bits. In the ith (<=i<=p) phase of the radix sort, we use the (i*its)th to ((i+)*its)th bits from the most significant bit. After partitioning, we use the bitonic sort [] to sort each partition. The bitonic sort can exploit the fast shared memory on G8 [6], where the shared memory of each multiprocessor is 6K. Different from the cache, the shared memory is programmer-manipulated. Additionally, it is shared by threads on the same multiprocessor. When the partition size is smaller than the shared memory size, the bitonic sort works entirely on the data in the shared memory, and causes no further cache misses. We determine the suitable P value so that the resulting size of each partition is smaller than the shared memory size, i.e., P=log f (n/s), where f is the number of partitions generated in each phase, i.e., its ; n, the size of the input array (bytes); S, the shared memory size of each multiprocessor (bytes). Data PID Loc Output Phase : the first bit from the left Data PID Loc Output Phase : the second bit from the left Figure 4. An example of radix sort. We implement the partitioning scheme using the scatter. Given an input array of data tuples, dataarray, the partitioning process is performed in three steps. The algorithm first computes the partition ID for each tuple and stores all partition IDs into an array, pidarray, in the same order as dataarray. Next, it computes the write location for each tuple based on pidarray, and stores the write locations into the third array, locarray. Finally, it applies our multi-pass scatter to output dataarray according to locarray. An example of the radix sort on eight tuples is illustrated in Figure 4. The first step is fast, because it is a uential scatter and a scan. We consider optimizations on the second and third steps. For the second step, we use histograms to compute the write location for each tuple. Our histograms are similar to those in the

9 parallel radix sort proposed by Zagha [4]. The main difference is that we fully consider the hardware features of the GPU to design our histograms. Specifically, in our partitioning algorithm, each thread is responsible for a portion of the input array. ecause the histogram for each thread is accessed randomly, we store the histograms into the shared memory to reduce the latency of these random accesses. In our implementation, each histogram element is encoded in one byte. The size of each histogram is its bytes. Since each thread has one histogram, the its value must satisfy the following constraint due to the limited size of the shared memory: T* its <=S, where T is the number of concurrent threads sharing the shared memory and S is the size of the shared memory. For the third step, we determine the suitable number of passes to minimize the execution time of the radix sort. The number of phases, i.e., the number of scatter operations, in the radix sort is P. We estimate the execution time of each scatter as in Case II with the number of partitions, its. The total execution time of these P scatter operations is C=P*T scatter. Considering the constraint in the second step, we choose the suitable its value to minimize the C value. We have implemented the radix sort on the GPU and the CPU. Figure 5 shows the performance impact of our optimization techniques on the GPU and the CPU. Each array element is an 8- byte record, including a four-byte key value and a four-byte record identifier. The key values are randomly distributed. For the CPU implementation, we consider both multithreading and cache-optimization techniques. Suppose the CPU has C p cores. We first divide the input array into C p partitions, and sort these partitions in parallel. Each partition is sorted by a thread running an optimized CPU-based radix sort, which is a multi-pass algorithm []. It determines the number of bits used in each pass according to the cache parameters of the CPU, e.g., the number of cache lines in the L and the L caches and the number of entries in the data TL (Translation Lookaside uffer). According to the existing cost model [], the number of bits in each pass is set to be six on both Intel and AMD so that the cache stalls are minimized. The number of partitions generated in one pass is 6 =64. Finally, we merge the C p sorted partitions into one. The performance comparison for the CPU-based radix sort with and without optimizations is shown on the top of Figure 5. Our optimization techniques greatly improve the radix sort by 85% on AMD and slightly improve the radix sort on Intel. On the GPU, the number of bits used in each phase of the radix sort, its, is five. That is, in each phase, an array is divided into sub-arrays. Our test data takes three phases of partitioning in total. The suitable number of passes in our optimized scatter is four according to our performance model. As shown in Figure 5, our multi-pass scheme improves the GPU-based radix sort by over %. Note, the data transfer time between the GPU and the CPU is -% of the total execution time of the GPU-based radix sort. Figure 6 highlights the performance of the GPU-based radix sort, the GPU-based bitonic sort [] and the CPU-based radix sort []. The GPU-based radix sort is twice faster than the GPUbased bitonic sort. Moreover, our GPU-based algorithm is over X faster than the best optimized CPU-based radix sort. Finally, we perform back-of-envelope calculation on the system bandwidth. We estimate the data transfer of the GPU-based radix sort through considering the cache misses estimated by our model and the internal data structures in the radix sort. The CPU-based radix sort is estimated using the existing cost model []. The estimated bandwidth of the radix sort is on average.4 G/s,. G/s, and. G/s on the GPU, Intel and AMD, respectively RadixSort on CPU Intel radixsort(w/o optimization) Intel radixsort(w/ optimization) AMD radixsort(w/o optimization) AMD radixsort(w/ optimization) #keys (million) RadixSort on GPU GPU radixsort(w/o optimization) GPU radixsort(w/ optimization) #keys (million) Figure 5. The performance impact of optimizations on the CPU and the GPU Sort (CPU vs. GPU) GPU radixsort(w/ optimization) GPU bitonic sort Intel radixsort(w/ optimization) AMD radixsort(w/ optimization) #keys (million) Figure 6. The execution time of sorting algorithms on the GPU and the CPU. 4. Hash search A hash table is commonly used to speed up equality search on a set of records. Applications using hash tables include the hash join and hash partitioning in databases []. A hash table consists of multiple hash headers, each of which maintains a pointer to its corresponding bucket. Each bucket stores the key values of the records that have the same hash value, and the pointers to the records. An example of the hash table is illustrated in Figure 7

10 (a). Since dynamic allocation and deallocation of the device memory is not supported on current GPUs, we use an array to store the buckets, and each header maintains the start position of its corresponding bucket in the array. The hash search takes as input a number of search keys, performs probes on a hash table with these keys, and outputs the matching records to a result buffer in the device memory. In our implementation, each thread is responsible for one search key. To avoid the conflict when multiple threads are writing results to the shared result buffer concurrently, we implement the hash search in four steps. First, for each search key, we use the hash function to compute its corresponding bucket ID and determine the (start, end) pair indicating the start and end locations for the bucket. Second, we scan the bucket and determine the number of matching results. Third, based on the number of results for each key, we compute a prefix sum on these numbers to determine the start location at which the results of each search key are written, and output the record IDs of the matching tuples to an array, L. Fourth, we perform a gather according to L to fetch the actual result records. Due to the random nature of the hash search, we estimate the execution time of the gather in Case I. Figure 7 (b) illustrates the four steps of searching two keys. H(x)=x mod 7 Header 5 uckets Records R R R R4 R5 R6 R7 R8 Step : Step : Step : keys 4 (start, end) (, ) (5,8) #matches Record IDs respectively. Additionally, the hash search optimized with our multi-pass scheme is on average 5% faster than the single-pass scheme. We also varied the number of search keys and the number of records in the hash table, and obtained similar performance results. Through back-of-envelope calculation, the estimated bandwidth of the hash search is on average. G/s,.56 G/s, and. G/s on the GPU, Intel and AMD, respectively GPU(w/o optimization) GPU(w/ optimization) Hash search on GPU GPU(w/ optimization) #keys in the hash table (million) Hash search (CPU vs. GPU) Intel radixsort(w/ optimization) AMD radixsort(w/ optimization) 5 4 (a) The hash table 7 R5 Step 4: output R4 R (b) Search #keys in the hash table (million) Figure 7. An example of hash search. Figure 8 shows the execution time of the hash search on the GPU and the CPU. Note that the data transfer time is 4-6% of the total execution time of the hash search. The values of the search keys and the keys in the hash table are all -bit random integers. The record size is 8 bytes. We fixed the number of search keys to be 4 million and varied the total number of records in the hash table. The load factor of the hash table is around 99%. Similar to the CPU-based radix sort, we consider multithreading and cache optimization techniques for the CPU-based hash search. Suppose the CPU has C p cores. We first divide the search keys into C p partitions, and probe the hash table using the search keys of each partition in parallel. In our implementation, we used C p threads and each thread is in charge of one partition. Each thread runs the cache-efficient hash searches using a multi-pass scheme similar to that on the GPU. The difference is that the chunk size is set to be the L data cache size. Similar to the CPUbased radix sort, the performance improvement of our optimization on AMD is significant and that on Intel is moderate. ased on our performance model, the suitable number of passes for the GPU-based multi-pass scatter is 6. We observe that the optimized GPU-based hash search is 7.X and.5x faster than its optimized CPU-based counterpart on Intel and AMD, Figure 8. Hash search on the GPU and the CPU. 4. Sparse-matrix vector multiplication (SpMV) Sparse-matrix vector multiplication (SpMV) is widely used in many linear solvers [5]. SpMV stores the sparse matrix in a compressed format and performs the multiplication on the compressed matrix. We adopt the compressed format (CSR) from the Intel MKL library [] to represent the sparse matrix: for each row, only the non-zero elements are stored. In row i, if column j is not zero, we represent this cell as a tuple (j, v), where v is the value of the cell (i, j). This compressed format stores the matrix into four arrays: values, columns, pointer and pointere. Array values stores all the values of non-zero cells in the order of their row and column indexes. Array columns stores all the column indexes for the non-zero cells in the row order, i.e., columns[i] stores the column index of the corresponding cell of values[i]. Arrays pointer and pointere store the start and end positions of each row in the values array, respectively. That is, the positions of the non-zero cells of the ith row are between pointer[i] and (pointere[i]-). Given a sparse matrix M and a dense vector V, we store M in the compressed format and compute the multiplication W=M*V. Our algorithm computes W in two steps. First, we scan values and

11 GFLOPS GFLOPS columns once, and compute the partial multiplication results on values and V into an array, R, where R[i]= (values[i]*v[columns[i]]). This multiplication is done through a gather to fetch the values from the vector V at read locations given in the array columns. Second, based on R, we sum up the partial results into W: W[i] pointere[i]- j pointer[i] R[j] When the matrix is large, SpMV is both data- and computationintensive. We use our multi-pass scheme to reduce the random accesses to the vector. Applying our performance model to SpMV, we estimate the execution time of the gather in Case I Class A CG benchmark. GPU (w/o optimization) GPU (w/ optimization) Intel (w/ optimization) AMD (w/ optimization) Class Figure 9. Conjugate gradient benchmark on the GPU and the CPU CG benchmark Stanford_erkeley pkustk4 pkustk8 nd4k Matrices GPU (w/o optimization) GPU (w/ optimization) Intel (w/ optimization) AMD (w/ optimization) Figure. Conjugate gradient benchmark with real-world matrices on the GPU and the CPU. Figure 9 plots the performance in GFLOPS (Giga Floating Point Operations per Second) of SpMV on the GPU and the CPU. We applied our SpMV algorithm to the conjugate gradient benchmark [5], where SpMV is a core component in the sparse linear solver. The benchmark solves a linear system through calling the conjugate gradient (CG) method niter multiple times. Each execution of CG has 5 iterations. Each iteration consists of a SpMV on the matrix and a dense vector, in addition to some vector manipulation. In our experiments, the total execution time of SpMV is over 9% of the total execution time of the CG benchmark on both the GPU and the CPU. There are two classes in the benchmark, classes A and. The vector sizes are 4 and 75 thousand (i.e., 56K and K, respectively) for classes A and, respectively. The niter value is 5 and 75 for classes A and, respectively. The CPU-based SpMV uses the Intel MKL routines, mkl_dcsrmv []. The MKL routine does not utilize all the cores on the CPU. We further consider multithreading techniques on the CPU. Suppose the CPU has C p cores. We horizontally divide the matrix into C p sub-matrices. Next, each thread performs the multiplication on a sub-matrix and the vector using the mkl_dcsrmv routine. The multithreading technique improves the benchmark around X on both AMD and Intel. Since the vector size is smaller than the cache size on G8, the vector fits into the cache. As a result of this small vector size, the performance of the GPU-based algorithm with optimization is similar to that without. Note that the GPU-based SpMV uses the single precision, whereas the MKL routine uses the double precision. The single-precision computation may have up to X performance advantage over the double-precision computation. Nevertheless, the GPU-based algorithm is more than six times faster than the CPU-based algorithm. We further evaluated our SpMV algorithms using the CG benchmark with real-world matrices. We used the real-world matrices downloaded from the Sparse Matrix Collection [7]. We follow Gahvari s categorization on these matrices [8]: the matrix size is small, medium or large, and the number of non-zero elements per row is low, medium or high. Figure shows the performance comparison for four large real-world matrices, namely pkustk8, pkust4, Stanford_erkeley and nd4k. These matrices have a similar size, and their non-zero elements per row increase from, 98, and 45 to 99. The benchmark configuration is the same as class. The optimized SpMV with our multi-pass scheme is moderately faster than the one with the single-pass scheme. Moreover, the GPU-based algorithm is over six times faster than the CPU-based algorithm. We also evaluated our algorithms with matrices of different sizes and obtained similar performance results. 5. CONCLUSION AND FUTURE WORK We presented a probabilistic model to estimate the memory performance of scatter and gather operations. Moreover, we proposed optimization techniques to improve the bandwidth utilization of these two operations by improving the memory locality in data accesses. We have applied our techniques to three scientific applications and have compared their performance against prior GPU-based algorithms and optimized CPU-based algorithms. Our results show a significant performance improvement on the NVIDIA 88 GPU. There are several avenues for future work. We are interested in applying our scatter and gather operations to other scientific applications. We are also interested in developing efficient scatter and gather operations on other architectures such as the IM Cell processor and AMD Fusion. Finally, it is an interesting future direction to extend our algorithms to multiple hosts, e.g., clusters or MPPs without shared memory. 6. ACKNOWLEDGMENTS The authors thank the shepherd, Leonid Oliker, and the anonymous reviewers for their insightful suggestions, all of which improved this work. Funding for this work was provided in part by DAG5/6.EG and 676 from the Hong Kong Research Grants Council. 7. REFERENCES [] CWI Calibrator, [] HPF,

12 [] Intel Math Kernel Library 9., [4] MPI, [5] NAS parallel benchmarks, [6] NVIDIA CUDA (Compute Unified Device Architecture), [7] Sparse matrix collection, [8] ZPL, [9] M. Abramowitz and I. A. Stegun (Eds.). Stirling numbers of the second kind. In Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, 97. [] A. Aggarwal and S. V. Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, (9):6 7, 988. [] D. H. ailey. Vector computer memory bank contention. IEEE Transactions on Computers, vol. C-6, no. (Mar. 987), pp [] G. E. lelloch. Prefix sums and their applications. Technical report, CMU-CS-9-9, Nov 99. [] P. oncz, S. Manegold and M. Kersten. Database architecture optimized for the new bottleneck: memory access. In Proc. of the International Conference on Very Large Data ases (VLD), pp 54-65, 999. [4] I. uck. Taking the plunge into GPU computing. In GPU Gems, Pharr M., (Ed.). Addison Wesley, Mar. 5, pp [5] J. olz, I. Farmer, E. Grinspun and P. Schröoder. Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Transactions on Graphics, pp [6] K. Fatahalian, J. Sugerman and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 4. [7] M. Frigo, C. E. Leiserson, H. Prokop and S. Ramachandran. Cache-oblivious algorithms. In Proc. of the 4th Annual Symposium on Foundations of Computer Science, 999. [8] H. Gahvari. enchmarking sparse matrix-vector multiply. Master thesis, Computer Science Division, U.C. erkeley, December 6. [9] N. Galoppo, N. Govindaraju, M. Henson and D. Manocha. LU-GPU: efficient algorithms for solving dense linear systems on graphics hardware. In Proc. of the 5 ACM/IEEE conference on Supercomputing. [] N. Govindaraju, J. Gray, R. Kumar and D. Manocha. GPUTeraSort: high performance graphics coprocessor sorting for large database management. In Proc. of the 6 ACM SIGMOD international conference on Management of data. [] N. Govindaraju, S. Larsen, J. Gray, D. Manocha. A memory model for scientific algorithms on graphics processors. In Proc. of the 6 ACM/IEEE conference on Supercomputing. [] N. Govindaraju,. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proc. of the 4 ACM SIGMOD international conference on Management of data. [] N. Govindaraju, N. Raghuvanshi and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proc. of the 5 ACM SIGMOD international conference on Management of data. [4] D. Horn. Lib GPU FFT, 6. [5] P. Kipfer, M. Segal, and R. Westermann. Uberflow: A gpubased particle engine. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 4. [6] A. Lamarca and R. Ladner. The influence of caches on the performance of sorting. In Proc. of the eighth annual ACM- SIAM symposium on Discrete algorithms, 997. [7] E. Larsen and D. McAllister. Fast matrix multiplies using graphics hardware. In Proc. of the ACM/IEEE conference on Supercomputing. [8] K. Moreland and E. Angel. The FFT on a GPU. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware,. [9] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn and T. J. Purcell. A survey of generalpurpose computation on graphics hardware. Computer Graphics Forum, Volume 6, 7. [] T. Purcell, C. Donner, M. Cammarano, H. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware,. [] S. Sengupta, M. Harris, Y. Zhang, J. D. Owens. Scan primitives for GPU computing. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 7. [] J. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 9-7,. [] S. Williams, J. Shalf, L. Oliker, P. Husbands and K. Yelick, Dense and sparse matrix operations on the Cell processor. May, 5. Lawrence erkeley National Laboratory. Paper LNL-585. [4] M. Zagha and G. E. lelloch. Radix sort for vector multiprocessors. In Proc. of the 99 ACM/IEEE conference on Supercomputing.

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion with Several GPUs High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

L20: GPU Architecture and Models

L20: GPU Architecture and Models L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

Operation Count; Numerical Linear Algebra

Operation Count; Numerical Linear Algebra 10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

CUDA Programming. Week 4. Shared memory and register

CUDA Programming. Week 4. Shared memory and register CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

GPGPU Computing. Yong Cao

GPGPU Computing. Yong Cao GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Parallel Computing for Data Science

Parallel Computing for Data Science Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2 GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency

More information

On Sorting and Load Balancing on GPUs

On Sorting and Load Balancing on GPUs On Sorting and Load Balancing on GPUs Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology SE-42 Göteborg, Sweden {cederman,tsigas}@chalmers.se Abstract

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*

More information

64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing 64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Configuring Memory on the HP Business Desktop dx5150

Configuring Memory on the HP Business Desktop dx5150 Configuring Memory on the HP Business Desktop dx5150 Abstract... 2 Glossary of Terms... 2 Introduction... 2 Main Memory Configuration... 3 Single-channel vs. Dual-channel... 3 Memory Type and Speed...

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Accelerating Wavelet-Based Video Coding on Graphics Hardware Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing

More information

Record Storage and Primary File Organization

Record Storage and Primary File Organization Record Storage and Primary File Organization 1 C H A P T E R 4 Contents Introduction Secondary Storage Devices Buffering of Blocks Placing File Records on Disk Operations on Files Files of Unordered Records

More information

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging Memory Management Outline Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging 1 Background Memory is a large array of bytes memory and registers are only storage CPU can

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

QCD as a Video Game?

QCD as a Video Game? QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

Radeon HD 2900 and Geometry Generation. Michael Doggett

Radeon HD 2900 and Geometry Generation. Michael Doggett Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command

More information

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques

More information

How To Write A Mapreduce Framework On A Computer Or A Graphics Processor

How To Write A Mapreduce Framework On A Computer Or A Graphics Processor Mars: A MapReduce Framework on Graphics Processors Bingsheng He Wenbin Fang Naga K. Govindaraju # Qiong Luo Tuyong Wang * Hong Kong Univ. of Sci. & Tech. {saven, benfaung, luo}@cse.ust.hk # Microsoft Corp.

More information

GPU for Scientific Computing. -Ali Saleh

GPU for Scientific Computing. -Ali Saleh 1 GPU for Scientific Computing -Ali Saleh Contents Introduction What is GPU GPU for Scientific Computing K-Means Clustering K-nearest Neighbours When to use GPU and when not Commercial Programming GPU

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

An examination of the dual-core capability of the new HP xw4300 Workstation

An examination of the dual-core capability of the new HP xw4300 Workstation An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,

More information

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

IP Video Rendering Basics

IP Video Rendering Basics CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

GPU Computing - CUDA

GPU Computing - CUDA GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental

More information

PARALLEL PROGRAMMING

PARALLEL PROGRAMMING PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL

More information

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Retargeting PLAPACK to Clusters with Hardware Accelerators

Retargeting PLAPACK to Clusters with Hardware Accelerators Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.

More information

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering

More information