Efficient Gather and Scatter Operations on Graphics Processors

Transcription

1 Efficient Gather and Operations on Graphics Processors ingsheng He # Naga K. Govindaraju * Qiong Luo # urton Smith * # Hong Kong Univ. of Science and Technology * Microsoft Corp. {saven, luo}@cse.ust.hk {nagag, burtons}@microsoft.com ASTRACT Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). With superior computing power and high memory bandwidth, GPUs have become a commodity multiprocessor platform for general-purpose high-performance computing. However, due to the random access nature of gather and scatter, a naive implementation of the two operations suffers from a low utilization of the memory bandwidth and conuently a long, unhidden memory latency. Additionally, the architectural details of the GPUs, in particular, the memory hierarchy design, are unclear to the programmers. Therefore, we design multi-pass gather and scatter operations to improve their data access locality, and develop a performance model to help understand and optimize these two operations. We have evaluated our algorithms in sorting, hashing, and the sparse matrix-vector multiplication in comparison with their optimized CPU counterparts. Our results show that these optimizations yield -4X improvement on the GPU bandwidth utilization and -5% improvement on the response time. Overall, our optimized GPU implementations are -7X faster than their optimized CPU counterparts. Categories and Subject Descriptors E.5 [Files] Optimization, sorting/searching; I.. [COMPUTER GRAPHICS] Hardware Architecture Graphics processors, parallel processing. General Terms Algorithms, Measurement, Performance. Keywords (c) 7 ACM /7/ $5.Gather, scatter, parallel processing, graphics processors, cache optimization. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC7 November -6, 7, Reno, Nevada, USA (c) 7 ACM /7/ $5.. INTRODUCTION The increasing demand for faster scientific routines to enable physics and multimedia applications on commodity PCs has transformed the graphics processing unit (GPU) into a massively parallel general-purpose co-processor [9]. These applications exhibit a large amount of data parallelism and map well to the data parallel architecture of the GPU. For instance, the current NVIDIA GPU has over 8 data parallel processors and a peak memory bandwidth of 86 G/s. Many numerical algorithms including matrix multiplication [6][7][], sorting [][5] [], LU decomposition [9], and fast Fourier transforms [4][8] have been designed on GPUs. Due to the high performance capabilities of GPUs, these scientific algorithms achieve -5X performance improvement over optimized CPU-based algorithms. In order to further improve the performance of GPU-based scientific algorithms and enable optimized implementations similar to those for the CPU-based algorithms, recent GPUs include support for inter-processor communication using shared local stores, and support for scatter and gather operations [6]. and gather operations are two fundamental operations in many scientific and enterprise computing applications. These operations are implemented as native collective operations in message passing interfaces (MPI) to define communication patterns across the processors [4], and in parallel programming languages such as ZPL [8] and HPF []. operations write data to arbitrary locations and gather operations read data from arbitrary locations. oth operations are highly memory intensive and form the basic primitives to implement many parallel algorithms such as quicksort [], sparse matrix transpose [8], and others. In this paper, we study the performance of scatter and gather operations on GPUs. Figure shows the execution time of the scatter and the gather on a GPU with the same input array but either uential or random read/write locations. The input array is 8M. The detailed experimental setup is described in Section.5. This figure shows that location distribution greatly affects the performance of the scatter and the gather. For uential locations, if we consider the data transfer of the output data array and internal data structures, both operations are able to achieve a system bandwidth of over 6 G/s. In contrast, for random locations, both operations yield a low utilization of the memory bandwidth. The performance comparison indicates that locality in memory accesses is an important factor for the performance of gather and scatter operations on GPUs.

2 Gather Sequential 76.9 Random 76.9 Figure. The elapsed time of the scatter and the gather on a GPU. oth operations underutilize the bandwidth for random locations. GPU memory architectures are significantly different from CPU memory architectures []. Specifically, GPUs consist of highbandwidth, high-latency video memory and the GPU cache sizes are significantly smaller than the CPUs therefore, the performance characteristics of scatter and gather operations on GPUs may involve different optimizations than corresponding CPU-based algorithms. Additionally, it was only recently that the scatter functionality was introduced on GPUs, and there is little work in identifying the performance characteristics of the scatter on GPUs. In this paper, we present a probabilistic analysis to estimate the performance of scatter and gather operations on GPUs. Our analysis accounts for the memory access locality and the parallelism in GPUs. Using our analysis, we design optimized algorithms for scatter and gather. In particular, we design multipass algorithms to improve the locality in the memory access and thereby improve the memory performance of these operations. ased on our analysis, our algorithms are able to determine the number of passes required to improve the performance of scatter and gather algorithms. Moreover, we demonstrate through experiments that our performance model is able to closely estimate the performance characteristics of the operations. As a result, our analysis can be applied to runtime systems to generate better execution plans for scatter and gather operations. We use our scatter and gather to implement three common applications on an NVIDIA GeForce 88 GPU (G8) - radix sort using scatter operations, and the hash search and the sparse-matrix vector multiplication using gather operations. Our results indicate that our optimizations can greatly improve the utilization of the memory bandwidth. Specifically, our optimized algorithms achieve a -4X performance improvement over single-pass GPUbased implementations. We also compared the performance of our algorithms with optimized CPU-based algorithms on high-end multi-core CPUs. In practice, our results indicate a -7X performance improvement over CPU-based algorithms. Organization: The remainder of the paper is organized as follows. We give a brief overview of the GPU and the prior work on GPU- or CPU-based cache efficient algorithms in Section. Section describes our performance model and presents our optimization techniques for GPUs. In Section 4, we use the optimized scatter and gather to improve the performance of sorting, the hash search and the sparse-matrix vector multiplication on the GPU. Finally, we conclude in Section 5.. ACKGROUND AND RELATED WORK In this section, we first review techniques in general-purpose computing on the GPU. Next, we survey the cache-optimized techniques on the CPU and the GPU.. GPGPU (General-purpose computation on GPU) A GPU is a SIMD processing unit with high memory bandwidth primarily intended for use in computer graphics rendering applications. In recent years, the GPU has become extremely flexible and programmable. This programmability is strengthened in every major generation (roughly every 8 months) [9]. Such programming flexibility further facilitates the use of the GPU as a general-purpose processor. Recently, NVIDIA has released the CUDA (Compute Unified Device Architecture) framework [6] together with the G8 GPU for general-purpose computation. Unlike DirectX or OpenGL, CUDA provides a programming model for a thread-based programming language similar to C. Thus, programmers can take advantage of GPUs without requiring graphics know-how. Two of the newly exposed features on the G8 GPU - data scattering and on-chip shared memory for the processing engines, are exploited by our implementation. GPUs have been recently used for various applications such as matrix operations [6][7][] and FFT computation [4][8]. We now briefly survey the techniques that use GPUs to improve the performance of scientific and database operations. The GPUbased scientific applications include linear algebra computations such as matrix multiplication [6] and sparse matrix computations [5]. Govindaraju et al. presented novel GPU-based algorithms for relational database operators including selections, aggregations [] as well as sorting [], and for data mining operations such as computing frequencies and quantiles for data streams []. For additional information on the state-of-the-art GPGPU techniques, we refer the reader to a recent survey by Owens et al. [9]. The existing work mainly develops OpenGL/DirectX programs to exploit the specialized hardware features of GPUs. In contrast, we use a general-purpose programming model to utilize the GPU hardware features, and implement our algorithms with two common building blocks, i.e., scatter and gather. Recently, Sengupta et al. implemented the segmented scan using the scatter on the GPU []. Our optimization on the scatter can be applied to the segmented scan to further improve its performance. There is relatively less work on developing efficient algorithms for scatter and gather on GPUs, even though these two operations are commonly provided primitives in traditional MPI architectures [4]. Previous-generation GPUs support gather but do not directly support scatter. uck described algorithms to implement the scatter using the gather [4]. However, these algorithms usually require complex mechanisms such as sorting, which can result in low bandwidth utilization. Even though new generation GPUs have more programming flexibility supporting both scatter and gather, the high programmability does not necessarily achieve high bandwidth utilization, as seen in Figure. These existing limitations motivate us to develop a performance model to understand the scatter and the gather on the GPU, and to improve their overall performance.

3 . Memory optimizations Memory stalls are an important factor for the overall performance of data-intensive applications, such as relational database systems []. For the design and analysis of memory-efficient algorithms, a number of memory models have been proposed, such as the external memory model [] (also known as the cache-conscious model) and the cache-oblivious model [7]. Our model is similar to the external memory model but applies to the parallel computation on the graphics processor. In addition to modeling the cache cost, ailey proposed a memory model to estimate the cost of memory bank contention on the vector computer []. In contrast, we focus on the cache performance of the graphics processor. The algorithms reducing the memory stalls can be categorized into two kinds, cache-conscious [] and cache-oblivious [7]. Cacheconscious algorithms utilize knowledge of cache parameters, such as cache size. On the other hand, cache-oblivious algorithms do not assume any knowledge of cache parameters. Cache-conscious techniques have been extensively used to improve the memory performance of the CPU-based algorithms. LaMarca et al. [6] studied the cache performance for the quick sort and showed that cache optimizations can significantly improve the performance of the quick sort. oncz et al. [] proposed the radix clustering with a multi-pass partitioning method in order to optimize the cache performance. Williams et al. [] proposed cache optimizations for matrix operations on the Cell processor. Memory optimization techniques have also been shown useful for GPU-based algorithms. These memory optimizations need to adapt to the massive threading architecture of the GPU. Govindaraju et al. [] proposed a memory model for scientific applications and showed that optimizations based on the model greatly improved the overall performance. Fatahalian et al. [6] analyzed the cache and bandwidth utilization of the matrix-matrix multiplication. Galoppo et al. [9] designed block-based data access patterns to optimize the cache performance of dense linear systems. In comparison, we propose a general performance model for scatter and gather, and use a multi-pass scheme to improve the cache locality.. MODEL AND ALGORITHM In this section, we first present our memory model on the graphics processor. Next, we give the definitions for the scatter and the gather. We then present our modeling and optimization techniques for the scatter and the gather. Finally, we present our evaluation results for our techniques.. Memory model on GPU Current GPUs achieve a high memory bandwidth using a wide memory interface, e.g., the memory interface of G8 is 84-bit. In order to mask high memory latency, GPUs have small L and L SRAM caches. Compared with the CPU, the GPU typically has a flat memory hierarchy and the details of the GPU memory hierarchy are often missing in vendor hardware specifications. We model the GPU memory as a two-level hierarchy, the cache and the device memory, as shown in Figure. The data access to the cache can be either a hit or a miss. If the data is found in the cache, this access is a hit. Otherwise, it is a miss and a memory block is fetched from the device memory. Given the number of cache misses, #miss, and that of cache hits, #hit, the cache hit rate, #hit h h, is defined as #hit#miss. We model the GPU as M SIMD multiprocessors. At any time, all processors in a multiprocessor execute the same instruction. The threads on each multiprocessor are grouped into thread warps. A warp is the minimum schedule unit on a multiprocessor. If a warp is stalled by memory access, it will be put to the end of a waiting list of thread warps. Then, the first warp in the waiting list becomes active and is scheduled to run on the multiprocessor. Multiprocessor Cache Device memory Multiprocessor M Small size High memory bandwidth Figure. The memory model of GPU. The GPU consists of M SIMD multiprocessors. The cache is shared by all the multiprocessors.. Definitions Gather and scatter are dual operations. A scatter performs indexed writes to an array, and a gather performs indexed reads from an array. We define the two operations in Figure. The array L for the scatter contains distinct write locations for each R in tuple, and that for the gather the read locations for each R out tuple. Essentially, the scatter consists of uential reads and random writes. In contrast, the gather consists of uential writes and random reads. Figure 4 illustrates examples for the scatter and the gather. Primitive: Input: R in [,, n], L[,, n]. Output: R out [,, n]. Function: R out [L[i]]=R in [i], i=, n. Primitive: Gather Input: R in [,, n], L[,, n]. Output: R out [,, n]. Function: R out [i]=r in [L[i]], i=, n. Figure. Definitions of scatter and gather To quantify the performance of the scatter and gather operations, we define the effective bandwidth for the uential access (resp. the random access) to be the ratio of the amount of the data array tuples accessed by the uential access (resp. the random access) in the scatter and gather operations to the elapsed time. This measure indicates how much bandwidth is utilized to access the target (effective) data. That means, we define two kinds of effective bandwidth, uential and random bandwidths, to distinguish the uential and the random access patterns, respectively. Since the uential access has a better locality than the random access, the uential bandwidth is usually higher than the random bandwidth.

4 L Rin Rout R R R R4 R5 R6 R R R5 R R4 R6 L Rin Rout R R R R4 R5 R6 R R4 R R5 R R6 between the device memory and the GPU for a cache miss is z / ll. Since measures the data transfer between the device memory and the GPU as well as the accesses to the GPU z / l l cache, we estimate the total cost for a cache miss to be. (a) (b) Gather Figure 4. Examples of scatter and gather.. Modeling the effective bandwidth In this subsection, we present our probabilistic model on estimating the uential bandwidth and the random bandwidth (denoted as and rand ). Since the estimation and optimization techniques are similar on the scatter and the gather operations, we describe the techniques for the scatter in detail, and present those for the gather in brief... Modeling the uential bandwidth A uential access fetches each memory block exactly once. A memory block causes a cache miss at the first access, and the later accesses to this memory block are cache hits. Thus, the uential access achieves a high effective bandwidth. We estimate the uential bandwidth using our measured effective bandwidth. To obtain the measured bandwidth, we perform a set of calibration measurements with input data arrays of different sizes. Given the capacity of the device memory, D m M, we measure the execution time for the uential scan when the input array size is i log M, i=,,, D m. Let the execution time for the input array of i M be t i. We compute the uential bandwidth to be the weighted bandwidth of all log calibrations as Eq., where n= D m. Compared with the estimation using the average value, the weighted sum gives more weight to the bandwidth of the larger sizes, which fully utilize the bus bandwidth of the GPU. n i ( i ) n i i t i i.. Modeling the random bandwidth Compared with the uential access, the random access has a low cache locality. Some accesses are cache hits, and many others are misses. Given the cache hit rate, h, the expected time for accessing a data array is defined as t, given in Eq.. z / l l t n(( h) h z / l' ) In this equation, n is the number of data array tuples; z is the size of each tuple (bytes); l and l are the memory block size and the transfer unit between the cache and the GPU (bytes); is the estimated uential bandwidth; is the time for fetching a cache line from the cache to the GPU. In this equation, the amortized cost for a cache hit is z / l'. The data transfer () () We define the random bandwidth as follows. rand n z t z z / ll z / l ( l h z / l' ) Given a fixed input, we maximize the cache hit rate in order to maximize the random bandwidth. We develop a probabilistic model to estimate the cache hit rate. Specifically, we estimate the cache hit rate in two cases. One is that the access locations are totally random, and the other is with the knowledge of the number of partitions in the input data. The number of partitions is usually a priori knowledge for various applications such as radix sort and hash partitioning. For instance, given the number of bits used in the radix sort, its, the number of partitions for the input data is its. With the knowledge of the number of partitions, we have a more accurate estimation on the cache hit rate. When there is no knowledge of the number of partitions, we assume the access locations are random and use the first case to estimate the cache hit rate. Case I: totally random accesses. We first estimate the expected number of distinct memory blocks, E, accessed by n random accesses. Suppose the total number of memory blocks for storing the input data array is k. Let E j be the number of cases accessing exactly j (<=j<=n) distinct memory blocks. We estimate E j as follows. E j C( k, j) S( n, j) j! (4) In this equation, C(k, j) is the number of cases of choosing j memory blocks from the available k memory blocks; S(n,j) is the Stirling number of the second kind [9] that equals the number of cases of partitioning n accesses into j memory blocks; j! is the number of cases in the permutation of the j memory blocks. Since the total number of cases in accessing the k memory blocks is k n, we estimate the expected number of distinct memory blocks by these n accesses in Eq. 5. E n j ( E j) k n j ased on E and the number of cache lines in the GPU cache, N, we estimate the cache miss rate, m, in Eq. 6. E n m N E ( ) ( n E) E n, E N, otherwise When E>N, we model cache misses in two categories -- the compulsory misses of loading E memory blocks, and the capacity N ( ) (ne) misses, E. Intuitively, given a fixed input, the larger the cache size, the smaller the cache miss rate. (5) (6) ()

5 Finally, we compute the cache hit rate of Case I to be, h=(-m). Case II: Given the number of partitions, p. In this case, if the write locations of a thread warp belong to the same partition, they are consecutive. The writes to consecutive locations are uential, and have a good cache locality. Suppose a warp consists of w threads. Each thread accesses one tuple at one time. The number of tuples accessed by the warp is w. To estimate the number of cache misses, we estimate the expected number of distinct partitions, D, accessed by the warp. Let Dj be the number of cases containing exactly j distinct partitions, <=j<=min(w, p). We estimate Dj to be Dj=C(p,j)*S(w,j)*j!. The estimation is similar to that on the number of distinct memory blocks accessed in Case I. Since the total number of cases is w p, we estimate D in Eq. 7. D min( p, w) j w p ( D j) j The average number of tuples belonging to each of these D w partitions is D. The number of memory blocks for each partition wz is Dl. Therefore, the cache miss rate within a thread warp is estimated in Eq. 8. w z D D l m w The cache hit rate within a thread warp to be (-m). Finally, we compute the cache hit rate of Case II to be the sum of the intra- and inter-warp cache hit rates. The inter-warp cache hit rate is estimated as in Case I, totally random accesses. Note, even with the knowledge of the number of partitions, the accesses of warps on different multiprocessors are random due to the unknown execution order among warps..4 Improving the memory performance A basic implementation of the scatter is to uentially scan L and R in once, and output all R in tuples to R out during the scan. Similarly, the basic implementation of the gather is to scan L once, read the R in tuples according to L, and write the tuples to R out uentially. Given and rand, the total execution time of the scatter and the gather is estimated in Eq. 9 and, respectively. Given an array, X, we denote the number of tuples in X and the size of X (bytes) to be X and X, respectively. T T scatter gather R in R in rand L R L R out rand out (7) (8) (9) () This basic implementation is simple. However, if L is random, the scatter and the gather suffer from the random access, which has low cache locality and results in a low bandwidth utilization. Since rand is usually much lower than, we consider a multipass optimization scheme to improve the cache locality. We apply the optimization technique to both the scatter and gather operations, and illustrate the optimization technique using the scatter as an example. Suppose we perform the scatter in nchunk passes. In each pass, we output only those R in tuples that belong to a certain region of R out. That is, the algorithm first divides R out into nchunk chunks, and then performs the scatter in nchunk passes. In the ith pass (<=i<=nchunk), it scans L once, and outputs the R in tuples belonging to the ith chuck of R out (i.e., the write locations are R out (i ) nchunk R out i nchunk between and ). Since each chunk is much smaller than R out, our multi-pass scatter has a better cache locality than the single-pass one. L Rin R R R R4 R5 R6 Rout R R R5 R R4 R6 miss hit (a) Single-pass scatter L Pass Rin R R R R4 R5 R6 Pass Rin Rout R R R R R R R4 R5 R6 Rout R R R5 R R4 R6 (b) Multi-pass scatter Figure 5. Single-pass vs. multi-pass scatter: the multi-pass scatter has a better cache locality than the single-pass one. Let us illustrate our multi-pass scheme using an example, as shown in Figure 5. For simplicity, we assume that the GPU cache can hold only one memory block and a block can hold three tuples. Under this assumption, each write in the single-pass scheme in Figure 5 (a) results in a miss. Conuently, the singlepass scheme has a low cache locality. In comparison, the multipass scheme divides R out into two chunks, and writes one chunk in each pass. It achieves an average hit rate of /. We estimate the total execution time of the multi-pass scatter as scanning R in and L for nchunk times and outputting the data. Given the nchunk value, we estimate the cache hit rate in a similar way to the single-pass scheme. The major difference is that in the estimation of the random bandwidth, the number of data accesses considered in each pass is n/nchunk, where n is the total number of data accesses in the scatter. When nchunks>, n/nchunks is smaller than n, the multi-pass scheme has a higher cache hit rate than the single-pass scheme. Given the estimated random bandwidth for the multi-pass scheme when the number of passes is nchunk, rand, we estimate the execution time of the multi-pass scatter, T scatter, in Eq.. We choose the nchunk value so that T scatter is minimized. T ' scatter ( R in L ) nchunk R out rand ().5 Evaluation We evaluated our model and optimization techniques with different data sizes and data distributions. We increase the data size to be sufficiently large to show the performance trend. The default tuple size is 8 bytes, and the default size of the input data array is 8M (i.e., the input data array consists of 6 million tuples). The default data distribution is random. We have implemented and tested our algorithms on a PC with an NVidia G8 GPU. The PC runs Windows XP on an Intel Core

6 Quad CPU. We also run our CPU-based algorithms on a Windows Server server machine with two AMD Opteron 8 dual-core processors. The hardware configuration of the PC and the server is shown in Table. The cache configuration of the GPU is obtained based on the GPU memory model proposed by Govindaraju et al. []. The cache latency information of the CPU are obtained using a cache calibrator [], and those of the GPU are obtained from the hardware specification [6]. We estimate the transfer unit between the cache and the GPU to be the memory interface, i.e., 84 bits, since we do not have any details on this value. We compute the theoretical memory bandwidth to be the bus width multiplying the memory clock rate. Thus, the GPU, Intel and AMD have a theoretical bandwidth of 86.4 G/s,.4 G/s and 8. G/sec, respectively. ased on our measurements on uentially scanning a large array, the G8 achieves a peak memory bandwidth of around 69. G/s whereas the Intel and the AMD 5.6 G/s and 5. G/s, respectively. To evaluate our model, we define the accuracy of our model to be vv' (- v ), where v and v are the measured and estimated values, respectively. We run each experiment five times and report the average value. Since we aim at validating our cache model and evaluating the scatter and the gather on the GPU, the results for the GPU-based algorithms in this section do not include the data transfer time between the CPU and the GPU. Table. Hardware configuration GPU(G8) CPU(Intel) CPU(AMD) Processors 5MHz GHz 4 (Quad-core).4GHz (two dual-core) Cache size 9 K L: K 4, L: 8M L: 64K, L: M Cache block 56 L: 64, L: 8 L: 8, L: size (bytes) 8 Cache access L:, L: L:, L: 9 time (cycle) DRAM (M) DRAM latency (cycle) 8 8 us width (bit) Memory clock.8.. (GHz) Validating the performance model. Figure 6 demonstrates the measured and estimated performance of uential scatter and gather operations on the GPU. The gather and the scatter have a very similar performance trend as the data size increases. The uential bandwidth is estimated to be 6.9 G/s. The figure also indicates that our estimation achieves an accuracy of over 87% for the gather and the scatter. The range of the accuracy is 8%~97%, i.e., min=7%, max=99%. Figure 7 shows the measured and estimated performance with the data size varied when the read/write locations are totally random. Our model has a high accuracy on predicting the performance of the gather and the scatter. The average accuracy of our model on the gather and the scatter is over 9% (min: 6%, max: 98%). Again, the gather and the scatter have a very similar performance trend as the data size increases. In the following, we report the results for the scatter only, because the results of the gather are similar to those of the gather Data size (M) Gather Data size (M) Figure 6. The measured and estimated performance of the uential single-pass scatter (top) and gather (bottom) on the GPU Data size (M) Gather Data size (M) Figure 7. The measured and estimated performance of the single-pass scatter (top) and gather (bottom) on the GPU with random locations. Figure 8 shows the measured and estimated performance with the number of partitions in the input data varied. The data size is

7 Effective bandwidth (G/s) 8M. Given the number of partitions, the partition ID of each tuple is randomly assigned. When the number of partitions is larger than (i.e., the warp size on the G8), the measured time increases dramatically due to the reduced cache reuse within a warp. We also observe this performance increase in our estimated time. These figures indicate that our estimation is accurate on different data sizes and different numbers of partitions in the input data. The average accuracy in Figure 8 is over 8% (min: 7%, max: 99%) #partition #partition Figure 8. The measured and estimated single-pass scatter performance on the GPU: (top) the effective bandwidth of the random access; (bottom) the execution time Figure 9. The measured and estimated performance of the multi-pass scatter on GPU with the number of passes varied. Finally, we validate our performance model on the multi-pass scheme. Figure 9 shows the measured and estimated performance of the multi-pass scatter with the number of passes varied. The suitable value for the number of passes is 6. When the number of passes is smaller than the suitable value, the scatter performance improves because the cache locality of the scatter is improved. When the number of passes is larger than the suitable value, the #pass overhead of extra scans in the multi-pass scheme becomes significant. The average accuracy of our model in Figure 9 is 86% (min: 7%, max: 99%), and our estimations for different data sizes and different number of partitions are also highly accurate. Evaluating the multi-pass scheme. We first consider whether we can improve the performance through dividing the input array into multiple segments. We apply our multi-pass scheme to each segment. Since the segment can fit into the cache, the cost of the multiple scans is minimized. Figure shows the scatter performance when the number of segments, #bin, increases. The input array is 8 M. When #bin is 5, the segment size is 56K, which is smaller than the GPU cache size. When the #bin value increases, the scatter time increases as well. This is because the cache locality between the scatters on different segments gets worse. We next evaluate our multi-pass scheme with the number of partitions, p, varied. The result is shown in Figure. When p<=8, the suitable number of passes in our multi-pass scheme is one, and the multi-pass scatter reduces to the single-pass one. As p increases from 8, our multi-pass scheme outperforms the singlepass scheme. Regardless of the number of partitions in the input relation, our model correctly predicts a suitable value for the number of passes, and our multi-pass optimization technique improves the scatter performance up to three times log(#bin) Figure. The scatter performance with the number of segments varied Single-pass Multi-pass Figure. The performance of the multi-pass scatter on the GPU with the number of partitions, p, varied. Finally, we compare the performance of scatter on the GPU and the CPU when the write locations are random. We obtain similar performance results as the number of partitions varied, and omit the results. Figure shows the scatter performance with the tuple size varied and with the number of tuples fixed to be 6 million. The maximum tuple size is 6 due to the limited size of the device memory. Figure shows the elapsed time with the number of tuples varied and the tuple size fixed to be 8 bytes. p

8 Time(ms) Effective bandwidth (G/s) Time(ms) On the CPU, we consider whether our multi-pass scheme and the multithreading technique can improve the performance of the scatter. We find that the multi-pass scheme has little performance improvement on both Intel and AMD. This is due to the difference of memory architectures of the GPU and the CPU. We apply the single-pass scheme to the gather and the scatter on the CPU. The multithreading technique improves the performance of the scatter by -X and -4X on Intel and AMD, respectively. On the GPU, the effective bandwidth of the random access increases as the tuple size increases. When the tuple size is fixed, the effective bandwidth is stable as the data size increases. Regardless of the tuple size and the data size, the GPU-based scatter has a much better performance than the optimized CPUbased one. The speedup is 7-X and -4X on Intel and AMD, respectively. On both the GPU and CPUs, the effective bandwidth is much lower than the peak bandwidth. This indicates the random access pattern has low locality and yield low bus utilization. Nevertheless, our multi-pass scheme improves the effective bandwidth of the GPU-based scatter by -4X (CPU vs. GPU) Tuple size (ytes) (CPU vs. GPU) Tuple size (ytes) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) Figure. The scatter performance on the CPU and the GPU with tuple size varied (CPU vs. GPU) Data size (M) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) Figure. The scatter time on the CPU and the GPU with data size varied. 4. APPLICATION AND ANALYSIS and gather are two fundamental data-parallel operations for many parallel algorithms [9]. In this section, we use scatter and gather to implement three memory intensive algorithms on GPUs, including the radix sort, the hash search and the sparsematrix vector multiplication. Specifically, we use our improved scatter to implement the radix sort and our improved gather for the hash search and the sparse-matrix vector multiplication. We also compare the end-to-end performance of the three applications on the GPU and the CPU. Our total execution time for the GPUbased algorithms includes the data transfer time between the CPU and the GPU. 4. Radix sort Radix sort is one of the highly parallel sorting algorithms [4]. It has a linear-time computational complexity. In this study, we implement the most significant digit (MSD) radix sort. The algorithm first performs P-phase partitioning on the input data, and then sorts the partitions in parallel. In each phase, we divide the input data into multiple partitions using its bits. In the ith (<=i<=p) phase of the radix sort, we use the (i*its)th to ((i+)*its)th bits from the most significant bit. After partitioning, we use the bitonic sort [] to sort each partition. The bitonic sort can exploit the fast shared memory on G8 [6], where the shared memory of each multiprocessor is 6K. Different from the cache, the shared memory is programmer-manipulated. Additionally, it is shared by threads on the same multiprocessor. When the partition size is smaller than the shared memory size, the bitonic sort works entirely on the data in the shared memory, and causes no further cache misses. We determine the suitable P value so that the resulting size of each partition is smaller than the shared memory size, i.e., P=log f (n/s), where f is the number of partitions generated in each phase, i.e., its ; n, the size of the input array (bytes); S, the shared memory size of each multiprocessor (bytes). Data PID Loc Output Phase : the first bit from the left Data PID Loc Output Phase : the second bit from the left Figure 4. An example of radix sort. We implement the partitioning scheme using the scatter. Given an input array of data tuples, dataarray, the partitioning process is performed in three steps. The algorithm first computes the partition ID for each tuple and stores all partition IDs into an array, pidarray, in the same order as dataarray. Next, it computes the write location for each tuple based on pidarray, and stores the write locations into the third array, locarray. Finally, it applies our multi-pass scatter to output dataarray according to locarray. An example of the radix sort on eight tuples is illustrated in Figure 4. The first step is fast, because it is a uential scatter and a scan. We consider optimizations on the second and third steps. For the second step, we use histograms to compute the write location for each tuple. Our histograms are similar to those in the

9 parallel radix sort proposed by Zagha [4]. The main difference is that we fully consider the hardware features of the GPU to design our histograms. Specifically, in our partitioning algorithm, each thread is responsible for a portion of the input array. ecause the histogram for each thread is accessed randomly, we store the histograms into the shared memory to reduce the latency of these random accesses. In our implementation, each histogram element is encoded in one byte. The size of each histogram is its bytes. Since each thread has one histogram, the its value must satisfy the following constraint due to the limited size of the shared memory: T* its <=S, where T is the number of concurrent threads sharing the shared memory and S is the size of the shared memory. For the third step, we determine the suitable number of passes to minimize the execution time of the radix sort. The number of phases, i.e., the number of scatter operations, in the radix sort is P. We estimate the execution time of each scatter as in Case II with the number of partitions, its. The total execution time of these P scatter operations is C=P*T scatter. Considering the constraint in the second step, we choose the suitable its value to minimize the C value. We have implemented the radix sort on the GPU and the CPU. Figure 5 shows the performance impact of our optimization techniques on the GPU and the CPU. Each array element is an 8- byte record, including a four-byte key value and a four-byte record identifier. The key values are randomly distributed. For the CPU implementation, we consider both multithreading and cache-optimization techniques. Suppose the CPU has C p cores. We first divide the input array into C p partitions, and sort these partitions in parallel. Each partition is sorted by a thread running an optimized CPU-based radix sort, which is a multi-pass algorithm []. It determines the number of bits used in each pass according to the cache parameters of the CPU, e.g., the number of cache lines in the L and the L caches and the number of entries in the data TL (Translation Lookaside uffer). According to the existing cost model [], the number of bits in each pass is set to be six on both Intel and AMD so that the cache stalls are minimized. The number of partitions generated in one pass is 6 =64. Finally, we merge the C p sorted partitions into one. The performance comparison for the CPU-based radix sort with and without optimizations is shown on the top of Figure 5. Our optimization techniques greatly improve the radix sort by 85% on AMD and slightly improve the radix sort on Intel. On the GPU, the number of bits used in each phase of the radix sort, its, is five. That is, in each phase, an array is divided into sub-arrays. Our test data takes three phases of partitioning in total. The suitable number of passes in our optimized scatter is four according to our performance model. As shown in Figure 5, our multi-pass scheme improves the GPU-based radix sort by over %. Note, the data transfer time between the GPU and the CPU is -% of the total execution time of the GPU-based radix sort. Figure 6 highlights the performance of the GPU-based radix sort, the GPU-based bitonic sort [] and the CPU-based radix sort []. The GPU-based radix sort is twice faster than the GPUbased bitonic sort. Moreover, our GPU-based algorithm is over X faster than the best optimized CPU-based radix sort. Finally, we perform back-of-envelope calculation on the system bandwidth. We estimate the data transfer of the GPU-based radix sort through considering the cache misses estimated by our model and the internal data structures in the radix sort. The CPU-based radix sort is estimated using the existing cost model []. The estimated bandwidth of the radix sort is on average.4 G/s,. G/s, and. G/s on the GPU, Intel and AMD, respectively RadixSort on CPU Intel radixsort(w/o optimization) Intel radixsort(w/ optimization) AMD radixsort(w/o optimization) AMD radixsort(w/ optimization) #keys (million) RadixSort on GPU GPU radixsort(w/o optimization) GPU radixsort(w/ optimization) #keys (million) Figure 5. The performance impact of optimizations on the CPU and the GPU Sort (CPU vs. GPU) GPU radixsort(w/ optimization) GPU bitonic sort Intel radixsort(w/ optimization) AMD radixsort(w/ optimization) #keys (million) Figure 6. The execution time of sorting algorithms on the GPU and the CPU. 4. Hash search A hash table is commonly used to speed up equality search on a set of records. Applications using hash tables include the hash join and hash partitioning in databases []. A hash table consists of multiple hash headers, each of which maintains a pointer to its corresponding bucket. Each bucket stores the key values of the records that have the same hash value, and the pointers to the records. An example of the hash table is illustrated in Figure 7

10 (a). Since dynamic allocation and deallocation of the device memory is not supported on current GPUs, we use an array to store the buckets, and each header maintains the start position of its corresponding bucket in the array. The hash search takes as input a number of search keys, performs probes on a hash table with these keys, and outputs the matching records to a result buffer in the device memory. In our implementation, each thread is responsible for one search key. To avoid the conflict when multiple threads are writing results to the shared result buffer concurrently, we implement the hash search in four steps. First, for each search key, we use the hash function to compute its corresponding bucket ID and determine the (start, end) pair indicating the start and end locations for the bucket. Second, we scan the bucket and determine the number of matching results. Third, based on the number of results for each key, we compute a prefix sum on these numbers to determine the start location at which the results of each search key are written, and output the record IDs of the matching tuples to an array, L. Fourth, we perform a gather according to L to fetch the actual result records. Due to the random nature of the hash search, we estimate the execution time of the gather in Case I. Figure 7 (b) illustrates the four steps of searching two keys. H(x)=x mod 7 Header 5 uckets Records R R R R4 R5 R6 R7 R8 Step : Step : Step : keys 4 (start, end) (, ) (5,8) #matches Record IDs respectively. Additionally, the hash search optimized with our multi-pass scheme is on average 5% faster than the single-pass scheme. We also varied the number of search keys and the number of records in the hash table, and obtained similar performance results. Through back-of-envelope calculation, the estimated bandwidth of the hash search is on average. G/s,.56 G/s, and. G/s on the GPU, Intel and AMD, respectively GPU(w/o optimization) GPU(w/ optimization) Hash search on GPU GPU(w/ optimization) #keys in the hash table (million) Hash search (CPU vs. GPU) Intel radixsort(w/ optimization) AMD radixsort(w/ optimization) 5 4 (a) The hash table 7 R5 Step 4: output R4 R (b) Search #keys in the hash table (million) Figure 7. An example of hash search. Figure 8 shows the execution time of the hash search on the GPU and the CPU. Note that the data transfer time is 4-6% of the total execution time of the hash search. The values of the search keys and the keys in the hash table are all -bit random integers. The record size is 8 bytes. We fixed the number of search keys to be 4 million and varied the total number of records in the hash table. The load factor of the hash table is around 99%. Similar to the CPU-based radix sort, we consider multithreading and cache optimization techniques for the CPU-based hash search. Suppose the CPU has C p cores. We first divide the search keys into C p partitions, and probe the hash table using the search keys of each partition in parallel. In our implementation, we used C p threads and each thread is in charge of one partition. Each thread runs the cache-efficient hash searches using a multi-pass scheme similar to that on the GPU. The difference is that the chunk size is set to be the L data cache size. Similar to the CPUbased radix sort, the performance improvement of our optimization on AMD is significant and that on Intel is moderate. ased on our performance model, the suitable number of passes for the GPU-based multi-pass scatter is 6. We observe that the optimized GPU-based hash search is 7.X and.5x faster than its optimized CPU-based counterpart on Intel and AMD, Figure 8. Hash search on the GPU and the CPU. 4. Sparse-matrix vector multiplication (SpMV) Sparse-matrix vector multiplication (SpMV) is widely used in many linear solvers [5]. SpMV stores the sparse matrix in a compressed format and performs the multiplication on the compressed matrix. We adopt the compressed format (CSR) from the Intel MKL library [] to represent the sparse matrix: for each row, only the non-zero elements are stored. In row i, if column j is not zero, we represent this cell as a tuple (j, v), where v is the value of the cell (i, j). This compressed format stores the matrix into four arrays: values, columns, pointer and pointere. Array values stores all the values of non-zero cells in the order of their row and column indexes. Array columns stores all the column indexes for the non-zero cells in the row order, i.e., columns[i] stores the column index of the corresponding cell of values[i]. Arrays pointer and pointere store the start and end positions of each row in the values array, respectively. That is, the positions of the non-zero cells of the ith row are between pointer[i] and (pointere[i]-). Given a sparse matrix M and a dense vector V, we store M in the compressed format and compute the multiplication W=M*V. Our algorithm computes W in two steps. First, we scan values and

11 GFLOPS GFLOPS columns once, and compute the partial multiplication results on values and V into an array, R, where R[i]= (values[i]*v[columns[i]]). This multiplication is done through a gather to fetch the values from the vector V at read locations given in the array columns. Second, based on R, we sum up the partial results into W: W[i] pointere[i]- j pointer[i] R[j] When the matrix is large, SpMV is both data- and computationintensive. We use our multi-pass scheme to reduce the random accesses to the vector. Applying our performance model to SpMV, we estimate the execution time of the gather in Case I Class A CG benchmark. GPU (w/o optimization) GPU (w/ optimization) Intel (w/ optimization) AMD (w/ optimization) Class Figure 9. Conjugate gradient benchmark on the GPU and the CPU CG benchmark Stanford_erkeley pkustk4 pkustk8 nd4k Matrices GPU (w/o optimization) GPU (w/ optimization) Intel (w/ optimization) AMD (w/ optimization) Figure. Conjugate gradient benchmark with real-world matrices on the GPU and the CPU. Figure 9 plots the performance in GFLOPS (Giga Floating Point Operations per Second) of SpMV on the GPU and the CPU. We applied our SpMV algorithm to the conjugate gradient benchmark [5], where SpMV is a core component in the sparse linear solver. The benchmark solves a linear system through calling the conjugate gradient (CG) method niter multiple times. Each execution of CG has 5 iterations. Each iteration consists of a SpMV on the matrix and a dense vector, in addition to some vector manipulation. In our experiments, the total execution time of SpMV is over 9% of the total execution time of the CG benchmark on both the GPU and the CPU. There are two classes in the benchmark, classes A and. The vector sizes are 4 and 75 thousand (i.e., 56K and K, respectively) for classes A and, respectively. The niter value is 5 and 75 for classes A and, respectively. The CPU-based SpMV uses the Intel MKL routines, mkl_dcsrmv []. The MKL routine does not utilize all the cores on the CPU. We further consider multithreading techniques on the CPU. Suppose the CPU has C p cores. We horizontally divide the matrix into C p sub-matrices. Next, each thread performs the multiplication on a sub-matrix and the vector using the mkl_dcsrmv routine. The multithreading technique improves the benchmark around X on both AMD and Intel. Since the vector size is smaller than the cache size on G8, the vector fits into the cache. As a result of this small vector size, the performance of the GPU-based algorithm with optimization is similar to that without. Note that the GPU-based SpMV uses the single precision, whereas the MKL routine uses the double precision. The single-precision computation may have up to X performance advantage over the double-precision computation. Nevertheless, the GPU-based algorithm is more than six times faster than the CPU-based algorithm. We further evaluated our SpMV algorithms using the CG benchmark with real-world matrices. We used the real-world matrices downloaded from the Sparse Matrix Collection [7]. We follow Gahvari s categorization on these matrices [8]: the matrix size is small, medium or large, and the number of non-zero elements per row is low, medium or high. Figure shows the performance comparison for four large real-world matrices, namely pkustk8, pkust4, Stanford_erkeley and nd4k. These matrices have a similar size, and their non-zero elements per row increase from, 98, and 45 to 99. The benchmark configuration is the same as class. The optimized SpMV with our multi-pass scheme is moderately faster than the one with the single-pass scheme. Moreover, the GPU-based algorithm is over six times faster than the CPU-based algorithm. We also evaluated our algorithms with matrices of different sizes and obtained similar performance results. 5. CONCLUSION AND FUTURE WORK We presented a probabilistic model to estimate the memory performance of scatter and gather operations. Moreover, we proposed optimization techniques to improve the bandwidth utilization of these two operations by improving the memory locality in data accesses. We have applied our techniques to three scientific applications and have compared their performance against prior GPU-based algorithms and optimized CPU-based algorithms. Our results show a significant performance improvement on the NVIDIA 88 GPU. There are several avenues for future work. We are interested in applying our scatter and gather operations to other scientific applications. We are also interested in developing efficient scatter and gather operations on other architectures such as the IM Cell processor and AMD Fusion. Finally, it is an interesting future direction to extend our algorithms to multiple hosts, e.g., clusters or MPPs without shared memory. 6. ACKNOWLEDGMENTS The authors thank the shepherd, Leonid Oliker, and the anonymous reviewers for their insightful suggestions, all of which improved this work. Funding for this work was provided in part by DAG5/6.EG and 676 from the Hong Kong Research Grants Council. 7. REFERENCES [] CWI Calibrator, [] HPF,

12 [] Intel Math Kernel Library 9., [4] MPI, [5] NAS parallel benchmarks, [6] NVIDIA CUDA (Compute Unified Device Architecture), [7] Sparse matrix collection, [8] ZPL, [9] M. Abramowitz and I. A. Stegun (Eds.). Stirling numbers of the second kind. In Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, 97. [] A. Aggarwal and S. V. Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, (9):6 7, 988. [] D. H. ailey. Vector computer memory bank contention. IEEE Transactions on Computers, vol. C-6, no. (Mar. 987), pp [] G. E. lelloch. Prefix sums and their applications. Technical report, CMU-CS-9-9, Nov 99. [] P. oncz, S. Manegold and M. Kersten. Database architecture optimized for the new bottleneck: memory access. In Proc. of the International Conference on Very Large Data ases (VLD), pp 54-65, 999. [4] I. uck. Taking the plunge into GPU computing. In GPU Gems, Pharr M., (Ed.). Addison Wesley, Mar. 5, pp [5] J. olz, I. Farmer, E. Grinspun and P. Schröoder. Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Transactions on Graphics, pp [6] K. Fatahalian, J. Sugerman and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 4. [7] M. Frigo, C. E. Leiserson, H. Prokop and S. Ramachandran. Cache-oblivious algorithms. In Proc. of the 4th Annual Symposium on Foundations of Computer Science, 999. [8] H. Gahvari. enchmarking sparse matrix-vector multiply. Master thesis, Computer Science Division, U.C. erkeley, December 6. [9] N. Galoppo, N. Govindaraju, M. Henson and D. Manocha. LU-GPU: efficient algorithms for solving dense linear systems on graphics hardware. In Proc. of the 5 ACM/IEEE conference on Supercomputing. [] N. Govindaraju, J. Gray, R. Kumar and D. Manocha. GPUTeraSort: high performance graphics coprocessor sorting for large database management. In Proc. of the 6 ACM SIGMOD international conference on Management of data. [] N. Govindaraju, S. Larsen, J. Gray, D. Manocha. A memory model for scientific algorithms on graphics processors. In Proc. of the 6 ACM/IEEE conference on Supercomputing. [] N. Govindaraju,. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proc. of the 4 ACM SIGMOD international conference on Management of data. [] N. Govindaraju, N. Raghuvanshi and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proc. of the 5 ACM SIGMOD international conference on Management of data. [4] D. Horn. Lib GPU FFT, 6. [5] P. Kipfer, M. Segal, and R. Westermann. Uberflow: A gpubased particle engine. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 4. [6] A. Lamarca and R. Ladner. The influence of caches on the performance of sorting. In Proc. of the eighth annual ACM- SIAM symposium on Discrete algorithms, 997. [7] E. Larsen and D. McAllister. Fast matrix multiplies using graphics hardware. In Proc. of the ACM/IEEE conference on Supercomputing. [8] K. Moreland and E. Angel. The FFT on a GPU. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware,. [9] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn and T. J. Purcell. A survey of generalpurpose computation on graphics hardware. Computer Graphics Forum, Volume 6, 7. [] T. Purcell, C. Donner, M. Cammarano, H. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware,. [] S. Sengupta, M. Harris, Y. Zhang, J. D. Owens. Scan primitives for GPU computing. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 7. [] J. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 9-7,. [] S. Williams, J. Shalf, L. Oliker, P. Husbands and K. Yelick, Dense and sparse matrix operations on the Cell processor. May, 5. Lawrence erkeley National Laboratory. Paper LNL-585. [4] M. Zagha and G. E. lelloch. Radix sort for vector multiprocessors. In Proc. of the 99 ACM/IEEE conference on Supercomputing.