Efficient Gather and Scatter Operations on Graphics Processors
|
|
- Gervais Jones
- 8 years ago
- Views:
Transcription
1 Efficient Gather and Operations on Graphics Processors ingsheng He # Naga K. Govindaraju * Qiong Luo # urton Smith * # Hong Kong Univ. of Science and Technology * Microsoft Corp. {saven, luo}@cse.ust.hk {nagag, burtons}@microsoft.com ASTRACT Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). With superior computing power and high memory bandwidth, GPUs have become a commodity multiprocessor platform for general-purpose high-performance computing. However, due to the random access nature of gather and scatter, a naive implementation of the two operations suffers from a low utilization of the memory bandwidth and conuently a long, unhidden memory latency. Additionally, the architectural details of the GPUs, in particular, the memory hierarchy design, are unclear to the programmers. Therefore, we design multi-pass gather and scatter operations to improve their data access locality, and develop a performance model to help understand and optimize these two operations. We have evaluated our algorithms in sorting, hashing, and the sparse matrix-vector multiplication in comparison with their optimized CPU counterparts. Our results show that these optimizations yield -4X improvement on the GPU bandwidth utilization and -5% improvement on the response time. Overall, our optimized GPU implementations are -7X faster than their optimized CPU counterparts. Categories and Subject Descriptors E.5 [Files] Optimization, sorting/searching; I.. [COMPUTER GRAPHICS] Hardware Architecture Graphics processors, parallel processing. General Terms Algorithms, Measurement, Performance. Keywords (c) 7 ACM /7/ $5.Gather, scatter, parallel processing, graphics processors, cache optimization. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC7 November -6, 7, Reno, Nevada, USA (c) 7 ACM /7/ $5.. INTRODUCTION The increasing demand for faster scientific routines to enable physics and multimedia applications on commodity PCs has transformed the graphics processing unit (GPU) into a massively parallel general-purpose co-processor [9]. These applications exhibit a large amount of data parallelism and map well to the data parallel architecture of the GPU. For instance, the current NVIDIA GPU has over 8 data parallel processors and a peak memory bandwidth of 86 G/s. Many numerical algorithms including matrix multiplication [6][7][], sorting [][5] [], LU decomposition [9], and fast Fourier transforms [4][8] have been designed on GPUs. Due to the high performance capabilities of GPUs, these scientific algorithms achieve -5X performance improvement over optimized CPU-based algorithms. In order to further improve the performance of GPU-based scientific algorithms and enable optimized implementations similar to those for the CPU-based algorithms, recent GPUs include support for inter-processor communication using shared local stores, and support for scatter and gather operations [6]. and gather operations are two fundamental operations in many scientific and enterprise computing applications. These operations are implemented as native collective operations in message passing interfaces (MPI) to define communication patterns across the processors [4], and in parallel programming languages such as ZPL [8] and HPF []. operations write data to arbitrary locations and gather operations read data from arbitrary locations. oth operations are highly memory intensive and form the basic primitives to implement many parallel algorithms such as quicksort [], sparse matrix transpose [8], and others. In this paper, we study the performance of scatter and gather operations on GPUs. Figure shows the execution time of the scatter and the gather on a GPU with the same input array but either uential or random read/write locations. The input array is 8M. The detailed experimental setup is described in Section.5. This figure shows that location distribution greatly affects the performance of the scatter and the gather. For uential locations, if we consider the data transfer of the output data array and internal data structures, both operations are able to achieve a system bandwidth of over 6 G/s. In contrast, for random locations, both operations yield a low utilization of the memory bandwidth. The performance comparison indicates that locality in memory accesses is an important factor for the performance of gather and scatter operations on GPUs.
2 Gather Sequential 76.9 Random 76.9 Figure. The elapsed time of the scatter and the gather on a GPU. oth operations underutilize the bandwidth for random locations. GPU memory architectures are significantly different from CPU memory architectures []. Specifically, GPUs consist of highbandwidth, high-latency video memory and the GPU cache sizes are significantly smaller than the CPUs therefore, the performance characteristics of scatter and gather operations on GPUs may involve different optimizations than corresponding CPU-based algorithms. Additionally, it was only recently that the scatter functionality was introduced on GPUs, and there is little work in identifying the performance characteristics of the scatter on GPUs. In this paper, we present a probabilistic analysis to estimate the performance of scatter and gather operations on GPUs. Our analysis accounts for the memory access locality and the parallelism in GPUs. Using our analysis, we design optimized algorithms for scatter and gather. In particular, we design multipass algorithms to improve the locality in the memory access and thereby improve the memory performance of these operations. ased on our analysis, our algorithms are able to determine the number of passes required to improve the performance of scatter and gather algorithms. Moreover, we demonstrate through experiments that our performance model is able to closely estimate the performance characteristics of the operations. As a result, our analysis can be applied to runtime systems to generate better execution plans for scatter and gather operations. We use our scatter and gather to implement three common applications on an NVIDIA GeForce 88 GPU (G8) - radix sort using scatter operations, and the hash search and the sparse-matrix vector multiplication using gather operations. Our results indicate that our optimizations can greatly improve the utilization of the memory bandwidth. Specifically, our optimized algorithms achieve a -4X performance improvement over single-pass GPUbased implementations. We also compared the performance of our algorithms with optimized CPU-based algorithms on high-end multi-core CPUs. In practice, our results indicate a -7X performance improvement over CPU-based algorithms. Organization: The remainder of the paper is organized as follows. We give a brief overview of the GPU and the prior work on GPU- or CPU-based cache efficient algorithms in Section. Section describes our performance model and presents our optimization techniques for GPUs. In Section 4, we use the optimized scatter and gather to improve the performance of sorting, the hash search and the sparse-matrix vector multiplication on the GPU. Finally, we conclude in Section 5.. ACKGROUND AND RELATED WORK In this section, we first review techniques in general-purpose computing on the GPU. Next, we survey the cache-optimized techniques on the CPU and the GPU.. GPGPU (General-purpose computation on GPU) A GPU is a SIMD processing unit with high memory bandwidth primarily intended for use in computer graphics rendering applications. In recent years, the GPU has become extremely flexible and programmable. This programmability is strengthened in every major generation (roughly every 8 months) [9]. Such programming flexibility further facilitates the use of the GPU as a general-purpose processor. Recently, NVIDIA has released the CUDA (Compute Unified Device Architecture) framework [6] together with the G8 GPU for general-purpose computation. Unlike DirectX or OpenGL, CUDA provides a programming model for a thread-based programming language similar to C. Thus, programmers can take advantage of GPUs without requiring graphics know-how. Two of the newly exposed features on the G8 GPU - data scattering and on-chip shared memory for the processing engines, are exploited by our implementation. GPUs have been recently used for various applications such as matrix operations [6][7][] and FFT computation [4][8]. We now briefly survey the techniques that use GPUs to improve the performance of scientific and database operations. The GPUbased scientific applications include linear algebra computations such as matrix multiplication [6] and sparse matrix computations [5]. Govindaraju et al. presented novel GPU-based algorithms for relational database operators including selections, aggregations [] as well as sorting [], and for data mining operations such as computing frequencies and quantiles for data streams []. For additional information on the state-of-the-art GPGPU techniques, we refer the reader to a recent survey by Owens et al. [9]. The existing work mainly develops OpenGL/DirectX programs to exploit the specialized hardware features of GPUs. In contrast, we use a general-purpose programming model to utilize the GPU hardware features, and implement our algorithms with two common building blocks, i.e., scatter and gather. Recently, Sengupta et al. implemented the segmented scan using the scatter on the GPU []. Our optimization on the scatter can be applied to the segmented scan to further improve its performance. There is relatively less work on developing efficient algorithms for scatter and gather on GPUs, even though these two operations are commonly provided primitives in traditional MPI architectures [4]. Previous-generation GPUs support gather but do not directly support scatter. uck described algorithms to implement the scatter using the gather [4]. However, these algorithms usually require complex mechanisms such as sorting, which can result in low bandwidth utilization. Even though new generation GPUs have more programming flexibility supporting both scatter and gather, the high programmability does not necessarily achieve high bandwidth utilization, as seen in Figure. These existing limitations motivate us to develop a performance model to understand the scatter and the gather on the GPU, and to improve their overall performance.
3 . Memory optimizations Memory stalls are an important factor for the overall performance of data-intensive applications, such as relational database systems []. For the design and analysis of memory-efficient algorithms, a number of memory models have been proposed, such as the external memory model [] (also known as the cache-conscious model) and the cache-oblivious model [7]. Our model is similar to the external memory model but applies to the parallel computation on the graphics processor. In addition to modeling the cache cost, ailey proposed a memory model to estimate the cost of memory bank contention on the vector computer []. In contrast, we focus on the cache performance of the graphics processor. The algorithms reducing the memory stalls can be categorized into two kinds, cache-conscious [] and cache-oblivious [7]. Cacheconscious algorithms utilize knowledge of cache parameters, such as cache size. On the other hand, cache-oblivious algorithms do not assume any knowledge of cache parameters. Cache-conscious techniques have been extensively used to improve the memory performance of the CPU-based algorithms. LaMarca et al. [6] studied the cache performance for the quick sort and showed that cache optimizations can significantly improve the performance of the quick sort. oncz et al. [] proposed the radix clustering with a multi-pass partitioning method in order to optimize the cache performance. Williams et al. [] proposed cache optimizations for matrix operations on the Cell processor. Memory optimization techniques have also been shown useful for GPU-based algorithms. These memory optimizations need to adapt to the massive threading architecture of the GPU. Govindaraju et al. [] proposed a memory model for scientific applications and showed that optimizations based on the model greatly improved the overall performance. Fatahalian et al. [6] analyzed the cache and bandwidth utilization of the matrix-matrix multiplication. Galoppo et al. [9] designed block-based data access patterns to optimize the cache performance of dense linear systems. In comparison, we propose a general performance model for scatter and gather, and use a multi-pass scheme to improve the cache locality.. MODEL AND ALGORITHM In this section, we first present our memory model on the graphics processor. Next, we give the definitions for the scatter and the gather. We then present our modeling and optimization techniques for the scatter and the gather. Finally, we present our evaluation results for our techniques.. Memory model on GPU Current GPUs achieve a high memory bandwidth using a wide memory interface, e.g., the memory interface of G8 is 84-bit. In order to mask high memory latency, GPUs have small L and L SRAM caches. Compared with the CPU, the GPU typically has a flat memory hierarchy and the details of the GPU memory hierarchy are often missing in vendor hardware specifications. We model the GPU memory as a two-level hierarchy, the cache and the device memory, as shown in Figure. The data access to the cache can be either a hit or a miss. If the data is found in the cache, this access is a hit. Otherwise, it is a miss and a memory block is fetched from the device memory. Given the number of cache misses, #miss, and that of cache hits, #hit, the cache hit rate, #hit h h, is defined as #hit#miss. We model the GPU as M SIMD multiprocessors. At any time, all processors in a multiprocessor execute the same instruction. The threads on each multiprocessor are grouped into thread warps. A warp is the minimum schedule unit on a multiprocessor. If a warp is stalled by memory access, it will be put to the end of a waiting list of thread warps. Then, the first warp in the waiting list becomes active and is scheduled to run on the multiprocessor. Multiprocessor Cache Device memory Multiprocessor M Small size High memory bandwidth Figure. The memory model of GPU. The GPU consists of M SIMD multiprocessors. The cache is shared by all the multiprocessors.. Definitions Gather and scatter are dual operations. A scatter performs indexed writes to an array, and a gather performs indexed reads from an array. We define the two operations in Figure. The array L for the scatter contains distinct write locations for each R in tuple, and that for the gather the read locations for each R out tuple. Essentially, the scatter consists of uential reads and random writes. In contrast, the gather consists of uential writes and random reads. Figure 4 illustrates examples for the scatter and the gather. Primitive: Input: R in [,, n], L[,, n]. Output: R out [,, n]. Function: R out [L[i]]=R in [i], i=, n. Primitive: Gather Input: R in [,, n], L[,, n]. Output: R out [,, n]. Function: R out [i]=r in [L[i]], i=, n. Figure. Definitions of scatter and gather To quantify the performance of the scatter and gather operations, we define the effective bandwidth for the uential access (resp. the random access) to be the ratio of the amount of the data array tuples accessed by the uential access (resp. the random access) in the scatter and gather operations to the elapsed time. This measure indicates how much bandwidth is utilized to access the target (effective) data. That means, we define two kinds of effective bandwidth, uential and random bandwidths, to distinguish the uential and the random access patterns, respectively. Since the uential access has a better locality than the random access, the uential bandwidth is usually higher than the random bandwidth.
4 L Rin Rout R R R R4 R5 R6 R R R5 R R4 R6 L Rin Rout R R R R4 R5 R6 R R4 R R5 R R6 between the device memory and the GPU for a cache miss is z / ll. Since measures the data transfer between the device memory and the GPU as well as the accesses to the GPU z / l l cache, we estimate the total cost for a cache miss to be. (a) (b) Gather Figure 4. Examples of scatter and gather.. Modeling the effective bandwidth In this subsection, we present our probabilistic model on estimating the uential bandwidth and the random bandwidth (denoted as and rand ). Since the estimation and optimization techniques are similar on the scatter and the gather operations, we describe the techniques for the scatter in detail, and present those for the gather in brief... Modeling the uential bandwidth A uential access fetches each memory block exactly once. A memory block causes a cache miss at the first access, and the later accesses to this memory block are cache hits. Thus, the uential access achieves a high effective bandwidth. We estimate the uential bandwidth using our measured effective bandwidth. To obtain the measured bandwidth, we perform a set of calibration measurements with input data arrays of different sizes. Given the capacity of the device memory, D m M, we measure the execution time for the uential scan when the input array size is i log M, i=,,, D m. Let the execution time for the input array of i M be t i. We compute the uential bandwidth to be the weighted bandwidth of all log calibrations as Eq., where n= D m. Compared with the estimation using the average value, the weighted sum gives more weight to the bandwidth of the larger sizes, which fully utilize the bus bandwidth of the GPU. n i ( i ) n i i t i i.. Modeling the random bandwidth Compared with the uential access, the random access has a low cache locality. Some accesses are cache hits, and many others are misses. Given the cache hit rate, h, the expected time for accessing a data array is defined as t, given in Eq.. z / l l t n(( h) h z / l' ) In this equation, n is the number of data array tuples; z is the size of each tuple (bytes); l and l are the memory block size and the transfer unit between the cache and the GPU (bytes); is the estimated uential bandwidth; is the time for fetching a cache line from the cache to the GPU. In this equation, the amortized cost for a cache hit is z / l'. The data transfer () () We define the random bandwidth as follows. rand n z t z z / ll z / l ( l h z / l' ) Given a fixed input, we maximize the cache hit rate in order to maximize the random bandwidth. We develop a probabilistic model to estimate the cache hit rate. Specifically, we estimate the cache hit rate in two cases. One is that the access locations are totally random, and the other is with the knowledge of the number of partitions in the input data. The number of partitions is usually a priori knowledge for various applications such as radix sort and hash partitioning. For instance, given the number of bits used in the radix sort, its, the number of partitions for the input data is its. With the knowledge of the number of partitions, we have a more accurate estimation on the cache hit rate. When there is no knowledge of the number of partitions, we assume the access locations are random and use the first case to estimate the cache hit rate. Case I: totally random accesses. We first estimate the expected number of distinct memory blocks, E, accessed by n random accesses. Suppose the total number of memory blocks for storing the input data array is k. Let E j be the number of cases accessing exactly j (<=j<=n) distinct memory blocks. We estimate E j as follows. E j C( k, j) S( n, j) j! (4) In this equation, C(k, j) is the number of cases of choosing j memory blocks from the available k memory blocks; S(n,j) is the Stirling number of the second kind [9] that equals the number of cases of partitioning n accesses into j memory blocks; j! is the number of cases in the permutation of the j memory blocks. Since the total number of cases in accessing the k memory blocks is k n, we estimate the expected number of distinct memory blocks by these n accesses in Eq. 5. E n j ( E j) k n j ased on E and the number of cache lines in the GPU cache, N, we estimate the cache miss rate, m, in Eq. 6. E n m N E ( ) ( n E) E n, E N, otherwise When E>N, we model cache misses in two categories -- the compulsory misses of loading E memory blocks, and the capacity N ( ) (ne) misses, E. Intuitively, given a fixed input, the larger the cache size, the smaller the cache miss rate. (5) (6) ()
5 Finally, we compute the cache hit rate of Case I to be, h=(-m). Case II: Given the number of partitions, p. In this case, if the write locations of a thread warp belong to the same partition, they are consecutive. The writes to consecutive locations are uential, and have a good cache locality. Suppose a warp consists of w threads. Each thread accesses one tuple at one time. The number of tuples accessed by the warp is w. To estimate the number of cache misses, we estimate the expected number of distinct partitions, D, accessed by the warp. Let Dj be the number of cases containing exactly j distinct partitions, <=j<=min(w, p). We estimate Dj to be Dj=C(p,j)*S(w,j)*j!. The estimation is similar to that on the number of distinct memory blocks accessed in Case I. Since the total number of cases is w p, we estimate D in Eq. 7. D min( p, w) j w p ( D j) j The average number of tuples belonging to each of these D w partitions is D. The number of memory blocks for each partition wz is Dl. Therefore, the cache miss rate within a thread warp is estimated in Eq. 8. w z D D l m w The cache hit rate within a thread warp to be (-m). Finally, we compute the cache hit rate of Case II to be the sum of the intra- and inter-warp cache hit rates. The inter-warp cache hit rate is estimated as in Case I, totally random accesses. Note, even with the knowledge of the number of partitions, the accesses of warps on different multiprocessors are random due to the unknown execution order among warps..4 Improving the memory performance A basic implementation of the scatter is to uentially scan L and R in once, and output all R in tuples to R out during the scan. Similarly, the basic implementation of the gather is to scan L once, read the R in tuples according to L, and write the tuples to R out uentially. Given and rand, the total execution time of the scatter and the gather is estimated in Eq. 9 and, respectively. Given an array, X, we denote the number of tuples in X and the size of X (bytes) to be X and X, respectively. T T scatter gather R in R in rand L R L R out rand out (7) (8) (9) () This basic implementation is simple. However, if L is random, the scatter and the gather suffer from the random access, which has low cache locality and results in a low bandwidth utilization. Since rand is usually much lower than, we consider a multipass optimization scheme to improve the cache locality. We apply the optimization technique to both the scatter and gather operations, and illustrate the optimization technique using the scatter as an example. Suppose we perform the scatter in nchunk passes. In each pass, we output only those R in tuples that belong to a certain region of R out. That is, the algorithm first divides R out into nchunk chunks, and then performs the scatter in nchunk passes. In the ith pass (<=i<=nchunk), it scans L once, and outputs the R in tuples belonging to the ith chuck of R out (i.e., the write locations are R out (i ) nchunk R out i nchunk between and ). Since each chunk is much smaller than R out, our multi-pass scatter has a better cache locality than the single-pass one. L Rin R R R R4 R5 R6 Rout R R R5 R R4 R6 miss hit (a) Single-pass scatter L Pass Rin R R R R4 R5 R6 Pass Rin Rout R R R R R R R4 R5 R6 Rout R R R5 R R4 R6 (b) Multi-pass scatter Figure 5. Single-pass vs. multi-pass scatter: the multi-pass scatter has a better cache locality than the single-pass one. Let us illustrate our multi-pass scheme using an example, as shown in Figure 5. For simplicity, we assume that the GPU cache can hold only one memory block and a block can hold three tuples. Under this assumption, each write in the single-pass scheme in Figure 5 (a) results in a miss. Conuently, the singlepass scheme has a low cache locality. In comparison, the multipass scheme divides R out into two chunks, and writes one chunk in each pass. It achieves an average hit rate of /. We estimate the total execution time of the multi-pass scatter as scanning R in and L for nchunk times and outputting the data. Given the nchunk value, we estimate the cache hit rate in a similar way to the single-pass scheme. The major difference is that in the estimation of the random bandwidth, the number of data accesses considered in each pass is n/nchunk, where n is the total number of data accesses in the scatter. When nchunks>, n/nchunks is smaller than n, the multi-pass scheme has a higher cache hit rate than the single-pass scheme. Given the estimated random bandwidth for the multi-pass scheme when the number of passes is nchunk, rand, we estimate the execution time of the multi-pass scatter, T scatter, in Eq.. We choose the nchunk value so that T scatter is minimized. T ' scatter ( R in L ) nchunk R out rand ().5 Evaluation We evaluated our model and optimization techniques with different data sizes and data distributions. We increase the data size to be sufficiently large to show the performance trend. The default tuple size is 8 bytes, and the default size of the input data array is 8M (i.e., the input data array consists of 6 million tuples). The default data distribution is random. We have implemented and tested our algorithms on a PC with an NVidia G8 GPU. The PC runs Windows XP on an Intel Core
6 Quad CPU. We also run our CPU-based algorithms on a Windows Server server machine with two AMD Opteron 8 dual-core processors. The hardware configuration of the PC and the server is shown in Table. The cache configuration of the GPU is obtained based on the GPU memory model proposed by Govindaraju et al. []. The cache latency information of the CPU are obtained using a cache calibrator [], and those of the GPU are obtained from the hardware specification [6]. We estimate the transfer unit between the cache and the GPU to be the memory interface, i.e., 84 bits, since we do not have any details on this value. We compute the theoretical memory bandwidth to be the bus width multiplying the memory clock rate. Thus, the GPU, Intel and AMD have a theoretical bandwidth of 86.4 G/s,.4 G/s and 8. G/sec, respectively. ased on our measurements on uentially scanning a large array, the G8 achieves a peak memory bandwidth of around 69. G/s whereas the Intel and the AMD 5.6 G/s and 5. G/s, respectively. To evaluate our model, we define the accuracy of our model to be vv' (- v ), where v and v are the measured and estimated values, respectively. We run each experiment five times and report the average value. Since we aim at validating our cache model and evaluating the scatter and the gather on the GPU, the results for the GPU-based algorithms in this section do not include the data transfer time between the CPU and the GPU. Table. Hardware configuration GPU(G8) CPU(Intel) CPU(AMD) Processors 5MHz GHz 4 (Quad-core).4GHz (two dual-core) Cache size 9 K L: K 4, L: 8M L: 64K, L: M Cache block 56 L: 64, L: 8 L: 8, L: size (bytes) 8 Cache access L:, L: L:, L: 9 time (cycle) DRAM (M) DRAM latency (cycle) 8 8 us width (bit) Memory clock.8.. (GHz) Validating the performance model. Figure 6 demonstrates the measured and estimated performance of uential scatter and gather operations on the GPU. The gather and the scatter have a very similar performance trend as the data size increases. The uential bandwidth is estimated to be 6.9 G/s. The figure also indicates that our estimation achieves an accuracy of over 87% for the gather and the scatter. The range of the accuracy is 8%~97%, i.e., min=7%, max=99%. Figure 7 shows the measured and estimated performance with the data size varied when the read/write locations are totally random. Our model has a high accuracy on predicting the performance of the gather and the scatter. The average accuracy of our model on the gather and the scatter is over 9% (min: 6%, max: 98%). Again, the gather and the scatter have a very similar performance trend as the data size increases. In the following, we report the results for the scatter only, because the results of the gather are similar to those of the gather Data size (M) Gather Data size (M) Figure 6. The measured and estimated performance of the uential single-pass scatter (top) and gather (bottom) on the GPU Data size (M) Gather Data size (M) Figure 7. The measured and estimated performance of the single-pass scatter (top) and gather (bottom) on the GPU with random locations. Figure 8 shows the measured and estimated performance with the number of partitions in the input data varied. The data size is
7 Effective bandwidth (G/s) 8M. Given the number of partitions, the partition ID of each tuple is randomly assigned. When the number of partitions is larger than (i.e., the warp size on the G8), the measured time increases dramatically due to the reduced cache reuse within a warp. We also observe this performance increase in our estimated time. These figures indicate that our estimation is accurate on different data sizes and different numbers of partitions in the input data. The average accuracy in Figure 8 is over 8% (min: 7%, max: 99%) #partition #partition Figure 8. The measured and estimated single-pass scatter performance on the GPU: (top) the effective bandwidth of the random access; (bottom) the execution time Figure 9. The measured and estimated performance of the multi-pass scatter on GPU with the number of passes varied. Finally, we validate our performance model on the multi-pass scheme. Figure 9 shows the measured and estimated performance of the multi-pass scatter with the number of passes varied. The suitable value for the number of passes is 6. When the number of passes is smaller than the suitable value, the scatter performance improves because the cache locality of the scatter is improved. When the number of passes is larger than the suitable value, the #pass overhead of extra scans in the multi-pass scheme becomes significant. The average accuracy of our model in Figure 9 is 86% (min: 7%, max: 99%), and our estimations for different data sizes and different number of partitions are also highly accurate. Evaluating the multi-pass scheme. We first consider whether we can improve the performance through dividing the input array into multiple segments. We apply our multi-pass scheme to each segment. Since the segment can fit into the cache, the cost of the multiple scans is minimized. Figure shows the scatter performance when the number of segments, #bin, increases. The input array is 8 M. When #bin is 5, the segment size is 56K, which is smaller than the GPU cache size. When the #bin value increases, the scatter time increases as well. This is because the cache locality between the scatters on different segments gets worse. We next evaluate our multi-pass scheme with the number of partitions, p, varied. The result is shown in Figure. When p<=8, the suitable number of passes in our multi-pass scheme is one, and the multi-pass scatter reduces to the single-pass one. As p increases from 8, our multi-pass scheme outperforms the singlepass scheme. Regardless of the number of partitions in the input relation, our model correctly predicts a suitable value for the number of passes, and our multi-pass optimization technique improves the scatter performance up to three times log(#bin) Figure. The scatter performance with the number of segments varied Single-pass Multi-pass Figure. The performance of the multi-pass scatter on the GPU with the number of partitions, p, varied. Finally, we compare the performance of scatter on the GPU and the CPU when the write locations are random. We obtain similar performance results as the number of partitions varied, and omit the results. Figure shows the scatter performance with the tuple size varied and with the number of tuples fixed to be 6 million. The maximum tuple size is 6 due to the limited size of the device memory. Figure shows the elapsed time with the number of tuples varied and the tuple size fixed to be 8 bytes. p
8 Time(ms) Effective bandwidth (G/s) Time(ms) On the CPU, we consider whether our multi-pass scheme and the multithreading technique can improve the performance of the scatter. We find that the multi-pass scheme has little performance improvement on both Intel and AMD. This is due to the difference of memory architectures of the GPU and the CPU. We apply the single-pass scheme to the gather and the scatter on the CPU. The multithreading technique improves the performance of the scatter by -X and -4X on Intel and AMD, respectively. On the GPU, the effective bandwidth of the random access increases as the tuple size increases. When the tuple size is fixed, the effective bandwidth is stable as the data size increases. Regardless of the tuple size and the data size, the GPU-based scatter has a much better performance than the optimized CPUbased one. The speedup is 7-X and -4X on Intel and AMD, respectively. On both the GPU and CPUs, the effective bandwidth is much lower than the peak bandwidth. This indicates the random access pattern has low locality and yield low bus utilization. Nevertheless, our multi-pass scheme improves the effective bandwidth of the GPU-based scatter by -4X (CPU vs. GPU) Tuple size (ytes) (CPU vs. GPU) Tuple size (ytes) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) Figure. The scatter performance on the CPU and the GPU with tuple size varied (CPU vs. GPU) Data size (M) GPU(Single-pass) GPU(Multiple-pass) Intel (w/ optimization) AMD(w/ optimization) Figure. The scatter time on the CPU and the GPU with data size varied. 4. APPLICATION AND ANALYSIS and gather are two fundamental data-parallel operations for many parallel algorithms [9]. In this section, we use scatter and gather to implement three memory intensive algorithms on GPUs, including the radix sort, the hash search and the sparsematrix vector multiplication. Specifically, we use our improved scatter to implement the radix sort and our improved gather for the hash search and the sparse-matrix vector multiplication. We also compare the end-to-end performance of the three applications on the GPU and the CPU. Our total execution time for the GPUbased algorithms includes the data transfer time between the CPU and the GPU. 4. Radix sort Radix sort is one of the highly parallel sorting algorithms [4]. It has a linear-time computational complexity. In this study, we implement the most significant digit (MSD) radix sort. The algorithm first performs P-phase partitioning on the input data, and then sorts the partitions in parallel. In each phase, we divide the input data into multiple partitions using its bits. In the ith (<=i<=p) phase of the radix sort, we use the (i*its)th to ((i+)*its)th bits from the most significant bit. After partitioning, we use the bitonic sort [] to sort each partition. The bitonic sort can exploit the fast shared memory on G8 [6], where the shared memory of each multiprocessor is 6K. Different from the cache, the shared memory is programmer-manipulated. Additionally, it is shared by threads on the same multiprocessor. When the partition size is smaller than the shared memory size, the bitonic sort works entirely on the data in the shared memory, and causes no further cache misses. We determine the suitable P value so that the resulting size of each partition is smaller than the shared memory size, i.e., P=log f (n/s), where f is the number of partitions generated in each phase, i.e., its ; n, the size of the input array (bytes); S, the shared memory size of each multiprocessor (bytes). Data PID Loc Output Phase : the first bit from the left Data PID Loc Output Phase : the second bit from the left Figure 4. An example of radix sort. We implement the partitioning scheme using the scatter. Given an input array of data tuples, dataarray, the partitioning process is performed in three steps. The algorithm first computes the partition ID for each tuple and stores all partition IDs into an array, pidarray, in the same order as dataarray. Next, it computes the write location for each tuple based on pidarray, and stores the write locations into the third array, locarray. Finally, it applies our multi-pass scatter to output dataarray according to locarray. An example of the radix sort on eight tuples is illustrated in Figure 4. The first step is fast, because it is a uential scatter and a scan. We consider optimizations on the second and third steps. For the second step, we use histograms to compute the write location for each tuple. Our histograms are similar to those in the
9 parallel radix sort proposed by Zagha [4]. The main difference is that we fully consider the hardware features of the GPU to design our histograms. Specifically, in our partitioning algorithm, each thread is responsible for a portion of the input array. ecause the histogram for each thread is accessed randomly, we store the histograms into the shared memory to reduce the latency of these random accesses. In our implementation, each histogram element is encoded in one byte. The size of each histogram is its bytes. Since each thread has one histogram, the its value must satisfy the following constraint due to the limited size of the shared memory: T* its <=S, where T is the number of concurrent threads sharing the shared memory and S is the size of the shared memory. For the third step, we determine the suitable number of passes to minimize the execution time of the radix sort. The number of phases, i.e., the number of scatter operations, in the radix sort is P. We estimate the execution time of each scatter as in Case II with the number of partitions, its. The total execution time of these P scatter operations is C=P*T scatter. Considering the constraint in the second step, we choose the suitable its value to minimize the C value. We have implemented the radix sort on the GPU and the CPU. Figure 5 shows the performance impact of our optimization techniques on the GPU and the CPU. Each array element is an 8- byte record, including a four-byte key value and a four-byte record identifier. The key values are randomly distributed. For the CPU implementation, we consider both multithreading and cache-optimization techniques. Suppose the CPU has C p cores. We first divide the input array into C p partitions, and sort these partitions in parallel. Each partition is sorted by a thread running an optimized CPU-based radix sort, which is a multi-pass algorithm []. It determines the number of bits used in each pass according to the cache parameters of the CPU, e.g., the number of cache lines in the L and the L caches and the number of entries in the data TL (Translation Lookaside uffer). According to the existing cost model [], the number of bits in each pass is set to be six on both Intel and AMD so that the cache stalls are minimized. The number of partitions generated in one pass is 6 =64. Finally, we merge the C p sorted partitions into one. The performance comparison for the CPU-based radix sort with and without optimizations is shown on the top of Figure 5. Our optimization techniques greatly improve the radix sort by 85% on AMD and slightly improve the radix sort on Intel. On the GPU, the number of bits used in each phase of the radix sort, its, is five. That is, in each phase, an array is divided into sub-arrays. Our test data takes three phases of partitioning in total. The suitable number of passes in our optimized scatter is four according to our performance model. As shown in Figure 5, our multi-pass scheme improves the GPU-based radix sort by over %. Note, the data transfer time between the GPU and the CPU is -% of the total execution time of the GPU-based radix sort. Figure 6 highlights the performance of the GPU-based radix sort, the GPU-based bitonic sort [] and the CPU-based radix sort []. The GPU-based radix sort is twice faster than the GPUbased bitonic sort. Moreover, our GPU-based algorithm is over X faster than the best optimized CPU-based radix sort. Finally, we perform back-of-envelope calculation on the system bandwidth. We estimate the data transfer of the GPU-based radix sort through considering the cache misses estimated by our model and the internal data structures in the radix sort. The CPU-based radix sort is estimated using the existing cost model []. The estimated bandwidth of the radix sort is on average.4 G/s,. G/s, and. G/s on the GPU, Intel and AMD, respectively RadixSort on CPU Intel radixsort(w/o optimization) Intel radixsort(w/ optimization) AMD radixsort(w/o optimization) AMD radixsort(w/ optimization) #keys (million) RadixSort on GPU GPU radixsort(w/o optimization) GPU radixsort(w/ optimization) #keys (million) Figure 5. The performance impact of optimizations on the CPU and the GPU Sort (CPU vs. GPU) GPU radixsort(w/ optimization) GPU bitonic sort Intel radixsort(w/ optimization) AMD radixsort(w/ optimization) #keys (million) Figure 6. The execution time of sorting algorithms on the GPU and the CPU. 4. Hash search A hash table is commonly used to speed up equality search on a set of records. Applications using hash tables include the hash join and hash partitioning in databases []. A hash table consists of multiple hash headers, each of which maintains a pointer to its corresponding bucket. Each bucket stores the key values of the records that have the same hash value, and the pointers to the records. An example of the hash table is illustrated in Figure 7
10 (a). Since dynamic allocation and deallocation of the device memory is not supported on current GPUs, we use an array to store the buckets, and each header maintains the start position of its corresponding bucket in the array. The hash search takes as input a number of search keys, performs probes on a hash table with these keys, and outputs the matching records to a result buffer in the device memory. In our implementation, each thread is responsible for one search key. To avoid the conflict when multiple threads are writing results to the shared result buffer concurrently, we implement the hash search in four steps. First, for each search key, we use the hash function to compute its corresponding bucket ID and determine the (start, end) pair indicating the start and end locations for the bucket. Second, we scan the bucket and determine the number of matching results. Third, based on the number of results for each key, we compute a prefix sum on these numbers to determine the start location at which the results of each search key are written, and output the record IDs of the matching tuples to an array, L. Fourth, we perform a gather according to L to fetch the actual result records. Due to the random nature of the hash search, we estimate the execution time of the gather in Case I. Figure 7 (b) illustrates the four steps of searching two keys. H(x)=x mod 7 Header 5 uckets Records R R R R4 R5 R6 R7 R8 Step : Step : Step : keys 4 (start, end) (, ) (5,8) #matches Record IDs respectively. Additionally, the hash search optimized with our multi-pass scheme is on average 5% faster than the single-pass scheme. We also varied the number of search keys and the number of records in the hash table, and obtained similar performance results. Through back-of-envelope calculation, the estimated bandwidth of the hash search is on average. G/s,.56 G/s, and. G/s on the GPU, Intel and AMD, respectively GPU(w/o optimization) GPU(w/ optimization) Hash search on GPU GPU(w/ optimization) #keys in the hash table (million) Hash search (CPU vs. GPU) Intel radixsort(w/ optimization) AMD radixsort(w/ optimization) 5 4 (a) The hash table 7 R5 Step 4: output R4 R (b) Search #keys in the hash table (million) Figure 7. An example of hash search. Figure 8 shows the execution time of the hash search on the GPU and the CPU. Note that the data transfer time is 4-6% of the total execution time of the hash search. The values of the search keys and the keys in the hash table are all -bit random integers. The record size is 8 bytes. We fixed the number of search keys to be 4 million and varied the total number of records in the hash table. The load factor of the hash table is around 99%. Similar to the CPU-based radix sort, we consider multithreading and cache optimization techniques for the CPU-based hash search. Suppose the CPU has C p cores. We first divide the search keys into C p partitions, and probe the hash table using the search keys of each partition in parallel. In our implementation, we used C p threads and each thread is in charge of one partition. Each thread runs the cache-efficient hash searches using a multi-pass scheme similar to that on the GPU. The difference is that the chunk size is set to be the L data cache size. Similar to the CPUbased radix sort, the performance improvement of our optimization on AMD is significant and that on Intel is moderate. ased on our performance model, the suitable number of passes for the GPU-based multi-pass scatter is 6. We observe that the optimized GPU-based hash search is 7.X and.5x faster than its optimized CPU-based counterpart on Intel and AMD, Figure 8. Hash search on the GPU and the CPU. 4. Sparse-matrix vector multiplication (SpMV) Sparse-matrix vector multiplication (SpMV) is widely used in many linear solvers [5]. SpMV stores the sparse matrix in a compressed format and performs the multiplication on the compressed matrix. We adopt the compressed format (CSR) from the Intel MKL library [] to represent the sparse matrix: for each row, only the non-zero elements are stored. In row i, if column j is not zero, we represent this cell as a tuple (j, v), where v is the value of the cell (i, j). This compressed format stores the matrix into four arrays: values, columns, pointer and pointere. Array values stores all the values of non-zero cells in the order of their row and column indexes. Array columns stores all the column indexes for the non-zero cells in the row order, i.e., columns[i] stores the column index of the corresponding cell of values[i]. Arrays pointer and pointere store the start and end positions of each row in the values array, respectively. That is, the positions of the non-zero cells of the ith row are between pointer[i] and (pointere[i]-). Given a sparse matrix M and a dense vector V, we store M in the compressed format and compute the multiplication W=M*V. Our algorithm computes W in two steps. First, we scan values and
11 GFLOPS GFLOPS columns once, and compute the partial multiplication results on values and V into an array, R, where R[i]= (values[i]*v[columns[i]]). This multiplication is done through a gather to fetch the values from the vector V at read locations given in the array columns. Second, based on R, we sum up the partial results into W: W[i] pointere[i]- j pointer[i] R[j] When the matrix is large, SpMV is both data- and computationintensive. We use our multi-pass scheme to reduce the random accesses to the vector. Applying our performance model to SpMV, we estimate the execution time of the gather in Case I Class A CG benchmark. GPU (w/o optimization) GPU (w/ optimization) Intel (w/ optimization) AMD (w/ optimization) Class Figure 9. Conjugate gradient benchmark on the GPU and the CPU CG benchmark Stanford_erkeley pkustk4 pkustk8 nd4k Matrices GPU (w/o optimization) GPU (w/ optimization) Intel (w/ optimization) AMD (w/ optimization) Figure. Conjugate gradient benchmark with real-world matrices on the GPU and the CPU. Figure 9 plots the performance in GFLOPS (Giga Floating Point Operations per Second) of SpMV on the GPU and the CPU. We applied our SpMV algorithm to the conjugate gradient benchmark [5], where SpMV is a core component in the sparse linear solver. The benchmark solves a linear system through calling the conjugate gradient (CG) method niter multiple times. Each execution of CG has 5 iterations. Each iteration consists of a SpMV on the matrix and a dense vector, in addition to some vector manipulation. In our experiments, the total execution time of SpMV is over 9% of the total execution time of the CG benchmark on both the GPU and the CPU. There are two classes in the benchmark, classes A and. The vector sizes are 4 and 75 thousand (i.e., 56K and K, respectively) for classes A and, respectively. The niter value is 5 and 75 for classes A and, respectively. The CPU-based SpMV uses the Intel MKL routines, mkl_dcsrmv []. The MKL routine does not utilize all the cores on the CPU. We further consider multithreading techniques on the CPU. Suppose the CPU has C p cores. We horizontally divide the matrix into C p sub-matrices. Next, each thread performs the multiplication on a sub-matrix and the vector using the mkl_dcsrmv routine. The multithreading technique improves the benchmark around X on both AMD and Intel. Since the vector size is smaller than the cache size on G8, the vector fits into the cache. As a result of this small vector size, the performance of the GPU-based algorithm with optimization is similar to that without. Note that the GPU-based SpMV uses the single precision, whereas the MKL routine uses the double precision. The single-precision computation may have up to X performance advantage over the double-precision computation. Nevertheless, the GPU-based algorithm is more than six times faster than the CPU-based algorithm. We further evaluated our SpMV algorithms using the CG benchmark with real-world matrices. We used the real-world matrices downloaded from the Sparse Matrix Collection [7]. We follow Gahvari s categorization on these matrices [8]: the matrix size is small, medium or large, and the number of non-zero elements per row is low, medium or high. Figure shows the performance comparison for four large real-world matrices, namely pkustk8, pkust4, Stanford_erkeley and nd4k. These matrices have a similar size, and their non-zero elements per row increase from, 98, and 45 to 99. The benchmark configuration is the same as class. The optimized SpMV with our multi-pass scheme is moderately faster than the one with the single-pass scheme. Moreover, the GPU-based algorithm is over six times faster than the CPU-based algorithm. We also evaluated our algorithms with matrices of different sizes and obtained similar performance results. 5. CONCLUSION AND FUTURE WORK We presented a probabilistic model to estimate the memory performance of scatter and gather operations. Moreover, we proposed optimization techniques to improve the bandwidth utilization of these two operations by improving the memory locality in data accesses. We have applied our techniques to three scientific applications and have compared their performance against prior GPU-based algorithms and optimized CPU-based algorithms. Our results show a significant performance improvement on the NVIDIA 88 GPU. There are several avenues for future work. We are interested in applying our scatter and gather operations to other scientific applications. We are also interested in developing efficient scatter and gather operations on other architectures such as the IM Cell processor and AMD Fusion. Finally, it is an interesting future direction to extend our algorithms to multiple hosts, e.g., clusters or MPPs without shared memory. 6. ACKNOWLEDGMENTS The authors thank the shepherd, Leonid Oliker, and the anonymous reviewers for their insightful suggestions, all of which improved this work. Funding for this work was provided in part by DAG5/6.EG and 676 from the Hong Kong Research Grants Council. 7. REFERENCES [] CWI Calibrator, [] HPF,
12 [] Intel Math Kernel Library 9., [4] MPI, [5] NAS parallel benchmarks, [6] NVIDIA CUDA (Compute Unified Device Architecture), [7] Sparse matrix collection, [8] ZPL, [9] M. Abramowitz and I. A. Stegun (Eds.). Stirling numbers of the second kind. In Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, 97. [] A. Aggarwal and S. V. Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, (9):6 7, 988. [] D. H. ailey. Vector computer memory bank contention. IEEE Transactions on Computers, vol. C-6, no. (Mar. 987), pp [] G. E. lelloch. Prefix sums and their applications. Technical report, CMU-CS-9-9, Nov 99. [] P. oncz, S. Manegold and M. Kersten. Database architecture optimized for the new bottleneck: memory access. In Proc. of the International Conference on Very Large Data ases (VLD), pp 54-65, 999. [4] I. uck. Taking the plunge into GPU computing. In GPU Gems, Pharr M., (Ed.). Addison Wesley, Mar. 5, pp [5] J. olz, I. Farmer, E. Grinspun and P. Schröoder. Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Transactions on Graphics, pp [6] K. Fatahalian, J. Sugerman and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 4. [7] M. Frigo, C. E. Leiserson, H. Prokop and S. Ramachandran. Cache-oblivious algorithms. In Proc. of the 4th Annual Symposium on Foundations of Computer Science, 999. [8] H. Gahvari. enchmarking sparse matrix-vector multiply. Master thesis, Computer Science Division, U.C. erkeley, December 6. [9] N. Galoppo, N. Govindaraju, M. Henson and D. Manocha. LU-GPU: efficient algorithms for solving dense linear systems on graphics hardware. In Proc. of the 5 ACM/IEEE conference on Supercomputing. [] N. Govindaraju, J. Gray, R. Kumar and D. Manocha. GPUTeraSort: high performance graphics coprocessor sorting for large database management. In Proc. of the 6 ACM SIGMOD international conference on Management of data. [] N. Govindaraju, S. Larsen, J. Gray, D. Manocha. A memory model for scientific algorithms on graphics processors. In Proc. of the 6 ACM/IEEE conference on Supercomputing. [] N. Govindaraju,. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proc. of the 4 ACM SIGMOD international conference on Management of data. [] N. Govindaraju, N. Raghuvanshi and D. Manocha. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proc. of the 5 ACM SIGMOD international conference on Management of data. [4] D. Horn. Lib GPU FFT, 6. [5] P. Kipfer, M. Segal, and R. Westermann. Uberflow: A gpubased particle engine. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 4. [6] A. Lamarca and R. Ladner. The influence of caches on the performance of sorting. In Proc. of the eighth annual ACM- SIAM symposium on Discrete algorithms, 997. [7] E. Larsen and D. McAllister. Fast matrix multiplies using graphics hardware. In Proc. of the ACM/IEEE conference on Supercomputing. [8] K. Moreland and E. Angel. The FFT on a GPU. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware,. [9] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn and T. J. Purcell. A survey of generalpurpose computation on graphics hardware. Computer Graphics Forum, Volume 6, 7. [] T. Purcell, C. Donner, M. Cammarano, H. Jensen, and P. Hanrahan. Photon mapping on programmable graphics hardware. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware,. [] S. Sengupta, M. Harris, Y. Zhang, J. D. Owens. Scan primitives for GPU computing. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 7. [] J. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 9-7,. [] S. Williams, J. Shalf, L. Oliker, P. Husbands and K. Yelick, Dense and sparse matrix operations on the Cell processor. May, 5. Lawrence erkeley National Laboratory. Paper LNL-585. [4] M. Zagha and G. E. lelloch. Radix sort for vector multiprocessors. In Proc. of the 99 ACM/IEEE conference on Supercomputing.
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationChoosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationHigh Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationL20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More informationEFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationOperation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationCUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationParallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationGPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2
GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,
More informationPerformance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
More informationOn Sorting and Load Balancing on GPUs
On Sorting and Load Balancing on GPUs Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology SE-42 Göteborg, Sweden {cederman,tsigas}@chalmers.se Abstract
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationChapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationEnhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
More informationLSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationIntro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
More information64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
More informationChapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationConfiguring Memory on the HP Business Desktop dx5150
Configuring Memory on the HP Business Desktop dx5150 Abstract... 2 Glossary of Terms... 2 Introduction... 2 Main Memory Configuration... 3 Single-channel vs. Dual-channel... 3 Memory Type and Speed...
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationAccelerating Wavelet-Based Video Coding on Graphics Hardware
Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing
More informationRecord Storage and Primary File Organization
Record Storage and Primary File Organization 1 C H A P T E R 4 Contents Introduction Secondary Storage Devices Buffering of Blocks Placing File Records on Disk Operations on Files Files of Unordered Records
More informationMemory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging
Memory Management Outline Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging 1 Background Memory is a large array of bytes memory and registers are only storage CPU can
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationMeasuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More informationQCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
More informationHIGH PERFORMANCE CONSULTING COURSE OFFERINGS
Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...
More informationRadeon HD 2900 and Geometry Generation. Michael Doggett
Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command
More informationThe Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
More informationHow To Write A Mapreduce Framework On A Computer Or A Graphics Processor
Mars: A MapReduce Framework on Graphics Processors Bingsheng He Wenbin Fang Naga K. Govindaraju # Qiong Luo Tuyong Wang * Hong Kong Univ. of Sci. & Tech. {saven, benfaung, luo}@cse.ust.hk # Microsoft Corp.
More informationGPU for Scientific Computing. -Ali Saleh
1 GPU for Scientific Computing -Ali Saleh Contents Introduction What is GPU GPU for Scientific Computing K-Means Clustering K-nearest Neighbours When to use GPU and when not Commercial Programming GPU
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationAn examination of the dual-core capability of the new HP xw4300 Workstation
An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,
More informationImplementation of Canny Edge Detector of color images on CELL/B.E. Architecture.
Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationIP Video Rendering Basics
CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.
More informationOptimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
More informationLattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
More informationGPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
More informationIntroduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
More informationNVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
More informationEfficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
More informationPARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
More informationThree Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationRetargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
More informationParallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
More information