Designing Efficient Sorting Algorithms for Manycore GPUs
|
|
|
- Egbert Ramsey
- 9 years ago
- Views:
Transcription
1 1 Designing Efficient Sorting Algorithms for Manycore GPUs Nadathur Satish Dept. of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA Mark Harris Michael Garland NVIDIA Corporation Santa Clara, CA Abstract We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix sort is up to 4 times faster than the graphicsbased GPUSort and greater than 2 times faster than other CUDAbased radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial finegrained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the highspeed onchip shared memory provided by NVIDIA s GPU architecture and efficient dataparallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors. I. INTRODUCTION Sorting is a computational building block of fundamental importance and is one of the most widely studied algorithmic problems [1], [2]. The importance of sorting has also lead to the design of efficient sorting algorithms for a variety of parallel architectures [3]. Many algorithms rely on the availability of efficient sorting routines as a basis for their own efficiency, and many other algorithms can be conveniently phrased in terms of sorting. Database systems make extensive use of sorting operations [4]. The construction of spatial data structures that are essential in computer graphics and geographic information systems is fundamentally a sorting process [5], [6]. Efficient sort routines are also a useful building block in implementing algorithms like sparse matrix multiplication and parallel programming patterns like MapReduce [7], [8]. It is therefore important to provide efficient sorting routines on practically any programming platform, and as computer architectures evolve there is a continuing need to explore efficient sorting techniques on emerging architectures. One of the dominant trends in microprocessor architecture in recent years has been continually increasing chiplevel parallelism. Multicore CPUs providing 2 4 scalar cores, typically augmented with vector units are now commonplace and there is every indication that the trend towards increasing parallelism will continue on towards manycore chips that provide far higher degrees of parallelism. GPUs have been at the leading edge of this drive towards increased chiplevel parallelism for some time and are already fundamentally manycore processors. Current NVIDIA GPUs, for example, contain up to 2 scalar processing elements per chip [9], and in contrast to earlier generations of GPUs, they can be programmed directly in C using CUDA [10], [11]. In this paper, we describe the design of efficient sorting algorithms for such manycore GPUs using CUDA. The programming flexibility provided by CUDA and the current generation of GPUs allows us to consider a much broader range of algorithmic choices than were convenient on past generations of GPUs. We specifically focus on two classes of sorting algorithms: a radix sort that directly manipulates the binary representation of keys and a merge sort that requires only a comparison function on keys. The GPU is a massively multithreaded processor which can support, and indeed expects, several thousand concurrent threads. Exposing large amounts of finegrained parallelism is critical for efficient algorithm design on such architectures. In radix sort, we exploit the inherent finegrained parallelism of the algorithm by building our algorithm upon efficient parallel scan operations [12]. We expose finegrained parallelism in merge sort by developing an algorithm for pairwise parallel merging of sorted sequences, adapting schemes based on parallel splitting and binary search previously described in the literature [13], [14], [15]. We demonstrate how to impose a blockwise structure on the sorting algorithms, allowing us to exploit the fast onchip memory provided by the GPU architecture. We also use this onchip memory for locally ordering data to improve coherence, thus resulting in substantially better bandwidth utilization for the scatter steps used by radix sort. Our experimental results demonstrate that our radix sort algorithm is faster than all previously published GPU sorting techniques when running on currentgeneration NVIDIA GPUs. Our tests further demonstrate that our merge sort algorithm is the fastest comparisonbased GPU sort algorithm described in the literature, and is faster in several cases than other GPUbased radix sort implementations. Finally, we demonstrate that our radix sort is highly competitive with multicore CPU implementations, being up to 4.1 times faster than comparable routines on an 8core 2.33 GHz Intel E5345 system and on average 23% faster than the most To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May 09.
2 To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May optimized published multicore CPU sort [16] running on a 4core 3.22 GHz Intel Q9550 processor. II. PARALLEL COMPUTING ON THE GPU Before discussing the design of our sorting algorithms, we briefly review the salient details of NVIDIA s current GPU architecture and the CUDA parallel programming model. As their name implies, GPUs (Graphics Processing Units) came about as accelerators for graphics applications, predominantly those using the OpenGL and DirectX programming interfaces. Due to the tremendous parallelism inherent in graphics, GPUs have long been massively parallel machines. Although they were originally purely fixedfunction devices, GPUs have rapidly evolved into increasingly flexible programmable processors. Modern NVIDIA GPUs beginning with the GeForce 8800 GTX are fully programmable manycore chips built around an array of parallel processors [9], as illustrated in Figure 1. The GPU consists of an array of SM multiprocessors, each of which is capable of supporting up to 1024 coresident concurrent threads. NVIDIA s current products range in size from 1 SM at the low end to 30 SMs at the high end. A single SM shown in Figure 1 contains 8 scalar SP processors, each with bit registers, for a total of 64KB of register space per SM. Each SM is also equipped with a 16KB onchip memory that has very low access latency and high bandwidth, similar to an L1 cache. All thread management, including creation, scheduling, and barrier synchronization is performed entirely in hardware by the SM with essentially zero overhead. To efficiently manage its large thread population, the SM employs a SIMT (Single Instruction, Multiple Thread) architecture [9], [10]. Threads are executed in groups of 32 called warps. The threads of a warp are executed on separate scalar processors which share a single multithreaded instruction unit. The SM transparently manages any divergence in the execution of threads in a warp. This SIMT architecture allows the hardware to achieve substantial efficiencies while executing nondivergent dataparallel codes. CUDA [10], [11] provides the means for developers to execute parallel programs on the GPU. In the CUDA programming model, an application is organized into a sequential host program that may execute parallel programs, referred to as kernels, on a parallel device. Typically, the host program executes on the CPU and the parallel kernels execute on the GPU, although CUDA kernels may also be compiled for efficient execution on multicore CPUs [17]. A kernel is a SPMDstyle (Single Program, Multiple Data) computation, executing a scalar sequential program across a set of parallel threads. The programmer organizes these threads into thread blocks; a kernel thus consists of a grid of one or more blocks. A thread block is a group of concurrent threads that can cooperate amongst themselves through barrier synchronization and a perblock shared memory space private to that block. When invoking a kernel, the programmer specifies both the number of blocks and the number of threads per block to be created when launching the kernel. The thread blocks of a CUDA kernel essentially virtualize the SM multiprocessors of the physical GPU. We can think of each thread block as a virtual multiprocessor where each thread has a fixed register footprint and each block has a fixed allocation of perblock shared memory. A. Efficiency Considerations The SIMT execution of threads is largely transparent to the CUDA programmer. However, much like cache line sizes on traditional CPUs, it can be ignored when designing for correctness, but must be carefully considered when designing for peak performance. To achieve best efficiency, kernels should avoid execution divergence, where threads within a warp follow different execution paths. Divergence between warps, however, introduces no performance penalty. The onchip shared memory provided by the SM is an essential ingredient for efficient cooperation and communication amongst threads in a block. It is particularly advantageous when a thread block can load a block of data into onchip shared memory, process it there, and then write the final result back out to external memory. The threads of a warp are free to load from and store to any valid address, thus supporting general gather and scatter access to memory. However, when threads of a warp access consecutive words in memory, the hardware is able to coalesce these accesses into aggregate transactions with the memory system, resulting in substantially higher memory throughput. For instance, a warp of 32 threads gathering from widely separated addresses will issue 32 requests to memory, while a warp reading 32 consecutive words will issue 2 requests. Finally, the GPU relies on multithreading, as opposed to a cache, to hide the latency of transactions with external memory. It is therefore necessary to design algorithms that create enough parallel work to keep the machine fully utilized. For currentgeneration hardware, a minimum of around 5,000 10,000 threads must be live simultaneously to efficiently utilize the entire chip. B. Algorithm Design When designing algorithms in the CUDA programming model, our central concern is in dividing the required work up into pieces that may be processed by p thread blocks of t threads each. Throughout this paper, we will use a thread block size of t = 256. Using a powerof2 size makes certain algorithms, such as parallel scan and bitonic sort, easier to implement, and we have empirically determined that 256thread blocks provide the overall best performance. For operations such as sort that are required to process input arrays of size n, we choose p n/t, assigning a single or small constant number of input elements to each thread. We treat the number of thread blocks as the analog of processor count in parallel algorithm analysis, despite the fact that threads within a block will execute on multiple processing cores within the SM multiprocessor. It is at the thread block (or SM) level at which internal communication
3 To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May GPU Offchip Memory DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM SM SP SP SP SP SP SP SP SP Shared Memory (16 KB) Fig. 1. GeForce GTX 280 GPU with 2 scalar processor cores (SPs), organized in 30 multiprocessors (SMs). is cheap and external communication becomes expensive. Therefore, focusing on the decomposition of work between the p thread blocks is the most important factor in assessing performance. It is fairly natural to think of a single thread block as the rough equivalent of a CRCW asynchronous PRAM [18], using explicit barriers to synchronize. Broadly speaking, we find that efficient PRAM algorithms are generally the most efficient algorithms at the block level. Since global synchronization can only be achieved via the barrier implicit between successive kernel calls, the need for global synchronization drives the decomposition of parallel algorithms into separate kernels. III. RADIX SORT Radix sort is one of the oldest and bestknown sorting algorithms, and on sequential machines it is often amongst the most efficient for sorting small keys. It assumes that the keys are ddigit numbers and sorts on one digit of the keys at a time, from least to most significant. For a fixed key size d, the complexity of sorting n input records will be O(n). The sorting algorithm used within each of the d passes of radix sort is typically a counting sort or bucket sort [2]. Each radix2 b digit is a bbit string within the key. For a given digit of each key, we compute the number of keys whose digits are smaller plus the number of keys with the same digit occurring earlier in the sequence. This will be the output index at which the element should be written, which we will refer to as the rank of the element. Having computed the rank of each element, we complete the pass by scattering the elements into the output array. Since this counting is stable (i.e., it preserves the relative ordering of keys with equal digits) sorting each digit from least to most significant is guaranteed to leave the sequence correctly sorted after all d passes are complete. Radix sort is fairly easy to parallelize, as the counting sort used for each pass can be reduced to a parallel prefix sum, or scan, operation [12], [19], [], [21]. Scans are a fundamental dataparallel primitive with many uses and which can be implemented efficiently on manycore processors like the GPU [22]. Past experience suggests that radix sort is amongst the easiest of parallel sorts to implement [23] and is as efficient as more sophisticated algorithms, such as sample sort, when n/p is small [23], [21]. The simplest scanbased technique is to sort keys 1 bit at a time referred to by Blelloch [12] as a split operation. This approach has been implemented in CUDA by Harris et al. [24] and is publicly available as part of the CUDA Data Parallel Primitives (CUDPP) library [25]. While conceptually quite simple, this approach is not particularly efficient when the arrays are in external DRAM. For 32bit keys, it will perform 32 scatter operations that reorder the entire sequence. Transferring data to/from external memory is relatively expensive on modern processors, so we would prefer to avoid this level of data movement if possible. One natural way to reduce the number of scatters is to consider more than b = 1 bits per pass. To do this efficiently, we can have each of the p blocks compute a histogram counting the occurrences of the 2 b possible digits in its assigned tile of data, which can then be combined using scan [19], [21]. Le Grand [26] and He et al. [27] have implemented similar schemes in CUDA. While more efficient, we have found that this scheme also makes inefficient use of external memory bandwidth. The higher radix requires fewer scatters to global memory. However, it still performs scatters where consecutive elements in the sequence may be written to very different locations in memory. This sacrifices the bandwidth improvement available due to coalesced writes, which in practice can be as high as a factor of 10. To design an efficient radix sort, we begin by dividing the sequence into tiles that will be assigned to p thread blocks. We focus specifically on making efficient use of memory bandwidth by (1) minimizing the number of scatters to global memory and (2) maximizing the coherence of scatters. Data blocking and a digit size b > 1 accomplishes the first goal. We accomplish the second goal by using onchip shared memory to locally sort data blocks by the current radix2 b digit. This converts scattered writes to external memory into scattered writes to onchip memory, which is roughly 2 orders of
4 To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May Fig. 2. sum. Blocks p 1 Buckets Global offset prefix sum Perblock histograms to be stored in columnmajor order for prefix magnitude faster. We implement each pass of the radix sort in four phases. As global synchronization is required between each phase, they correspond to separate parallel kernel invocations. 1) Each block loads and sorts its tile in onchip memory using b iterations of 1bit split. 2) Each block writes its 2 b entry digit histogram and the sorted data tile to global memory. 3) Perform a prefix sum over the p 2 b histogram table, stored in columnmajor order, to compute global digit offsets [19], [21] (see Figure 2). 4) Using prefix sum results, each block copies its elements to their correct output position. As we discussed earlier, our CUDA kernels are executed by blocks of t = 256 threads each. While assigning one element per thread would be a natural design choice, handling a larger number of elements per thread is actually somewhat more efficient. We process 4 elements per thread or 1024 elements per block. Performing more independent serial work in each thread improves overall parallel work efficiency and provides more opportunities to hide latency. Since each block will process a tile of 1024 elements, we use p = n/1024 blocks in our computations. The choice of b is determined by two competing factors. We presort each tile in Step 1 to limit the scatter in Step 4 to only 2 b contiguous blocks. Larger values of b will decrease the coherence of this scatter. On the other hand, a small b leads to a large number of passes, each of which performs a scatter in global memory. Given that each block processes O(t) elements, we expect that the number of buckets 2 b should be at most O( t), since this is the largest size for which we can expect uniform random keys to (roughly) fill all buckets uniformly. We find empirically that choosing b = 4, which in fact happens to produce exactly t buckets, provides the best balance between these factors and the best overall performance. IV. MERGE SORT Since direct manipulation of keys as in radix sort is not always allowed, it is important to provide efficient comparisonbased sorting algorithms as well. From the many available choices, we have chosen to explore divideandconquer merge sort. As we will see, this leads to an efficient sort and also provides an efficient merge procedure which is a valuable computational primitive in its own right. Sorting networks, such as Batcher s bitonic sorting network [28], are among the oldest parallel sorting techniques. Bitonic sort is often the fastest sort for small sequences but its performance suffers for larger inputs [23], [21]. This reflects both its asymptotically inefficient O(n log 2 n) complexity and its relatively high communication cost. Parallel versions of quicksort can be implemented in parallel using segmented scan primitives [12]. Similar strategies can also be used for other partitioning sorts, such as MSB radix sort which recursively partitions the sequence based on the high order bits of each key [27]. However, the relatively high overhead of the segmented scan procedures leads to a sort that is not competitive with other alternatives [23], [22]. Cederman and Tsigas [29] achieve a substantially more efficient quicksort by using explicit partitioning for large sequences coupled with bitonic sort for small sequences. Another elegant parallelization technique is used by sample sort [30], [31], [32]. It randomly selects a subset of the input, called splitters, which are then sorted by some other efficient procedure. The sorted sequence of splitters can be used to divide all input records into buckets corresponding to the intervals delimited by successive splitters. Each bucket can then be sorted in parallel and the sorted output is simply the concatenation of the sorted buckets. Sample sort has proven in the past to be one of the most efficient parallel sorts on large inputs, particularly when the cost of interprocessor communication is high. However, it appears less effective when n/p, the number of elements per processor, is small [23], [21]. Since we use a fairly small n/p = 256, this makes sample sort less attractive. We are also working in a programming model where each of the p thread blocks is given a staticallyallocated fixed footprint in onchip memory. This makes the randomly varying bucket size produced by sample sort inconvenient. Merge sort is generally regarded as the preferred technique for external sorting, where the sequence being sorted is stored in a large external memory and the processor only has direct access to a much smaller memory. In some respects, this fits the situation we face. Every thread block running on the GPU has access to shared external DRAM, up to 4 GB in current products, but only at most 16 KB of onchip memory. The latency of working in onchip memory is roughly times faster than in external DRAM. To keep operations in onchip memory as much as possible, we must be able to divide up the total work to be done into fairly small tiles of bounded size. Given this, it is also important to recognize that sorting any reasonably large sequence will require moving data in and out of the chip, as the onchip memory is simply too small to store all data being sorted. Since the cost of interprocessor communication is no higher than the cost of storing data in global memory, the implicit load balancing afforded by merge sort appears more beneficial than sample sort s avoidance of interprocessor communication.
5 To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May The merge sort procedure consists of three steps: 1) Divide the input into p equalsized tiles. 2) Sort all p tiles in parallel with p thread blocks. 3) Merge all p sorted tiles. Since we assume the input sequence is a contiguous array of records, the division performed in Step (1) is trivial. For sorting individual data tiles in Step (2), we use an implementation of Batcher s oddeven merge sort [28]. For sorting t values on chip with a tthread block, we have found the Batcher sorting networks to be substantially faster than either radix sort or quicksort. We use the oddeven merge sort, rather than the more common bitonic sort, because our experiments show that it is roughly 5 10% faster in practice. All the real work of merge sort occurs in the merging process of Step (3), which we accomplish with a pairwise merge tree of log p depth. At each level of the merge tree, we merge pairs of corresponding odd and even subsequences. This is obviously an inherently parallel process. However, the number of pairs to be merged decreases geometrically. This coarsegrained parallelism is insufficient to fully utilize massively parallel architectures. Our primary focus, therefore, is on designing a process for pairwise merging that will expose substantial finegrained parallelism. A. Parallel Merging Following prior work on merging pairs of sorted sequences in parallel for the PRAM model [13], [33], [14], [15] we use two key insights in designing an efficient merge algorithm. First, we can use a divideandconquer technique to expose higher levels of parallelism at each level of the merge tree. And second, computing the final position of elements in the merged sequence can be done efficiently using parallel binary searches. Given the sorted sequences A and B, we want to compute the sorted sequence C = merge(a, B). If these sequences are sufficiently small, say of size no greater than t = 256 elements each, we can merge them using a single tthread block. For an element a i A, we need only compute rank(a i, C), which is the position of element a i in the merged sequence C. Because both A and B are sorted, we know that rank(a i, C) = i + rank(a i, B), where rank(a i, B) is simply the count of elements b j B with b j < a i and which we compute efficiently using binary search. Elements of B can obviously be treated in the same way. Therefore, we can efficiently merge these two sequences by having each thread of the block compute the output rank of its corresponding elements in A and B, subsequently writing those elements to the correct position. Since this can be done in onchip memory, it will be very efficient. We merge larger arrays by dividing them up into tiles of size at most t that can be merged independently using the blockwise process we have just outlined. To do so in parallel, we begin by constructing two sequences of splitters S A and S B by selecting every tth element of A and B, respectively. By construction, these splitters partition A and B into contiguous tiles of at most t elements each. We now construct a single merged splitter set S = merge(s A, S B ), which we achieve with a nested invocation of our merge procedure. We use the combined set S to split both A and B into contiguous tiles of at most t = 256 elements. To do so, we must compute the rank of each splitter s in both input sequences. The rank of a splitter s = a i drawn from A is obviously i, the position from which it was drawn. To compute its rank in B, we first compute its rank in S B. We can compute this directly as rank(s, S B ) = rank(s, S) rank(s, S A ), since both terms on the right hand side are its known positions in the arrays S and S A. Given the rank of s in S B, we have now established a telement window in B in which s would fall, bounded by the splitters in S B that bracket s. We determine the rank of s in B via binary search within this window. V. PERFORMANCE ANALYSIS We now examine the experimental performance of our sorting algorithms. Our performance tests are all based on sorting sequences of keyvalue pairs where both keys and values are 32bit words. Although this does not address all sorting problems, it covers many cases that are of primary importance to us, such as sorting points in space and building irregular data structures. We use a uniform random number generator to produce random keys. We report GPU times as execution time only and do not include the cost of transferring input data from the host CPU memory across the PCIe bus to the GPU s onboard memory. Sorting is frequently most important as one building block of a largerscale computation. In such cases, the data to be sorted is being generated by a kernel on the GPU and the resulting sorted array will be consumed by a kernel on the GPU. Figure 3 reports the running time for our sorting implementations. The input arrays are randomly generated sequences whose lengths range from 1K elements to 16M elements, only some of which are powerof2 sizes. One central goal of our algorithm design is to scale across a significant range of physical parallelism. To illustrate this scaling, we measure performance on a range of NVIDIA GeForce GPUs: the GTX 280 (30 SMs), 9800 GTX+ (16 SMs), 8800 Ultra (16 SMs), 8800 GT (14 SMs), and 80 GTS (4 SMs). Our measurements demonstrate the scaling we expect to see: the progressively more parallel devices achieve progressively faster running times. The sorting performance trends are more clearly apparent in Figure 4, which shows sorting rates derived from Figure 3. We compute sorting rates by dividing the input size by total running time, and thus measure the number of keyvalue pairs sorted per second. We expect that radix sort, which does O(n) work, should converge to a roughly constant sorting rate, whereas merge sort should converge to a rate of O(1/ log n). For small array sizes, running times can be dominated by fixed overhead and there may not be enough work to fully utilize the available hardware parallelism, but these trends are apparent at larger array sizes. The radix sorting rate exhibits a larger degree of variability, which we believe is due to the load placed on the memory subsystem by radix sort s global scatters.
6 Merge Sorting Rate (pairs/sec) Millions Radix Sort Time (ms) Merge Sort Time (ms) To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May GTX GTX Ultra 8800 GT 80 GTS GTX GTX Ultra 8800 GT 80 GTS ,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 (a) Radix sort 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 (b) Merge sort Fig. 3. Sorting time for varying sequence sizes across a range of GPU chips. Radix Sorting Rate (pairs/sec) Millions GTX GTX Ultra 8800 GT 80 GTS GTX GTX Ultra 8800 GT 80 GTS (a) Radix sort (b) Merge sort Fig. 4. Sorting rates (elements sorted per second) derived from running times shown in Figure 3. We next examine the impact of different key distributions on our sort performance. Figure 5 shows sorting rates for different key sizes. Here, we have simply restricted the number of random bits we generate for keys to the lower order 8, 16, 24, or full 32 bits. As expected, this makes a substantial difference for radix sort, where we can limit the number of passes over the data we make. Measured sorting rates for 8, 16, and 24bit keys are, precisely as expected, 4, 2, and 1.33 times faster than 32bit keys for all but the smallest sequences. For merge sort, the effect is far less dramatic, although having only 256 unique keys (the 8bit case) does result in 5 10% better performance. We also adopt the technique of Thearling and Smith [] for generating key sequences with different entropy levels by computing keys from the bitwise AND of multiple random samples. As they did, we show performance in Figure 6 using 1 5 samples per key, effectively producing 32, 25.95, 17.41, 10.78, and 6.42 unique bits per key, respectively. We also show the performance for 0 unique bits, corresponding to AND ing infinitely many samples together, thus making all keys zero. We see that increasing entropy, or decreasing unique bits per key, results in progressively higher performance. This agrees with the results reported by Thearling and Smith [] and Dusseau et al. [21]. Finally, we also measured the performance of both sorts when the randomly generated sequences were presorted. As expected, this made no measureable difference to the radix sort. Applying merge sort to presorted sequences produced nearly identical measurements to the case of uniform 0 keys shown in Figure 6, resulting in 1.3 times higher performance
7 To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May Radix Sorting Rate (pairs/sec) Millions bits 16 bits 24 bits 32 bits Merge Sorting Rate (pairs/sec) Millions bits 16 bits 24 bits 32 bits 10 (a) Radix sort (b) Merge sort Fig. 5. Sort performance on GeForce GTX 280 with restricted key sizes. Radix Sorting Rate (pairs/sec) Millions bits bits (effective) bits (effective) bits (effective) 6.42 bits (effective) 0 bits (constant) Merge Sorting Rate (pairs/sec) Millions bits bits (effective) bits (effective) bits (effective) 6.42 bits (effective) 0 bits (constant) 10 (a) Radix sort (b) Merge sort Fig. 6. Sort performance on GeForce GTX 280 with varying key entropy. for sequences of 1M keys or more. A. Comparison with GPUbased Methods Figure 7 shows the relative performance of both our sorting routines, as well as the radix sort published by Le Grand [26] in GPU Gems 3, the radix sort algorithm implemented by Sengupta et al. [22] in CUDPP [25], and the bitonic sort system GPUSort of Govindaraju et al. [34]. All performance measurements reflect the running time of the original implementations provided by the respective authors. We measure performance on the 8800 Ultra, rather than the more recent GTX 280, to provide a more fair comparison as some of these older codes have not been tuned for the newer hardware. GPUSort is an example of traditional graphicsbased GPGPU programming techniques; all computation is done in pixel shaders via the OpenGL API. Note that GPUSort only handles powerof2 input sizes on the GPU, performing postprocessing on the CPU for arrays of other sizes. Therefore, we only measure GPUSort performance on powerof2 sized sequences, since only these reflect actual GPU performance. GPUABiSort [35] another wellknown graphicsbased GPU sort routine does not run correctly on current generation GPUs. However, it was previously measured to be about 5% faster than GPUSort on a GeForce 7800 system. Therefore, we believe that the GPUSort performance on the GeForce 8800 should be representative of the GPUABiSort performance as well. All other sorts shown are implemented in CUDA. Several trends are apparent in this graph. First of all, the CUDAbased sorts are generally substantially faster than GPUSort. This is in part due to the intrinsic advantages of
8 To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May Sorting Rate (pairs/sec) Millions Fig. 7. Our radix sort Our merge sort GPUSort GPU Gems radix sort CUDPP radix sort Sorting rates for several GPUbased methods on an 8800 Ultra. CUDA. Directly programming the GPU via CUDA imposes less overhead than the graphics API and exposes architectural features such as load/store to memory and the onchip shared memory which are not available to graphics programs like GPUSort. Furthermore, the bitonic sort used by GPUSort does O(n log 2 n) work, as opposed to the more workefficient O(n) radix sort algorithms and our O(n log n) merge sort algorithm. Our results clearly show that our radix sort code delivers substantially higher performance than all the other sorting algorithms we tested. It is faster across all input sizes and the relative performance gap increases for larger inputs. At the largest input sizes, it can sort at greater than 2 times the rate of all other algorithms and at greater than 4 times the GPUSort rate. The algorithm suggested by Le Grand [26] is competitive at array sizes up to 1M elements, at which point its sorting rate degrades substantially. This shows the importance of the blocklevel sorting we perform to improve scatter coherence. Our radix sort is roughly twice as fast as the CUDPP radix sort [22]. The results also show that our merge sort is roughly twice as fast as GPUSort, which is the only other comparisonbased sort shown. At all but the largest input sizes, it is also faster than the CUDPP radix sort, and is competitive with Le Grand s radix sort for large input sizes. Only our radix sort routine consistently outperforms the merge sort by a substantial margin. In addition to the sorts shown in Figure 7, we can also draw comparisons with two other CUDAbased partitioning sorts. The numbers reported by He et al. [27] for their radix sort show their sorting performance to be roughly on par with the CUDPP radix sort. Therefore, their method should perform at roughly half the speed of our radix sort and slightly slower than our merge sort. The quicksort method of Cederman and Tsigas [29] sorts keyonly sequences, as opposed to the keyvalue sorts profiled above. We have not implemented a keyonly variant of our merge sort and so cannot reliably compare its performance. However, we can compare their reported performance with our keyonly radix sort results shown in Figure 8b. For uniform random sequences of 1M 16M elements, their sorting rate on GTX 280 is between 51 and 66 million keys/sec. Our keyonly radix sorting rate for these sequence sizes is on average 3 times higher. B. Comparison with CPUbased Methods We also compare the performance of our sorting routines with highspeed multicore CPU routines. In practice, applications running on a system containing both GPU and CPU are faced with the choice of sorting data on either processor. Our experiments demonstrate that our sorting routines are highly competitive with comparable CPU routines. All other factors being equal, we conclude that applications faced with the choice of processor to use in sorting should choose the processor in whose memory the data is stored. We benchmarked the performance of the quicksort routine (tbb::parallel_sort) provided by Intel s Threading Building Blocks (TBB), our own efficient TBBbased radix sort implementation, and a carefully handtuned radix sort implementation that uses SSE2 vector instructions and Pthreads. We ran each using 8 threads on an 8core 2.33 GHz Intel Core2 Xeon E5345 ( Clovertown ) system. Here the CPU cores are distributed between two physical sockets, each containing a multichip module with twin Core2 chips and 4MB L2 cache per chip, for a total of 8cores and 16MB of L2 cache between all cores. Figure 8a shows the results of these tests. As we can see, the algorithms developed in this paper are highly competitive. Our radix sort produces the fastest running times for all sequences of 8K elements and larger. It is on average 4.4 times faster than tbb::parallel_sort for input sizes larger than 8K elements. For the same size range, it is on average 3 and 3.5 times faster than our TBB radix sort and the handtuned SIMD radix sort, respectively. For the two comparisonbased sorts our merge sort and tbb::parallel_sort our merge sort is faster for all inputs larger than 8K elements by an average factor of 2.1. Recently, Chhugani et al. [16] published results for a multicore merge sort that is highly tuned for the Intel Core2 architecture and makes extensive use of SSE2 vector instructions. This appears to be the fastest multicore sorting routine available for the Intel Core2 architecture. They report sorting times for 32bit floatingpoint sequences (keys only) and benchmark performance using a 4core Intel Q9550 ( Yorkfield ) processor clocked at 3.22 GHz with 32KB L1 cache per core and a 12MB L2 cache. Figure 8b plots their results along with a modified version of our radix sort for processing 32bit keyonly floating point sequences. The combination of aggressive hand tuning and a higherend core produces noticeably higher performance than we were able to achieve on the Clovertown system. Nevertheless, we see that our CUDA radix sort performance is on average 23% faster than the multicore CPU merge sort, sorting on average 187 million keys per second for the data shown, compared to an average of 152 million keys per second for the quadcore CPU.
9 Sorting Rate (keys/sec) Millions To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May Sorting Rate (pairs/sec) Millions Our radix sort (GTX 280) Our merge sort (GTX 280) tbb::parallel_sort (8 cores) Scalar radix sort (8 cores) SSE radix sort (8 cores) Radix sort (GTX 280) Chhugani et al. (4 cores) Chhugani et al. (2 cores) Chhugani et al. (1 core) 50 (a) 8core Clovertown (keyvalue pairs),000,000 Sequence Size (32bit floating point keys only) (b) 4core Yorkfield (float keys only) Fig. 8. Performance comparison with efficient multicore sort implementations. VI. CONCLUSION We have presented efficient algorithms for both radix sort and merge sort on manycore GPUs. Our experimental results demonstrate that our radix sort technique is the fastest published sorting algorithm for modern GPU processors and is up to 4 times more efficient than techniques that map sorting onto the graphics API. In addition to being the fastest GPU sorting technique, it is also faster than comparable sorting routines on multicore CPUs. Our merge sort provides the additional flexibility of comparisonbased sorting while remaining one of the fastest sorting methods in our performance tests. We achieve this algorithmic efficiency by concentrating as much work as possible in the fast onchip memory provided by the NVIDIA GPU architecture and by exposing enough finegrained parallelism to take advantage of the thousands of parallel threads supported by this architecture. We believe that these key design principles also point the way towards efficient design for manycore processors in general. When making the transition from the coarsegrained parallelism of multicore chips to the finegrained parallelism of manycore chips, the structure of efficient algorithms changes from a largely taskparallel structure to a more dataparallel structure. This is reflected in our use of dataparallel primitives in radix sort and finegrained merging in merge sort. The exploitation of fast memory spaces whether implicitly cached or explicitly managed is also a central theme for efficiency on modern processors. Consequently, we believe that the design techniques that we have explored in the context of GPUs will prove applicable to other manycore processors as well. Starting from the algorithms that we have described, there are obviously a number of possible directions for future work. We have focused on one particular sorting problem, namely sorting sequences of wordlength keyvalue pairs. Other important variants include sequences with long keys and/or variable length keys. In such situations, an efficient sorting routine might make somewhat different efficiency tradeoffs than ours. It would also be interesting to explore outofcore variants of our algorithms which could support sequences larger than available RAM; a natural generalization since our algorithms are already inherently designed to work on small subsequences at a time in the GPU s onchip memory. Finally, there are other sorting algorithms whose efficient parallelization on manycore GPUs we believe should be explored, foremost among them being sample sort. The CUDA source code for our implementations of the radix sort and merge sort algorithms described in this paper will be publicly available in the NVIDIA CUDA SDK [36] (version 2.2) as well as in the CUDA DataParallel Primitives Library [25]. ACKNOWLEDGEMENT We would like to thank Shubhabrata Sengupta for help in implementing the scan primitives used by our radix sort routines and Brian Budge for providing his SSEbased implementation of radix sort. We also thank Naga Govindaraju and Scott Le Grand for providing copies of their sort implementations. REFERENCES [1] D. E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second ed. Boston, MA: AddisonWesley, [2] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Second ed. MIT Press, Sep. 01. [3] S. G. Akl, Parallel Sorting Algorithms. Orlando: Academic Press, Inc., [4] G. Graefe, Implementing sorting in database systems, ACM Comput. Surv., vol. 38, no. 3, pp. 1 37, 06. [5] C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, Fast BVH construction on GPUs, in Proc. Eurographics 09, Mar. 09. [6] V. Havran, Heuristic ray shooting algorithms, Ph.D. Thesis, Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague, Nov. 00.
10 To Appear in Proc. 23rd IEEE International Parallel and Distributed Processing Symposium, May [7] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1, pp , 08. [8] B. He, W. Fang, N. K. Govindaraju, Q. Luo, and T. Wang, Mars: A MapReduce framework on graphics processors, in Proc. 17th Int l Conference on Parallel Architectures and Compilation Techniques, 08, pp [9] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, NVIDIA Tesla: A unified graphics and computing architecture, IEEE Micro, vol. 28, no. 2, pp , Mar/Apr 08. [10] J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel programming with CUDA, Queue, vol. 6, no. 2, pp. 53, Mar/Apr 08. [11] NVIDIA CUDA Programming Guide, NVIDIA Corporation, Jun. 08, version 2.0. [12] G. E. Blelloch, Vector models for dataparallel computing. Cambridge, MA, USA: MIT Press, [13] F. Gavril, Merging with parallel processors, Commun. ACM, vol. 18, no. 10, pp , [14] T. Hagerup and C. Rub, Optimal merging and sorting on the EREW PRAM, Information Processing Letters, vol. 33, pp , [15] D. Z. Chen, Efficient parallel binary search on sorted arrays, with applications, IEEE Trans. Parallel and Dist. Systems, vol. 6, no. 4, pp , [16] J. Chhugani, W. Macy, A. Baransi, A. D. Nguyen, M. Hagog, S. Kumar, V. W. Lee, Y.K. Chen, and P. Dubey, Efficient implementation of sorting on multicore SIMD CPU architecture, in Proc. 34th Int l Conference on Very Large Data Bases, Aug. 08, pp [17] J. A. Stratton, S. S. Stone, and W. W. Hwu, MCUDA: An efficient implementation of CUDA kernels for multicore CPUs, in 21st Annual Workshop on Languages and Compilers for Parallel Computing (LCPC 08), Jul. 08, pp [18] P. B. Gibbons, A more practical PRAM model, in Proc. 1st ACM Symposium on Parallel Algorithms and Architectures, 1989, pp [19] M. Zagha and G. E. Blelloch, Radix sort for vector multiprocessors, in Proc. ACM/IEEE Conference on Supercomputing, 1991, pp [] K. Thearling and S. Smith, An improved supercomputer sorting benchmark, in Proc. ACM/IEEE Conference on Supercomputing, 1992, pp [21] A. C. Dusseau, D. E. Culler, K. E. Schauser, and R. P. Martin, Fast parallel sorting under LogP: Experience with the CM5, IEEE Trans. Parallel Distrib. Syst., vol. 7, no. 8, pp , Aug [22] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, Scan primitives for GPU computing, in Graphics Hardware 07, Aug. 07, pp [23] G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, A comparison of sorting algorithms for the Connection Machine CM2, in Proc. Third ACM Symposium on Parallel Algorithms and Architectures, 1991, pp [24] M. Harris, S. Sengupta, and J. D. Owens, Parallel prefix sum (scan) with CUDA, in GPU Gems 3, H. Nguyen, Ed. Addison Wesley, Aug. 07, pp [25] CUDPP: CUDA DataParallel Primitives Library, org/developer/cudpp/, 09. [26] S. Le Grand, Broadphase collision detection with CUDA, in GPU Gems 3, H. Nguyen, Ed. AddisonWesley Professional, Jul. 07, ch. 32, pp [27] B. He, N. K. Govindaraju, Q. Luo, and B. Smith, Efficient gather and scatter operations on graphics processors, in Proc. ACM/IEEE Conference on Supercomputing, 07, pp [28] K. E. Batcher, Sorting networks and their applications, in Proc. AFIPS Spring Joint Computer Conference, vol. 32, 1968, pp [29] D. Cederman and P. Tsigas, A practical quicksort algorithm for graphics processors, in Proc. 16th Annual European Symposium on Algorithms (ESA 08), Sep. 08, pp [30] W. D. Frazer and A. C. McKellar, Samplesort: A sampling approach to minimal storage tree sorting, J. ACM, vol. 17, no. 3, pp , [31] J. H. Reif and L. G. Valiant, A logarithmic time sort for linear size networks, J. ACM, vol. 34, no. 1, pp. 76, [32] J. S. Huang and Y. C. Chow, Parallel sorting and data partitioning by sampling, in Proc. Seventh IEEE Int l Computer Software and Applications Conference, Nov. 1983, pp [33] R. Cole, Parallel merge sort, SIAM J. Comput., vol. 17, no. 4, pp , [34] N. K. Govindaraju, J. Gray, R. Kumar, and D. Manocha, GPUTera Sort: High performance graphics coprocessor sorting for large database management, in Proc. 06 ACM SIGMOD Int l Conference on Management of Data, 06, pp [35] A. Greß and G. Zachmann, GPUABiSort: Optimal parallel sorting on stream architectures, in Proc. th International Parallel and Distributed Processing Symposium, Apr. 06, pp [36] NVIDIA CUDA SDK, 09.
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
On Sorting and Load Balancing on GPUs
On Sorting and Load Balancing on GPUs Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology SE-42 Göteborg, Sweden {cederman,tsigas}@chalmers.se Abstract
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Parallel Prefix Sum (Scan) with CUDA. Mark Harris [email protected]
Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
A Comparison of Parallel Sorting Algorithms on Different Architectures
A Comparison of Parallel Sorting Algorithms on Different Architectures NANCY M. AMATO RAVISHANKAR IYER SHARAD SUNDARESAN YAN WU [email protected] [email protected] [email protected] [email protected] Department
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
String Matching on a multicore GPU using CUDA
String Matching on a multicore GPU using CUDA Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department of Applied Informatics, University of
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
GPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
Fast Sequential Summation Algorithms Using Augmented Data Structures
Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik [email protected] Abstract This paper provides an introduction to the design of augmented data structures that offer
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
The Methodology of Application Development for Hybrid Architectures
Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
GRID SEARCHING Novel way of Searching 2D Array
GRID SEARCHING Novel way of Searching 2D Array Rehan Guha Institute of Engineering & Management Kolkata, India Abstract: Linear/Sequential searching is the basic search algorithm used in data structures.
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
Programming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University [email protected] COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System
The Design and Implement of Ultra-scale Data Parallel In-situ Visualization System Liu Ning [email protected] Gao Guoxian [email protected] Zhang Yingping [email protected] Zhu Dengming [email protected]
Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning
Application-Specific Architecture Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University Xiaowei Wang Rui Chen Outline
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
PARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Efficient Gather and Scatter Operations on Graphics Processors
Efficient Gather and Operations on Graphics Processors ingsheng He # Naga K. Govindaraju * Qiong Luo # urton Smith * # Hong Kong Univ. of Science and Technology * Microsoft Corp. {saven, luo}@cse.ust.hk
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
FAST PARALLEL SORTING ALGORITHMS ON GPUS
FAST PARALLEL SORTING ALGORITHMS ON GPUS Bilal Jan 1,3, Bartolomeo Montrucchio 1, Carlo Ragusa 2, Fiaz Gul Khan 1 and Omar Khan 1 1 Dipartimento di Automatica e Informatica, Politecnico di Torino, Torino,
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
CUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
