FAST PARALLEL SORTING ALGORITHMS ON GPUS
|
|
|
- Molly Dickerson
- 9 years ago
- Views:
Transcription
1 FAST PARALLEL SORTING ALGORITHMS ON GPUS Bilal Jan 1,3, Bartolomeo Montrucchio 1, Carlo Ragusa 2, Fiaz Gul Khan 1 and Omar Khan 1 1 Dipartimento di Automatica e Informatica, Politecnico di Torino, Torino, I-1129 taly 2 Dipartimento di Ingegneria Elettrica, Politecnico di Torino, Torino, I-1129 Italy 3 [email protected] ABSTRACT This paper presents a comparative analysis of the three wely used parallel sorting algorithms: Odd- Even sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongse we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. All algorithms have been implemented exploiting data parallelism model, for achieving high performance, as available on multi-core GPUs using the OpenCL specification. Our results depicts minimum speed-up19x of bitonic sort against oddeven sorting technique for small queue sizes on CPU and maximum of 23x speed-up for very large queue sizes on Nvia Quadro 6 GPU architecture. Our implementation of full-butterfly network sorting results in relatively better performance than all of the three sorting techniques: bitonic, odd-even and rank sort. For min-max butterfly network, our findings report high speed-up of Nvia quadro 6 GPU for high data set size reaching 2 24 with much lower sorting time. KEYWORDS Parallel Computing, Parallel Sorting Algorithms, GPUs, Butterfly Network, OpenCL 1. INTRODUCTION Parallelism on chip level is the hub for advancements in micro processor architectures for high performance computing. As a result of which multi-core CPUs [1] are commonly available in the market. These core-processors, in personal computers, were not sufficient for high datacomputation intensive tasks. As a result of collective efforts by industry and academia, modular and specialized hardware in the form of sound cards or graphic accelerators are increasingly present in most personal computers. These cards prove much high performance as compared to legacy on-board units. Recently, graphics cards or graphics processing units (GPU), introduced primarily for high-end gaming requiring high resolution, are now intensively being used, as a co-processor to the CPU, for general purpose computing[2, 3]. The GPU itself is a multi-core processor having support for thousands of threads [4] running concurrently. GPUs are result of dozens of streaming processors with hundreds of core aligned in a particular way forming a single hardware unit. Thread management at such hardware level requires contextswitching time close to null otherwise penalizing performance. Apart from high-end games, general purpose CPU-bound applications which have significant data in-dependency are well suited for such devices. Hence data parallel codes are efficiently performed since the hardware can be classified as SIMT (single-instruction, multiple threads). Performance evaluation in GFLOPS (Giga Floating Point Operations per Second) shows that GPUs outperforms their CPU counterparts. For example a high-end Core I7 processor (3.46 GHz) delivers up to a peak of GFLOPs 1. Table-1 reports some architecture details of GPUs versus Intel Core2 system that we have used for our implementation of sorting algorithms. The devices include both highend graphics card like Quadro 6 comprising of 14 stream processors with 32 cores each, and 1 Intel Core I7 Specification, DOI : /ijdps
2 also low-end graphic cards that is GeForce with 3 processors of 8 cores each. One such architecture of with 27 processors having 8 cores each, is depicted in Fig. 1. To low-end we have used GT 32 M as is more eally suitable for laptops and hence the fewer cores prove good balance for battery power. The high powerful Quadro-6 and GTX-26 is well suited for desktops with power requirement of 24W and 182W respectively. NVIDIA Intel Architecture Details Quadro 6 GTX 26 GT 32M Core2 Quad Q84 Total Cores Micro Processors Clock Rate (MHz) FLOPs Mem. Bandwth (GB/s) Table 1Architecture details of GPUs and CPU Figure 1GTX-26 Device Archite To program a GPU, one needs vendor proved API's which is NVIDIAs CUDA 2, ATIs FireStream 3 or the OpenCL specification by Khronos group 4. One difference between CUDA and OpenCL is that CUDA is specific for GPU devices whereas OpenCL is heterogeneous and targets all devices conforming its specification [5], [6]. This may include GPUs and/or CPUs but for to achieve high performance, it primarily focuses on the GPUs. OpenCL adopts C-style and is an extension of C99 with some extra keywords and a slightly modified syntax for threads driving kernels. OpenCL runs two pieces of codes. One is kernel, also called device program which is a specific piece of code running on device, is executed concurrently by several threads and this is where task parallelism takes place consisting of thousands of threads on the target device. The other, called a host program, runs entirely on CPU se that launches kernels i.e. SIMT based programs. Thread management is hardware based and programmer only organizes the work-domain into several work-items dived into one or more work-groups. The overall problem domain, called the ND-Range, can support up to three dimensions. A work-item or thread which is the basic execution unit in NDRange, is entified by a global and local addressing scheme in NDRange for each dimension of NDRange and work-groups. Global addressing obtained by is unique for all threads, whereas any two threads of 2 Nvia Cuda GPGPU framework. 3 AtiFirestream DAAMIT GPGPU Framework. 4 OpenCL Specification. 18
3 different groups can have same local address. This scheme is outlined in Fig. 2 for a 2 dimensional problem. Workgroup Size Y NDRange Y Figure 2 ND Range Adressing Scheme A single dimensional address can be computed as: global = group * group ) Proved that it fulfils the following expressions: d ( global d local NDRange max( group = ( group global * group size NDRange = max( group %2 = ) ) ) * group size Here, groupsize and NDRange is set in host program by the programmer. The dimensional limits differs from device to device with a limit of up to maximum of 3 dimensions,1 and 2. For Quadro 6 maximum size for the 3-dimensions are 124*124*64 and 512*512*64 for both GTX26 and GT32M. This corresponds to approximately 67 and 16 million threads for Quadro and GTX,respectively. All threads are executed in form of thread blocks containing 32 threads, referred to as warps. However, some devices support execution of half warps. Our focus in this paper is to report performance of sorting algorithms using graphics cards which is of significant importance to various computer science applications. The choice of sorting technique is vital in performance for some applications, for instance discrete event simulations, where sorting frequent events can directly affect the performance of simulation. The 19
4 algorithms discussed int the following are bitonic, odd-even, and rank sorting algorithms. 2. PARALLEL SORTING ALGORITHMS Sorting on GPU require transferring data from main memory to on-board GPU global memory. Although on-device bandwth is in the range of 144Gb/s, thus only those sorting techniques are efficient which require minimum amount of synchronization because the PCI bandwth is to the range of 2.5Gb/s. i.e., synchronization and memory transfers between CPU and GPU will affect system performance adversely. Compared to serial sorting algorithms, parallel algorithms are designed requiring high data independence between various elements for achieving better performance. Those techniques which involve large data dependency are categorized as sequential sorting algorithms Odd-Even Sort The odd-even sort is a parallel sorting algorithm and is based on bubble-sort technique. Adjacent pairs of items in an array are exchanged if they are found to be out of order. What makes the technique distinct from bubble-sort is the technique of working on disjointed pairs, i.e., by using alternating pairs of odd-even and even-odd elements of the array. The technique works in multiple passes on a queue Q of size N. In each pass, elements at odd-numbered positions perform a comparison check based on bubble-sort, after which elements at evennumbered positions do the same. The maximum number of iterations or passes for odd-even sort is N. 2 Total running time for this technique is Ο (log 2N ). The algorithm works as:- 11
5 2.2. Rank Sort There are two phases of the rank-sort algorithm. In the first phase, for each element in queueq of size N, the total number of elements less than itself is maintained in another data structure of same size N. This is called the ranking phase and is depicted in Algorithm-2. Since each element n is compared against n 1other elements, therefore there are a total of n ( n 1) total computational steps. But since the comparison requires sharing of data and not changing of data, the comparison can be made in Ο(N ) total steps for N processors. This also means that the technique is feasible for shared memory architectures. The second phase involves sorting of elements in queue Q based on its rank. The phase is shown in Algorithm-2. The second phase sorting can be performed in Ο(log 2n) steps. For optimization, the number of elements is dived based on number of processors using m = n. p 111
6 2.3. Bitonic Sort Bitonic sort with the property that sequence of comparisons is data-independent makes it one of the fastest and suitable parallel sorting algorithms. To sort an arbitrary sequence bitonic sort have two steps. In the first step it makes the arbitrary sequence in to bitonic sequence. A bitonic sequence is a sequence which either monotonically increases or decreases, reaches a single maximum or minimum, and then after that maximum or minimum value it again monotonically increases or decreases. For example, the two sequences and are bitonic. The first one increases from 3 to 9, then decreases. The second one can be converted to the first one by cyclically shifting. In the second step the bitonic sequence is sorted in such a k way that, lets we have a bitonic sequence N with length n = 2, which would require k steps N would be compared to ( n ) to sort an entire length of n elements. In the first step () N, 2 N( 1) with N ( n 2 +1) up to N ( n 2 1) with N ( n 1) and elements are exchanged according either subsequence from N ( ) to N ( n 2) and from N ( n 2 +1) up to N ( n 1). Then in the second step same procedure would be applied to each subsequence and each subsequence would th k yield another subsequence and after the k step it yields the 2 sub-sequences of length 1 so 2 all the elements in the bitonic sequence are being sorted. Bitonic sort consists of Ο ( n log n ) comparators as in every step n 2 compare/exchange operations are performed and total number of steps are k = log n, so in parallel processing it would take n processor to sort it with Ο (log n 2 ) complexity. The total number of steps required in bitonic sort in both steps that are k( k +1) creating a bitonic sequence and the sorting. For example arbitrary sequence of elements ( 2 ) would take 1 steps Min-max Butterfly Network Butterfly network is a special form of hypercube. A k-dimensional butterfly has k( k + 1) 2 vertices and k2 k +1 edges [7]. In this case vertices represent input data whereas edges represent possible data movements. We are consering 2X2 butterfly-network, acting as a comparator, placing minimum and maximum number at their respective upper and lower leaves. The minmax butterfly of N numbers has log 2 N stages resulting into a total complexity of N 2 log2 N butterflies in terms of comparators or cross points with N 2 as the number of 2x2 butterflies in each stage. The min-max butterfly network is of significance importance to several network applications and dynamic systems involving minimum and maximum values/quantities. The butterfly-network, in general, has its roots in many diverse areas: DSP-FFT calculation, Switching Fabric-Benes Networks, Network Flows-Hamiltonian, cycle construction, min-max fairness problems etc. This paper exploits the butterfly network in a parallel way for finding minimum and maximum values of the input data. The Fig. 3 shows a schematic of a min-max butterfly of 8 random numbers x(),x(1)...x(7) in increasing order. The input to stage 1 is obtained from a random variate generator. At each stage a single butterfly compares two th numbers and places it at upper and lower level accordingly. At last stage i.e. log 2 N stage minimum and maximum values are output at upper and lower leaf of first and last butterfly respectively. In this case x() and x(7) are the resulting min-max values. The stages of min-max butterfly structure could not be parallelized as output of any stage s i is input to subsequent stage s i+ 1 and hence algorithm works stage-by-stage sequentially. For efficient resource utilization and high performance, we have introduced parallelism inse stage. k 112
7 Referring to Fig. 3 it can be judged that at any stage s i < log 2 N parallelism can be imposed by executing similar operations concurrently. For example at stage s1 each 2x2 butterfly can be executed in parallel constrained by maximum number of threads running in parallel imposed by underlying hardware architecture. Degree of parallelism remains constant throughout all stages s1, s2... log2 N. Figure 3 Min-Max 8x8 Butterfly The min-max butterfly algorithm works as follows;- 113
8 3. RELATED WORK Sorting is one of the wely studied algorithmic topics for the last three decades. Due to space limitation for mentioning different kinds of sorting algorithms, we discuss only relevant parallel sorting techniques on GPUs in this section. An overview of sorting algorithms in parallel is given in [8]. A quick-sort implementation on GPU using CUDA is consered in [9] which results quick-sort as an efficient alternative to both bitonic and radix sort over GPU's for larger data sequences. Moreover bitonic sort is suggested for smaller sequences. The quick-sort algorithm discussed in [9] uses a dive-and-conquer approach for sorting, forming left and right sequences depending on whether current value is greater or smaller than pivot value. For each recursive call, a new pivot value has to be selected. On the GPU, [9] have proposed two steps (1) sub-sequences creation, and (2) assigning each sub-sequence to thread for sorting. The overall complexity of above GPU-quick sort technique is Ο ( n log n), with a worst case of Ο ( n 2 ). A merge and radix-sort implementation for GPU's is proved in [1]. Here, radix sort first dives full sequence n into n p thread blocks with p as total available threads. Each sequence then is locally sorted by radix sort on-chip shared memory reducing number of scatters to global memory and maximizing their coherence. Scattered I/O is efficient by placing single procedure call to write data to a single data stream coming from multiple buffers. But it has no support in all GPU devices and thus all writes are sequential [11]. In recent cards, including the NVIDIA G8 series and AMD R6 series this is however no longer a problem. Their technique achieves complexity of Ο ( t) with t threads handling 2 b buckets. The merge sort [1] follows the same dive-and-conquer technique where complete sequence is dived into p same size tiles. Afterwards all tiles are sorted in parallel using odd-even sort with p thread blocks, and then merged together using merge-sort conventions on a tree of log p depth. This technique is well suitable for external sorting, where a processor has access only to a small memory address space. Moreover, degree of parallelism is reduced as higher levels are sorted and thus not fully utilizing parallel GPU architecture. An adaptive bitonic-scheme is proposed in [12]. Their nlogn Ο technique sorts n values using p stream processors achieving optimum complexity of ( ). Bitonic sort has also been implemented in [13] using Imagine stream processor. An overview of sorting queues for traffic simulations is covered in [14]. Their approach is to study the behavior of relatively large groups of transport agents. 4. PERFORMANCE ANALYSIS Several different C data structures and built-in routines are usually used for sorting algorithm implementation. In OpenCL framework this is not the case because only a few supported math functions, most of these are absent. Hence they have to be implemented explicitly by developers. Moreover, as memory cannot be allocated dynamically in kernels, all memory has to be allocated before Experimental Setup This section is dedicated to examine performance of our sorting algorithms. The performance tests are carried out on varying queue sizes where each queue size is a value of power 2. The input data type is float for all algorithms. Random numbers are generated following uniform and/or exponential distributions to populate the input queue size. For uniform distribution, value ranges from 1 and 2 n. All necessary variable initializations for input/output, random variate generators, output from the queues are performed locally on the CPU, whereas actual sorting implementation is carried out entirety on GPU se. The GPU devices for running our p 114
9 simulations are the NVIDIA Quadro 6, NVIDIA GeForce and NVIDIA GeForce. The GT32M, designed for notebooks, consumes less power and has less cores with 1GB global memory for the device. The GTX26, on the other hand, is a high-end graphics card with large number of cores, 216 in number and 895MB of global memory. The NVIDIA Quadro 6, built on innovative NVIDIA fermi architecture, supports 14 micro-processors having 32 cores each, thus resulting into 448 cores in total, arranged as array of streaming multiprocessors. For comparison with CPU we have implemented the same algorithms specific to be run sequentially on CPUs. We have used Intel Core2Quad CPU Q84 with 2.66 GHz processor and 4GB of random memory Results and Discussion Sorting Time Sorting time is recorded as the actual sorting duration of the queue in seconds and does not take into account any memory copy and other contention times. Fig. 4 reports sorting times of bitonic, odd-even and rank sort on different GPU devices and CPU. Data in-dependency in case of bitonic and odd-even sorts makes them suitable for parallel systems. Fig. 4.a and 4.b illustrate how faster bitonic and odd-even sort run on GPU devices and get conserable speedup over their respective serial implementation on CPU. While on the other hand as shown in Fig. 4.c rank sort performs conserably well on CPU rather than on GPU devices because of the data dependency during sorting. On quadro 6, bitonic sort has recorded minimum sorting time for very large queue size i.e. 2 as 1.97 seconds and.12 seconds for queue size 2. For oddeven sort it is.23 seconds for queue size 2 and for rank sort it is 28.8 seconds for queue 15 size On the, rank-sort has recorded maximum sorting time for queues size 2 as 47.2s. Whereas equivalent sized queue using bitonic sort has recorded time of.3 seconds 15 and odd-even sort.7 seconds. On the, for queue size 2, sorting time for rank-sort recorded is 283 seconds or 4m : 43s, which is huge as expected. The time for complete oddeven sort on the GT32M recorded as 2.17 seconds and.16 seconds for bitonic sort. From our results, we see average speed-up of 2.73 for odd-even sort on GTX26 vs and speedup of on Quadro 6 vs. In case of bitonic sort the average speedup is when Quadro 6 is used and 1.11 if is used, over respectively. Whereas for rank sort an average speed-up of 9.25 and 5.79 is achieved respectively on Quadro 6 and GTX26. This show that both odd-even and rank-sort will achieve conserable speed-up if number of on-device cores increases. Sorting time for min-max butterfly and full-butterfly network sorting, in both cases, in relatively lower than sorting times of all three: bitonic, oddeven and rank sort. Performance is improved because of the parallel nature of the algorithm and better code optimization. Interestingly, for complete descending ordered data, min-max butterfly,beses its sole purpose of finding minimum and maximum in data, gives complete sorted data in less sorting time than others. Fig. 7.a and Fig. 8.b shows our results for min-max butterfly for large queue sizes Sorting Rate Fig. 5 shows sorting rate of bitonic, odd-even and rank sort, which is determined as the ratio of queue length and sorting time. For smaller queue sizes 212, rank-sort has a rounded rate of 8 elements on the GT32M, 46 elements on the GTX26 and 9 elements on Quadro 6. In contrast, odd-even sort shows a rounded rate of 31, elements on GT32M, rate of 69, on the GTX26 and.2 million elements on Quadro 6. The bitonic sort shows a rounded rate of 1.4 million elements on the GT32M, 2.9 million on the GTX26 and 9.5 million elements on Quadro 6. In case of serial implementation on Intel Q84 CPU the sorting rate is.9 million, 49, and 36, elements for bitonic, odd-even and rank sort respectively for same queue size. However, in case of odd-even and rank sort we can observe that the sorting rate approaches to zero as the size of queue increases. Of these, rank-sort 115
10 converges more quickly than odd-even sort. This suggests that both odd-even and rank-sort do not scale well for large queue sizes. But on the other hand bitonic sort performs well for all cases on Quadro 6 and. Bitonic sort gives rounded sorting rate of 19 million 1 million elements on Quadro 6 and respectively even for a very large queue size of Sorting rates for min-max butterfly are shown in Fig. 7.b for different GPU architectures and Intel system Speedup Fig. 6 shows different speedups of different algorithms and architectures. Speedup of bitonic 12 sort over odd-even on the is recorded 45.4x for queue size 2 and speed-up of 332.7x for queue size Whereas on the the speed-up of bitonic sort is 42.4x for queue size 2 and x for queue size 2 and on Quadro 6 the speed-up bitonic sort is x for queue size 2 and 45.3x for queue size 2 over odd-even sort. The reduced speedup on the Quadro 6 even though it has 18x more cores than the GT32 suggests that bitonic sort may have reduced performance edge over odd-even sort as the degree of parallelism increases. Fig. 6.a and 6.b show the speed up achieved on Quadro 6 against other architectures for bitonic and odd-even sort respectively. As can be seen from figures, speedup increases by increasing queue size. A speedup comparison of different GPUs against Intel CPU for rank-sort is highlighted in Fig. 6.c. As shown here, the rank sort performs conserably well on CPU rather than on GPU devices because of the data dependency during sorting as it is not designed specifically for parallel systems. One thing is clear until now that increasing of number of cores on GPU, speed up of sorting algorithms also increases. A speedup improvement, on different GPU and CPU architectures, is drawn in Fig. 7.c and Fig. 8 for both min-max butterfly and full-butterfly sorting respectively. Full-butterfly gives complete sorting of large random data and has good performance relatively to other sorting algorithms, discussed here. Due to content and space limitation, we let algorithm and implementation details of full-butterfly network sorting techniques to next paper. 3 Quadro Quadro 6 3 Quadro Time (seconds) (a) Bitonic sort (b) Odd-Even sort (c) Rank sort Figure 4 Sorting Time of different algorithms on different architectures 116
11 3e+7 2.5e+7 Quadro Quadro Quadro 6 1 2e+7 Sorting Rate 1.5e e+7 4 5e (a) Bitonic sort (b) Odd-Even sort (c) Rank sort Figure 5 Sorting rate of different algorithms on different architectures Quadro SpeedUp (a) Quadro6 for Bitonic sort b) Quadro for Odd-Even sort (c) GPUs vs CPUS Rank sort Figure 6 SpeedUp among different architectures for different algorithms 1 Quadro 6 1 3e+8 Quadro 6 2.5e Quadro 6 Time in seconds (log).1.1 Sorting Rate 2e+8 1.5e+8 1e+8 SpeedUp over Bitonic Sort e (a) Sorting time b) Sorting rate (c) speed up Figure 7 Min-max Butterfly 1 Bitonic Sort 1 Bitonic Sort 1 Odd-Even Sort Rank Sort 1 Odd-Even Sort Rank Sort test2 SpeedUp (log) Time in seconds (log) Full Butterfly Sort (a) Speedup against all b) Sorting time Figure 8 Performance Comparison of Full-Butterfly sort and others 117
12 5. CONCLUSION We tested performance of parallel bitonic, odd-even and rank-sort algorithms for GPUs and comparison with their serial implementation on CPU. It is shown that performance is affected mainly by two things: nature of algorithm and hardware architecture. It is shown that bitonic sort, easily parallizable, has maximum of 23x speed-up against odd-even sorting technique on Quadro 6 GPU, whereas rank sort performs well on CPU as data dependency of that algorithm. The performance of our algorithms: min-max butterfly and full-butterfly sort is relatively higher than the rest. Future work will be dedicated to design and implementation details of our full-butterfly sort and a feasibility report of parallel sorting algorithms for hold _ operation. REFERENCES [1] M. Creeger, Multicore cpus for the masses, Queue, vol. 3, pp. 64-ff, Sep. 25. [2] H. Hacker, C. Trinitis, J. Weendorfer, and M. Brehm, Consering GPGPU for HPC centers: Is it worth the effort?, in Facing the Multicore-Challenge (R. Keller, D. Kramer, and J.-P. Weiss, eds.), vol. 631 of Lecture Notes in Computer Science, pp , Springer, 21. [3] J. Nickolls and W. J. Dally, The GPU computing era, IEEE Micro,vol. 3, no. 2, pp , 21. [4] Y. Zhang and J. D. Owens, A quantitative performance analysis model for GPU architectures, in HPCA, pp , IEEE Computer Society,211. [5] M. Garland, Parallel computing with CUDA, in IPDPS, p. 1, IEEE,21. [6] V. V. Kindratenko, J. Enos, G. Shi, M. T. Showerman, G. W. Arnold, J. E. Stone, J. C. Phillips, and W. mei W. Hwu, GPU clusters for high-performance computing, in CLUSTER, pp. 1 8, IEEE, 29. [7] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, Inc., [8] S. G. Akl, Parallel Sorting Algorithms. Academic Press, [9] D. Cederman and P. Tsigas, GPU-quicksort: A practical quicksort algorithm for graphics processors, ACM Journal of Experimental Algorithmics,vol. 14, 29. [1] N. Satish, M. Harris, and M. Garland, Designing efficient sorting algorithms for many core GPUs, in Proc. 23rd IEEE International Symposium on Parallel and Distributed Processing (23rd IPDPS 9), (Rome, Italy), pp. 1 1, IEEE Computer Society, May 29. [11] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, Scan primitives for GPU computing, in Graphics Hardware (M. Segal and T. Aila, eds.), (San Diego, California, USA), pp , Eurographics Association, 27. [12] A. Gres and G. Zachmann, GPU-bisort: Optimal parallel sorting on stream architectures, in Proc. IEEE International Parallel & Distributed Processing Symposium (2th IPDPS 6), (Rhodes Island, Greece), IEEE Computer Society, Apr. 26. [13] T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and P. Hanrahan, Photon mapping on programmable graphics hardware, in Proceedings of Graphics Hardware 23, pp. 41 5, 23. [14] D. Strippgen and K. Nagel, Using common graphics hardware for multi-agent traffic simulation with CUDA, in SimuTools (O. Dalle, G. A. Wainer, L. F. Perrone, and G. Stea, eds.), p. 62, ICST,
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
On Sorting and Load Balancing on GPUs
On Sorting and Load Balancing on GPUs Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology SE-42 Göteborg, Sweden {cederman,tsigas}@chalmers.se Abstract
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Parallel Prefix Sum (Scan) with CUDA. Mark Harris [email protected]
Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, [email protected]
PARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
High Performance GPGPU Computer for Embedded Systems
High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Optimization of Cluster Web Server Scheduling from Site Access Statistics
Optimization of Cluster Web Server Scheduling from Site Access Statistics Nartpong Ampornaramveth, Surasak Sanguanpong Faculty of Computer Engineering, Kasetsart University, Bangkhen Bangkok, Thailand
GPGPU Parallel Merge Sort Algorithm
GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Speeding up GPU-based password cracking
Speeding up GPU-based password cracking SHARCS 2012 Martijn Sprengers 1,2 Lejla Batina 2,3 [email protected] KPMG IT Advisory 1 Radboud University Nijmegen 2 K.U. Leuven 3 March 17-18, 2012 Who
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015
INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Experiences With Mobile Processors for Energy Efficient HPC
Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Designing Efficient Sorting Algorithms for Manycore GPUs
1 Designing Efficient Sorting Algorithms for Manycore GPUs Nadathur Satish Dept. of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA Email: [email protected]
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
Intrusion Detection Architecture Utilizing Graphics Processors
Acta Informatica Pragensia 1(1), 2012, 50 59, DOI: 10.18267/j.aip.5 Section: Online: aip.vse.cz Peer-reviewed papers Intrusion Detection Architecture Utilizing Graphics Processors Liberios Vokorokos 1,
Applying Parallel and Distributed Computing for Image Reconstruction in 3D Electrical Capacitance Tomography
AUTOMATYKA 2010 Tom 14 Zeszyt 3/2 Pawe³ Kapusta*, Micha³ Majchrowicz*, Robert Banasiak* Applying Parallel and Distributed Computing for Image Reconstruction in 3D Electrical Capacitance Tomography 1. Introduction
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
Figure 1: Graphical example of a mergesort 1.
CSE 30321 Computer Architecture I Fall 2011 Lab 02: Procedure Calls in MIPS Assembly Programming and Performance Total Points: 100 points due to its complexity, this lab will weight more heavily in your
Recommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
GPU Accelerated Monte Carlo Simulations and Time Series Analysis
GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis
Dynamic resource management for energy saving in the cloud computing environment
Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan
The Methodology of Application Development for Hybrid Architectures
Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
Speeding Up RSA Encryption Using GPU Parallelization
2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Multi-GPU Load Balancing for In-situ Visualization
Multi-GPU Load Balancing for In-situ Visualization R. Hagan and Y. Cao Department of Computer Science, Virginia Tech, Blacksburg, VA, USA Abstract Real-time visualization is an important tool for immediately
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
Real-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
