On Sorting and Load Balancing on GPUs
|
|
|
- Lilian Fox
- 10 years ago
- Views:
Transcription
1 On Sorting and Load Balancing on GPUs Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology SE-42 Göteborg, Sweden Abstract In this paper we take a look at GPU-Quicksort, an efficient Quicksort algorithm suitable for the highly parallel multi-core graphics processors. Quicksort had previously been considered an inefficient sorting solution for graphics processors, but GPU-Quicksort often performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors. We also take look at a comparison of different load balancing schemes. To get maximum performance on the many-core graphics processors it is important to have an even balance of the workload so that all processing units contribute equally to the task at hand. This can be hard to achieve when the cost of a task is not known beforehand and when new sub-tasks are created dynamically during execution. With the recent advent of scatter operations and atomic hardware primitives it is now possible to bring some of the more elaborate dynamic load balancing schemes from the conventional SMP systems domain to the graphics processor domain. Introduction Multi-core systems are now commonly available on desktop systems and it seems very likely that in the future we will see an increase in the number of cores as both Intel and AMD targets many-core systems. But already now Supported by Microsoft Research through its European PhD Scholarship Programme. Partially supported by the Swedish Research Council (VR). The results presented in this extended abstract appeared before in the Proceedings of the 6th Annual European Symposium on Algorithms (ESA 28), Lecture Notes in Computer Science Vol.: 593, Springer- Verlag 28 [2] and in the Proceedings of the th Graphics Hardware (GH 28), ACM/Eurographics Association 28 [3]. there are cheap and commonly available many-core systems in the form of modern graphics processors. Due to the many embarrassingly parallel problems in 3D-rendering, the graphics processors have come quite a bit on the way to massive parallelism and high end graphics processors currently boasts up to 24 processing cores. Until recently the only way to take advantage of the GPU was to transform the problem into the graphics domain and use the tools available there. This however was a very awkward abstraction layer and made it hard to use. Better tools are now available and among these are CUDA, which is NVIDIA s initiative to bring general purpose computation to their graphics processors [5]. It consists of a compiler and a run-time for a C/C++-based language which can be used to create kernels that can be executed on CUDA-enabled graphics processors. 2 System Model In CUDA you have unrestricted access to the main graphics memory, known as the global memory. There is no cache memory but the hardware supports coalescing memory operations so that several read operations on consecutive memory locations can be merged into one big read or write operation which will make better use of the memory bus and provide far greater performance. Newer graphics processors support most of the common atomic operations such as CAS (Compare-And-Swap) and FAA (Fetch-And- Add) when accessing the memory and these can be used to implement efficient parallel data structures. The high range graphics processors currently consist of up to multiprocessors each, which can perform SIMD (Single Instruction, Multiple Data) instructions on 8 memory positions at a time. Each multiprocessor has 6 kb of a very fast local memory that allows information to be communicated between threads assigned to the same thread block. A thread block is a set of threads that are assigned to run at the same multiprocessors. All thread blocks have the ACM SIGARCH Computer Architecture News Vol. 36, No. 5, December 28
2 same number of threads assigned to them and this number is specified by the programmer. Depending on how many registers and how much local memory the block of threads requires, there could be multiple blocks assigned to a single multiprocessor. All the threads in a scheduled thread block are run from start to finish before the block can be swapped out, so if more blocks are needed than there is room for on any of the multiprocessors, the leftover blocks will be run sequentially. The GPU schedules threads depending on which warp they are in. Threads with id..3 are assigned to the first warp, threads with id..63 to the next and so on. When a warp is scheduled for execution, the threads which perform the same instructions are executed concurrently (limited by the size of the multiprocessor) whereas threads that deviate are executed sequentially. 3 Overview Having a relatively parallel algorithm it is possible to get really good performance out of CUDA. It is important however to try to make all threads in the same warp perform the same instructions most of the time so that processor can fully utilize the SIMD operations and also, since there is no cache, to try to organize data so that memory operations coalesce as much as possible, something which is not always trivial. The local memory is very fast, just as fast as accessing a register, and should be used for common data and communication, but it is very small since it s being shared by a larger number of threads and there is a challenge in how to use it optimally. This paper is divided into two parts. In the first part we take a look at the GPU-Quicksort algorithm and in the second we look at a comparison between different load balancing schemes. 4 GPU-Quicksort In this section we will give an overview of the GPU- Quicksort algorithm. For a more detailed description please consult the original paper [2]. The method used by the algorithm is to recursively partition the sequence to be sorted, i.e. to move all elements that are lower than a specific pivot value to a position to the left of the pivot and to move all elements with a higher value to the right of the pivot. This is done until the entire sequence has been sorted. In each partition iteration a new pivot value is picked and as a result two new subsequences are created that can be sorted independently. After a while there will be enough subsequences available that each thread block can be assigned one of them. But before that point is reached, the thread blocks need to work together on the same sequences. For this reason, the algorithm is divided up into two, albeit rather similar, phases. First Phase In the first phase, several thread blocks might be working on different parts of the same sequence of elements to be sorted. This requires appropriate synchronization between the thread blocks, since the results of the different blocks need to be merged together to form the two resulting subsequences. Newer graphics processors provide access to atomic primitives that can aid somewhat in this synchronization, but they are not yet available on the high-end graphics processors. Because of that, there is still a need to have a thread block barrier-function between the partition iterations. The reason for this is that the blocks might be executed sequentially and there is no way of knowing in which order they will be executed. The only way to synchronize thread blocks is to wait until all blocks have finished executing. Then one can assign new subsequences to them. Exiting and reentering the GPU is not expensive, but it is also not delay-free since parameters need to be copied from the CPU to the GPU. When there are enough subsequences so that each thread block can be assigned its own subsequence, the algorithm enters the second phase. Second Phase In the second phase, each thread block is assigned its own subsequence of input data, eliminating the need for thread block synchronization. This means that the second phase can run entirely on the graphics processor. By using an explicit stack and always recurse on the smallest subsequence, the shared memory required for bookkeeping is minimized. In-place On conventional SMP systems it is favorable to perform the sorting in-place, since that gives good cache behavior. But on GPUs, because of their limited cache memory and the expensive thread synchronization that is required when hundreds of threads need to communicate with each other, the advantages of sorting in-place quickly fades away. Here it is better to aim for reads and writes to be coalesced to increase performance, something that is not possible on conventional SMP systems. For these reasons it is better, performance-wise, to use an auxiliary buffer instead of sorting in-place. So, in each partition iteration, data is read from the primary buffer and the result is written to the auxiliary buffer. Then the two buffers switch places, with the primary becoming the auxiliary and vice versa. 4. Partitioning When a sequence is going to be partitioned it is logically divided into m equally sized sections, where m is the number of thread blocks available. Each thread block is then ACM SIGARCH Computer Architecture News 2 Vol. 36, No. 5, December 28
3 assigned a section of the sequence. The thread block goes through its assigned data, with all threads in the block accessing consecutive memory so that the reads can be coalesced. This is important, since reads being coalesced will significantly lower the memory access time. Synchronization The objective is to partition the sequence, i.e. to move all elements that are lower than the pivot to a position to the left of the pivot in the auxiliary buffer and to move the elements with a higher value than the pivot to the right of the pivot. The problem here is to synchronize this in an efficient way. Cumulative Sum A possible solution is to let each thread read an element and then synchronize the threads using a barrier function. By calculating a cumulative sum of the number of threads that want to write to the left and to the right of the pivot respectively, each thread would know that x threads with a lower thread id than its own are going to write to the left of the pivot and that y threads are going to write to the right of the pivot. Each thread then knows that it can write its element to either buf x+ or buf n (y+), depending on if the element is higher or lower than the pivot. A Two-Pass Solution But calculating a cumulative sum is not free, so to improve performance the algorithm goes through the sequence two times. In the first pass each thread just counts the number of elements it has seen that have value higher (or lower) than the pivot. Then when the block has finished going through its assigned data, these sums are used instead to calculate the cumulative sum. Now each thread knows how much memory the threads with a lower id than its own needs in total, turning it into an implicit memory-allocation scheme that only needs to run once for every thread block, in each iteration. In the first phase, where several thread blocks accesses the same sequence, an additional cumulative sum need to be calculated for the total memory used by each thread block. When each thread knows where to store its elements, the algorithm goes through the data in a second pass, storing the elements at their new position in the auxiliary buffer. As a final step, the pivot value is stored at the gap between the two resulting subsequences. The pivot value is now at its final position which is why it doesn t need to be included in any of the two subsequences. 5 Experimental Evaluation Now we take a look at the experiments performed in the original GPU-Quicksort paper [2]. The GPU-Quicksort algorithm was evaluated on a dual-processor dual-core AMD Opteron.8GHz machine. Two different graphics processors were used, the low-end NVIDIA 86GTS 256MiB with 4 multiprocessors and the high-end NVIDIA 8GTX 768MiB with 6 multiprocessors. Since the 8GTX provides no support for the atomic FAA operation, an implementation of the algorithm that exits to the CPU for block-synchronization needed to be used. GPU-Quicksort was compared to the following state-ofthe-art GPU sorting algorithms: GPUSort Uses bitonic merge sort [6]. Radix-Merge Uses radix sort to sort blocks that are then merged [7]. Global Radix Uses radix sort on the entire sequence []. Hybridsort Uses a bucket sort followed by a merge sort [3]. STL-Introsort This is the Introsort implementation found in the C++ Standard Library. Introsort is based on Quicksort, but switches to heap-sort when the recursion depth gets too large []. All measurements are done on the actual sorting phase, they do not include the time it took to setup the data structures and to transfer the data on and off the graphics memory. The reason for this is the different methods used to transfer data which wouldn t give a fair comparison between the GPU-based algorithms. Transfer times are also irrelevant if the data to be sorted are already available on the GPU. Because of those reasons, this way of measuring has become a standard in the literature. The source code of GPU-Quicksort is available for noncommercial use [4]. For benchmarking a uniform, sorted, zero, bucket, gaussian and staggered distribution was used which are defined and motivated in [8]. These are commonly used yardsticks to compare the performance of different sorting algorithms. The source of the random uniform values is the Mersenne Twister [9]. 5. Discussion Quicksort has a worst case scenario complexity of O(n 2 ), but in practice, and on average when using a random pivot, it tends to be close to O(n log(n)), which is the lower bound for comparison sorts. In all experiments GPU- Quicksort showed the best performance or was among the best. There was no distribution that caused problems to the performance of GPU-Quicksort. As can be seen when comparing the performance on the two GPUs, GPU-Quicksort shows a speedup of approximately 3 on the higher-end GPU. On the CPU, Quicksort is normally seen as a faster algorithm as it can potentially pick better pivot points and doesn t need an extra check to determine when the sequence is fully sorted. The time complexity of radix sort is O(n), but that hides a potentially high constant which is dependent on the key size. Optimizations are possible to lower this constant, such as constantly checking if the sequence ACM SIGARCH Computer Architecture News 3 Vol. 36, No. 5, December 28
4 Quicksort Global Radix GPUSort Radix-Merge STL Quicksort Global Radix GPUSort Radix Merge STL Hybrid Uniform Uniform Sorted Sorted Time in milliseconds - Logarithmic scale Zero Bucket Gaussian Time in milliseconds - Logarithmic scale Zero Bucket Gaussian Staggered Staggered Elements (millions) Elements (millions) Figure. Results on the 8GTX. Figure 2. Results on the 86GTS. has been sorted, but that can be expensive when dealing with longer keys. Quicksort being a comparison sort also means that it is easier to modify it to handle different key types. The hybrid approach uses atomic instructions that were only available on the 86GTS. It outperforms both GPU- Quicksort and the global radix sort on the uniform distribution. But it loses speed on the staggered distributions and becomes immensely slow on the zero distribution. The authors state that the algorithm drops in performance when faced with already sorted data, so they suggest randomizing the data first, but this wouldn t affect the result in the zero distribution. GPUSort doesn t increase as much in performance as the other algorithms when executed on the higher-end GPU. It goes from being much faster than the slow radix-merge to perform on par with and even a bit slower than it. The global radix sort showed a 3x speed improvement, as did GPU- Quicksort. All algorithms showed about the same performance on ACM SIGARCH Computer Architecture News 4 Vol. 36, No. 5, December 28
5 the uniform, bucket and Gaussian distributions. GPUSort always shows the same result independent of distributions since it is a sorting network, which means it always performs the same number of operations regardless of the distribution. The staggered distribution was more interesting. On the low-end GPU the hybrid sorting was more than twice as slow as on the uniform distribution. GPU-Quicksort also dropped in speed and started to show the same performance as GPUSort. This can probably be attributed to the choice of pivot selection which was more optimized for uniform distributions. The zero distribution, which can be seen as an already sorted sequence, affected the algorithms to different extent. The STL reference implementation increased dramatically in performance since its two-way partitioning function always returned even partitions regardless of the pivot chosen. GPU-Quicksort shows the best performance as it does a three-way partitioning and can sort the sequence in O(n) time. 6 Dynamic Load Balancing on Graphics Processors Load balancing play a significant role in the design of efficient parallel algorithms and applications. We have looked at a comparison of four different dynamic load balancing methods that was made to see which one was the most suited to the highly parallel world of graphics processors [3]. Three of these methods were lock-free and one was lock-based. They were evaluated on the task of creating an octree partitioning of a set of particles. The experiments showed that synchronization can be very expensive and that new methods that take more advantage of the graphics processors features and capabilities might be required. They also showed that lock-free methods achieves better performance than blocking and that they can be made to scale with increased numbers of processing units. 7 Load Balancing Methods This section gives an overview of the different load balancing methods that was compared in the paper [3]. 7. Static Task List The default method for load balancing used in CUDA is to divide the data that is to be processed into a list of blocks or tasks. Each processing unit then takes out one task from the list and executes it. When the list is empty all processing units stop and control is returned to the CPU. This is a lock-free method and it is excellent when the work can be easily divided into chunks of similar processing time, but it needs to be improved upon when this information is not known beforehand. Any new tasks that are created during execution will have to wait until all the statically assigned tasks are done, or be processed by the thread block that created them, which could lead to an unbalanced workload on the multiprocessors. The method, as implemented in this work, consists of two steps that are performed iteratively. The only data structures required are two arrays containing tasks to be processed and a tail pointer. One of the arrays is called the in-array and only allows read operations while the other, called the out-array, only allows write operations. In the first step of the first iteration the in-array contains all the initial tasks. For each task in the array a thread block is started. Each thread block then reads task i, where i is the thread block ID. Since no writing is allowed to this array, there is no need for any synchronization when reading. If any new task is created by a thread block while performing its assigned task, it is added to the out-array. This is done by incrementing the tail pointer using the atomic FAA-instruction. FAA returns the value of a variable and increments it by a specified number atomically. Using this instruction the tail pointer can be moved safely so that multiple thread blocks can write to the array concurrently. The first step is over when all tasks in the in-array has been executed. In the second step the out-array is checked to see if it is empty or not. If it is empty the work is completed. If not, the pointers to the out- and in-array are switched so that they change roles. Then a new thread block for each of the items in the new in-array is started and this process is repeated until the out-array is empty. 7.2 Blocking Dynamic Task Queue In order to be able to add new tasks during runtime a parallel dynamic task queue was designed that thread blocks can use to announce and acquire new tasks. As several thread blocks might try to access the queue simultaneously it is protected by a lock so that only one thread block can access the queue at any given time. This is a very easy and standard way to implement a shared queue, but it lowers the available parallelism since only one thread block can access the queue at a time, even if there is no conflict between them. The queue is array-based and uses the atomic CAS (Compare-And-Swap) instruction to set a lock variable to ensure mutual exclusion. When the work is done the lock variable is reset so that another thread block might try to grab it. 7.3 Lock-free Dynamic Task Queue A lock-free implementation of a queue was implemented to avoid the problems that comes with locking and also in order to study the behavior of lock-free synchronization ACM SIGARCH Computer Architecture News 5 Vol. 36, No. 5, December 28
6 on graphics processors. A lock-free queue guarantees that, without using blocking at all, at least one thread block will always succeed to enqueue or dequeue an item at any given time even in presence of concurrent operations. Since an operation will only have to be repeated at an actual conflict it can deliver much more parallelism. The implementation is based upon the simple and efficient array-based lock-free queue described in a paper by Tsigas and Zhang [4]. A tail pointer keeps track of the tail of queue and tasks are then added to the queue using CAS. If the CAS-operation fails it must be due to a conflict with another thread block, so the operation is repeated on the new tail until it succeeds. This way at least one thread block is always assured to successfully enqueue an item. The queue uses lazy updating of the tail and head pointers to lower contention. Instead of changing the head/tail pointer after every enqueue/dequeue operation, something that is done with expensive CAS-operations, it is only updated every x:th time. This increases the time it takes to find the actual head/tail since several queue positions needs to be checked. But by reading consecutive positions in the array, the reads will be coalesced by the hardware into one fast read operation and the extra time can be made lower than the time it takes to try to update the head/tail pointer x times. 7.4 Task Stealing Task stealing is a popular load balancing scheme. Each processing unit is given a set of tasks and when it has completed them it tries to steal a task from another processing unit which has not yet completed its assigned tasks. If a unit creates a new task it is added to its own local set of tasks. One of the most used task stealing methods is the lockfree scheme by Arora et al. [] with multiple array-based double ended queues (deques). This method will be referred to as ABP task stealing in the remainder of this paper. In this scheme each thread block is assigned its own deque. Tasks are then added and removed from the tail of the deque in a LIFO manner. When the deque is empty the process tries to steal from the head of another process deque. Since only the owner of the deque is accessing the tail of the deque there is no need for expensive synchronization when the deque contains more than one element. Several thread blocks might however try to steal at the same time, requiring synchronization, but stealing is assumed to occur less often than a normal local access. The implementation is based on the basic non-blocking version by Arora et al. []. The stealing is performed in a global round robin fashion, so thread block i looks at thread block i +followed by i +2and so on. 8 Octree Partitioning To evaluate the dynamical load balancing methods described in the previous section, they where applied to the task of creating an octree partitioning of a set of particles [2]. An octree is a tree-based spatial data structure that recursively divides the space in each direction, creating eight octants. This is done until an octant contains less than a specific number of particles. The fact that there is no information beforehand on how deep each branch in the tree will be, makes this a suitable problem for dynamic load balancing. 9 Experimental Evaluation Two different graphics processors were used in the experiments, the GT 52MiB NVIDIA graphics processor with cores and the 8GT 52MiB NVIDIA graphics processor with 2 cores. Two input distributions were used, one where all particles were randomly picked from a cubic space and one where they were randomly picked from a space shaped like a geometrical tube. All methods where initialized by a single iteration using one thread block. The maximum number of particles in any given octant was set to 2 for all experiments. 9. Discussion Figure 3 is taken from the original paper and shows the time it took to partition two different particle sets on the 8GT graphics processors using each of the load balancing methods while varying the number of threads per block and blocks per grid [3]. The static method always uses one block per task and is thus shown in a 2D graph. Figure 4 shows the time taken to partition particle sets of varying size using the combination of threads per block and blocks per grid found to be optimal in the previously described graph. Figure 3 (a) clearly shows that using less than threads with the blocking method gives the worst performance in all of the experiments. This is due to the expensive spinning on the lock variable. These repeated attempts to acquire the lock causes the bus to be locked for long amounts of times during which only -bit memory accesses are done. With more than threads the number of concurrent thread blocks is lowered from three to one, due to the limited number of available registers per thread, which leads to less lock contention and much better performance. This shows that the blocking method scales very poorly. In fact, the best result is achieved when using less than ten blocks, that is, by not using all of the multiprocessors! ACM SIGARCH Computer Architecture News 6 Vol. 36, No. 5, December 28
7 a) Blocking Queue b) Non-Blocking Queue Blocks Threads 6 Blocks Threads 6 c) ABP Task Stealing 3 6 d) Static Task List Blocks Threads Threads Figure 3. Comparison of load balancing methods on the 8GT. Shows the time taken to partition a Uniform (filled grid) and Tube (unfilled grid) distribution of half a million particles using different combinations of threads/block and blocks/grid. Figure 4. A comparison of the load balancing methods on the uniform distribution. ACM SIGARCH Computer Architecture News 7 Vol. 36, No. 5, December 28
8 The non-blocking queue-based method, shown in Figure 3 (b), can take better advantage of an increased number of blocks per grid. The performance increases quickly when more blocks are added, but after around 2 blocks the effect fades. It was expected that this effect would be visible until the number of blocks was increased beyond 42, the number of blocks that can run concurrently when using less than threads. This means that even though its performance is much better than the its blocking counterpart, it still does not scale as well as would have been expected. This can also clearly be seen when the thread boundary is passed and an increase in performance is shown instead of the anticipated drop. In Figure 3 (c) the result from the ABP task stealing can be seen and it lies more closely to the ideal. Adding more blocks increases the performance until around 3 blocks. Adding more threads also increases performance until the expected drop at threads per block. There is also a slight drop after threads since the warp size is passed and there are now incomplete warps being scheduled. Figure 4 shows that the work stealing gives great performance and is not affected negatively by the increase in number of cores on the 8GT. In Figure 3 (d) the static method shows similar behavior to the task stealing. When increasing the number of threads used by the static method from 8 to there is a steady increase in performance. Then there are the expected drops after and, due to incomplete warps and less concurrent thread blocks. Increasing the number of threads further does not give any increase in speed as the synchronization overhead in the octree partitioning algorithm becomes dominant. The optimal number of threads for the static method is thus and that is what was used when it was compared to the other methods in Figure 4. As can be seen in Figure 3, adding more blocks than needed is not a problem since the only extra cost is an extra read of the finishing condition for each additional block. Conclusions The blocking queue performed quite poorly and scaled badly when faced with more processing units, something which can be attributed to the inherent busy waiting. The non-blocking queue performed better but scaled poorly when the number of processing units got too high. Since the number of tasks increased quickly and the tree itself was relatively shallow the static queue performed well. The ABP task stealing method perform very well and outperformed the static method. The experiments showed that lock-free methods achieves better performance than blocking and that they can be made to scale with increased numbers of processing units. References [] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 9 29, 998. [2] D. Cederman and P. Tsigas. A Practical Quicksort Algorithm for Graphics Processors. In Proceedings of the 6th Annual European Symposium on Algorithms (ESA 28), Lecture Notes in Computer Science Vol.: 593, pages Springer-Verlag, 28. [3] D. Cederman and P. Tsigas. On Dynamic Load Balancing on Graphics Processors. In Proceedings of the th Graphics Hardware (GH 28), pages 57. ACM press, 28. [4] D. Cederman and P. Tsigas. GPU Quicksort Library. www. cs.chalmers.se/ dcs/gpuqsortdcs.html, December 27. [5] N. CUDA. [6] N. Govindaraju, N. Raghuvanshi, M. Henson, and D. Manocha. A Cache-Efficient Sorting Algorithm for Database and Data Mining Computations using Graphics Processors. Technical report, University of North Carolina- Chapel Hill, 25. [7] M. Harris, S. Sengupta, and J. D. Owens. Parallel Prefix Sum (Scan) with CUDA. In H. Nguyen, editor, GPU Gems 3. Addison Wesley, Aug. 27. [8] D. R. Helman, D. A. Bader, and J. JáJá. A Randomized Parallel Sorting Algorithm with an Experimental Study. Journal of Parallel and Distributed Computing, 52(): 23, 998. [9] M. Matsumoto and T. Nishimura. Mersenne Twister: a 623- Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. Transactions on Modeling and Computer Simulation, 8():3 3, 998. [] D. R. Musser. Introspective Sorting and Selection Algorithms. Software - Practice and Experience, 27(8): , 997. [] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan Primitives for GPU Computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pages 97 6, 27. [2] M. Shephard and M. Georges. Automatic three-dimensional mesh generation by the finite Octree technique. International Journal for Numerical Methods in Engineering, :79 749, 99. [3] E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on General Purpose Processing on Graphics Processing Units, 27. [4] P. Tsigas and Y. Zhang. A simple, fast and scalable nonblocking concurrent FIFO queue for shared memory multiprocessor systems. In Proceedings of the thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 34 43, 2. ACM SIGARCH Computer Architecture News 8 Vol. 36, No. 5, December 28
Dynamic Load Balancing using Graphics Processors
I.J. Intelligent Systems and Applications, 2014, 05, 70-75 Published Online April 2014 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijisa.2014.05.07 Dynamic Load Balancing using Graphics Processors
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Chapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Parallel Prefix Sum (Scan) with CUDA. Mark Harris [email protected]
Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Synchronization. Todd C. Mowry CS 740 November 24, 1998. Topics. Locks Barriers
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based (barriers) Point-to-point tightly
Chapter 11 I/O Management and Disk Scheduling
Operatin g Systems: Internals and Design Principle s Chapter 11 I/O Management and Disk Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles An artifact can
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
PARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
Parallel Computing for Data Science
Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang [email protected] May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Chapter 6, The Operating System Machine Level
Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General
A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms
A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms Matthias Korch Thomas Rauber Universität Bayreuth Fakultät für Mathematik, Physik und Informatik Fachgruppe Informatik {matthias.korch,
Multi-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Multi- and Many-Core Technologies: Architecture, Programming, Algorithms, & Application
Multi- and Many-Core Technologies: Architecture, Programming, Algorithms, & Application 2 i ii Chapter 1 Scheduling DAG Structured Computations Yinglong Xia IBM T.J. Watson Research Center, Yorktown Heights,
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Partitioning and Divide and Conquer Strategies
and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,
CS4410 - Fall 2008 Homework 2 Solution Due September 23, 11:59PM
CS4410 - Fall 2008 Homework 2 Solution Due September 23, 11:59PM Q1. Explain what goes wrong in the following version of Dekker s Algorithm: CSEnter(int i) inside[i] = true; while(inside[j]) inside[i]
Optimization of Cluster Web Server Scheduling from Site Access Statistics
Optimization of Cluster Web Server Scheduling from Site Access Statistics Nartpong Ampornaramveth, Surasak Sanguanpong Faculty of Computer Engineering, Kasetsart University, Bangkhen Bangkok, Thailand
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Symmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)
Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse
The Methodology of Application Development for Hybrid Architectures
Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
Binary Heap Algorithms
CS Data Structures and Algorithms Lecture Slides Wednesday, April 5, 2009 Glenn G. Chappell Department of Computer Science University of Alaska Fairbanks [email protected] 2005 2009 Glenn G. Chappell
Software and the Concurrency Revolution
Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
CHAPTER 4 ESSENTIAL DATA STRUCTRURES
CHAPTER 4 ESSENTIAL DATA STRUCTURES 72 CHAPTER 4 ESSENTIAL DATA STRUCTRURES In every algorithm, there is a need to store data. Ranging from storing a single value in a single variable, to more complex
Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings
Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Principles and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015
CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015 1. Goals and Overview 1. In this MP you will design a Dynamic Load Balancer architecture for a Distributed System 2. You will
SYSTEM ecos Embedded Configurable Operating System
BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original
SHARED HASH TABLES IN PARALLEL MODEL CHECKING
SHARED HASH TABLES IN PARALLEL MODEL CHECKING IPA LENTEDAGEN 2010 ALFONS LAARMAN JOINT WORK WITH MICHAEL WEBER AND JACO VAN DE POL 23/4/2010 AGENDA Introduction Goal and motivation What is model checking?
Designing Efficient Sorting Algorithms for Manycore GPUs
1 Designing Efficient Sorting Algorithms for Manycore GPUs Nadathur Satish Dept. of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA Email: [email protected]
Embedded Parallel Computing
Embedded Parallel Computing Lecture 5 - The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: Course responsible and examiner: Tomas
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
Load Balancing Techniques
Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1 Load Balancing Techniques
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Operating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
The Yin and Yang of Processing Data Warehousing Queries on GPU Devices
The Yin and Yang of Processing Data Warehousing Queries on GPU Devices Yuan Yuan Rubao Lee Xiaodong Zhang Department of Computer Science and Engineering The Ohio State University {yuanyu, liru, zhang}@cse.ohio-state.edu
