Multi-GPU Load Balancing for In-situ Visualization
|
|
|
- Adelia Walker
- 10 years ago
- Views:
Transcription
1 Multi-GPU Load Balancing for In-situ Visualization R. Hagan and Y. Cao Department of Computer Science, Virginia Tech, Blacksburg, VA, USA Abstract Real-time visualization is an important tool for immediately inspecting results for scientific simulations. Graphics Processing Units (GPUs) as commodity computing devices offer massive parallelism that can greatly improve performance for data-parallel applications. However, a single GPU provides limited support which is only suitable for smaller scale simulations. Multi-GPU computing, on the other hand, allows concurrent computation of simulation and rendering carried out on separate GPUs. However, use of multiple GPUs can introduce workload imbalance that decreases utilization and performance. This work proposes load balancing for in-situ visualization for multiple GPUs on a single system. We demonstrate the effectiveness of the load balancing method with an N-body simulation and a ray tracing visualization by varying input size, supersampling, and simulation parameters. Our results show that the load balancing method can accurately predict the optimal workload balance between simulation and ray tracing to significantly improve performance. Keywords: Multi-GPU Computing, Load Balancing, In-situ Visualization, N-body Simulation, Ray Tracing 1. Introduction GPU computing offers massively parallel processing that can greatly accelerate a variety of data parallel applications. Use of multiple GPUs can lead to even greater performance gains by overlapping computations by executing multiple tasks on different GPUs. This provides an opportunity for handling larger scale problems that a single GPU cannot process in real-time. The resulting increase in runtime speeds can allow for real-time navigation and interaction, which can lead to a much more effective visualization experience. By designing effective algorithms to run on multiple GPUs, a considerable improvement in computational power can be realized. Effective load balancing can greatly increase utilization and performance in a multi-gpu environment by distributing workloads equally. These properties make multiple GPUs suitable for in-situ visualization applications that use the GPU for concurrent simulation and rendering for interactive visualization. N- body simulation is one such application that involves computation of the interaction among a group of bodies. The N-body problem can be solved by computing the force of all bodies on each other. This problem is used in many domains, including biomolecular and physics applications. As in the work of [1], the gravitational N-body problem can be expressed as F i = Gm i 1 j N,j i m j r ij r ij 3 (1) where F i is the computed force for body i, m i is the mass of body i, r ij is the vector from body i to j, and G is the gravitational constant. In a molecular simulation, an N-body simulation algorithm can be used to compute the interaction of each atom in the molecule. This application can benefit from use of multiple GPUs to both compute new frames of simulation and render these new frames in parallel. Computation of simulation with rendering in real-time allows for interactive update in visualization applications. This can result in a smooth interaction experience with use of increased processing power to accelerate computing. While multiple GPUs can offer a large performance gain in visualization applications, many challenges exist in scheduling multiple tasks. Load imbalance can lead to underutilization of available resources and reduced performance in the visualization. Multi-GPU computing can therefore benefit from a method for load balancing to improve workload distribution. Load balancing needs to account for performance in order to maximize use of available resources. Use of the load balancing method in an N-body simulation accounts for workload differences between simulation and rendering to maintain more equal workload distribution. Visualization algorithms such as ray tracing can be computationally expensive, so specific techniques to distribute and load balance rendering workloads can offer considerable performance gains for visualization applications. Taking advantage of concurrent computations of visualizations with simulation can lead to a significant performance improvement to maintain interactivity in these applications. While load balancing can improve performance when using multiple GPUs, several factors need to be accounted for in visualization. The cost of simulation and rendering can depend on the chosen algorithms for each. The input data size may differ for applications, which can have a varying effect on simulation and rendering time. Accuracy of simulation can be improved by using more accurate techniques or by decreasing the timestep in simulation. In visualization, image quality is important to produce a better result for interactive viewing. Ray tracing is a rendering method that can produce realistic results on the GPU for visualization.
2 Supersampling can further improve image quality by using multiple samples per pixel that can decrease aliasing. Use of ray tracing with supersampling can greatly improve the results in interactive visualization but significantly increases computations that can be accelerated through use of multiple GPUs. Supersampling and improving the accuracy of simulation can vary the cost of computations, which will require adjusting load balancing for multi-gpu processing. Our method addresses this issue to improve performance with multi-gpu visualization. Due to the significant gains possible with use of multiple GPUs, we implement load balancing for multi-gpu visualization applications. Our work provides several contributions, including: Acceleration of an N-body simulation and ray tracing application using multiple GPUs Performance analysis of workload variation based on multiple input parameters A load balancing method to predict optimal workload distribution and significantly improve performance 2. Related Work There have been several related areas of previous research, including multi-gpu computing and visualization using the GPU. Previous work in multi-gpu visualization has included several applications that use multiple GPUs for rendering. Fogal et al. present a system for visualizing volume datasets on a GPU cluster [2]. However, their work could benefit from additional load balancing between GPU tasks that could provide more flexible and effective workload distribution for simulation and rendering. Monfort et al. present an analysis of split frame and alternate frame with multiple GPUs for a game engine [3]. They present an analysis of load balancing for a combined rendering mode to improve utilization of multiple GPUs. Binotto et al. present work in load balancing in a CFD simulation application [4]. They use both an initial static analysis followed by adaptive load balancing based on various factors including performance results. While these previous works present load balancing techniques, they do not focus on load balancing for in-situ visualization based on rendering and simulation tasks. We present a performance model and load balancing technique for simulation and rendering that allows improved load balancing and accounts for the pipelining process necessary in a multi-gpu environment. Other work has focused on streaming for out-of-core rendering. Gobbetti et al. present Far Voxels, a visualization framework for out-of-core rendering of large datasets using level-of-detail and visibility culling [5]. Crassin et al. present GigaVoxels, an out-of-core rendering framework for volume rendering of massive datasets using a view-dependent data representation [6]. While our load balancing method also uses multiple GPUs and similar pipelining to visualize datasets, we focus specifically on load balancing for in-situ visualization using ray tracing. Several other frameworks have been proposed that use multiple GPUs for general-purpose computations. Harmony presents a framework that dynamically schedules kernels [7]. Merge provides a framework heterogeneous scheduling that exposes a map-reduce interface [8]. DCGN is another framework that allows for dynamic communication with a message passing API [9]. However, these works focus on providing a general framework not specific to visualization and could benefit from additional tools for load balancing. We employ similar techniques to improve multi-gpu performance, but we provide improved load balancing in a visualization and simulation application. Other previous work has focused on GPU computing in molecular dynamics applications, relating to the N-body simulation and visualization used in our work. Past work has included Amber, a molecular dynamics software package that offers tools for molecular simulation [10]. This simulation can be used to compute the change in atoms over time due to an N-body simulation. Other research in molecular dynamics has included work by Anandakrishnan et al. to use an N-body simulation to compute the interaction of atoms in a molecule [11]. Humphrey et al. present Visual Molecular Dynamics (VMD), a software package for visualization of molecular datasets [12]. However, VMD provides primarily off-line rendering that does not simulate and render each frame interactively to allow for real-time user interaction. Stone et al. present work that computes molecular dynamics simulations on multi-core CPUs and GPUs [13]. Furthermore, they visualize molecular orbitals in an interactive rendering in VMD. We also apply our work to an N-body simulation, but we focus on load balancing using multiple GPUs for both simulation and rendering while their work focuses on data parallel algorithms on single GPUs. Chen et al. present work in multi-gpu load balancing applied to molecular dynamics that uses dynamic load balancing with a task queue [14]. Their framework focuses on finegrained load balancing usable within a single GPU, while our work focuses on coarse-grained task scheduling among GPUs for both simulation and visualization. While there has been considerable work in visualization and multi- GPU computing, we focus on a method for load balancing between simulation and rendering to improve utilization in the pipelined memory model useful for in-situ visualization. 3. Methods Our approach addresses the issue of workload imbalance for in-situ visualization applications. We will first describe the problems in this application area, and then we present our load balancing method for solving these issues.
3 3.1 Multi-GPU Architecture In comparison with a single GPU, a multi-gpu implementation has several advantages. Most notably, multiple GPUs can overlap concurrent computations on several GPUs at once. However, unlike a single GPU, memory transfers are required to ensure that a GPU has the required data. Figure 1 shows the multi-gpu configuration used with our application. It identifies how multiple GPUs can be used for overlapping computation between simulation and rendering, while host memory is used to transfer results among GPUs. For in-situ visualization applications, the simulation data must be transferred to rendering tasks to render the resulting image. This data is first transferred to the host and then transferred to the recipient GPU. This creates a pipelined model of execution where multiple GPUs can process data concurrently but must transfer data through host memory. Fig. 2: Memory transfers between host and GPU memory time can be introduced if simulation time is shorter than rendering. Fig. 1: Diagram of multi-gpu configuration: "Sim" refers to simulation and "Vis" refers to visualization 3.2 Multi-GPU Workload Imbalance While use of multiple GPUs can greatly increase the available processing power and improve performance by overlapping computation, use of multiple GPUs introduces issues of workload distribution among processors. This workload imbalance results from the synchronization necessary through host memory. Each task either sends or receives data through a buffer. However, use of a single buffer would require simulation tasks to wait for rendering tasks to read this data before writing the next frame. This can result in significant idle times that can decrease utilization and performance. Multiple buffers can allow one task to read or write data to multiple buffers before having to wait for other tasks to process the data as shown in Figure 2. Thus, having multiple buffers can improve load balance at the cost of additional memory. The amount of host memory is finite, however, which requires the tasks to eventually wait if workload imbalance is significant. If simulation of a single frame requires less time than rendering, host memory eventually becomes full, which requires simulation to wait for rendering. When rendering of a single frame takes less time than simulation, the buffers in host memory become increasingly empty, which requires rendering tasks to wait for simulation. Figure 3 shows the case where performance can decrease due to improper workload distribution. Since rendering GPUs need to read simulation data before simulation can overwrite it, idle Fig. 3: Simulation tasks must wait for ray tracing to read results. "W1" refers to writing data to the first buffer in host memory from a GPU, while "R2" refers to reading data from the second buffer in host memory Accounting for workload imbalance can eliminate this idle time by distributing work equally among available processors as shown in Figure 4. Load balance among tasks allows tasks to send and receive data at an equal rate and thus improve utilization. However, this requires formulating a method for load balancing. Varying workloads for tasks creates issues for the problem of load balancing. For example, the type of rendering technique or number of samples in supersampling can change the workload and introduce additional idle time. Factors of simulation such as accuracy or type of simulation could also affect runtime, resulting in a different optimal workload balance as well. These various issues demonstrate the important need for load balancing for in-situ visualization. Given an initial set of characteristics for rendering and simulation for a specific visualization, finding the optimal load balance can reduce idle time and improve performance. 3.3 Load Balancing The use of our multi-gpu implementation allows for load balancing techniques for in-situ visualization. Our test
4 three-dimensional parameter matrix M that can be used to determine the appropriate balance for the visualization application. The input dimensions of M include the number of samples for supersampling, the number of iterations for simulation, and the input size, while the associated output values are performance times for these configurations. Our method first collects performance results for this matrix and then computes the desired workload balance for a new set of input parameters. Thus, we find the solution to the function: Fig. 4: Use of load balancing reduces idle time application uses a gravitational N-body simulation based on the method of [1] to compute interactions of particles, while ray tracing renders the results. Particle position data is transferred through the host using multiple buffers in order to pipeline position data between simulation and rendering. Since simulation and rendering may have different amounts of workload, it is important to address the possibility of workload imbalance between the tasks. In this application load balancing requires partitioning the dataset in order to distribute work to GPUs. Two types of work partitioning are possible with the application: interframe and intra-frame. Inter-frame partitioning involves distributing complete frames of data in order to achieve load balancing. Ray tracing in this implementation uses inter-frame partitioning to render entire frames in order to avoid communication in combining results and improve performance. Intra-frame partitioning distributes parts of a single frame to GPUs for processing. The N-body simulation utilizes intra-frame partitioning by having each GPU update only a subset of the particles for a single frame of data. Since each frame of simulation requires previous data, simulation cannot be computed out of order. Thus, multiple GPUs can only accelerate simulation by having each GPU update a subset of the dataset. Intra-frame partitioning for simulation is therefore necessary to apply load balancing. While this requires communication to combine the results for each frame, the computation can be distributed among multiple GPUs. Thus, this load balancing method uses groups of GPUs to compute frames of data for simulation while rendering has single GPUs separately compute rendering results for consecutive frames. The partitioning of both rendering and simulation tasks allow for load balancing in the visualization application. As more processors are dedicated to simulation, fewer are dedicated to rendering consecutive frames. The total visualization time for a frame is used to determine the optimal load balance between rendering and simulation among the available processors. To address the issue of workload imbalance, we present a load balancing method to achieve the optimal distribution of work to improve performance. We present a f(i, s, p) = g (2) where f is the function to compute optimal workload distribution, i is the number of iterations for simulation, s is the number of samples for supersampling, p is the number of particles in the simulation, and g is the number of GPUs allocated for rendering versus simulation. The predicted optimal load balance is computed based on previous results through trilinear interpolation. Given known optimal workload distributions for sets of input parameters, our model predicts the optimal load balancing result g for a new set of input parameters: L isp = L 000 (1 i)(1 s)(1 p) + L 100 i(1 s)(1 p) + L 010 (1 i)s(1 p) + L 001 (1 i)(1 s)p + L 101 i(1 s)p + L 011 (1 i)sp + L 110 is(1 p) + L 111 isp (3) where L is the optimal load balance, i is the number of simulation iterations, s is the number of samples for ray tracing, and p is the number of particles. Here, i, s, and p are normalized to the range [0, 1] for interpolation, and the result g is rounded to the nearest integer. This result gives a prediction for the optimal load balance for a given set of input parameter values. 4. Results The multi-gpu load balancing method was tested with an N-body simulation and ray tracing of thousands of spheres. All tests were done on a single computer with eight GTX 295 graphics cards. The final result of the visualization and the differences in supersampling can be seen in Figure 6. Aliasing artifacts due to inadequate sampling can be seen in the image on the left with one sample per pixel. Using sixteen samples per pixel in a random fashion, however, significantly improves the results. While supersampling improves the quality of the final image, it comes at a performance cost as shown in Table 1. The increase in execution time for a greater number of samples for supersampling is linear. Thus, the tradeoff between performance and image quality must be considered when
5 Table 4: Execution time for simulation based on number of GPUs used 1 GPU 2 GPUs 3 GPUs 4 GPUs 5 GPUs 6 GPUs ms ms ms 9.32 ms 7.58 ms 6.44 ms Fig. 5: Comparison of single sample (left) and 16 sample randomized supersampling (right) Table 1: GPU execution time (ms) for ray tracing based on number of samples for supersampling for 1000 particles 1 sample 4 samples 8 samples 12 samples 16 samples ms ms ms ms ms choosing an appropriate number of samples for supersampling. Table 2 shows the performance time for simulation when performing multiple iterations with a smaller timestep. The performance of simulation shows a linear increase in time with an increase in number of iterations. While a smaller timestep provides more accurate simulations, it introduces additional computations for each frame. Thus, using a smaller timestep but increasing the number of iterations leads to an increase in performance. Table 3 shows the percent difference in positions of simulation from iteration simulation, which uses the smallest timestep. Each simulation is carried out over the same total time, with a smaller timestep for simulations run for more iterations. With a smaller timestep, the accuracy of the simulation is improved due to the finer granularity used for integration in the N-body simulation. Table 4 shows a linear decrease in the execution time for simulation when partitioning the dataset to simulate on multiple GPUs. Due to slight constant overhead of launching the kernel, etc., six GPUs gain a slightly less than six times speedup over use of one GPU. Ray tracing has a longer execution time than simulation for smaller dataset sizes. Simulation takes less time for smaller datasets, but with an increased number of simulation iterations this cost can exceed that of ray tracing with fewer samples. These differences in workload affect the final optimal load balance. Table 2: GPU execution time for simulation based on number of iterations 20 iterations 40 iterations 60 iterations 80 iterations ms ms ms ms Table 3: Percent difference in positions of simulation from iteration simulation Iterations Percent Workload Characteristics The multiple input parameters for this application result in many possibilities for workloads. These varying workloads can introduce a performance penalty if not accounted for in distribution of work. Figure 7 shows the trends for performance times for different workloads (number of ray tracing tasks) with varying dataset sizes with 16 sample ray tracing and 80 iteration simulation. The cost of simulation increases more as dataset size increases due to the nature of the N-body simulation, while ray tracing scales linearly with dataset size. This causes the overall performance to be increasingly limited by simulation time for larger datasets. Fig. 6: Performance time for various input sizes with 16 samples, 80 iterations Figure 8 shows performance for different input sizes with a varying workload distribution for four sample ray tracing and 80 iterations for simulation. This graph shows that allocating more GPUs for simulation when the number of samples is low can result in performance gain. The difference in the trend from Figure 7 also demonstrates that different input parameters can lead to significantly different optimal workload distributions that requires load balancing. 4.2 Load Balancing Figure 9 shows a trend of optimal load balance based on the number of iterations for simulation. As shown, increasing the number of iterations requires a greater number of GPUs dedicated to simulation to achieve optimal load balance. With the fewest iterations for simulation, the majority of GPUs should be allocated for ray tracing due to the greater cost of ray tracing. Increasing the number of samples for supersampling increases the cost of ray tracing and also impacts the load balancing scheme. Figure 10 shows that increasing the number of samples for supersampling results in need of
6 Fig. 7: Performance time for various input sizes with four samples, 80 iterations Fig. 9: Load balancing for various numbers of samples for ray tracing with changing dataset size Fig. 8: Load balancing for various simulation iterations with four samples, 3000 particles Fig. 10: Load balancing for various dataset sizes with 80 simulation iterations additional ray tracing tasks to improve workload balance. A larger dataset size requires fewer GPUs for ray tracing due to the smaller increase in cost of ray tracing with larger datasets. Figure 11 shows the trend for varying dataset size and number of samples with a constant simulation. With a larger dataset size, simulation becomes increasingly expensive while ray tracing cost increases at a linear rate. Therefore, it becomes necessary to compute simulation on an increasing number of GPUs with larger datasets to maintain workload balance. These results demonstrate that significant workload imbalance can be introduced based on differing workloads of rendering and simulation. Each configuration leads to a different optimal load balancing configuration. A load balancing method must be able to account for these varying trends in order to achieve effective performance. 4.3 Performance Model We now present a summary of the results of applying our load balancing method. Table 5 shows the percent error in our proposed prediction model when compared to the actual optimal load balanced configuration. The performance model was tested by computing the average percent error for 40 values for varying one dimension (simulation iterations), 20 for varying two dimensions (iterations and samples), and 12 for varying three dimensions (iterations, samples, and data size). These results show that the average percent difference is below 5 percent for two and three dimensions, and below 10 percent for all categories. Interpolation with fewer dimensions yields a larger error due to discretization error with selection of number of GPUs for load balancing. Using a larger number of data points with more dimensions in interpolation decreases this discretization error. The performance of our load balancing method was also compared against the worst and average case workload distribution. Figure 12 shows the speedup of using the load Table 5: Percent error in load balancing model for a varying number of dimensions 1 dimension 2 dimensions 3 dimensions 9.88% 2.67% 1.67%
7 balancing method. These results show that by using the model, a significant speedup can consistently be achieved with different parameter configurations. Fig. 11: Load balancing speedup over the average and worst cases 5. Conclusions and Future Work We have proposed a multi-gpu load balancing solution for in-situ visualization. We have presented an analysis of the workload properties and load balancing results in an N-body simulation and ray tracing visualization. Our results show that workloads can vary greatly for different sets of input parameters, which demonstrates the need for load balancing in multi-gpu computing. Our multi-gpu implementation demonstrates the use of intra- and interframe task partitioning for scheduling of GPU tasks to allow the use of load balancing. The results of our tests show that the load balancing method can accurately predict optimal workload balance to significantly improve performance by increasing utilization of available resources. This work could be extended in multiple ways in future work. Our load balancing approaches could be extended to additional visualization applications, where other rendering and simulation methods with varying workloads could also be addressed. Different performance models may be useful for other applications as well. The pipelining model used in this application would also be useful for out-of-core rendering of massive models. 6. Acknowledgements This work is funded by Air Force Research Laboratory Munitions Directorate, FA , titled "Unified High-Performance Computing and Visualization Framework on GPU to Support MAV Airframe Research." I also thank Dr. Eli Tilevich for support and discussion on the project. References [1] J. P. L. Nyland, M. Harris, Fast n-body simulation with cuda, GPU Gems 3, pp , [2] T. Fogal, H. Childs, S. Shankar, J. Krüger, R. D. Bergeron, and P. Hatcher, Large data visualization on distributed memory multi-gpu clusters, in Proceedings of the Conference on High Performance Graphics, ser. HPG 10. Aire-la-Ville, Switzerland, Switzerland: Eurographics Association, 2010, pp [Online]. Available: [3] J. R. Monfort and M. Grossman, Scaling of 3d game engine workloads on modern multi-gpu systems, in Proceedings of the Conference on High Performance Graphics 2009, ser. HPG 09. New York, NY, USA: ACM, 2009, pp [Online]. Available: [4] A. Binotto, C. Pereira, and D. Fellner, Towards dynamic reconfigurable load-balancing for hybrid desktop platforms, in Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, , pp [5] E. Gobbetti and F. Marton, Far voxels: a multiresolution framework for interactive rendering of huge complex 3d models on commodity graphics platforms, in ACM SIGGRAPH 2005 Papers, ser. SIGGRAPH 05. New York, NY, USA: ACM, 2005, pp [Online]. Available: [6] C. Crassin, F. Neyret, S. Lefebvre, and E. Eisemann, Gigavoxels: ray-guided streaming for efficient and detailed voxel rendering, in Proceedings of the 2009 symposium on Interactive 3D graphics and games, ser. I3D 09. New York, NY, USA: ACM, 2009, pp [Online]. Available: [7] G. F. Diamos and S. Yalamanchili, Harmony: an execution model and runtime for heterogeneous many core systems, in Proceedings of the 17th international symposium on High performance distributed computing, ser. HPDC 08. New York, NY, USA: ACM, 2008, pp [Online]. Available: [8] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng, Merge: a programming model for heterogeneous multi-core systems, SIGPLAN Not., vol. 43, pp , March [Online]. Available: [9] J. A. Stuart and J. D. Owens, Message passing on data-parallel architectures, in Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, May [10] D. A. Pearlman, D. A. Case, J. W. Caldwell, W. S. Ross, T. E. Cheatham, S. DeBolt, D. Ferguson, G. Seibel, and P. Kollman, Amber, a package of computer programs for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to simulate the structural and energetic properties of molecules, Computer Physics Communications, vol. 91, no. 1-3, pp. 1 41, [Online]. Available: [11] R. Anandakrishnan and A. V. Onufriev, An n log n approximation based on the natural organization of biomolecules for speeding up the computation of long range interactions, Journal of Computational Chemistry, vol. 31, no. 4, pp , [Online]. Available: [12] W. Humphrey, A. Dalke, and K. Schulten, VMD Visual Molecular Dynamics, Journal of Molecular Graphics, vol. 14, pp , [13] J. E. Stone, J. Saam, D. J. Hardy, K. L. Vandivort, W.-m. W. Hwu, and K. Schulten, High performance computation and interactive display of molecular orbitals on gpus and multi-core cpus, in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-2. New York, NY, USA: ACM, 2009, pp [Online]. Available: [14] L. Chen, O. Villa, S. Krishnamoorthy, and G. Gao, Dynamic load balancing on single- and multi-gpu systems, in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, , pp
Multi-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
How To Share Rendering Load In A Computer Graphics System
Bottlenecks in Distributed Real-Time Visualization of Huge Data on Heterogeneous Systems Gökçe Yıldırım Kalkan Simsoft Bilg. Tekn. Ltd. Şti. Ankara, Turkey Email: [email protected] Veysi İşler Dept.
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH
Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
NVIDIA IndeX. Whitepaper. Document version 1.0 3 June 2013
NVIDIA IndeX Whitepaper Document version 1.0 3 June 2013 NVIDIA Advanced Rendering Center Fasanenstraße 81 10623 Berlin phone +49.30.315.99.70 fax +49.30.315.99.733 [email protected] Copyright Information
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
Writing Applications for the GPU Using the RapidMind Development Platform
Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...
Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions
Module 4: Beyond Static Scalar Fields Dynamic Volume Computation and Visualization on the GPU Visualization and Computer Graphics Group University of California, Davis Overview Motivation and applications
Parallel Analysis and Visualization on Cray Compute Node Linux
Parallel Analysis and Visualization on Cray Compute Node Linux David Pugmire, Oak Ridge National Laboratory and Hank Childs, Lawrence Livermore National Laboratory and Sean Ahern, Oak Ridge National Laboratory
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System
The Design and Implement of Ultra-scale Data Parallel In-situ Visualization System Liu Ning [email protected] Gao Guoxian [email protected] Zhang Yingping [email protected] Zhu Dengming [email protected]
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems
Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems G.Rajina #1, P.Nagaraju #2 #1 M.Tech, Computer Science Engineering, TallaPadmavathi Engineering College, Warangal,
A NEW METHOD OF STORAGE AND VISUALIZATION FOR MASSIVE POINT CLOUD DATASET
22nd CIPA Symposium, October 11-15, 2009, Kyoto, Japan A NEW METHOD OF STORAGE AND VISUALIZATION FOR MASSIVE POINT CLOUD DATASET Zhiqiang Du*, Qiaoxiong Li State Key Laboratory of Information Engineering
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Experiments on the local load balancing algorithms; part 1
Experiments on the local load balancing algorithms; part 1 Ştefan Măruşter Institute e-austria Timisoara West University of Timişoara, Romania [email protected] Abstract. In this paper the influence
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
Load Balancing on a Grid Using Data Characteristics
Load Balancing on a Grid Using Data Characteristics Jonathan White and Dale R. Thompson Computer Science and Computer Engineering Department University of Arkansas Fayetteville, AR 72701, USA {jlw09, drt}@uark.edu
GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2
GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Parallel Visualization for GIS Applications
Parallel Visualization for GIS Applications Alexandre Sorokine, Jamison Daniel, Cheng Liu Oak Ridge National Laboratory, Geographic Information Science & Technology, PO Box 2008 MS 6017, Oak Ridge National
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries
First Semester Development 1A On completion of this subject students will be able to apply basic programming and problem solving skills in a 3 rd generation object-oriented programming language (such as
Advanced Rendering for Engineering & Styling
Advanced Rendering for Engineering & Styling Prof. B.Brüderlin Brüderlin,, M Heyer 3Dinteractive GmbH & TU-Ilmenau, Germany SGI VizDays 2005, Rüsselsheim Demands in Engineering & Styling Engineering: :
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
Dynamic Load Balancing using Graphics Processors
I.J. Intelligent Systems and Applications, 2014, 05, 70-75 Published Online April 2014 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijisa.2014.05.07 Dynamic Load Balancing using Graphics Processors
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies
Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999 in Cambridge Massachusetts, USA Characterizing the Performance of Dynamic Distribution
Employing Complex GPU Data Structures for the Interactive Visualization of Adaptive Mesh Refinement Data
Volume Graphics (2006) T. Möller, R. Machiraju, T. Ertl, M. Chen (Editors) Employing Complex GPU Data Structures for the Interactive Visualization of Adaptive Mesh Refinement Data Joachim E. Vollrath Tobias
Comp 410/510. Computer Graphics Spring 2016. Introduction to Graphics Systems
Comp 410/510 Computer Graphics Spring 2016 Introduction to Graphics Systems Computer Graphics Computer graphics deals with all aspects of creating images with a computer Hardware (PC with graphics card)
Facts about Visualization Pipelines, applicable to VisIt and ParaView
Facts about Visualization Pipelines, applicable to VisIt and ParaView March 2013 Jean M. Favre, CSCS Agenda Visualization pipelines Motivation by examples VTK Data Streaming Visualization Pipelines: Introduction
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application
2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Load Balancing Techniques
Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1 Load Balancing Techniques
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
How To Make Visual Analytics With Big Data Visual
Big-Data Visualization Customizing Computational Methods for Visual Analytics with Big Data Jaegul Choo and Haesun Park Georgia Tech O wing to the complexities and obscurities in large-scale datasets (
CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS
ICCVG 2002 Zakopane, 25-29 Sept. 2002 Rafal Mantiuk (1,2), Sumanta Pattanaik (1), Karol Myszkowski (3) (1) University of Central Florida, USA, (2) Technical University of Szczecin, Poland, (3) Max- Planck-Institut
V&V and QA throughout the M&S Life Cycle
Introduction to Modeling and Simulation and throughout the M&S Life Cycle Osman Balci Professor Department of Computer Science Virginia Polytechnic Institute and State University (Virginia Tech) Blacksburg,
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
Interactive Visualization of Magnetic Fields
JOURNAL OF APPLIED COMPUTER SCIENCE Vol. 21 No. 1 (2013), pp. 107-117 Interactive Visualization of Magnetic Fields Piotr Napieralski 1, Krzysztof Guzek 1 1 Institute of Information Technology, Lodz University
Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm
Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm Lokesh Babu Rao 1 C. Elayaraja 2 1PG Student, Dept. of ECE, Dhaanish Ahmed College of Engineering,
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
On Sorting and Load Balancing on GPUs
On Sorting and Load Balancing on GPUs Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology SE-42 Göteborg, Sweden {cederman,tsigas}@chalmers.se Abstract
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Dynamic Resolution Rendering
Dynamic Resolution Rendering Doug Binks Introduction The resolution selection screen has been one of the defining aspects of PC gaming since the birth of games. In this whitepaper and the accompanying
SCHEDULING IN CLOUD COMPUTING
SCHEDULING IN CLOUD COMPUTING Lipsa Tripathy, Rasmi Ranjan Patra CSA,CPGS,OUAT,Bhubaneswar,Odisha Abstract Cloud computing is an emerging technology. It process huge amount of data so scheduling mechanism
Hardware design for ray tracing
Hardware design for ray tracing Jae-sung Yoon Introduction Realtime ray tracing performance has recently been achieved even on single CPU. [Wald et al. 2001, 2002, 2004] However, higher resolutions, complex
Multiresolution 3D Rendering on Mobile Devices
Multiresolution 3D Rendering on Mobile Devices Javier Lluch, Rafa Gaitán, Miguel Escrivá, and Emilio Camahort Computer Graphics Section Departament of Computer Science Polytechnic University of Valencia
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
Load Balancing using Potential Functions for Hierarchical Topologies
Acta Technica Jaurinensis Vol. 4. No. 4. 2012 Load Balancing using Potential Functions for Hierarchical Topologies Molnárka Győző, Varjasi Norbert Széchenyi István University Dept. of Machine Design and
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Efficient DNS based Load Balancing for Bursty Web Application Traffic
ISSN Volume 1, No.1, September October 2012 International Journal of Science the and Internet. Applied However, Information this trend leads Technology to sudden burst of Available Online at http://warse.org/pdfs/ijmcis01112012.pdf
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Group Based Load Balancing Algorithm in Cloud Computing Virtualization
Group Based Load Balancing Algorithm in Cloud Computing Virtualization Rishi Bhardwaj, 2 Sangeeta Mittal, Student, 2 Assistant Professor, Department of Computer Science, Jaypee Institute of Information
CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS
Computing and Informatics, Vol. 31, 2012, 1255 1278 CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS Antonio García-Dopico, Antonio Pérez Santiago Rodríguez,
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,
Accelerating Fast Fourier Transforms Using Hadoop and CUDA
Accelerating Fast Fourier Transforms Using Hadoop and CUDA Rostislav Tsiomenko SAIC Columbia, Md., USA [email protected] 1 Abstract There has been considerable research into improving Fast
