3D Gaussian beam modeling
|
|
|
- Thomasina Hancock
- 10 years ago
- Views:
Transcription
1 DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. 3D Gaussian beam modeling Paul Hursky 3366 North Torrey Pines Court, Suite 310 La Jolla, CA phone: (858) fax: (858) Award Number: N C LONG-TERM GOALS Long-term goals of this research are: provide the underwater acoustic research community and the US Navy with a practical 3D propagation model that can be used over a broad frequency range in passive and active sonar performance prediction, acoustic communications channel modeling, and in understanding operational concepts impacted by 3D propagation effects. provide these predictive capabilities for both pressure and particle velocity sensors. OBJECTIVES The initial objective was to implement a 3D Gaussian beam model on multi-core, high-performance computing hardware in order to support mid and high frequency applications like reverberation modeling and acoustic channel simulation. As graphic processing unit or GPU hardware rapidly emerged as a compelling platform for such work, my objective became to adapt selected acoustic wave propagation models onto GPUs. APPROACH Assessing the availability and maturity of available numerical building blocks for GPUs, I chose a split-step Fourier parabolic equation model for an initial objective this would allow me to assess the potential of this technology, produce a useful product, and at the same time lay the groundwork for developing a GPU-based 3D Gaussian beam model. There was a very successful off-the-shelf FFT implementation for GPUs. Linear algebra tools needed for some of the other models (e.g. tri-diagonal linear solver, needed by split-step Pade PE or wavenumber integral models) were still works in progress. I selected the split-step Fourier PE (SSFPE) model for GPU implementation. I will refer to this implementation as GPU-SSFPE. Although this particular PE model has been somewhat eclipsed by the split-step Pade PE (which requires a tri-diagonal solver), it is still part of the standard OAML set of models, and a case can be made for its use in deep water cases. NVIDIA was the GPU manufacturer that had been most supportive of high performance scientific computing (in terms of providing tools and software libraries). NVIDIA provides the Compute 1
2 Unified Device Architecture or CUDA framework for developing software to run on GPUs. CUDA provides a C language interface to the GPU hardware. CUDA entails programming in C with a few added programming constructs to set up Single Instruction Multiple Thread or SIMT programs called kernels which execute in parallel on the individual GPU cores. The strategy for implementing significantly accelerated GPU applications requires managing the flow of data between the CPU and the GPU, and organizing the flow of data between several specialized memory units, such as global memory, shared memory, constant memory, texture memory, and register memory. There are ample opportunities to set up parallelism of the various data transfers with computations on both the CPU and the GPU. Typically, the bottleneck is in transferring data in and out of the GPU if too little arithmetic is required in the application for the data being moved between the CPU and the GPU, the GPU cores will not be kept busy enough to achieve the desired acceleration. I discovered that NVIDIA had opened up their OptiX software infrastructure for ray tracing (tracing optical rays in order render or display complex 3D scenes) to scientific applications, like using rays to model sound propagation. To use OptiX for underwater sound propagation, the 3D waveguide structure must be represented using the data structures used by OptiX for ray tracing. However, it is possible to modify the ray payload and interactions (with objects like a sound speed layer) to represent customized models (like a Gaussian beam model for underwater acoustics). There is a great deal of commonality between a 3D Gaussian beam model for underwater acoustics and ray tracing of light rays in order to render realistic 3D scenes, capturing effects such as caustics caused by refraction and transmission through glass objects for example. Ray tracing is not trivially parallelized. Different rays encounter different objects and are extinguished at different path lengths, so the computational load is not evenly distributed over rays. Nevertheless, ray tracing is of sufficient importance to the computer graphics industry that considerable effort has been spent on achieving realtime, high-definition ray traced rendering. As a result, various acceleration structures have been developed in order to run ray tracing efficiently on parallel architectures. NVIDIA, the GPU company that has been most aggressive in supporting high-performance computing and scientific computing on GPUs, provides a general-purpose ray-tracing framework (called OptiX) that allows third party developers (like us) to incorporate their own ray payload and their own mathematics into all the ray interactions with objects in a 3D model. In our case, the ray payload include all of the Gaussian beam parameters that are numerically integrated along the rays, and the mathematics consists of our particular algorithms for numerically integrating these parameters based on the medium properties, which are incorporated into a 3D model of our ocean waveguide. The mathematics is programmed using CUDA C, the same platform I have used to implement our split-step Fourier PE model. WORK COMPLETED An important application for a Gaussian beam model, and perhaps the most demanding in terms of computational load, is synthesizing a high-frequency acoustic channel, for acoustic communications. To lay the groundwork for this application, in particular for vector sensors, I did some prototyping and analysis, using the Matlab version of Bellhop and the VirTEX channel simulator that was developed at HLS Research by Martin Siderius and Michael Porter. Bellhop provides a choice of several influence functions for calculating the field. I adapted several of these beam influence functions to produce particle velocity and benchmarked them against the wavenumber-integral OASES model developed by Henrik Schmidt (which can also calculate particle velocities). Integrating these influence functions into a channel simulator involves convolving the analytic signal form of the broadband acomms waveform (being transmitted through the channel) with the channel impulse response. A simple and efficient way 2
3 to perform this operation in terms of the Gaussian beam constructs is to recognize that the pressure gradient along the ray tangent is equivalent to the time derivative of the analytic signal being convolved, while the pressure gradient along the ray normal is the derivative with respect to the influence function. These two orthogonal convolved components (i.e. ray tangent and ray normal for each traced ray or beam) are projected onto the range and depth axes to form the vector sensor broadband output. This work was completed prior to my having set a course to work with GPUs. I assessed evolving General Purpose Graphic Processing Unit (GPGPU) developments, particularly key low-level building blocks needed to implement the various types of acoustic propagation models, such as FFTs and linear algebra building blocks in BLAS and LAPACK. I attended several tutorial sessions on GPGPU software development. I developed an understanding of the underlying hardware on GPUs and reviewed the available software tools for developing applications. Basically, of the two major GPU manufacturers, NVIDIA and ATI (now part of AMD), NVIDIA was aggressively evangelizing use of GPUs in high performance scientific computing, essentially by providing free software development tools and fostering the development of generally useful algorithmic building blocks (like FFTs and BLAS/LAPACK). There are a number of high-level tools that attempt to hide some of this complexity, but my experimentation with these has not yielded good results. There is no interface for specifying how the data is to flow among the GPU and its specialized memories. Furthermore, it is hard to find a full repertoire of the mathematical functions needed, or, if they exist, they may run four times slower than their CPU counterparts. There are also alternative language bindings to CUDA, including Python (PyCUDA) and Fortran (from Portland Group, which has both a low-level FORTRAN interface to CUDA, and an Accelerator which tries to convert existing Fortran code to parallel form to be run in CUDA). A Fortran binding may be helpful for porting a model that is already coded in Fortran, but it is hard to imagine how this can be successful, without intelligent restructuring of the original code to adapt it to the GPU architecture. Although I was a bit apprehensive about coding in C, I think that this has allowed me to best expose the potential of this technology. It also set the stage for the more ambitious Gaussian beam implementation using OptiX, which interfaces with CUDA C. In an initial version of GPU-SSFPE, the GPU was used purely as a FFT engine, in order to establish the performance boost that could be achieved on GPUs, and to assess the level of difficulty for developing a more complete GPU implementation. This initial foray into GPU programming produced speedups of 10 to 20 times relative to a CPU-only implementation, and was not significantly more difficult to program than a CPU-only version, in which FFTW ( fastest FFT in the West ) was used as an FFT engine. The open-source FFTW is widely used, including by Matlab, and is reported to be only slightly slower than commercial libraries such as Intel MKL FFT. The FFTW implementation also has a multi-core CPU version using OpenMP, an open-source software standard for parallelizing multiple cores communicating through a shared memory. The OpenMP version of FFTW did not yield run times that scaled with the number of cores, so perhaps there is a better multi-threaded FFT (e.g. in Intel s MKL library). Once I convinced myself that the performance gains through GPUs were significant, I migrated all the mathematics onto the GPU and fleshed out a relatively complete set of features that would be expected in a split-step Fourier PE. This included three variants of the operator splitting, including (case I in Ref. [14]), (case III in [14]), and (a wider-angle form from [15], that provides the best accuracy in the split-step Fourier repertoire). Each of these operator splittings has 3
4 different forms for the A and B operators. I implemented the following starter fields: Gaussian starter, Greene starter, normal mode starter, Thomson starter (see [17]), the self-starter (see [16]), and a userprovided starter field read from a file. My implementation uses a reduced pressure formulation in which pressure is stepped in range and then multiplied by the square root of the density before being output. Using this reduced pressure form enables the density change at the water-seabed interface to be handled properly. The model inputs are provided through an ASCII input spec that is based on the RAM input spec (RAM can be run off of this spec), but expanded to specify the various starter fields. The operator splitting and the number of threads to use on the GPU are specified using a command line argument. The model outputs a complex pressure field, and, optionally, all of the intermediate results generated in each range step, if needed (for debugging purposes). The current implementation assumes the sound speed is fixed and the bathymetry is smoothly varying. The GPU-friendly way to accommodate range-dependent sound speed is to download a sparse representation of it into the GPU and to use GPU-resident interpolation to produce the finely sampled profiles needed by the model as it marches out in range. I was initially planning to do this by loading the sound speed into texture memory, and then to use the hardware-accelerated interpolation provided by the texture memory. All the waveguide specs can be handled in a similar manner. However, the most recent generation of the NVIDIA hardware has features which reduce the advantage of using the texture memory. As a result, I have set this enhancement aside until a clearer consensus emerges. Currently, all the i/o between the host and the GPU is hidden by doing GPU calculations, host calculations, and the data transfers all in parallel. I benchmarked GPU-SSFPE against historical benchmarks. In making these comparisons I was able to duplicate the historical results, even to a fault (e.g. the leaky duct problem encountered by the Thomson-Chapman formulation, that was fixed by using a different c0). However, I discovered a discrepancy that is fundamental to the SSFPE for problems with significant density and sound speed contrasts. For an ocean-seabed boundary with cp_water=1500, cp_seabed of 1700, and density ratio of 1.5, results at long range (>15km) in shallow water exhibited a shift in range of the TL pattern, when compared with either RAM or Scooter. Kevin Smith graciously provided the source code for the MMPE model, which is also a split-step Fourier model. Comparisons with MMPE on this and other cases, confirmed that this phase offset was endemic to the split-step Fourier PE. Kevin and I found that the discrepancy could be fixed by slightly increasing the waveguide depth, but I have not found a pre-scriptive solution, which can be applied without having a benchmark solution in hand already. My hypothesis is that the phase offset is caused by the smoothing function applied to the density contrast at the ocean-seabed interface, so I have been using other models with the same artificial smoothing function added to the medium in order to understand how this discrepancy can be overcome. RESULTS I have learned how to adapt algorithms implemented on single and multi-core CPUs to GPUs. As optimized building blocks such as a parallelized tri-diagonal linear solver (used in the implicit finite differences and split-step Pade PE models, as well as in the wavenumber integral models) become available, I will be able to produce a unified suite of PE models, all running on GPUs, and using the same input spec and output formats. 4
5 Timing comparisons were performed on three range-dependent cases shown below, which I have labeled as: case m5rd, a Munk profile at 400 Hz in which the water depth decreases smoothly from 5 km to 4.5 km; case dickens, with a shallow duct and a 3 km water column depth, interrupted by a seamount, at 230 Hz; and case w500, modeling upslope propagation in a wedge at 500 Hz. Figure 1: Case m5rd, range-dependent Munk profile, source at 150 m depth, at 400 Hz, zmax=7000, dz= , and fft_sz=16k. Figure 2: Case dickins, Dickins seamount, source at 18 m depth, at 230 Hz, zmax=4500, dz= , and fft_sz=16k. 5
6 Figure 3: Case w500, upslope propagation in an isovelocity wedge, source at 40 m depth, at 500 Hz, zmax=200, dz=.0625, and fft_sz=6.4k. The following tables show the results of running the CPU and GPU versions of the three cases described above with varying numbers of cores or threads. The four systems on which the timing comparisons were performed have CPUs and GPUs with varying capabilities and vintages. The GFLOPs/sec numbers do not include all the calculations, only those associated with applying the two operators, as described above, so these numbers will be underestimates. Nevertheless, these measurements can be used to compare the CPU and GPU capabilities. Table 1: Laptop, Intel Core i7 Q GHz, Quad core (4 cores, 8 threads), Q3 09, GeForce GTS 360M (previous generation, mobile version), 96 cores, 1.32 GHz, CUDA 4.0. Seconds GFLOPs/sec Ratio No. threads GPU CPU GPU CPU GPU CPU m5rd dickins w Table 2: Desktop, Intel Core 2 Q GHz Quad core (4 cores, 4 threads), Q1 08, GeForce GTX 460 (current generation), 336 cores, 1.35 GHz, CUDA 4.0. Seconds GFLOPs/sec Ratio No. threads GPU CPU GPU CPU GPU CPU m5rd dickins w
7 Table 3: Desktop, Intel Core i GHz (4 cores, 8 threads), Q2 09, GeForce GTX 460 (current generation), 336 cores, 1.35 GHz, CUDA 4.0. Seconds GFLOPs/sec Ratio No. threads GPU CPU GPU CPU GPU CPU m5rd dickins w Table 4: Server, dual-slot Intel Xeon X GHz, quad core (8 cores, 16 threads), Q1 09, GeForce GTX 285 (prev. generation), 240 cores, 1.48 GHz, CUDA 3.2. Seconds GFLOPs/sec Ratio No. threads GPU CPU GPU CPU GPU CPU m5rd dickins w I have produced an accelerated SSFPE model, running times faster than a comparable CPUbased version, that can be used as a drop-in replacement for CPU-based versions, including UMPE/MMPE and RAM. Both versions were coded in C, using nearly identical code for the mathematics, except for the calls to CUDA FFTs on the GPU and calls to FFTW on the CPU. A factor of times represents a significant amount of acceleration, especially considering that the cost of the hardware required to realize it is readily available in off-the-shelf laptops, desktops and servers, is updated more frequently than most CPUs (every 6-9 months), and costs significantly less than a typical computer desktop or laptop ($250-$500). Higher-end models with up to 6 GB of RAM, full doubleprecision capability, and ECC RAM costing more than $2000 are available. To assess the correctness of GPU-SSFPE, I have compared GPU-SSFPE outputs with the outputs of Scooter (see[18]), a wavenumber integration model, and the outputs of RAM, a split-step Pade PE model (see [4]). 7
8 Figure 4: Pekeris waveguide comparison at 100 Hz between Scooter and our split-step Fourier PE, with sound speeds of 1500 m/s (water) and 1550 m/s (seabed), seabed density 1.5, and attenuation.5 db/wavelength, 300 m water depth, source depth of 180 m, with line plot TL comparison for receiver at depth 100 m. 8
9 Figure 5: Wedge comparison at 25 Hz between RAM and our split-step Fourier PE, with sound speeds of 1500 m/s (water) and 1700 m/s (seabed), no density contrast at the interface, and attenuation.5 db/wavelength, with water depth starting out at 300 m, then decreasing from 300 m to 120 m from range of 5 km to 12.5 km (a slope of 1.3 degrees), and source depth of 180 m, with line plot TL comparison for receiver at depth 100 m. IMPACT/APPLICATIONS The impact of GPU technology on high performance computing is clear. Clusters of computing nodes with multi-core CPUs are being augmented with GPUs. For example, a 4U dual-slot server can be outfitted with eight GPU cards to greatly increase the GFLOPs per volume and per watt. Interestingly, GPUs are being considered for embedded applications as well, because the GFLOPs per watt of GPUs is competitive. To realize the potential benefits of this technology, it will be important to adapt existing and emerging acoustic models (e.g. 3D models) to the GPU architecture. RELATED PROJECTS I have presented these results at the HIFAST conference and we are hoping to obtain funding to accelerate reverberation and target scattering models using GPU technology. 9
10 REFERENCES 1. Steinar H. Gunderson. GPUwave, (an earlier GPU implementation, using low-level shader functions, prior to availability of CUDA). 2. Suzanne T. McDaniel and Ding Lee. A finite-difference treatment of interface conditions for the parabolic wave equation: The horizontal interface. The Journal of the Acoustical Society of America, 71(4):855, Ding Lee and Suzanne T. McDaniel. A finite-difference treatment of interface conditions for the parabolic wave equation: The irregular interface. The Journal of the Acoustical Society of America, 73(5):1441, Michael D. Collins. A split-step Pade solution for the parabolic equation method. The Journal of the Acoustical Society of America, 93(4):1736, Michael D. Collins. Generalization of the split-step Pade solution. The Journal of the Acoustical Society of America, 96(1):382, Jonathan Cohen and John D Owens and Yao Zhang. Fast tridiagonal solvers on the GPU. Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming PPoPP 10, pages 127, New York, NY, ACM. 7. M. Frigo and S.G. Johnson. The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2): , Jason Sanders and Edward Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, NVIDIA CUDA C Programming Guide. NVIDIA, v4.0 edition, CUDA C Best Practices Guide. NVIDIA, v4.0 edition, CUDA CUFFT Library. NVIDIA, v1.0 edition, Frederick Tappert. The parabolic approximation method. Wave Propagation in Underwater Acoustics, edited by J.B. Keller and J.S. Papadakis in Lecture Notes in Physics, pages Springer-Verlag, New York, NY, Finn B. Jensen and William A. Kuperman and Michael B. Porter and Henrik Schmidt. Computational Ocean Acoustics. Springer-Verlag, New York, NY, D. J. Thomson and N.R. Chapman. A wide-angle split-step algorithm for the parabolic equation. The Journal of the Acoustical Society of America, 74(6):1848, Michael D. Collins. A self-starter for the parabolic equation method, The Journal of the Acoustical Society of America, 92(4):2069, David J. Thomson, Wide-angle parabolic equation solutions to two range-dependent benchmark problems, The Journal of the Acoustical Society of America, 87(4):1514, Michael B. Porter, The Acoustic Toolbox, (this modeling suite contains the Scooter wavenumber integration model used as a benchmark for our split-step Fourier PE model). 10
11 19. Michael D. Collins, Users guide for RAM, versions 1.0 and 1.0p. PUBLICATIONS 1. Kai Tu, Dario Fertonani, Tolga M. Duman, Milica Stojanovic, John G. Proakis and Paul Hursky, Mitigation of Intercarrrier Interference for OFDM for Time-Varying Underwater Acoustic Channels, IEEE J. Ocean. Eng. 36(1): (2011). 2. Aijun Song; Abdi, A.; Badiey, M.; Hursky, P., "Experimental Demonstration of Underwater Acoustic Communication by Vector Sensors," Oceanic Engineering, IEEE Journal of, vol.36, no.3, pp , July Paul Hursky and Michael B. Porter, Implementation of a split-step Fourier Parabolic Equation model on graphic processing units, 4 th Underwater Acoustics Measurements Conference, UAM 2011, June 2011, Kos, Greece. 4. Paul Hursky, Ahmad T. Abawi, and Michael B. Porter, Benefit of two-dimensional aperture in matched-field processing, 4 th Underwater Acoustics Measurements Conference, UAM 2011, June 2011, Kos, Greece. 5. Paul Hursky and Michael B. Porter, Accelerating Underwater Acoustic Propagation Modeling Using General Purpose Graphic Processing Units, Oceans 11 IEEE/MTS Kona, Kona, Hawaii, September
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Speeding Up RSA Encryption Using GPU Parallelization
2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
Parallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected].
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected] Outline Motivation why do we need GPUs? Past - how was GPU programming
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Several tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
How to choose a suitable computer
How to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and post-processing your data with Artec Studio. While
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE
SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA
Accelerating Fast Fourier Transforms Using Hadoop and CUDA
Accelerating Fast Fourier Transforms Using Hadoop and CUDA Rostislav Tsiomenko SAIC Columbia, Md., USA [email protected] 1 Abstract There has been considerable research into improving Fast
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Real-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Speed up numerical analysis with MATLAB
2011 Technology Trend Seminar Speed up numerical analysis with MATLAB MathWorks: Giorgia Zucchelli Marieke van Geffen Rachid Adarghal TU Delft: Prof.dr.ir. Kees Vuik Thales Nederland: Dènis Riedijk 2011
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Amazing renderings of 3D data... in minutes.
Amazing renderings of 3D data... in minutes. What is it? KeyShot is a standalone 3D rendering and animation application for anyone with a need to quickly and easily create photo-realistic images of 3D
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Several tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang [email protected] May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
The Future Of Animation Is Games
The Future Of Animation Is Games 王 銓 彰 Next Media Animation, Media Lab, Director [email protected] The Graphics Hardware Revolution ( 繪 圖 硬 體 革 命 ) : GPU-based Graphics Hardware Multi-core (20 Cores
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
Environmental Effects On Phase Coherent Underwater Acoustic Communications: A Perspective From Several Experimental Measurements
Environmental Effects On Phase Coherent Underwater Acoustic Communications: A Perspective From Several Experimental Measurements T. C. Yang, Naval Research Lab., Washington DC 20375 Abstract. This paper
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
High Performance GPGPU Computer for Embedded Systems
High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not
Department of Electrical and Computer Engineering Ben-Gurion University of the Negev. LAB 1 - Introduction to USRP
Department of Electrical and Computer Engineering Ben-Gurion University of the Negev LAB 1 - Introduction to USRP - 1-1 Introduction In this lab you will use software reconfigurable RF hardware from National
Multi-GPU Load Balancing for In-situ Visualization
Multi-GPU Load Balancing for In-situ Visualization R. Hagan and Y. Cao Department of Computer Science, Virginia Tech, Blacksburg, VA, USA Abstract Real-time visualization is an important tool for immediately
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2
GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,
Interactive Level-Set Segmentation on the GPU
Interactive Level-Set Segmentation on the GPU Problem Statement Goal Interactive system for deformable surface manipulation Level-sets Challenges Deformation is slow Deformation is hard to control Solution
Interactive Level-Set Deformation On the GPU
Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
RWTH GPU Cluster. Sandra Wienke [email protected] November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky
RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke [email protected] November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Voronoi Treemaps in D3
Voronoi Treemaps in D3 Peter Henry University of Washington [email protected] Paul Vines University of Washington [email protected] ABSTRACT Voronoi treemaps are an alternative to traditional rectangular
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Writing Applications for the GPU Using the RapidMind Development Platform
Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Graphic Processing Units: a possible answer to High Performance Computing?
4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Cluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: [email protected] 1 Introduction and some local history High performance computing
DARPA, NSF-NGS/ITR,ACR,CPA,
Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer
