College of William & Mary Department of Computer Science
|
|
- Daniela French
- 8 years ago
- Views:
Transcription
1 Technical Report WM-CS College of William & Mary Department of Computer Science WM-CS Implementing the Dslash Operator in OpenCL Andy Kowalski, Xipeng Shen Department of Computer Science, The College of William and Mary Feburary 3, 2010
2 Abstract The Dslash operator is used in Lattice Quantum Chromodymamics (LQCD) applications to implement a Wilson-Dirac sparse matrix-vector product. Typically the Dslash operation has been implemented as a parallel program. Today s Graphics Processing Units (GPU) are designed to do highly parallel numerical calculations for 3D graphics rendering. This design works well with scientific applications such as LQCD s implementation of the Dslash operator. The Scientific Computing group at the Thomas Jefferson National Accelerator Facility (Jefferson Lab) has implemented the Dslash operator for execution on GPUs using NVIDIA s Compute Unified Device Architecture (CUDA). CUDA applications, however, will only run on NVIDIA hardware. OpenCL (Open Computing Language) is a new open standard for developing parallel programs across CPUs, GPUs and other processors. This paper describes the implementation of the Dslash operator using OpenCL (Open Computing Language), its performance on NVIDIA GPUs compared with CUDA, and its performance on other hardware platforms. General Terms GPU, OpenCL, CUDA, LQCD, Parallel Pro- Keywords gramming 1. Introduction Performance, Languages. Computer simulations of Quantum Chromodynamics (QCD) using a space time lattice (also known as lattice QCD or LQCD) play an important role in calculations of quantities involving the strong nuclear force, and are vital underpinnings of research into High Energy Physics, Nuclear Physics and the search for physics beyond the Standard Model of particle interactions. These calculations are highly computationally demanding, often running on the largest supercomputing facilities in the world. The efficiency of the algorithms can determine both the accuracy of the calculation and the scale of the problems that can be tackled. The constant demands for higher efficiency have stimulated a large body of research in both algorithm design and software optimizations [4,6,7,8,9,10]. Recent developments in Graphics Processing Units (GPU) offer many new, remarkable opportunities for high performance computing. Thanks to their tremendous throughput and very high bandwidth, GPUs have emerged as an appealing alternative over CPUs for scientific applications, bringing speedup of up to several orders of magnitude [3, 5, 12]. The goal of this project is to examine the potential of GPU for enhancing the computing efficiency of a specific implementation of QCD, called Lattice Quantum Chromodymamics (LQCD). LQCD uses a Dslash operator to implement a Wilson-Dirac sparse matrix-vector product. The base implementation used in this project comes from the USQCD community [14]. It has gone through a decade of evolvement over many generations of CPU architectures. The Scientific Computing group at the Thomas Jefferson National Accelerator Facility (Jefferson Lab) has recently implemented the Dlash operator with NVIDIA s Compute Unified Device Architecture (CUDA), and showed clear performance improvement over the executions on CPU. The existing CUDA implementation, however, is specific to NVIDIA GPU architecture. OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming across various CPUs, GPUs and other processors [13]. Unlike CUDA, OpenCL code can be compiled for execution on NVIDIA and ATI graphics cards as well as general purpose CPUs. It has drawn many industry attentions rapidly. For instance, Apple Computer has provided support for it with Mac OS X In this project, we attempt to implement the Dslash operator in OpenCL and evaluate its programmability, portability and performance on a variety of hardware. 2. Dslash and LQCD In LQCD, the Dslash operator forms part of the Wilson Fermion matrix that describes the interaction of quark fields and the vacuum gluon fields between two points in space-time. The inverse of the fermion matrix is required in several places in lattice QCD simulations, including the computation of (valence) quark propagators and in the evaluation of the Hamiltonian, and the Molecular Dynamics forces in Hybrid Molecular Dynamics Monte Carlo algorithms [6,9]. At the start of a simulation, the fermions and the gauge fields between the fermions are in a random state. Over time, the system settles into a state of equilibrium. Physicists can deduce physical properties from the equilibrium configurations. These resulting configurations can themselves be saved to conduct other calculations. In this case a valence particle is introduced into the configuration and the simulation predicts interactions. The Dslash operator is used in LQCD to efficiently solve Dirac equations. In this case, the implementation is to solve the equation shown in Figure 1 below. Figure 1. [2]
3 Here µ iterates over the space-time dimensions of the lattice (x, y, z and t). The γ µ are 4x4 Dirac spin matrices. For each sum of the terms there is a forward and backward direction from each site. In the forward direction, the link between two sites x and x + ˆ µ has an associated gauge link field U x,µ, which is a 3x3 complex matrix. Each site also has an associated fermion field ψ. The δ x+ µ ˆ,y operator denotes a shift of the ψ field from site x + ˆ µ to site x, which is required to form numerical derivative terms. Similarly for the backward direction, each link between two sites x and x ˆ µ has an associated gauge link field U x µ ˆ,µ and δ x µ ˆ,y operator. Visually, the formula looks like Figure 2 below. In OpenCL, as in CUDA, the functions being executed in parallel are called kernels. Kernels can be either data or task parallel. Kernels can either be compiled from source during program execution or precompiled in advance. Kernels are executed as work items (individual threads) within a work group. A work group is a 1, 2 or 3 dimensional index of work items executed on a single compute unit [13]. The memory model in OpenCL consists of private, local, constant, and global memory. Private memory is accessible only by an individual work item. It is the fastest memory to access, but smallest in size. Local memory is shared among all the work items within a single a work group. Access to local memory is faster than global memory, but also limited in size. Constant memory is a region of global memory that does not change during the execution of a kernel. Global memory is the largest of all the memory types and is accessible by all work items within a context. There are five main steps to executing an OpenCL kernel. These steps are initialization, allocation of resources, creating the program and kernel execution and cleanup. Figure 2. In Figure 2, the spinor fields ψ from the Dirac equation are the 4x1 matrices containing elements of 3x1 matrices of color (c). The 4x4 matrices containing elements of U (3x3 matrices) are the product of the projectors (1-γ or 1+γ depending on direction) and U. The algorithm for Dslash is rather compute intensive. It requires 1320 floating point operations (flops) to perform a Dslash operation on a single site. Being that the same calculation is done for each site, it is a good candidate for parallelization and execution on Single Instruction Multiple Data (SIMD) processors like today s GPUs. 3. OpenCL OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming across CPUs, GPUs and other processors [13]. It is based on C99 and abstracts the specifics of the underlying hardware [11]. OpenCL programs are portable in that they can be recompiled with little to no change to the source code and executed on different hardware. As an example, a program can be created and tested on a CPU. Once working, the same code can be compiled ad executed on a GPU. During initialization, compute devices are chosen and contexts and command queues are created. Command queues hold the commands to be executed on a compute device. They can be configured to submit the commands to the compute device either in order or out of order. The commands can be kernels to execute or reads/writes from/to device memory. The allocation of resources involves the creating of input and output buffers on the device and copying any input data into its buffer. Creation of the program and kernel involves loading the kernel source code or precompiled binary into a program object. If the kernel is loaded as source code, it needs to be compiled. A kernel object is then created from the compiled program. Execution involves setting arguments for the kernel and inserting the kernel in a command queue for execution. Then any result buffers are read back into host memory. Lastly, tear down is done by releasing (freeing) memory objects, kernels, programs, command queues, and contexts.
4 4. CUDA and OpenCL Language Differences CUDA and OpenCL differ in terminology. They share the same concepts and design in terms of how you layout your work. Table 1 shows the general differences in terminology used to describe the work models of CUDA and OpenCL. Thread Thread Block CUDA Global Memory Constant Memory Shared Memory Local Memory Work-item Work-group Table 1. [1] OpenCL Global Memory Constant Memory Local Memory Private Memory There are also differences in accessing index information within kernels. CUDA uses predefined variables where OpenCL uses function calls [1]. Table 2 shows the equivalent function calls in CUDA and OpenCL for accessing index information. griddim blockdim blockidx threadidx CUDA OpenCL get_num_groups() get_local_size() get_group_id() get_local_id() No direct equivalent. Com-get_global_id(bine blockdim, blockidx, and threadidx to calculate a global index. No direct equivalent. Combine griddim and blockdim get_global_size() to calculate the global size. Table 2. [1] Other naming differences exist between the APIs for similar object definitions and functions, but functionally CUDA does not provide anything equivalent to OpenCL s use of command queues. Command queues hold the commands to be executed on a compute device. Additionally, OpenCL provides task parallelism capabilities by allowing dependencies to be declared between tasks executing on a device [1]. 5. Code Developement Mac OS X Apple has included support for OpenCL in Mac OS X 10.6 and in the development tools that come with Mac OS X The development tools are not installed by default, but are included on the distribution DVD and freely available from the Apple website. To compile OpenCL programs, one only needs to include the -framework OpenCL argument to the Apple provided gcc compiler. Apple s implementation of OpenCL also includes support for executing OpenCL kernels on both GPUs and the system s CPU. Apple s Xcode editor for writing source code recognizes OpenCL and indents and shows different parts of the code in different color fonts. This makes it easier to develop code because you can quickly identify types, variable names, comments, etc. NVIDIA SDK on Linux Linux developers do not include OpenCL support in their distributions. To support OpenCL for use on NVIDIA GPU devices, version of the NVIDIA device driver must be installed on the system. CUDA must also be installed on the system. NVIDIA s Software Development Kit (SDK) for GPU programming version 2.3b includes support for OpenCL. To compile OpenCL code, one must install the SDK, set the LD_LIBRARY_PATH environment variable to include the path to the library included with the SDK and set the PATH environment variable to include the path to the CUDA bin directory. NVIDIA s OpenCL SDK is integrated with CUDA. So OpenCL kernels compiled using the NVIDIA SDK are said to perform as well as their equivalent CUDA kernels. ATI SDK on Linux At the time of this writing, ATI provides beta versions of their device driver and SDK for OpenCL support. To compile OpenCL code, one must install the SDK, set the ATIS- TREAMSDKROOT environment variable to the path of the SDK installation, set the LD_LIBRARY_PATH environment variable to include the path to the library included with the SDK and set the PATH environment variable to include the path to the bin directory in the SDK.
5 Getting the SDK installed such that one could compile was straightforward and successful. Executing the kernels on the ATI GPUs was not successful. The issue seems to be ATI s required access to the X11 server for communicating with the GPU. Remote logins use the X11 server on the remote machine and thus cannot find the GPU. There were also problems getting X11 to work with the device driver. Due to those problems, this project demonstrates the ability to compile OpenCL code with ATI s SDK but demonstrated portability by executing the code on the CPU of a MacBook Pro. 6. Performance The OpenCL implementation of Dslash was not fully optimized at the time of this writing. It does not take advantage of red-black labeling. Red-black labeling is a technique used for implementing a complete Dslash operation using ½ the number of sites. The CUDA implementation, however, does use red-black labeling. Memory coalescing is used for the storage of both the inputs and outputs by the OpenCL implementation. However, it does this using global memory for both. Of the memory available for the storage of inputs and outputs on the GPUs in OpenCL, global memory has the worst performance. It proved, however, to be easier to use for getting this first OpenCL implementation working. The performance of the algorithm is measured in effective flops. Since the existing CUDA implementation uses the red-black labeling technique, we will measure effective flops using the following formula: NVIDIA 9400M GT The performance on the NVIDIA 9400M GT was not that impressive. This was to be expected because the graphics card is in used for the display and because of the limited number of cores and memory. Figure 3 below shows the observed performance. Figure 3. One can clearly see that the operation tops out at around 388 Mflops with a work group size of 64 and 512 threads. The program actually caused the screen to flicker and the program to error out when the program was executed with more than 512 threads. Intel 2.4GHz Core 2 Duo Executing the same code on the Intel 2.4GHz Core 2 Duo CPU of the MacBook was better. Figure 4 below shows the observed performance. (flops * num_sites/2 * num_loops) / seconds Here, flops is the 1320 flops required per site; num_sites is the number of sites; num_loops is 200 (number of loops performed); and seconds is the recorded time in seconds. Again, we divide the number of sites processed by 2 because the existing CUDA version utilizes red-black labeling to perform the calculation. Performance testing was conducted on the processors/ systems shown in Table 2 below. System Operating GPU/Processor System MacBook Pro Mac OS X NVIDIA 9400M GT (16 cores, 256MB) MacBook Pro Mac OS X Intel 2.4GHz Core 2 Duo (2GB) SuperMicro CentOS 5.3 NVIDIA Tesla C1060 (240 cores 4GB) Table 2. Figure 4. The performance peaks out at 1.0 Gflops. As the number of sites increases one can see that performance begins to degrade around sites. It is believed that it is at this point that cache misses and context switching begin to degrade performance.
6 NVIDIA Tesla C1060 Next the program was executed on an NVIDIA Tesla C1060. The Tesla has 240 cores and 4 GBytes of memory. As expected the Tesla was the top performer in our test group. It performed best with 64 threads per work group and peaked out at about 27 Gflops. Although the Telsa C1060 can deliver even better performance, this clearly is a big gain for the application when compared to its performance on a CPU. Figure 5 shows the observed performance. Figure 5. OpenCL vs CUDA Implementation The last test was to compare the performance of the CUDA implementation with that of the OpenCL implementation of Dslash. Figure 6 shows the observed performance of each with a work group size (block size in CUDA) of 64 threads. Figure 6. Although the CUDA implementation outperformed the OpenCL implementation, the curve of the graph for each is very similar. Both reach their peak when processing at least sites. However, recall that the CUDA implementation uses red-black labeling and actually performs the Dslash operation using only half the sites and thus launches half the number of threads. The OpenCL implementation currently uses one thread for each site to do the Dslash operation. So one improvement we can make to increase the performance of the OpenCL implementation is to adjust the algorithm such that only has to execute on half the sites. (Anecdotally, when the OpenCL implementation was executed using half the sites the performance was comparable. Those results were not included in this paper because the result of the Dslash operator was incorrect.) Additionally, the CUDA implementation utilizes texture memory for the inputs. Texture memory provides better performance than the global memory used by the OpenCL implementation. So another performance gain can be expected if the OpenCL implementation is modified to use image memory (called texture memory in CUDA). 7. Future Work The OpenCL implementation of Dslash discussed here does not make use of several of the performance optimizations applied to the CUDA implementation. In particular, utilizing local memory instead of global memory is a known performance enhancement. In terms of the algorithm itself, the CUDA implementation utilizes a trick known as redblack labeling to perform the calculation with only half the sites. Hence, it launches half the number of threads as the current OpenCL implementation. Additionally, performance testing of Dslash on an ATI GPU and comparing it to NVIDIA is of interest. The portability of OpenCL was demonstrated using Centos, Mac OS X, two different GPUs and a traditional CPU. However, performance testing ATI GPUs is of interest by the Scientific Computing group at Jefferson Lab. If performance is comparable, future procurements of GPUs for use by LQCD applications by the Scientific Computing group can be competed between NVIDIA and ATI hardware without involving huge development efforts to port Dslash. 8. Conclusions Implementing Dslash in OpenCL for the first time was a bit tedious. It is important to understand the different types of memory available and the limitations associated with each type. Once fully understood, however, the steps required to execute a kernel are mostly clear-cut. Additionally, once the algorithm for the Dslash operator was completely understood, porting the existing CUDA kernel to OpenCL proved to be time consuming but straightforward. The kernels are similar enough that developers using CUDA should not have a hard time porting them over to OpenCL. The portability of OpenCL and its ability to be executed on heterogeneous processor architectures is a big advantage.
7 It was very easy to move and execute the Dslash implementation from Mac OS X to Linux and from CPU to GPU. This flexibility will allow the LQCD community to take advantage of advances in GPU designs from multiple vendors without the major software development efforts traditionally required when migrating to a new processor architecture. Although the performance of the Dslash operator in CUDA out performed that of the OpenCL implementation, it is important to remember that the CUDA implementation has implemented optimizations, such as red-black labeling and using shared and texture memory on the GPU. It is expected that the OpenCL implementation will perform similarly with the same optimizations. It is clear that scientific applications like LQCD can take advantage of modern GPU designs. The performance observed on a single high-end graphics card when compared to a traditional CPU makes such development efforts worthwhile. Implementation of such applications using OpenCL adds value by providing portability, reduced development costs for use with heterogeneous processor architectures and the ability to utilize new hardware platforms sooner. Acknowledgments We thank Jie Chen for the instruction on the exiting CUDA implementation of Dslash, Balint Joo for the instruction on the Dslash operator, physics and LQCD. The discussion with Kostas Orginos triggers the exploration of the application of GPU for LQCD. References [1] AMD, Inc., OpenCL and the ATI Stream SDK v2.0 (2009) cl-and-the-ati-stream-v2.0-beta.aspx [2] Balint Joo (private communication), [3] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sa- dayappan. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS 08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages , [4] M.A. Clark, R. Babich, K. Barros, R.C. Brower, C. Rebbi. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs., Nov arxiv: [heplat] [5] Y. Dotsenko, N. K. Govindaraju, P. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In ICS 08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages , [6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid monte carlo. Phys. Lett., B195: , [7] G. I. Egri et al. Lattice QCD as a video game. Comput. Phys. Commun., 177: , [8] J. Foley., K. J. Juge, A. O Cais, M. Peardon, S. Ryan, and J.-I. Skullerud. Practical all-to-all propagators for lattice qcd. Comput. Phys. Commun., 172: , [9] S. A. Gottlieb, W. Liu, D. Toussaint, R. L. Renken, and R. L. Sugar. Hybrid molecular dynamics algorithms for the numerical simulation of quantum chromodynamics. Phys. Rev., D35: , [10] K. Ibrahim, F. Bodin, and O. Pe`ne. Fine-grained parallelization of lattice QCD kernel routine on GPUs. Journal of Parallel and Distributed Computing, 68(10): , [11] Khronos Group, OpenCL Parallel Computing for Heterogeneous Devices (2009), overview.pdf [12] Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for gpu programs optimization. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), [13] A. Munshi et al., The OpenCL specification version 1.0 Revision 48, Technical report, Khronos OpenCL Working Group (2009). [14] US QCD.
Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationQCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationHETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationAccelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
More informationHP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch
More informationNVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X
NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X DU-05348-001_v5.5 July 2013 Installation and Verification on Mac OS X TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2. About
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationHIGH PERFORMANCE CONSULTING COURSE OFFERINGS
Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationA GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
More informationNVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X
NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X DU-05348-001_v6.5 August 2014 Installation and Verification on Mac OS X TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2. About
More informationLattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
More informationCase Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationExperiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationGeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationEnhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More informationGraphic Processing Units: a possible answer to High Performance Computing?
4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/
More informationThe High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
More informationL20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
More informationTurbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
More informationGPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationSAM XFile. Trial Installation Guide Linux. Snell OD is in the process of being rebranded SAM XFile
SAM XFile Trial Installation Guide Linux Snell OD is in the process of being rebranded SAM XFile Version History Table 1: Version Table Date Version Released by Reason for Change 10/07/2014 1.0 Andy Gingell
More informationST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
More informationCOSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
More informationAutodesk Revit 2016 Product Line System Requirements and Recommendations
Autodesk Revit 2016 Product Line System Requirements and Recommendations Autodesk Revit 2016, Autodesk Revit Architecture 2016, Autodesk Revit MEP 2016, Autodesk Revit Structure 2016 Minimum: Entry-Level
More informationWriting Applications for the GPU Using the RapidMind Development Platform
Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationHPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationMixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
More informationPERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. achrosny@treeage.com Andrew Munzer Director, Training and Customer
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationParallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationHow to choose a suitable computer
How to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and post-processing your data with Artec Studio. While
More informationInstallation Guide. (Version 2014.1) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom
Installation Guide (Version 2014.1) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom Tel: +44 (0) 141 3322681 Fax: +44 (0) 141 3326792 www.mve.com Table of Contents 1.
More informationDesign and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures
Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More informationIntelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
More informationNVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS
NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS DU-05349-001_v6.0 February 2014 Installation and Verification on TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2.
More informationOptimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
More informationSeveral tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationOpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
More informationSUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE
SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationHPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014
HPC Cluster Decisions and ANSYS Configuration Best Practices Diana Collier Lead Systems Support Specialist Houston UGM May 2014 1 Agenda Introduction Lead Systems Support Specialist Cluster Decisions Job
More informationNVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
More informationSystem Requirements Table of contents
Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5
More informationGPGPU Parallel Merge Sort Algorithm
GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led
More informationRetargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
More informationAccelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
More informationA general-purpose virtualization service for HPC on cloud computing: an application to GPUs
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs R.Montella, G.Coviello, G.Giunta* G. Laccetti #, F. Isaila, J. Garcia Blas *Department of Applied Science University
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationCS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and
More informationSeveral tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationSPEEDUP - optimization and porting of path integral MC Code to new computing architectures
SPEEDUP - optimization and porting of path integral MC Code to new computing architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić, A. Bogojević Scientific Computing Laboratory, Institute of Physics
More informationANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
More informationExploiting GPU Hardware Saturation for Fast Compiler Optimization
Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom a.magni@sms.ed.ac.uk Christophe Dubach School of Informatics
More informationLe langage OCaml et la programmation des GPU
Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline
More informationImplementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationMONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
More informationPedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
More informationOptimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationBuilding a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
More informationCLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014
CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 Introduction Cloud ification < 2013 2014+ Music, Movies, Books Games GPU Flops GPUs vs. Consoles 10,000
More informationMitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011
Mitglied der Helmholtz-Gemeinschaft OpenCL Basics Parallel Computing on GPU and CPU Willi Homberg Agenda Introduction OpenCL architecture Platform model Execution model Memory model Programming model Platform
More informationGPU Profiling with AMD CodeXL
GPU Profiling with AMD CodeXL Software Profiling Course Hannes Würfel OUTLINE 1. Motivation 2. GPU Recap 3. OpenCL 4. CodeXL Overview 5. CodeXL Internals 6. CodeXL Profiling 7. CodeXL Debugging 8. Sources
More information