College of William & Mary Department of Computer Science

Size: px
Start display at page:

Download "College of William & Mary Department of Computer Science"

Transcription

1 Technical Report WM-CS College of William & Mary Department of Computer Science WM-CS Implementing the Dslash Operator in OpenCL Andy Kowalski, Xipeng Shen Department of Computer Science, The College of William and Mary Feburary 3, 2010

2 Abstract The Dslash operator is used in Lattice Quantum Chromodymamics (LQCD) applications to implement a Wilson-Dirac sparse matrix-vector product. Typically the Dslash operation has been implemented as a parallel program. Today s Graphics Processing Units (GPU) are designed to do highly parallel numerical calculations for 3D graphics rendering. This design works well with scientific applications such as LQCD s implementation of the Dslash operator. The Scientific Computing group at the Thomas Jefferson National Accelerator Facility (Jefferson Lab) has implemented the Dslash operator for execution on GPUs using NVIDIA s Compute Unified Device Architecture (CUDA). CUDA applications, however, will only run on NVIDIA hardware. OpenCL (Open Computing Language) is a new open standard for developing parallel programs across CPUs, GPUs and other processors. This paper describes the implementation of the Dslash operator using OpenCL (Open Computing Language), its performance on NVIDIA GPUs compared with CUDA, and its performance on other hardware platforms. General Terms GPU, OpenCL, CUDA, LQCD, Parallel Pro- Keywords gramming 1. Introduction Performance, Languages. Computer simulations of Quantum Chromodynamics (QCD) using a space time lattice (also known as lattice QCD or LQCD) play an important role in calculations of quantities involving the strong nuclear force, and are vital underpinnings of research into High Energy Physics, Nuclear Physics and the search for physics beyond the Standard Model of particle interactions. These calculations are highly computationally demanding, often running on the largest supercomputing facilities in the world. The efficiency of the algorithms can determine both the accuracy of the calculation and the scale of the problems that can be tackled. The constant demands for higher efficiency have stimulated a large body of research in both algorithm design and software optimizations [4,6,7,8,9,10]. Recent developments in Graphics Processing Units (GPU) offer many new, remarkable opportunities for high performance computing. Thanks to their tremendous throughput and very high bandwidth, GPUs have emerged as an appealing alternative over CPUs for scientific applications, bringing speedup of up to several orders of magnitude [3, 5, 12]. The goal of this project is to examine the potential of GPU for enhancing the computing efficiency of a specific implementation of QCD, called Lattice Quantum Chromodymamics (LQCD). LQCD uses a Dslash operator to implement a Wilson-Dirac sparse matrix-vector product. The base implementation used in this project comes from the USQCD community [14]. It has gone through a decade of evolvement over many generations of CPU architectures. The Scientific Computing group at the Thomas Jefferson National Accelerator Facility (Jefferson Lab) has recently implemented the Dlash operator with NVIDIA s Compute Unified Device Architecture (CUDA), and showed clear performance improvement over the executions on CPU. The existing CUDA implementation, however, is specific to NVIDIA GPU architecture. OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming across various CPUs, GPUs and other processors [13]. Unlike CUDA, OpenCL code can be compiled for execution on NVIDIA and ATI graphics cards as well as general purpose CPUs. It has drawn many industry attentions rapidly. For instance, Apple Computer has provided support for it with Mac OS X In this project, we attempt to implement the Dslash operator in OpenCL and evaluate its programmability, portability and performance on a variety of hardware. 2. Dslash and LQCD In LQCD, the Dslash operator forms part of the Wilson Fermion matrix that describes the interaction of quark fields and the vacuum gluon fields between two points in space-time. The inverse of the fermion matrix is required in several places in lattice QCD simulations, including the computation of (valence) quark propagators and in the evaluation of the Hamiltonian, and the Molecular Dynamics forces in Hybrid Molecular Dynamics Monte Carlo algorithms [6,9]. At the start of a simulation, the fermions and the gauge fields between the fermions are in a random state. Over time, the system settles into a state of equilibrium. Physicists can deduce physical properties from the equilibrium configurations. These resulting configurations can themselves be saved to conduct other calculations. In this case a valence particle is introduced into the configuration and the simulation predicts interactions. The Dslash operator is used in LQCD to efficiently solve Dirac equations. In this case, the implementation is to solve the equation shown in Figure 1 below. Figure 1. [2]

3 Here µ iterates over the space-time dimensions of the lattice (x, y, z and t). The γ µ are 4x4 Dirac spin matrices. For each sum of the terms there is a forward and backward direction from each site. In the forward direction, the link between two sites x and x + ˆ µ has an associated gauge link field U x,µ, which is a 3x3 complex matrix. Each site also has an associated fermion field ψ. The δ x+ µ ˆ,y operator denotes a shift of the ψ field from site x + ˆ µ to site x, which is required to form numerical derivative terms. Similarly for the backward direction, each link between two sites x and x ˆ µ has an associated gauge link field U x µ ˆ,µ and δ x µ ˆ,y operator. Visually, the formula looks like Figure 2 below. In OpenCL, as in CUDA, the functions being executed in parallel are called kernels. Kernels can be either data or task parallel. Kernels can either be compiled from source during program execution or precompiled in advance. Kernels are executed as work items (individual threads) within a work group. A work group is a 1, 2 or 3 dimensional index of work items executed on a single compute unit [13]. The memory model in OpenCL consists of private, local, constant, and global memory. Private memory is accessible only by an individual work item. It is the fastest memory to access, but smallest in size. Local memory is shared among all the work items within a single a work group. Access to local memory is faster than global memory, but also limited in size. Constant memory is a region of global memory that does not change during the execution of a kernel. Global memory is the largest of all the memory types and is accessible by all work items within a context. There are five main steps to executing an OpenCL kernel. These steps are initialization, allocation of resources, creating the program and kernel execution and cleanup. Figure 2. In Figure 2, the spinor fields ψ from the Dirac equation are the 4x1 matrices containing elements of 3x1 matrices of color (c). The 4x4 matrices containing elements of U (3x3 matrices) are the product of the projectors (1-γ or 1+γ depending on direction) and U. The algorithm for Dslash is rather compute intensive. It requires 1320 floating point operations (flops) to perform a Dslash operation on a single site. Being that the same calculation is done for each site, it is a good candidate for parallelization and execution on Single Instruction Multiple Data (SIMD) processors like today s GPUs. 3. OpenCL OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming across CPUs, GPUs and other processors [13]. It is based on C99 and abstracts the specifics of the underlying hardware [11]. OpenCL programs are portable in that they can be recompiled with little to no change to the source code and executed on different hardware. As an example, a program can be created and tested on a CPU. Once working, the same code can be compiled ad executed on a GPU. During initialization, compute devices are chosen and contexts and command queues are created. Command queues hold the commands to be executed on a compute device. They can be configured to submit the commands to the compute device either in order or out of order. The commands can be kernels to execute or reads/writes from/to device memory. The allocation of resources involves the creating of input and output buffers on the device and copying any input data into its buffer. Creation of the program and kernel involves loading the kernel source code or precompiled binary into a program object. If the kernel is loaded as source code, it needs to be compiled. A kernel object is then created from the compiled program. Execution involves setting arguments for the kernel and inserting the kernel in a command queue for execution. Then any result buffers are read back into host memory. Lastly, tear down is done by releasing (freeing) memory objects, kernels, programs, command queues, and contexts.

4 4. CUDA and OpenCL Language Differences CUDA and OpenCL differ in terminology. They share the same concepts and design in terms of how you layout your work. Table 1 shows the general differences in terminology used to describe the work models of CUDA and OpenCL. Thread Thread Block CUDA Global Memory Constant Memory Shared Memory Local Memory Work-item Work-group Table 1. [1] OpenCL Global Memory Constant Memory Local Memory Private Memory There are also differences in accessing index information within kernels. CUDA uses predefined variables where OpenCL uses function calls [1]. Table 2 shows the equivalent function calls in CUDA and OpenCL for accessing index information. griddim blockdim blockidx threadidx CUDA OpenCL get_num_groups() get_local_size() get_group_id() get_local_id() No direct equivalent. Com-get_global_id(bine blockdim, blockidx, and threadidx to calculate a global index. No direct equivalent. Combine griddim and blockdim get_global_size() to calculate the global size. Table 2. [1] Other naming differences exist between the APIs for similar object definitions and functions, but functionally CUDA does not provide anything equivalent to OpenCL s use of command queues. Command queues hold the commands to be executed on a compute device. Additionally, OpenCL provides task parallelism capabilities by allowing dependencies to be declared between tasks executing on a device [1]. 5. Code Developement Mac OS X Apple has included support for OpenCL in Mac OS X 10.6 and in the development tools that come with Mac OS X The development tools are not installed by default, but are included on the distribution DVD and freely available from the Apple website. To compile OpenCL programs, one only needs to include the -framework OpenCL argument to the Apple provided gcc compiler. Apple s implementation of OpenCL also includes support for executing OpenCL kernels on both GPUs and the system s CPU. Apple s Xcode editor for writing source code recognizes OpenCL and indents and shows different parts of the code in different color fonts. This makes it easier to develop code because you can quickly identify types, variable names, comments, etc. NVIDIA SDK on Linux Linux developers do not include OpenCL support in their distributions. To support OpenCL for use on NVIDIA GPU devices, version of the NVIDIA device driver must be installed on the system. CUDA must also be installed on the system. NVIDIA s Software Development Kit (SDK) for GPU programming version 2.3b includes support for OpenCL. To compile OpenCL code, one must install the SDK, set the LD_LIBRARY_PATH environment variable to include the path to the library included with the SDK and set the PATH environment variable to include the path to the CUDA bin directory. NVIDIA s OpenCL SDK is integrated with CUDA. So OpenCL kernels compiled using the NVIDIA SDK are said to perform as well as their equivalent CUDA kernels. ATI SDK on Linux At the time of this writing, ATI provides beta versions of their device driver and SDK for OpenCL support. To compile OpenCL code, one must install the SDK, set the ATIS- TREAMSDKROOT environment variable to the path of the SDK installation, set the LD_LIBRARY_PATH environment variable to include the path to the library included with the SDK and set the PATH environment variable to include the path to the bin directory in the SDK.

5 Getting the SDK installed such that one could compile was straightforward and successful. Executing the kernels on the ATI GPUs was not successful. The issue seems to be ATI s required access to the X11 server for communicating with the GPU. Remote logins use the X11 server on the remote machine and thus cannot find the GPU. There were also problems getting X11 to work with the device driver. Due to those problems, this project demonstrates the ability to compile OpenCL code with ATI s SDK but demonstrated portability by executing the code on the CPU of a MacBook Pro. 6. Performance The OpenCL implementation of Dslash was not fully optimized at the time of this writing. It does not take advantage of red-black labeling. Red-black labeling is a technique used for implementing a complete Dslash operation using ½ the number of sites. The CUDA implementation, however, does use red-black labeling. Memory coalescing is used for the storage of both the inputs and outputs by the OpenCL implementation. However, it does this using global memory for both. Of the memory available for the storage of inputs and outputs on the GPUs in OpenCL, global memory has the worst performance. It proved, however, to be easier to use for getting this first OpenCL implementation working. The performance of the algorithm is measured in effective flops. Since the existing CUDA implementation uses the red-black labeling technique, we will measure effective flops using the following formula: NVIDIA 9400M GT The performance on the NVIDIA 9400M GT was not that impressive. This was to be expected because the graphics card is in used for the display and because of the limited number of cores and memory. Figure 3 below shows the observed performance. Figure 3. One can clearly see that the operation tops out at around 388 Mflops with a work group size of 64 and 512 threads. The program actually caused the screen to flicker and the program to error out when the program was executed with more than 512 threads. Intel 2.4GHz Core 2 Duo Executing the same code on the Intel 2.4GHz Core 2 Duo CPU of the MacBook was better. Figure 4 below shows the observed performance. (flops * num_sites/2 * num_loops) / seconds Here, flops is the 1320 flops required per site; num_sites is the number of sites; num_loops is 200 (number of loops performed); and seconds is the recorded time in seconds. Again, we divide the number of sites processed by 2 because the existing CUDA version utilizes red-black labeling to perform the calculation. Performance testing was conducted on the processors/ systems shown in Table 2 below. System Operating GPU/Processor System MacBook Pro Mac OS X NVIDIA 9400M GT (16 cores, 256MB) MacBook Pro Mac OS X Intel 2.4GHz Core 2 Duo (2GB) SuperMicro CentOS 5.3 NVIDIA Tesla C1060 (240 cores 4GB) Table 2. Figure 4. The performance peaks out at 1.0 Gflops. As the number of sites increases one can see that performance begins to degrade around sites. It is believed that it is at this point that cache misses and context switching begin to degrade performance.

6 NVIDIA Tesla C1060 Next the program was executed on an NVIDIA Tesla C1060. The Tesla has 240 cores and 4 GBytes of memory. As expected the Tesla was the top performer in our test group. It performed best with 64 threads per work group and peaked out at about 27 Gflops. Although the Telsa C1060 can deliver even better performance, this clearly is a big gain for the application when compared to its performance on a CPU. Figure 5 shows the observed performance. Figure 5. OpenCL vs CUDA Implementation The last test was to compare the performance of the CUDA implementation with that of the OpenCL implementation of Dslash. Figure 6 shows the observed performance of each with a work group size (block size in CUDA) of 64 threads. Figure 6. Although the CUDA implementation outperformed the OpenCL implementation, the curve of the graph for each is very similar. Both reach their peak when processing at least sites. However, recall that the CUDA implementation uses red-black labeling and actually performs the Dslash operation using only half the sites and thus launches half the number of threads. The OpenCL implementation currently uses one thread for each site to do the Dslash operation. So one improvement we can make to increase the performance of the OpenCL implementation is to adjust the algorithm such that only has to execute on half the sites. (Anecdotally, when the OpenCL implementation was executed using half the sites the performance was comparable. Those results were not included in this paper because the result of the Dslash operator was incorrect.) Additionally, the CUDA implementation utilizes texture memory for the inputs. Texture memory provides better performance than the global memory used by the OpenCL implementation. So another performance gain can be expected if the OpenCL implementation is modified to use image memory (called texture memory in CUDA). 7. Future Work The OpenCL implementation of Dslash discussed here does not make use of several of the performance optimizations applied to the CUDA implementation. In particular, utilizing local memory instead of global memory is a known performance enhancement. In terms of the algorithm itself, the CUDA implementation utilizes a trick known as redblack labeling to perform the calculation with only half the sites. Hence, it launches half the number of threads as the current OpenCL implementation. Additionally, performance testing of Dslash on an ATI GPU and comparing it to NVIDIA is of interest. The portability of OpenCL was demonstrated using Centos, Mac OS X, two different GPUs and a traditional CPU. However, performance testing ATI GPUs is of interest by the Scientific Computing group at Jefferson Lab. If performance is comparable, future procurements of GPUs for use by LQCD applications by the Scientific Computing group can be competed between NVIDIA and ATI hardware without involving huge development efforts to port Dslash. 8. Conclusions Implementing Dslash in OpenCL for the first time was a bit tedious. It is important to understand the different types of memory available and the limitations associated with each type. Once fully understood, however, the steps required to execute a kernel are mostly clear-cut. Additionally, once the algorithm for the Dslash operator was completely understood, porting the existing CUDA kernel to OpenCL proved to be time consuming but straightforward. The kernels are similar enough that developers using CUDA should not have a hard time porting them over to OpenCL. The portability of OpenCL and its ability to be executed on heterogeneous processor architectures is a big advantage.

7 It was very easy to move and execute the Dslash implementation from Mac OS X to Linux and from CPU to GPU. This flexibility will allow the LQCD community to take advantage of advances in GPU designs from multiple vendors without the major software development efforts traditionally required when migrating to a new processor architecture. Although the performance of the Dslash operator in CUDA out performed that of the OpenCL implementation, it is important to remember that the CUDA implementation has implemented optimizations, such as red-black labeling and using shared and texture memory on the GPU. It is expected that the OpenCL implementation will perform similarly with the same optimizations. It is clear that scientific applications like LQCD can take advantage of modern GPU designs. The performance observed on a single high-end graphics card when compared to a traditional CPU makes such development efforts worthwhile. Implementation of such applications using OpenCL adds value by providing portability, reduced development costs for use with heterogeneous processor architectures and the ability to utilize new hardware platforms sooner. Acknowledgments We thank Jie Chen for the instruction on the exiting CUDA implementation of Dslash, Balint Joo for the instruction on the Dslash operator, physics and LQCD. The discussion with Kostas Orginos triggers the exploration of the application of GPU for LQCD. References [1] AMD, Inc., OpenCL and the ATI Stream SDK v2.0 (2009) cl-and-the-ati-stream-v2.0-beta.aspx [2] Balint Joo (private communication), [3] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sa- dayappan. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS 08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages , [4] M.A. Clark, R. Babich, K. Barros, R.C. Brower, C. Rebbi. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs., Nov arxiv: [heplat] [5] Y. Dotsenko, N. K. Govindaraju, P. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In ICS 08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages , [6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid monte carlo. Phys. Lett., B195: , [7] G. I. Egri et al. Lattice QCD as a video game. Comput. Phys. Commun., 177: , [8] J. Foley., K. J. Juge, A. O Cais, M. Peardon, S. Ryan, and J.-I. Skullerud. Practical all-to-all propagators for lattice qcd. Comput. Phys. Commun., 172: , [9] S. A. Gottlieb, W. Liu, D. Toussaint, R. L. Renken, and R. L. Sugar. Hybrid molecular dynamics algorithms for the numerical simulation of quantum chromodynamics. Phys. Rev., D35: , [10] K. Ibrahim, F. Bodin, and O. Pe`ne. Fine-grained parallelization of lattice QCD kernel routine on GPUs. Journal of Parallel and Distributed Computing, 68(10): , [11] Khronos Group, OpenCL Parallel Computing for Heterogeneous Devices (2009), overview.pdf [12] Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for gpu programs optimization. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), [13] A. Munshi et al., The OpenCL specification version 1.0 Revision 48, Technical report, Khronos OpenCL Working Group (2009). [14] US QCD.

Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

QCD as a Video Game?

QCD as a Video Game? QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X DU-05348-001_v5.5 July 2013 Installation and Verification on Mac OS X TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2. About

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X DU-05348-001_v6.5 August 2014 Installation and Verification on Mac OS X TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2. About

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Graphic Processing Units: a possible answer to High Performance Computing?

Graphic Processing Units: a possible answer to High Performance Computing? 4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/

More information

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

More information

L20: GPU Architecture and Models

L20: GPU Architecture and Models L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

SAM XFile. Trial Installation Guide Linux. Snell OD is in the process of being rebranded SAM XFile

SAM XFile. Trial Installation Guide Linux. Snell OD is in the process of being rebranded SAM XFile SAM XFile Trial Installation Guide Linux Snell OD is in the process of being rebranded SAM XFile Version History Table 1: Version Table Date Version Released by Reason for Change 10/07/2014 1.0 Andy Gingell

More information

ST810 Advanced Computing

ST810 Advanced Computing ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

More information

COSCO 2015 Heterogeneous Computing Programming

COSCO 2015 Heterogeneous Computing Programming COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology

More information

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Autodesk Revit 2016 Product Line System Requirements and Recommendations Autodesk Revit 2016 Product Line System Requirements and Recommendations Autodesk Revit 2016, Autodesk Revit Architecture 2016, Autodesk Revit MEP 2016, Autodesk Revit Structure 2016 Minimum: Entry-Level

More information

Writing Applications for the GPU Using the RapidMind Development Platform

Writing Applications for the GPU Using the RapidMind Development Platform Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

More information

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. achrosny@treeage.com Andrew Munzer Director, Training and Customer

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

How to choose a suitable computer

How to choose a suitable computer How to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and post-processing your data with Artec Studio. While

More information

Installation Guide. (Version 2014.1) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom

Installation Guide. (Version 2014.1) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom Installation Guide (Version 2014.1) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom Tel: +44 (0) 141 3322681 Fax: +44 (0) 141 3326792 www.mve.com Table of Contents 1.

More information

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Intelligent Heuristic Construction with Active Learning

Intelligent Heuristic Construction with Active Learning Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field

More information

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS DU-05349-001_v6.0 February 2014 Installation and Verification on TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2.

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

Several tips on how to choose a suitable computer

Several tips on how to choose a suitable computer Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC 2.0 and the PGI Accelerator Compilers OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group michael.wolfe@pgroup.com This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present

More information

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014 HPC Cluster Decisions and ANSYS Configuration Best Practices Diana Collier Lead Systems Support Specialist Houston UGM May 2014 1 Agenda Introduction Lead Systems Support Specialist Cluster Decisions Job

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

System Requirements Table of contents

System Requirements Table of contents Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5

More information

GPGPU Parallel Merge Sort Algorithm

GPGPU Parallel Merge Sort Algorithm GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led

More information

Retargeting PLAPACK to Clusters with Hardware Accelerators

Retargeting PLAPACK to Clusters with Hardware Accelerators Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs A general-purpose virtualization service for HPC on cloud computing: an application to GPUs R.Montella, G.Coviello, G.Giunta* G. Laccetti #, F. Isaila, J. Garcia Blas *Department of Applied Science University

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and

More information

Several tips on how to choose a suitable computer

Several tips on how to choose a suitable computer Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

SPEEDUP - optimization and porting of path integral MC Code to new computing architectures

SPEEDUP - optimization and porting of path integral MC Code to new computing architectures SPEEDUP - optimization and porting of path integral MC Code to new computing architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić, A. Bogojević Scientific Computing Laboratory, Institute of Physics

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

Exploiting GPU Hardware Saturation for Fast Compiler Optimization

Exploiting GPU Hardware Saturation for Fast Compiler Optimization Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom a.magni@sms.ed.ac.uk Christophe Dubach School of Informatics

More information

Le langage OCaml et la programmation des GPU

Le langage OCaml et la programmation des GPU Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline

More information

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

More information

Pedraforca: ARM + GPU prototype

Pedraforca: ARM + GPU prototype www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

More information

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 Introduction Cloud ification < 2013 2014+ Music, Movies, Books Games GPU Flops GPUs vs. Consoles 10,000

More information

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011 Mitglied der Helmholtz-Gemeinschaft OpenCL Basics Parallel Computing on GPU and CPU Willi Homberg Agenda Introduction OpenCL architecture Platform model Execution model Memory model Programming model Platform

More information

GPU Profiling with AMD CodeXL

GPU Profiling with AMD CodeXL GPU Profiling with AMD CodeXL Software Profiling Course Hannes Würfel OUTLINE 1. Motivation 2. GPU Recap 3. OpenCL 4. CodeXL Overview 5. CodeXL Internals 6. CodeXL Profiling 7. CodeXL Debugging 8. Sources

More information