Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Size: px
Start display at page:

Download "Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui"

Transcription

1 Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui

2 Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching Conclusions and Future Work

3 Introduction Implement JosStam sstable Fluids fluid simulation algorithm on CPU, GPU and Cell Detailed Flop and Bandwidth Analysis for each computational stage and each implementation Propose new schemes to solver the processor idle problem and the performance loss caused by random memory access

4 Highlights Stam s fluid algorithm is bandwidth bounded The cores sitting idle up to 96% of the time!! Detailed Flops and Bandwidth analysis of each computational stage Make use of otherwise idle processors by using higher order Mehrstellen methods Adopt a simple static caching scheme to beat the performance barrier in the semi -Lagrangianadvection step (99% hit rate ~~ this is very impressive)

5 Flops & Bandwidth Analysis -Overview Pre-assumptions: Ideal cache with maximum data reuse Three rows of the computational grid fit in cache Hardware: CPU: SSE4 -Intel Xeon 5100 (Woodcrest) GPU: Nvidia Geforce 8800 Ultra, Geforce 7900 Cell: IBM Cell QS20 Code version: CPU: Stam s open source implementation GPU: Harris implementation Cell: Theodore Kim implementation Multiply add is counted as one operation on all the architectures. The following example is in 2D. The 3D results could be obtained by applying the same analysis

6 Flops & Bandwidth Analysis Add Source Source Code: Analysis: Analysis: Within each loop: 2 loads + 1 store 2 Scalar components of velocity and the single density field Flops: 3N 2 Bandwidth: 9N 2

7 Flops & Bandwidth Analysis -Diffusion Source Code: Analysis: Perform I iterations on each grid cell 1 store for x[i][j], 1 store for x0[i][j] Under ideal cache assumption: 1 load only for x[i][j+1] 3 adds and 1 multiply-add in each iteration Flops: (3+ 12I) N 2 Bandwidth: (9+ 9I) N 2

8 Flops & Bandwidth Analysis Projection I Three sub-stages: divergence computation, pressure computation, final projection Divergence Computation: Source Code Analysis Analysis 1 store for div[i][j], 1 load for u, 1 load for new row of v 2 minus, 1 add, 1 multiply Flops: 3N 2 Bandwidth:4N 2 Pressure Computation: Computed by a linear solver The same as Divergence Computation Flops: 3N 2 Bandwidth: 4N 2

9 Flops & Bandwidth Analysis Projection II Final Projection Source Code Analysis Loads and stores for u and v Loads for p could be amortized into a single load 1 minus, 1 multiply add per line Flops: 5N 2 Bandwidth: 4N 2 Sum up: Flops: (8 + 4I)N 2 Bandwidth: (8+3I)N 2

10 Flops & Bandwidth Analysis Advection I Three steps: backtrace, grid index computation and interpolation Backtrace Source Code Analysis 1 multiply add for each line Loads from u and v Flops: 2N 2 Bandwidth: 2N 2 Grid Index Computation Source Code

11 Flops & Bandwidth Analysis Advection II Grid Index Computation: Analysis: The If statements can be stated as ternaries (0 Flops) and emulate a floor function, 1 flop for each Local variable computation Flops: 4N 2 Bandwidth: 0 Interpolation Two steps: weight computation and interpolation computation

12 Flops & Bandwidth Analysis Advection III Interpolation Weights Computation Source Code Flops: 4N 2 Bandwidth: 0 Interpolation Computation Source Code Analysis 1 load for d[i][j] No amortize for the loads of d0 (unpredictable access pattern) With multiply-add 6 flops Flops: 6N 2 Bandwidth: 5N 2

13 Flops & Bandwidth Analysis -Summary To sum up 2D case Flops : ( I)N 2 Bandwidth: ( I)N 2 3D case (Extended from 2D) Flops: ( I)N 3 Bandwidth: (71+15I)N 3

14 Peak Performance Estimate I Hardware specification: CPU: Intel Xeon 5100 Two cores at 3 Ghz Dispatch a 4-float SIMD instruction each clock cycle. Peak performance: 24 GFlops/s. Peak memory bandwidth: GB/s GPU: NvidiaGeforce8800 Ultra 128 scalar cores at 1.5 Ghz. Peak Performance :192 GFlop/s. Peak memory bandwidth:103.7 GB/s Cell: IBM QS20 Cell blade Two Cell chips at 3.2 Ghz. 8 Synergistic Processing Elements (SPEs) per cell Dispatching 4-float SIMD instructions every clock cycle. Peak Performance: GFlops/s. Peak memory bandwidth: 25.6 GB/s Performance Evaluation from the developed equations (Table1. on the next page)

15 Peak Performance Estimate II Table 1: Estimated peak frames per second of Stable Fluids over different resolutions for several architectures. Peak performance is estimated for each architecture assuming the computation is compute-bound (ieinfinite bandwidth is available) and bandwidth-bound (i.e. infinite flops are available). The lesser of these two quantities is the more realistic estimate. In all cases, the algorithm is bandwidth-bound. Performance Estimate The ratio of computation to data arrival 2D: CPU 6.65x faster, GPU: 5.47x faster, Cell: 23.66x faster 3D: CPU 4.47x faster, GPU: 3.89x faster, Cell: 16.8x faster Processer Idle Rate 2D: CPU 85%, GPU 82%, Cell 96% 3D: CPU 79%, GPU 74%, Cell 94%

16 Peak Performance Estimate III Arithmetic Intensity When I (Iteration #) goes to infinity? A reasonable explanation: Algorithms runs well on the Cell and GPU when their arithmetic intensities are much greater than one. As both the 2D and 3D cases are close to one, the available flops will be underutilized.

17 Frame Rate Performance Measurement Table 2: Theoretical peak frames per second (The bandwidth-bound values from Table 1) and actual measured frames per second. None of the measured times exceed the predicted theoretical peaks, validating the finding that the algorithm is bandwidth bound. A GeForce7900 was used for the 16 bit timings because the frame rates were uniformly superior to the Some findings The predicted theoretical peaks were never exceeded, providing additional evidence that the algorithm is bandwidth-bound. A trend on both the GPU and Cell is that as the resolution is increased, the theoretical peak is more closely approached. (Larger Coherent loads)

18 MehrstellenSchemes -Background Poisson Solver for diffusion and projection stages: Discretized Version: Rewritten in Matrix format: From 2 nd order to 4 th order for less # of iteration

19 MehrstellenSchemes -Details An alternate discretizationthat allows us to increase the accuracy from second to fourth order without significantly increasing the complexity of the memory access pattern 2D 3D

20 MehrstellenSchemes Results I Spectral radius of the resultant matrix: the error of the current solution is multiplied by the spectral radius of the Jacobi matrix every iteration. Expectation: If the radius is significantly smaller than that of the second order discretization, then Less Jacobi iterations are needed overall. The spectral radius of Jacobi iteration using the Mehrstellen The equivalent radius for the standard Jacobi matrix The number of iterations it would take MehrstellenJacobi to achieve an error reduction equivalent to 20 iterations of standard Jacobi

21 MehrstellenSchemes Results II Table 3: Spectral radii of the fourth order accurate Mehrstellen Jacobi matrix (M) and the standard second order accurate Jacobi matrix (S). The third column computes the number of Mehrstellen iterations necessary to match the error reduction of 20 standard iterations. The last column is the fraction of Mehrstellen iterations necessary to match the error reduction of one standard iteration.

22 Advection Caching -Scheme Physical Characteristics Reasons to expect that the majority of the vector field exhibits high spatial locality The time-step size in practice would be quite small The projection and diffusion operators smear out the velocity field Large velocities quickly dissipate into smaller ones in both space and time. Make use of this: Assume that most of the advection rays terminate in regions that are very close to their origins. Static Caching Scheme Two way approach: Prefetchthe rows j 1, j, and j + 1 from the d0 array. While iterating over the elements of row j, first check to see if the semi-lagrangianray terminated in a 3x3 neighborhood of the origin. If so, make use of the prefetchedd0 values for the interpolation. Else, perform the more expensive fetch from main memory.

23 Advection Caching -Tests Two Test scene 2D scene : eight jets of velocity and density were injected into a 5122 simulation at different points and in different directions in order induce a wide variety of directions into the velocity field. 3D scene : A buoyant pocket of smoke is continually inserted into a 643 simulation Cache Miss Rate: 2D: miss rate never exceeds 0.65% 3D: miss rate never exceeds 0.44% Bandwidth Test for 2D scene on the Cell Bandwidth achieved by the advection stage on the Cell with and without the static cache.

24 Conclusion & Future Work Adetailed flop and bandwidth analysisof the implementation of Stable Fluids on current CPU, GPU and Cell architectures. Prove theoretically and experimentally that the performance of the algorithm is bandwidth-bound Proposed the use of Mehrstellendiscretizationto reduce the # of iterations in Jacobi solver to reduce processor idle rate This scheme allows the linear solver to terminate 17% earlier in 2D, and 33% earlier in 3D. Designed a static caching scheme for the advection stage that makes more effective use of the available memory bandwidth. 2x speedup is measured in the advection stage using this scheme on the Cell. Map algorithms that handle free surface cases to parallel architecture and do corresponding performance analysis Develop Mehrstellen discretizations like scheme for PCG solvers

25 Thanks for your attention. Questions???

Hardware-Aware Analysis and Optimization of Stable Fluids

Hardware-Aware Analysis and Optimization of Stable Fluids Hardware-Aware Analysis and Optimization of Stable Fluids Theodore Kim IBM TJ Watson Research Center Abstract We perform a detailed flop and bandwidth analysis of Jos Stam s Stable Fluids algorithm on

More information

Implementing 3D Jacobi Method in Cell

Implementing 3D Jacobi Method in Cell Implementing 3D Jacobi Method in Cell Md. Kamruzzaman Abstract IBM Cell Broadband Engine (CBE) is an interesting architecture that provides amazing performance for floating point (especially single precision)

More information

Explicit Early-Z Culling for Efficient Fluid Flow Simulation and Rendering

Explicit Early-Z Culling for Efficient Fluid Flow Simulation and Rendering Explicit Early-Z Culling for Efficient Fluid Flow Simulation and Rendering ATI Research Technical Report August 2, 24 Pedro V. Sander Natalya Tatarchuk Jason L. Mitchell ATI Research, Inc. 62 Forest Street

More information

G8x Hardware Architecture

G8x Hardware Architecture G8x Hardware Architecture 1 G80 Architecture First DirectX 10 compatible GPU Unified shader architecture Scalar processors Includes new hardware features designed for general purpose computation shared

More information

CSE 6040 Computing for Data Analytics: Methods and Tools

CSE 6040 Computing for Data Analytics: Methods and Tools CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS

More information

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware Nico Galoppo, Naga K. Govindaraju, Michael Henson, Dinesh Manocha http://gamma.cs.unc.edu/lu-gpu Goals Demonstrate advantages

More information

NVIDIA GPU Architecture. for General Purpose Computing. Anthony Lippert 4/27/09

NVIDIA GPU Architecture. for General Purpose Computing. Anthony Lippert 4/27/09 NVIDIA GPU Architecture for General Purpose Computing Anthony Lippert 4/27/09 1 Outline Introduction GPU Hardware Programming Model Performance Results Supercomputing Products Conclusion 2 Intoduction

More information

Attaining High Performance in General-Purpose Computations on Current Graphics Processors

Attaining High Performance in General-Purpose Computations on Current Graphics Processors Attaining High Performance in General-Purpose Computations on Current Graphics Processors Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí Departamento de Ingeniería y Ciencia de los Computadores.

More information

Overview Fluid Simulation Techniques

Overview Fluid Simulation Techniques Particle-based Fluid Simulation Simon Green February 2008 Overview Fluid Simulation Techniques A Brief History of Interactive Fluid Simulation CUDA particle simulation Spatial subdivision techniques Rendering

More information

a conservative approach in this proposal. 2 Note that there are several steps and details missing from this description but for the purposes of this

a conservative approach in this proposal. 2 Note that there are several steps and details missing from this description but for the purposes of this 3.0 Graphics Processing Units (GPU) Today s commodity graphics cards are built upon a programmable, data parallel architecture that in many cases is capable of out-performing the CPU in computational rates.

More information

Lecture 3: Single processor architecture and memory

Lecture 3: Single processor architecture and memory Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

Program Optimization Study on a 128-Core GPU

Program Optimization Study on a 128-Core GPU Program Optimization Study on a 128-Core GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, and Wen-mei W. Hwu Yu, Xuan Dept of Computer & Information Sciences University

More information

1.6 Serial, Vector and Multiprocessor Computers

1.6 Serial, Vector and Multiprocessor Computers 1.6 Serial, Vector and Multiprocessor Computers In this section we consider the components of a computer, and the various ways they are connected. In particular, vector pipelines and multiprocessors will

More information

Parallelism and Cloud Computing

Parallelism and Cloud Computing Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

More information

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

More information

Solving the Table Maker s Dilemma on multi-core CPUs and on the Xeon Phi

Solving the Table Maker s Dilemma on multi-core CPUs and on the Xeon Phi Solving the Table Maker s Dilemma on multi-core CPUs and on the Xeon Phi Christophe Avenel, Pierre ortin, Mourad Gouicem, Samia Zaidi Université Pierre et Marie Curie - LIP6 ANR TaMaDi meeting October

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

We assume a HPC machine to be a collection of interconnected nodes Performance depends on:

We assume a HPC machine to be a collection of interconnected nodes Performance depends on: We assume a HPC machine to be a collection of interconnected nodes Performance depends on: Application and parallelization Application structure Algorithm Data set Programming model Single-node performance

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

Klaus Mueller, Wei Xu, Ziyi Zheng Fang Xu

Klaus Mueller, Wei Xu, Ziyi Zheng Fang Xu MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs) Entertainment Graphics: Virtual Realism for the Masses Computer games need to have: realistic appearance

More information

Mixed Precision Methods on GPUs

Mixed Precision Methods on GPUs Mixed Precision Methods on GPUs Robert Strzodka, Dominik Göddeke 2008 NVIDIA Corporation. Collaboration Dominik Göddeke, Stefan Turek, FEAST Group (Dortmund University of Technology) Jamaludin Mohd-Yusof,

More information

Lecture 4: Modeling Sparse Matrix-Vector Multiply. William Gropp

Lecture 4: Modeling Sparse Matrix-Vector Multiply. William Gropp Lecture 4: Modeling Sparse Matrix-Vector Multiply William Gropp www.cs.illinois.edu/~wgropp Sustained Memory Bandwidth Measure the rate at which data can be copied from within a program: t = mysecond()

More information

The Evolution of High Performance Computers

The Evolution of High Performance Computers The Evolution of High Performance Computers David Chisnall February 8, 2011 Outline Classic Computer Architecture Fetch Decode Execute Execute? 1. Read value from memory 2. Store in register 1. Read value

More information

Recent Trends. GPU Computation Strategies & Tricks. ...but Bandwidth is Expensive. Compute is Cheap. Optimizing for GPUs. Compute vs.

Recent Trends. GPU Computation Strategies & Tricks. ...but Bandwidth is Expensive. Compute is Cheap. Optimizing for GPUs. Compute vs. Recent Trends GPU Computation Strategies & Tricks Ian Buck NVIDIA 2 Compute is Cheap...but Bandwidth is Expensive parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments

More information

Fast Circuit Simulation on Graphics Processing Units

Fast Circuit Simulation on Graphics Processing Units Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati John F. Croix Sunil P. Khatri Rahm Shastry Texas A&M University, College Station, TX Nascentric, Inc. Austin, TX Outline Introduction

More information

Implementation of NAMD molecular dynamics nonbonded force-field on the Cell Broadband Engine processor

Implementation of NAMD molecular dynamics nonbonded force-field on the Cell Broadband Engine processor Implementation of NAMD molecular dynamics nbonded force-field on the Cell Broadband Engine processor Guochun Shi Volodymyr Kindratenko National Center for Supercomputing Applications University of Illiis

More information

Performance Error Correcting Codes

Performance Error Correcting Codes GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas B. Chang, and dstephen Leung This work is sponsored by the Department of the Air Force under Air Force contract

More information

Heterogeneous (CPU+GPU) Performance Libraries

Heterogeneous (CPU+GPU) Performance Libraries Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Heterogeneous (CPU+GPU) Performance Libraries Vasily Volkov June 6, 2008 1 Par Lab Research Overview Easy to

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

Memory Bound vs. Compute Bound: A Quantitative Study of Cache and Memory Bandwidth in High Performance Applications

Memory Bound vs. Compute Bound: A Quantitative Study of Cache and Memory Bandwidth in High Performance Applications Memory Bound vs. Compute Bound: A Quantitative Study of Cache and Memory Bandwidth in High Performance Applications Alex Hutcheson and Vincent Natoli Stone Ridge Technology January 2011 Abstract High performance

More information

GPU Accelerated Pathfinding

GPU Accelerated Pathfinding GPU Accelerated Pathfinding By: Avi Bleiweiss NVIDIA Corporation Graphics Hardware (2008) Editors: David Luebke and John D. Owens NTNU, TDT24 Presentation by Lars Espen Nordhus http://delivery.acm.org/10.1145/1420000/1413968/p65-bleiweiss.pdf?ip=129.241.138.231&acc=active

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

YALES2 porting on the Xeon- Phi Early results

YALES2 porting on the Xeon- Phi Early results YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

High Performance Poisson Equation Solver for Hybrid CPU/GPU Systems

High Performance Poisson Equation Solver for Hybrid CPU/GPU Systems Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe High Performance Poisson Equation Solver for Hybrid CPU/GPU Systems Stoyan Markov, Peicho Petkov, Damyan Grancharov and

More information

5 Ray Tracing on the GPU

5 Ray Tracing on the GPU 5 Ray Tracing on the GPU In this chapter we review the scene data layout on the GPU, as well as our proposed ray tracing implementation. Firstly, we identify the common routines used to trace rays using

More information

Hierarchical Identify Verify Exploit (HIVE) Program

Hierarchical Identify Verify Exploit (HIVE) Program Hierarchical Identify Verify Exploit (HIVE) Program Trung Tran DARPA/MTO Proposers Day Brief DARPA-BAA-16-52 HIVE What are we trying to do? HIVE will create a graph analytics processor that achieves 1000x

More information

Radar signal processing on graphics processors (Nvidia/CUDA)

Radar signal processing on graphics processors (Nvidia/CUDA) Radar signal processing on graphics processors (Nvidia/CUDA) Jimmy Pettersson & Ian Wainwright Full master thesis presentation will be held in january CUDA is a framework to access the GPU for non-graphics

More information

Boundary element analysis of the multi-capsule flow using an ultra-high speed GPGPU computation

Boundary element analysis of the multi-capsule flow using an ultra-high speed GPGPU computation Boundary Elements and Other Mesh Reduction Methods XXXIV 201 Boundary element analysis of the multi-capsule flow using an ultra-high speed GPGPU computation D. Matsunaga 1,Y.Imai 1,T.Ishikawa 1 & T. Yamaguchi

More information

Multi Grid for Multi Core

Multi Grid for Multi Core Multi Grid for Multi Core Harald Köstler, Daniel Ritter, Markus Stürmer and U. Rüde (LSS Erlangen, ruede@cs.fau.de) in collaboration with many more Lehrstuhl für Informatik 10 (Systemsimulation) Universität

More information

Topics. ! Generic cache memory organization! Direct mapped caches! Set associative caches! Impact of caches on performance.

Topics. ! Generic cache memory organization! Direct mapped caches! Set associative caches! Impact of caches on performance. Cache Memories Topics! Generic cache memory organization! Direct mapped caches! Set associative caches! Impact of caches on performance Next time! Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

TIPS FOR WORKSTATION PERFORMANCE ANSYS 12.1 AND 13.0

TIPS FOR WORKSTATION PERFORMANCE ANSYS 12.1 AND 13.0 TIPS FOR WORKSTATION PERFORMANCE ANSYS 12.1 AND 13.0 CPU MEMORY I/O 1. Fast processor 2. More Cores for Parallel execution Memory buffered I/O 3. Large CPU Cache 2. Fill all memory channels 4. Performance

More information

Optical Flow Estimation with CUDA. Mikhail Smirnov

Optical Flow Estimation with CUDA. Mikhail Smirnov Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent

More information

COMPUTING OF NEURAL NETWORK ON GRAPHICS CARD

COMPUTING OF NEURAL NETWORK ON GRAPHICS CARD COMPUTING OF NEURAL NETWORK ON GRAPHICS CARD S. Kajan, J. Slačka Institute of Control and Industrial Informatics, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology

More information

Implementation of Parallel Processing Techniques on Graphical Processing Units. Brad Baker, Wayne Haney, Dr. Charles Choi

Implementation of Parallel Processing Techniques on Graphical Processing Units. Brad Baker, Wayne Haney, Dr. Charles Choi Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi Industry Direction High performance COTS computing is moving to multi-core and heterogeneous

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

MAXware: acceleration in HPC. R. Dimond, M. J. Flynn, O. Mencer and O. Pell Maxeler Technologies contact:

MAXware: acceleration in HPC. R. Dimond, M. J. Flynn, O. Mencer and O. Pell Maxeler Technologies contact: MAXware: acceleration in HPC R. Dimond, M. J. Flynn, O. Mencer and O. Pell Maxeler Technologies contact: flynn@maxeler.com Maxeler Technologies MAXware: acceleration in HPC 2 / 26 HPC: the case for accelerators

More information

QCD as a Video Game?

QCD as a Video Game? QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

More information

Memory Hierarchy Design. Overview

Memory Hierarchy Design. Overview Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview Problem CPU vs Memory performance imbalance Solution Driven by temporal and spatial locality Memory hierarchies Fast L1, L2, L3 caches Larger

More information

Implementing AES on GPU Final Report

Implementing AES on GPU Final Report Implementing AES on GPU Final Report Michael Kipper, Joshua Slavkin, Dmitry Denisenko University of Toronto April 20, 2009 Introduction Encryption and decryption are increasingly important. In order to

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC

More information

Parallel Processors. Session 10. Multivector and SIMD Computers

Parallel Processors. Session 10. Multivector and SIMD Computers Parallel Processors Session 10 Multivector and SIMD Computers Vector Processing Principles Vector: A set of scalar data items All of the same type Stored in memory Stride: Address increments between successive

More information

General & Special-purpose architecture. General-purpose GPU. GPGPU Programming models. GPGPU Memory models. Next generation

General & Special-purpose architecture. General-purpose GPU. GPGPU Programming models. GPGPU Memory models. Next generation General & Special-purpose architecture General-purpose GPU GPGPU Programming models GPGPU Memory models Next generation 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 2 Von Neumann

More information

Chapter 5 and Appendix C

Chapter 5 and Appendix C Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview Problem CPU vs Memory performance imbalance Solution Driven by temporal and spatial locality Memory hierarchies Fast L1, L2, L3 caches Larger

More information

CPU-specific optimization. Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad.

CPU-specific optimization. Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad. CPU-specific optimization 1 Example of a target CPU core: ARM Cortex-M4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad. Example of a function that we want to optimize: adding 1000

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Advanced cache optimizations - overview. Why More on Memory Hierarchy?

Advanced cache optimizations - overview. Why More on Memory Hierarchy? Advanced cache optimizations - overview Why More on Memory Hierarchy? 100,000 10,000 Performance 1,000 100 Processor Processor-Memory Performance Gap Growing 10 Memory 1 1980 1985 1990 1995 2000 2005 2010

More information

Algorithms of Scientific Computing II

Algorithms of Scientific Computing II Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware

More information

arxiv: v1 [cs.pf] 6 May 2009

arxiv: v1 [cs.pf] 6 May 2009 Introducing a Performance Model for Bandwidth-Limited Loop Kernels Jan Treibig and Georg Hager arxiv:0905.0792v1 [cs.pf] 6 May 2009 Regionales Rechenzentrum Erlangen University Erlangen-Nuernberg jan.treibig@rrze.uni-erlangen.de

More information

Improving Performance of Matrix Multiplication and FFT on GPU

Improving Performance of Matrix Multiplication and FFT on GPU 2009 15th International Conference on Parallel and Distributed Systems Improving Performance of Matrix Multiplication and FFT on GPU Xiang Cui, Yifeng Chen, and Hong Mei Key Laboratory of High Confidence

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Internal and External Memories CPU Date transfer Control Main Memory Control Data transfer Secondary

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

GPU Algorithms for Radiosity and Subsurface Scattering

GPU Algorithms for Radiosity and Subsurface Scattering Graphics Hardware 2003 GPU Algorithms for Radiosity and Subsurface Scattering Nathan A. Carr, Jesse D. Hall, John C. Hart Matrix Radiosity & Diffuse Subsurface Scattering Both can be solved using discretization

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Cache Systems. Memory Hierarchy Design. Example: Two-level Hierarchy. Overview. Locality of Reference. Basic Cache Read Operation

Cache Systems. Memory Hierarchy Design. Example: Two-level Hierarchy. Overview. Locality of Reference. Basic Cache Read Operation Cache Systems Hierarchy Design 4MHz Bus 66MHz Main MHz Cache Bus 66MHz Main MHz Chapter 5 and Appendix C Data object transfer Block transfer Main Cache 4 Overview Example: Two-level Hierarchy Access Time

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

GPU Physics. Mark Harris NVIDIA Developer Technology

GPU Physics. Mark Harris NVIDIA Developer Technology GPU Physics Mark Harris NVIDIA Developer Technology Game Physics Enhance game experience through simulation Simulate objects and interactions between them Rigid bodies, particles, rag dolls, cloth, fluids,

More information

Implementing a GPU-Efficient FFT

Implementing a GPU-Efficient FFT Why Fast Fourier Transform? Classic algorithm Computationally intensive Implementing a GPU-Efficient FFT Useful Imaging Signal analysis Procedural texturing John Spitzer NVIDIA Corporation What is a FFT?

More information

Real-Time Eulerian Water Simulation Using a Restricted Tall Cell Grid. Monday, August 15, 2011

Real-Time Eulerian Water Simulation Using a Restricted Tall Cell Grid. Monday, August 15, 2011 Real-Time Eulerian Water Simulation Using a Restricted Tall Cell Grid Nuttapong Chentanez Matthias Müller Main Contributions GPU friendly tall cell grid data structure Multigrid Poisson solver for the

More information

Accelerating CST MWS Performance with GPU and MPI Computing. CST workshop series

Accelerating CST MWS Performance with GPU and MPI Computing.  CST workshop series Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques - Overview - Multithreading GPU Computing Distributed Computing

More information

ITERATIVE SOLUTIONS TO LINEAR ALGEBRAIC EQUATIONS. As finer discretizations are being applied with Finite Difference and Finite Element codes:

ITERATIVE SOLUTIONS TO LINEAR ALGEBRAIC EQUATIONS. As finer discretizations are being applied with Finite Difference and Finite Element codes: LECTURE 19 ITERATIVE SOLUTIONS TO LINEAR ALGEBRAIC EQUATIONS As finer discretizations are being applied with Finite Difference and Finite Element codes: Matrices are becoming increasingly larger Density

More information

Faster Matrix-Vector Multiplication on GeForce 8800GTX

Faster Matrix-Vector Multiplication on GeForce 8800GTX Faster Matrix-Vector Multiplication on GeForce 88GTX Noriyuki Fujimoto Graduate School of Information Science and Technology, Osaka University 1-3 Machikaneyama, Toyonaka, Osaka, 6-831, Japan fujimoto@ist.osaka-u.ac.jp

More information

Analysis of GPU Parallel Computing based on Matlab

Analysis of GPU Parallel Computing based on Matlab Analysis of GPU Parallel Computing based on Matlab Mingzhe Wang, Bo Wang, Qiu He, Xiuxiu Liu, Kunshuai Zhu (School of Computer and Control Engineering, University of Chinese Academy of Sciences, Huairou,

More information

Fast Fluid Dynamics on the Single-chip Cloud Computer

Fast Fluid Dynamics on the Single-chip Cloud Computer Fast Fluid Dynamics on the Single-chip Cloud Computer Marco Fais, Francesco Iorio High-Performance Computing Group Autodesk Research Toronto, Canada francesco.iorio@autodesk.com Abstract Fast simulation

More information

Assume a pipelined processor LD R0, (R1) // Reading (R1) into R0 at time t LD R2, (R3) // Reading the instruction at time t

Assume a pipelined processor LD R0, (R1) // Reading (R1) into R0 at time t LD R2, (R3) // Reading the instruction at time t Abbreviated Quiz1 Solution: 1. Explain the utility of each of the following DSP architectural features. Demonstrate with the aid of a simple specific example (described in pseudo-code). - Circular Buffers

More information

Building Blocks. CPUs, Memory and Accelerators

Building Blocks. CPUs, Memory and Accelerators Building Blocks CPUs, Memory and Accelerators Outline Computer layout CPU and Memory What does performance depend on? Limits to performance Silicon-level parallelism Single Instruction Multiple Data (SIMD/Vector)

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

Parallel computing with MATLAB The MathWorks, Inc. 1

Parallel computing with MATLAB The MathWorks, Inc. 1 Parallel computing with MATLAB 2012 The MathWorks, Inc. 1 Going Beyond Serial MATLAB Applications MATLAB Desktop (Client) 2 Programming Parallel Applications (CPU) Ease of Use Built-in support with toolboxes

More information

Chapter 3.3 Computer Architecture and the Fetch-Execute Cycle

Chapter 3.3 Computer Architecture and the Fetch-Execute Cycle Chapter 3.3 Computer Architecture and the Fetch-Execute Cycle 3.3 (a) Von Neumann Architecture The earliest computing machines had fixed programs. For example, a desk calculator (in principle) is a fixed

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Experiencing Various Massively Parallel Architectures and Programming Models for Data-Intensive Applications

Experiencing Various Massively Parallel Architectures and Programming Models for Data-Intensive Applications Experiencing Various Massively Parallel Architectures and Programming Models for Data-Intensive Applications Hongliang Gao, Martin Dimitrov, Jingfei Kong, Huiyang Zhou School of Electrical Engineering

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR TELESCOPE (5/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR TELESCOPE (5/5) Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR TELESCOPE (5/5) Rob van Nieuwpoort rob@cs.vu.nl escience 2 Enhanced science Apply ICT in the broadest sense to other sciences Data-driven research across

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming. OpenCL in Action

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming. OpenCL in Action CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming OpenCL in Action Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Many slides from this lecture are adapted

More information

High Performance Computing. Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology

High Performance Computing. Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology High Performance Computing Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Why this course? Almost all computers today use parallelism As software developers,

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation

3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation 3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation Florian Jung, Stefan Wesarg Interactive Graphics Systems Group (GRIS), TU Darmstadt, Germany stefan.wesarg@gris.tu-darmstadt.de

More information

Evaluating polynomials in several variables and their derivatives on a GPU computing processor

Evaluating polynomials in several variables and their derivatives on a GPU computing processor Evaluating polynomials in several variables and their derivatives on a GPU computing processor Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics,

More information

A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction. Chulian Zhang, Hamed Tabkhi, Gunar Schirner

A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction. Chulian Zhang, Hamed Tabkhi, Gunar Schirner A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction Chulian Zhang, Hamed Tabkhi, Gunar Schirner Outline Motivation Background Related Work Approach General Optimization

More information

Agenda. Why GPUs? Parallel Processing CUDA Example GPU Memory Advanced Techniques Issues Conclusion

Agenda. Why GPUs? Parallel Processing CUDA Example GPU Memory Advanced Techniques Issues Conclusion GP- GPU Programming Agenda Why GPUs? Parallel Processing CUDA Example GPU Memory Advanced Techniques Issues Conclusion Why GPUs? Moore s Law Concurrency, not speed Hardware support Memory bandwidth Programmability

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Experiences with Seismic Algorithms on GPUs. Scott Morton Geoscience Technology Hess Corporation

Experiences with Seismic Algorithms on GPUs. Scott Morton Geoscience Technology Hess Corporation Experiences with Seismic Algorithms on GPUs Scott Morton Geoscience Technology Hess Corporation Hess Commodity PC Cluster 2005 Hess Commodity PC Cluster 2005 John Hess: What's next? Outline History Other

More information

HP and ANSYS 17. Table of contents. HP Workstations. Technical white paper

HP and ANSYS 17. Table of contents. HP Workstations. Technical white paper Technical white paper HP and ANSYS 17 HP Workstations Table of contents ANSYS Mechanical... 2 HP Workstation recommendations for running ANSYS Mechanical... 2 HP Workstation tips for running ANSYS Mechanical...

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,

More information