HardwareAware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui


 Blaze McDonald
 2 years ago
 Views:
Transcription
1 HardwareAware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui
2 Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching Conclusions and Future Work
3 Introduction Implement JosStam sstable Fluids fluid simulation algorithm on CPU, GPU and Cell Detailed Flop and Bandwidth Analysis for each computational stage and each implementation Propose new schemes to solver the processor idle problem and the performance loss caused by random memory access
4 Highlights Stam s fluid algorithm is bandwidth bounded The cores sitting idle up to 96% of the time!! Detailed Flops and Bandwidth analysis of each computational stage Make use of otherwise idle processors by using higher order Mehrstellen methods Adopt a simple static caching scheme to beat the performance barrier in the semi Lagrangianadvection step (99% hit rate ~~ this is very impressive)
5 Flops & Bandwidth Analysis Overview Preassumptions: Ideal cache with maximum data reuse Three rows of the computational grid fit in cache Hardware: CPU: SSE4 Intel Xeon 5100 (Woodcrest) GPU: Nvidia Geforce 8800 Ultra, Geforce 7900 Cell: IBM Cell QS20 Code version: CPU: Stam s open source implementation GPU: Harris implementation Cell: Theodore Kim implementation Multiply add is counted as one operation on all the architectures. The following example is in 2D. The 3D results could be obtained by applying the same analysis
6 Flops & Bandwidth Analysis Add Source Source Code: Analysis: Analysis: Within each loop: 2 loads + 1 store 2 Scalar components of velocity and the single density field Flops: 3N 2 Bandwidth: 9N 2
7 Flops & Bandwidth Analysis Diffusion Source Code: Analysis: Perform I iterations on each grid cell 1 store for x[i][j], 1 store for x0[i][j] Under ideal cache assumption: 1 load only for x[i][j+1] 3 adds and 1 multiplyadd in each iteration Flops: (3+ 12I) N 2 Bandwidth: (9+ 9I) N 2
8 Flops & Bandwidth Analysis Projection I Three substages: divergence computation, pressure computation, final projection Divergence Computation: Source Code Analysis Analysis 1 store for div[i][j], 1 load for u, 1 load for new row of v 2 minus, 1 add, 1 multiply Flops: 3N 2 Bandwidth:4N 2 Pressure Computation: Computed by a linear solver The same as Divergence Computation Flops: 3N 2 Bandwidth: 4N 2
9 Flops & Bandwidth Analysis Projection II Final Projection Source Code Analysis Loads and stores for u and v Loads for p could be amortized into a single load 1 minus, 1 multiply add per line Flops: 5N 2 Bandwidth: 4N 2 Sum up: Flops: (8 + 4I)N 2 Bandwidth: (8+3I)N 2
10 Flops & Bandwidth Analysis Advection I Three steps: backtrace, grid index computation and interpolation Backtrace Source Code Analysis 1 multiply add for each line Loads from u and v Flops: 2N 2 Bandwidth: 2N 2 Grid Index Computation Source Code
11 Flops & Bandwidth Analysis Advection II Grid Index Computation: Analysis: The If statements can be stated as ternaries (0 Flops) and emulate a floor function, 1 flop for each Local variable computation Flops: 4N 2 Bandwidth: 0 Interpolation Two steps: weight computation and interpolation computation
12 Flops & Bandwidth Analysis Advection III Interpolation Weights Computation Source Code Flops: 4N 2 Bandwidth: 0 Interpolation Computation Source Code Analysis 1 load for d[i][j] No amortize for the loads of d0 (unpredictable access pattern) With multiplyadd 6 flops Flops: 6N 2 Bandwidth: 5N 2
13 Flops & Bandwidth Analysis Summary To sum up 2D case Flops : ( I)N 2 Bandwidth: ( I)N 2 3D case (Extended from 2D) Flops: ( I)N 3 Bandwidth: (71+15I)N 3
14 Peak Performance Estimate I Hardware specification: CPU: Intel Xeon 5100 Two cores at 3 Ghz Dispatch a 4float SIMD instruction each clock cycle. Peak performance: 24 GFlops/s. Peak memory bandwidth: GB/s GPU: NvidiaGeforce8800 Ultra 128 scalar cores at 1.5 Ghz. Peak Performance :192 GFlop/s. Peak memory bandwidth:103.7 GB/s Cell: IBM QS20 Cell blade Two Cell chips at 3.2 Ghz. 8 Synergistic Processing Elements (SPEs) per cell Dispatching 4float SIMD instructions every clock cycle. Peak Performance: GFlops/s. Peak memory bandwidth: 25.6 GB/s Performance Evaluation from the developed equations (Table1. on the next page)
15 Peak Performance Estimate II Table 1: Estimated peak frames per second of Stable Fluids over different resolutions for several architectures. Peak performance is estimated for each architecture assuming the computation is computebound (ieinfinite bandwidth is available) and bandwidthbound (i.e. infinite flops are available). The lesser of these two quantities is the more realistic estimate. In all cases, the algorithm is bandwidthbound. Performance Estimate The ratio of computation to data arrival 2D: CPU 6.65x faster, GPU: 5.47x faster, Cell: 23.66x faster 3D: CPU 4.47x faster, GPU: 3.89x faster, Cell: 16.8x faster Processer Idle Rate 2D: CPU 85%, GPU 82%, Cell 96% 3D: CPU 79%, GPU 74%, Cell 94%
16 Peak Performance Estimate III Arithmetic Intensity When I (Iteration #) goes to infinity? A reasonable explanation: Algorithms runs well on the Cell and GPU when their arithmetic intensities are much greater than one. As both the 2D and 3D cases are close to one, the available flops will be underutilized.
17 Frame Rate Performance Measurement Table 2: Theoretical peak frames per second (The bandwidthbound values from Table 1) and actual measured frames per second. None of the measured times exceed the predicted theoretical peaks, validating the finding that the algorithm is bandwidth bound. A GeForce7900 was used for the 16 bit timings because the frame rates were uniformly superior to the Some findings The predicted theoretical peaks were never exceeded, providing additional evidence that the algorithm is bandwidthbound. A trend on both the GPU and Cell is that as the resolution is increased, the theoretical peak is more closely approached. (Larger Coherent loads)
18 MehrstellenSchemes Background Poisson Solver for diffusion and projection stages: Discretized Version: Rewritten in Matrix format: From 2 nd order to 4 th order for less # of iteration
19 MehrstellenSchemes Details An alternate discretizationthat allows us to increase the accuracy from second to fourth order without significantly increasing the complexity of the memory access pattern 2D 3D
20 MehrstellenSchemes Results I Spectral radius of the resultant matrix: the error of the current solution is multiplied by the spectral radius of the Jacobi matrix every iteration. Expectation: If the radius is significantly smaller than that of the second order discretization, then Less Jacobi iterations are needed overall. The spectral radius of Jacobi iteration using the Mehrstellen The equivalent radius for the standard Jacobi matrix The number of iterations it would take MehrstellenJacobi to achieve an error reduction equivalent to 20 iterations of standard Jacobi
21 MehrstellenSchemes Results II Table 3: Spectral radii of the fourth order accurate Mehrstellen Jacobi matrix (M) and the standard second order accurate Jacobi matrix (S). The third column computes the number of Mehrstellen iterations necessary to match the error reduction of 20 standard iterations. The last column is the fraction of Mehrstellen iterations necessary to match the error reduction of one standard iteration.
22 Advection Caching Scheme Physical Characteristics Reasons to expect that the majority of the vector field exhibits high spatial locality The timestep size in practice would be quite small The projection and diffusion operators smear out the velocity field Large velocities quickly dissipate into smaller ones in both space and time. Make use of this: Assume that most of the advection rays terminate in regions that are very close to their origins. Static Caching Scheme Two way approach: Prefetchthe rows j 1, j, and j + 1 from the d0 array. While iterating over the elements of row j, first check to see if the semilagrangianray terminated in a 3x3 neighborhood of the origin. If so, make use of the prefetchedd0 values for the interpolation. Else, perform the more expensive fetch from main memory.
23 Advection Caching Tests Two Test scene 2D scene : eight jets of velocity and density were injected into a 5122 simulation at different points and in different directions in order induce a wide variety of directions into the velocity field. 3D scene : A buoyant pocket of smoke is continually inserted into a 643 simulation Cache Miss Rate: 2D: miss rate never exceeds 0.65% 3D: miss rate never exceeds 0.44% Bandwidth Test for 2D scene on the Cell Bandwidth achieved by the advection stage on the Cell with and without the static cache.
24 Conclusion & Future Work Adetailed flop and bandwidth analysisof the implementation of Stable Fluids on current CPU, GPU and Cell architectures. Prove theoretically and experimentally that the performance of the algorithm is bandwidthbound Proposed the use of Mehrstellendiscretizationto reduce the # of iterations in Jacobi solver to reduce processor idle rate This scheme allows the linear solver to terminate 17% earlier in 2D, and 33% earlier in 3D. Designed a static caching scheme for the advection stage that makes more effective use of the available memory bandwidth. 2x speedup is measured in the advection stage using this scheme on the Cell. Map algorithms that handle free surface cases to parallel architecture and do corresponding performance analysis Develop Mehrstellen discretizations like scheme for PCG solvers
25 Thanks for your attention. Questions???
HardwareAware Analysis and Optimization of Stable Fluids
HardwareAware Analysis and Optimization of Stable Fluids Theodore Kim IBM TJ Watson Research Center Abstract We perform a detailed flop and bandwidth analysis of Jos Stam s Stable Fluids algorithm on
More informationImplementing 3D Jacobi Method in Cell
Implementing 3D Jacobi Method in Cell Md. Kamruzzaman Abstract IBM Cell Broadband Engine (CBE) is an interesting architecture that provides amazing performance for floating point (especially single precision)
More informationExplicit EarlyZ Culling for Efficient Fluid Flow Simulation and Rendering
Explicit EarlyZ Culling for Efficient Fluid Flow Simulation and Rendering ATI Research Technical Report August 2, 24 Pedro V. Sander Natalya Tatarchuk Jason L. Mitchell ATI Research, Inc. 62 Forest Street
More informationG8x Hardware Architecture
G8x Hardware Architecture 1 G80 Architecture First DirectX 10 compatible GPU Unified shader architecture Scalar processors Includes new hardware features designed for general purpose computation shared
More informationCSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
More informationLUGPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
LUGPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware Nico Galoppo, Naga K. Govindaraju, Michael Henson, Dinesh Manocha http://gamma.cs.unc.edu/lugpu Goals Demonstrate advantages
More informationNVIDIA GPU Architecture. for General Purpose Computing. Anthony Lippert 4/27/09
NVIDIA GPU Architecture for General Purpose Computing Anthony Lippert 4/27/09 1 Outline Introduction GPU Hardware Programming Model Performance Results Supercomputing Products Conclusion 2 Intoduction
More informationAttaining High Performance in GeneralPurpose Computations on Current Graphics Processors
Attaining High Performance in GeneralPurpose Computations on Current Graphics Processors Francisco Igual Rafael Mayo Enrique S. QuintanaOrtí Departamento de Ingeniería y Ciencia de los Computadores.
More informationOverview Fluid Simulation Techniques
Particlebased Fluid Simulation Simon Green February 2008 Overview Fluid Simulation Techniques A Brief History of Interactive Fluid Simulation CUDA particle simulation Spatial subdivision techniques Rendering
More informationa conservative approach in this proposal. 2 Note that there are several steps and details missing from this description but for the purposes of this
3.0 Graphics Processing Units (GPU) Today s commodity graphics cards are built upon a programmable, data parallel architecture that in many cases is capable of outperforming the CPU in computational rates.
More informationLecture 3: Single processor architecture and memory
Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationProgram Optimization Study on a 128Core GPU
Program Optimization Study on a 128Core GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, SainZee Ueng, and Wenmei W. Hwu Yu, Xuan Dept of Computer & Information Sciences University
More information1.6 Serial, Vector and Multiprocessor Computers
1.6 Serial, Vector and Multiprocessor Computers In this section we consider the components of a computer, and the various ways they are connected. In particular, vector pipelines and multiprocessors will
More informationParallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
More informationMixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
More informationSolving the Table Maker s Dilemma on multicore CPUs and on the Xeon Phi
Solving the Table Maker s Dilemma on multicore CPUs and on the Xeon Phi Christophe Avenel, Pierre ortin, Mourad Gouicem, Samia Zaidi Université Pierre et Marie Curie  LIP6 ANR TaMaDi meeting October
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT Inmemory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationAccelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
More informationWe assume a HPC machine to be a collection of interconnected nodes Performance depends on:
We assume a HPC machine to be a collection of interconnected nodes Performance depends on: Application and parallelization Application structure Algorithm Data set Programming model Singlenode performance
More informationGPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g CarlFriedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
More informationKlaus Mueller, Wei Xu, Ziyi Zheng Fang Xu
MICGPU: HighPerformance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs) Entertainment Graphics: Virtual Realism for the Masses Computer games need to have: realistic appearance
More informationMixed Precision Methods on GPUs
Mixed Precision Methods on GPUs Robert Strzodka, Dominik Göddeke 2008 NVIDIA Corporation. Collaboration Dominik Göddeke, Stefan Turek, FEAST Group (Dortmund University of Technology) Jamaludin MohdYusof,
More informationLecture 4: Modeling Sparse MatrixVector Multiply. William Gropp
Lecture 4: Modeling Sparse MatrixVector Multiply William Gropp www.cs.illinois.edu/~wgropp Sustained Memory Bandwidth Measure the rate at which data can be copied from within a program: t = mysecond()
More informationThe Evolution of High Performance Computers
The Evolution of High Performance Computers David Chisnall February 8, 2011 Outline Classic Computer Architecture Fetch Decode Execute Execute? 1. Read value from memory 2. Store in register 1. Read value
More informationRecent Trends. GPU Computation Strategies & Tricks. ...but Bandwidth is Expensive. Compute is Cheap. Optimizing for GPUs. Compute vs.
Recent Trends GPU Computation Strategies & Tricks Ian Buck NVIDIA 2 Compute is Cheap...but Bandwidth is Expensive parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments
More informationFast Circuit Simulation on Graphics Processing Units
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati John F. Croix Sunil P. Khatri Rahm Shastry Texas A&M University, College Station, TX Nascentric, Inc. Austin, TX Outline Introduction
More informationImplementation of NAMD molecular dynamics nonbonded forcefield on the Cell Broadband Engine processor
Implementation of NAMD molecular dynamics nbonded forcefield on the Cell Broadband Engine processor Guochun Shi Volodymyr Kindratenko National Center for Supercomputing Applications University of Illiis
More informationPerformance Error Correcting Codes
GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas B. Chang, and dstephen Leung This work is sponsored by the Department of the Air Force under Air Force contract
More informationHeterogeneous (CPU+GPU) Performance Libraries
Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Heterogeneous (CPU+GPU) Performance Libraries Vasily Volkov June 6, 2008 1 Par Lab Research Overview Easy to
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More informationMemory Bound vs. Compute Bound: A Quantitative Study of Cache and Memory Bandwidth in High Performance Applications
Memory Bound vs. Compute Bound: A Quantitative Study of Cache and Memory Bandwidth in High Performance Applications Alex Hutcheson and Vincent Natoli Stone Ridge Technology January 2011 Abstract High performance
More informationGPU Accelerated Pathfinding
GPU Accelerated Pathfinding By: Avi Bleiweiss NVIDIA Corporation Graphics Hardware (2008) Editors: David Luebke and John D. Owens NTNU, TDT24 Presentation by Lars Espen Nordhus http://delivery.acm.org/10.1145/1420000/1413968/p65bleiweiss.pdf?ip=129.241.138.231&acc=active
More informationGPU Computing with CUDA Lecture 4  Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4  Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationYALES2 porting on the Xeon Phi Early results
YALES2 porting on the Xeon Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN  Demijournée calcul intensif, 16 juin
More informationDesign and Optimization of OpenFOAMbased CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAMbased CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationHigh Performance Poisson Equation Solver for Hybrid CPU/GPU Systems
Available online at www.praceri.eu Partnership for Advanced Computing in Europe High Performance Poisson Equation Solver for Hybrid CPU/GPU Systems Stoyan Markov, Peicho Petkov, Damyan Grancharov and
More information5 Ray Tracing on the GPU
5 Ray Tracing on the GPU In this chapter we review the scene data layout on the GPU, as well as our proposed ray tracing implementation. Firstly, we identify the common routines used to trace rays using
More informationHierarchical Identify Verify Exploit (HIVE) Program
Hierarchical Identify Verify Exploit (HIVE) Program Trung Tran DARPA/MTO Proposers Day Brief DARPABAA1652 HIVE What are we trying to do? HIVE will create a graph analytics processor that achieves 1000x
More informationRadar signal processing on graphics processors (Nvidia/CUDA)
Radar signal processing on graphics processors (Nvidia/CUDA) Jimmy Pettersson & Ian Wainwright Full master thesis presentation will be held in january CUDA is a framework to access the GPU for nongraphics
More informationBoundary element analysis of the multicapsule flow using an ultrahigh speed GPGPU computation
Boundary Elements and Other Mesh Reduction Methods XXXIV 201 Boundary element analysis of the multicapsule flow using an ultrahigh speed GPGPU computation D. Matsunaga 1,Y.Imai 1,T.Ishikawa 1 & T. Yamaguchi
More informationMulti Grid for Multi Core
Multi Grid for Multi Core Harald Köstler, Daniel Ritter, Markus Stürmer and U. Rüde (LSS Erlangen, ruede@cs.fau.de) in collaboration with many more Lehrstuhl für Informatik 10 (Systemsimulation) Universität
More informationTopics. ! Generic cache memory organization! Direct mapped caches! Set associative caches! Impact of caches on performance.
Cache Memories Topics! Generic cache memory organization! Direct mapped caches! Set associative caches! Impact of caches on performance Next time! Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationTIPS FOR WORKSTATION PERFORMANCE ANSYS 12.1 AND 13.0
TIPS FOR WORKSTATION PERFORMANCE ANSYS 12.1 AND 13.0 CPU MEMORY I/O 1. Fast processor 2. More Cores for Parallel execution Memory buffered I/O 3. Large CPU Cache 2. Fill all memory channels 4. Performance
More informationOptical Flow Estimation with CUDA. Mikhail Smirnov
Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent
More informationCOMPUTING OF NEURAL NETWORK ON GRAPHICS CARD
COMPUTING OF NEURAL NETWORK ON GRAPHICS CARD S. Kajan, J. Slačka Institute of Control and Industrial Informatics, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology
More informationImplementation of Parallel Processing Techniques on Graphical Processing Units. Brad Baker, Wayne Haney, Dr. Charles Choi
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi Industry Direction High performance COTS computing is moving to multicore and heterogeneous
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCIGA.3033012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationMAXware: acceleration in HPC. R. Dimond, M. J. Flynn, O. Mencer and O. Pell Maxeler Technologies contact:
MAXware: acceleration in HPC R. Dimond, M. J. Flynn, O. Mencer and O. Pell Maxeler Technologies contact: flynn@maxeler.com Maxeler Technologies MAXware: acceleration in HPC 2 / 26 HPC: the case for accelerators
More informationQCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
More informationMemory Hierarchy Design. Overview
Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview Problem CPU vs Memory performance imbalance Solution Driven by temporal and spatial locality Memory hierarchies Fast L1, L2, L3 caches Larger
More informationImplementing AES on GPU Final Report
Implementing AES on GPU Final Report Michael Kipper, Joshua Slavkin, Dmitry Denisenko University of Toronto April 20, 2009 Introduction Encryption and decryption are increasingly important. In order to
More informationPARALLEL PROGRAMMING MANYCORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)
PARALLEL PROGRAMMING MANYCORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC
More informationParallel Processors. Session 10. Multivector and SIMD Computers
Parallel Processors Session 10 Multivector and SIMD Computers Vector Processing Principles Vector: A set of scalar data items All of the same type Stored in memory Stride: Address increments between successive
More informationGeneral & Specialpurpose architecture. Generalpurpose GPU. GPGPU Programming models. GPGPU Memory models. Next generation
General & Specialpurpose architecture Generalpurpose GPU GPGPU Programming models GPGPU Memory models Next generation 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 2 Von Neumann
More informationChapter 5 and Appendix C
Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview Problem CPU vs Memory performance imbalance Solution Driven by temporal and spatial locality Memory hierarchies Fast L1, L2, L3 caches Larger
More informationCPUspecific optimization. Example of a target CPU core: ARM CortexM4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad.
CPUspecific optimization 1 Example of a target CPU core: ARM CortexM4F core inside LM4F120H5QR microcontroller in Stellaris LM4F120 Launchpad. Example of a function that we want to optimize: adding 1000
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationAdvanced cache optimizations  overview. Why More on Memory Hierarchy?
Advanced cache optimizations  overview Why More on Memory Hierarchy? 100,000 10,000 Performance 1,000 100 Processor ProcessorMemory Performance Gap Growing 10 Memory 1 1980 1985 1990 1995 2000 2005 2010
More informationAlgorithms of Scientific Computing II
Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. HansJoachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4  Hardwareaware
More informationarxiv: v1 [cs.pf] 6 May 2009
Introducing a Performance Model for BandwidthLimited Loop Kernels Jan Treibig and Georg Hager arxiv:0905.0792v1 [cs.pf] 6 May 2009 Regionales Rechenzentrum Erlangen University ErlangenNuernberg jan.treibig@rrze.unierlangen.de
More informationImproving Performance of Matrix Multiplication and FFT on GPU
2009 15th International Conference on Parallel and Distributed Systems Improving Performance of Matrix Multiplication and FFT on GPU Xiang Cui, Yifeng Chen, and Hong Mei Key Laboratory of High Confidence
More informationLecture 2: Memory Systems
Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Internal and External Memories CPU Date transfer Control Main Memory Control Data transfer Secondary
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationGPU Algorithms for Radiosity and Subsurface Scattering
Graphics Hardware 2003 GPU Algorithms for Radiosity and Subsurface Scattering Nathan A. Carr, Jesse D. Hall, John C. Hart Matrix Radiosity & Diffuse Subsurface Scattering Both can be solved using discretization
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationCache Systems. Memory Hierarchy Design. Example: Twolevel Hierarchy. Overview. Locality of Reference. Basic Cache Read Operation
Cache Systems Hierarchy Design 4MHz Bus 66MHz Main MHz Cache Bus 66MHz Main MHz Chapter 5 and Appendix C Data object transfer Block transfer Main Cache 4 Overview Example: Twolevel Hierarchy Access Time
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationGPU Physics. Mark Harris NVIDIA Developer Technology
GPU Physics Mark Harris NVIDIA Developer Technology Game Physics Enhance game experience through simulation Simulate objects and interactions between them Rigid bodies, particles, rag dolls, cloth, fluids,
More informationImplementing a GPUEfficient FFT
Why Fast Fourier Transform? Classic algorithm Computationally intensive Implementing a GPUEfficient FFT Useful Imaging Signal analysis Procedural texturing John Spitzer NVIDIA Corporation What is a FFT?
More informationRealTime Eulerian Water Simulation Using a Restricted Tall Cell Grid. Monday, August 15, 2011
RealTime Eulerian Water Simulation Using a Restricted Tall Cell Grid Nuttapong Chentanez Matthias Müller Main Contributions GPU friendly tall cell grid data structure Multigrid Poisson solver for the
More informationAccelerating CST MWS Performance with GPU and MPI Computing. CST workshop series
Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques  Overview  Multithreading GPU Computing Distributed Computing
More informationITERATIVE SOLUTIONS TO LINEAR ALGEBRAIC EQUATIONS. As finer discretizations are being applied with Finite Difference and Finite Element codes:
LECTURE 19 ITERATIVE SOLUTIONS TO LINEAR ALGEBRAIC EQUATIONS As finer discretizations are being applied with Finite Difference and Finite Element codes: Matrices are becoming increasingly larger Density
More informationFaster MatrixVector Multiplication on GeForce 8800GTX
Faster MatrixVector Multiplication on GeForce 88GTX Noriyuki Fujimoto Graduate School of Information Science and Technology, Osaka University 13 Machikaneyama, Toyonaka, Osaka, 6831, Japan fujimoto@ist.osakau.ac.jp
More informationAnalysis of GPU Parallel Computing based on Matlab
Analysis of GPU Parallel Computing based on Matlab Mingzhe Wang, Bo Wang, Qiu He, Xiuxiu Liu, Kunshuai Zhu (School of Computer and Control Engineering, University of Chinese Academy of Sciences, Huairou,
More informationFast Fluid Dynamics on the Singlechip Cloud Computer
Fast Fluid Dynamics on the Singlechip Cloud Computer Marco Fais, Francesco Iorio HighPerformance Computing Group Autodesk Research Toronto, Canada francesco.iorio@autodesk.com Abstract Fast simulation
More informationAssume a pipelined processor LD R0, (R1) // Reading (R1) into R0 at time t LD R2, (R3) // Reading the instruction at time t
Abbreviated Quiz1 Solution: 1. Explain the utility of each of the following DSP architectural features. Demonstrate with the aid of a simple specific example (described in pseudocode).  Circular Buffers
More informationBuilding Blocks. CPUs, Memory and Accelerators
Building Blocks CPUs, Memory and Accelerators Outline Computer layout CPU and Memory What does performance depend on? Limits to performance Siliconlevel parallelism Single Instruction Multiple Data (SIMD/Vector)
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationParallel computing with MATLAB The MathWorks, Inc. 1
Parallel computing with MATLAB 2012 The MathWorks, Inc. 1 Going Beyond Serial MATLAB Applications MATLAB Desktop (Client) 2 Programming Parallel Applications (CPU) Ease of Use Builtin support with toolboxes
More informationChapter 3.3 Computer Architecture and the FetchExecute Cycle
Chapter 3.3 Computer Architecture and the FetchExecute Cycle 3.3 (a) Von Neumann Architecture The earliest computing machines had fixed programs. For example, a desk calculator (in principle) is a fixed
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPUCPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationExperiencing Various Massively Parallel Architectures and Programming Models for DataIntensive Applications
Experiencing Various Massively Parallel Architectures and Programming Models for DataIntensive Applications Hongliang Gao, Martin Dimitrov, Jingfei Kong, Huiyang Zhou School of Electrical Engineering
More informationPARALLEL PROGRAMMING MANYCORE COMPUTING: THE LOFAR TELESCOPE (5/5) Rob van Nieuwpoort
PARALLEL PROGRAMMING MANYCORE COMPUTING: THE LOFAR TELESCOPE (5/5) Rob van Nieuwpoort rob@cs.vu.nl escience 2 Enhanced science Apply ICT in the broadest sense to other sciences Datadriven research across
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwthaachen.de Rechen und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationCellSWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
CellSWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
More informationCSCIGA Graphics Processing Units (GPUs): Architecture and Programming. OpenCL in Action
CSCIGA.3033004 Graphics Processing Units (GPUs): Architecture and Programming OpenCL in Action Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Many slides from this lecture are adapted
More informationHigh Performance Computing. Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology
High Performance Computing Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Why this course? Almost all computers today use parallelism As software developers,
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More information3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation
3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation Florian Jung, Stefan Wesarg Interactive Graphics Systems Group (GRIS), TU Darmstadt, Germany stefan.wesarg@gris.tudarmstadt.de
More informationEvaluating polynomials in several variables and their derivatives on a GPU computing processor
Evaluating polynomials in several variables and their derivatives on a GPU computing processor Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics,
More informationA GPUbased Algorithmspecific Optimization for Highperformance Background Subtraction. Chulian Zhang, Hamed Tabkhi, Gunar Schirner
A GPUbased Algorithmspecific Optimization for Highperformance Background Subtraction Chulian Zhang, Hamed Tabkhi, Gunar Schirner Outline Motivation Background Related Work Approach General Optimization
More informationAgenda. Why GPUs? Parallel Processing CUDA Example GPU Memory Advanced Techniques Issues Conclusion
GP GPU Programming Agenda Why GPUs? Parallel Processing CUDA Example GPU Memory Advanced Techniques Issues Conclusion Why GPUs? Moore s Law Concurrency, not speed Hardware support Memory bandwidth Programmability
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk OxfordMan Institute of Quantitative Finance Oxford University Mathematical Institute Oxford eresearch
More informationExperiences with Seismic Algorithms on GPUs. Scott Morton Geoscience Technology Hess Corporation
Experiences with Seismic Algorithms on GPUs Scott Morton Geoscience Technology Hess Corporation Hess Commodity PC Cluster 2005 Hess Commodity PC Cluster 2005 John Hess: What's next? Outline History Other
More informationHP and ANSYS 17. Table of contents. HP Workstations. Technical white paper
Technical white paper HP and ANSYS 17 HP Workstations Table of contents ANSYS Mechanical... 2 HP Workstation recommendations for running ANSYS Mechanical... 2 HP Workstation tips for running ANSYS Mechanical...
More informationParallel Ray Tracing using MPI: A Dynamic Loadbalancing Approach
Parallel Ray Tracing using MPI: A Dynamic Loadbalancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,
More information