GPU Acceleration of the SENSEI CFD Code Suite
|
|
|
- Darcy Ford
- 9 years ago
- Views:
Transcription
1 GPU Acceleration of the SENSEI CFD Code Suite Chris Roy, Brent Pickering, Chip Jackson, Joe Derlaga, Xiao Xu Aerospace and Ocean Engineering Primary Collaborators: Tom Scogland, Wu Feng (Computer Science) Kasia Swirydowicz, Arielle Grim-McNally, Eric de Sturler (Math) Ross Glandon, Paul Tranquilli, Adrian Sandu (Computer Science) Virginia Tech, Blacksburg, VA,
2 SENSEI CFD Code SENSEI: Structured Euler/Navier-Stokes Explicit Implicit SENSEI: solves Euler and Navier-Stokes equations 2D/3D multi-block structured grids Finite volume discretization Currently for steady-state problems Explicit: Euler explicit and multi-stage Runge-Kutta Implicit: Euler implicit with linear solves via basic GMRES Written in modern Fortran 03/08 and employs object-oriented features such as derived types and procedure pointers Currently uses OpenMP and MPI for parallelism Employs Array-of-Struct data structures (problems for GPUs ) SENSEI leverages another grant on error estimation and control for CFD (in AFOSR Computational Math) 2
3 SENSEI-Lite CFD Code Preliminary testing of directive-based parallelism employs simplified version: SENSEI-Lite 2D Navier-Stokes equations Finite difference discretization on Cartesian grids Employs same artificial compressibility approach as SENSEI for extending compressible CFD code to incompressible flows Avoids Array-of-Struct data structures 3
4 Collaborations with SENSEI Code Suite Joe Derlaga (PhD Student) Xiao Xu (AOE/Math Postdoc) Brent Pickering (PhD Student) 4
5 Collaborations w/ Math Collaborations w/ Eric de Sturler (Math) and his group for expertise in parallel implicit solvers SENSEI only has basic GMRES with very limited preconditioners Work has begun to incorporate advanced parallel solvers and preconditioners into SENSEI Work is near completion for serial and parallel solvers and preconditioners in SENSEI-Lite 5
6 Collaborations w/ CS (#1) Collaborations w/ Adrian Sandu (CS) and his group for expertise in implicit/explicit time integrators for unsteady flows MAV application requires large variations in cell size from freestream to near-wall boundary layer Explicit schemes tend to be more naturally parallel, but suffer from severe stability limits for small cells Initial collaborations begun using SENSEI-Lite 6
7 Collaborations w/ CS (#2) Collaborations w/ Wu Feng (CS) and his group for expertise in OpenACC directives and hardware / software optimization Initial work has focused on explicit version of SENSEI-Lite Tight collaborations with Wu Feng and Tom Scogland resulted in AIAA Paper at SciTech (journal submission soon) Our collaborations have establish potential paths forward for GPU parallelization of full SENSEI 7
8 Introduction Directive Based Programming provides an alternative to platform specific languages such as CUDA C. Directives or pragmas (in Fortran or C/C++) permit re-use of existing source codes ideally with no modifications. Maintain a single cross-platform code base. Easily incorporate legacy code Shifts re-factoring burden to compiler. Compiler vendors can extend functionality to new architectures. Source code expresses algorithm, not implementation. Directive based APIs are already familiar to many computational scientists (OpenMP). 8
9 Efficient GPU Data Storage SOA preferred over AOS on GPU. Permits contiguous access on GPUs and SIMD hardware. Avoids need for scatter-gather. Array of Struct (AOS) Pressure U-velocity V-velocity Pressure U-velocity V-velocity Pressure U-velocity V-velocity Node 1 Node 1 Node 1 Node 2 Node 2 Node 2 Node 3 Node 3 Node 3 Struct of Array (SOA) Pressure Pressure Pressure U-velocity U-velocity U-velocity V-velocity V-velocity V-velocity Node 1 Node 2 Node 3 Node 1 Node 2 Node 3 Node 1 Node 2 Node 3 Layouts in linear memory for a sequence of 3 grid nodes (3-DOF per node). Red nodes represent memory access for a three-point stencil. 9
10 CFD Code: SENSEI-Lite Solves 2D incompressible Navier Stokes using the artificial compressibility method (Chorin). Nonlinear system with 3 equations (cons. of mass, x+y momentum). Fourth derivative damping (artificial viscosity). Steady state formulation. 1 p β 2 t + u j 4 p λ x j Δx j C j 4 j x = 0 j u i t + u j u i x j + 1 ρ p x i ν 2 u i x j 2 = 0 (Cons. of mass) (Momentum) Compressibility term β is calculated from local flow characteristics. 10
11 CFD Code: SENSEI-Lite Structured grid FDM with second order accurate spatial discretization. 3 D.O.F. per node. 5-point and 9-point stencils require 19 data loads per node (152B double precision, 76B with single precision) 130 FLOPS per node, including 3 FP divide and 2 full-precision square roots. Explicit time integration (forward Euler). Keeps the performance focus on the stencil operations not linear solvers. Data parallel algorithm stores two solution copies (time step n and n+1). Verified code using Method of Manufactured Solutions. Relatively simple code Easy to modify and experiment with. 11
12 CFD Code: SENSEI-Lite Benchmark case: 2D lid driven cavity flow. All cases used fixed-size square domain, with lid velocity 1 m/s, Re 100, and density constant 1 kg/m 3. Uniform Cartesian computational grids were used, ranging in size from to nodes. (Max size: 200 million DOF) 12
13 GFLOPS Preliminary Code Modifications Reduce memory traffic and eliminate temporary arrays. Fuse loops having only local dependencies (e.g., artificial viscosity and residual). Eliminate unnecessary mem-copy Original Code Loop fusion (remove temporary arrays) Loop fusion+ Replace copy with pointer swap 13
14 OpenACC Performance Evaluated on three models of NVIDIA GPU representing two microarchitectures (Fermi, Kepler). Compiler = PGI 13.6 and For NVIDIA case, compiler automatically generated binaries for Compute Capability 1.x, 2.x, and 3.x devices. Experimented with manual optimization of the OpenACC code. 14
15 OpenACC Performance Manual Performance Tuning Compiler defaulted to D thread-block configuration, but this could be overridden using the vector clause. On CUDA devices, launch configuration can have a significant effect on performance. Utilization of shared memory and registers. Can lead to variations in occupancy. Shape of the thread blocks can affect the layout of the shared data structures possible bank conflicts. Used a fixed size problem of 4097x4097 nodes (50 million DOF), and explored full parameter space of thread-block dimensions. 15
16 OpenACC Performance Manual Performance Tuning (Double Precision, 4097x4097 cavity) C2075 K20c 16
17 OpenACC Performance Default and optimized thread-block dimensions PGI 13.6 Double Precision C2075 K20c 17
18 OpenACC Performance Single vs Double Precision Single prec. maximum throughput 2-3x greater than double prec. 32-bit data types require half the memory bandwidth, half the register file and half the shared memory space of 64-bit types. Trivial to switch in Fortran by editing the precision parameter used for the real data type. Like SENSEI, INS code uses ISO C binding module precision was always c_double or c_float. For the explicit INS code, single precision was adequate for converged solution. Lost about 7 significant digits, equivalent to round-off error. May be more of a problem with implicit algorithms. 18
19 OpenACC Performance Manual Performance Tuning (Single Precision, 4097x4097 cavity) C2075 K20c 19
20 OpenACC Performance Summary Tesla C2075 (Fermi) Double Precision (GFLOPS) Single Precision (GFLOPS) Speedup (%) Tesla K20c (Kepler) Double Precision (GFLOPS) Single Precision (GFLOPS) Speedup (%) Default Block Size (GFLOPS) Optimal Block Size (GFLOPS) Speedup (%) % % 44.6% 57.4% Default Block Size (GFLOPS) Optimal Block Size (GFLOPS) Speedup (%) % % 117.8% 69.4% Optimal Thread-block Dimension C2075 (Fermi) Double Precision 16x4 (16x8) Single Precision 16x6 K20c (Kepler) 16x8 32x4 20
21 OpenACC Performance PGI beta. Multiple GPU Used domain decomposition. OpenMP parallel region with one GPU per thread. Boundary values are transferred between regions on each iteration. Incurs thread-synch overhead and requires data transfer across PCIe. OpenACC does not expose mechanism for direct GPU to GPU copy permits use of AMD GPUs, such as the HD is equivalent to two 7970 South Islands dies on a single card. Requires multi-gpu code to make use of both simultaneously. 21
22 GFLOPS OpenACC Performance Multiple GPU NVIDIA c2070 NVIDIA k20x AMD Radeon Number of GPU dies 22
23 OpenACC Performance OpenACC speedup over single CPU thread CPU version compiled from the same source code using PGI vs. Sandy Bridge (E5-2687W) 35 vs. Nehalem (X5560) OpenACC (NVIDIA C2075) OpenACC (NVIDIA K20C) OpenACC (NVIDIA K20X) OpenACC (AMD 7990 (1 of 2 dies)) 0 OpenACC (NVIDIA C2075) OpenACC (NVIDIA K20C) OpenACC (NVIDIA K20X) OpenACC (AMD (1 of 2 dies))
24 GFLOPS OpenACC Performance 120 OpenACC (GPU) performance vs OpenMP (CPU) performance OpenMP (Xeon X5560, 8T/8C) OpenMP (Xeon E5-2687W, 16T/16C) OpenACC (NVIDIA C2075) OpenACC (NVIDIA K20C) OpenACC (NVIDIA K20X) OpenACC (AMD 7990 [1 of 2 24 dies])
25 OpenACC Performance Advantage of software managed cache on GPU for computation on structured grids avoids conflict misses. Nehalem 16T/8C K20c Cache conflict misses. Uniform performance. 25
26 Summary and Path Forward OpenACC provided good performance on GPU, and in conjunction with OpenMP directives permitted a single code base to serve CPU and GPU platforms robust to rapid changes is hardware platforms. However Decided not to apply the OpenACC 1.0 API to SENSEI. Requires too many alterations to current code base. Restricts use of object-oriented features in modern Fortran. Looking to other emerging standards that also support GPUs and other accelerators: OpenACC 2.0 and OpenMP
27 Publications J. M. Derlaga, T. S. Phillips, and C. J. Roy, SENSEI Computational Fluid Dynamics Code: A Case Study in Modern Fortran Software Development, AIAA Paper , 21 st AIAA Computational Fluid Dynamics Conference, San Diego, CA, June 24-27, B. P. Pickering, C. W. Jackson, T. R. W. Scogland, W.-C. Feng, and C. J. Roy, Directive-Based GPU Programming for Computational Fluid Dynamics, AIAA Paper , 52 nd Aerospace Sciences Meeting, National Harbor, MD, January 13-17, Both are in preparation for journal submission 27
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms P. E. Vincent! Department of Aeronautics Imperial College London! 25 th March 2014 Overview Motivation Flux Reconstruction Many-Core
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators
A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators Sandra Wienke 1,2, Christian Terboven 1,2, James C. Beyer 3, Matthias S. Müller 1,2 1 IT Center, RWTH Aachen University 2 JARA-HPC, Aachen
A Fast Double Precision CFD Code using CUDA
A Fast Double Precision CFD Code using CUDA Jonathan M. Cohen *, M. Jeroen Molemaker** *NVIDIA Corporation, Santa Clara, CA 95050, USA (e-mail: [email protected]) **IGPP UCLA, Los Angeles, CA 90095, USA
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model
5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model C99, C++, F2003 Compilers Optimizing Vectorizing Parallelizing Graphical parallel tools PGDBG debugger PGPROF profiler Intel, AMD, NVIDIA
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
GPGPU acceleration in OpenFOAM
Carl-Friedrich Gauß Faculty GPGPU acceleration in OpenFOAM Northern germany OpenFoam User meeting Braunschweig Institute of Technology Thorsten Grahs Institute of Scientific Computing/move-csc 2nd October
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief
NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief NVIDIA changed the high performance computing (HPC) landscape by introducing its Fermibased GPUs that delivered
Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster
Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster Mark J. Stock, Ph.D., Adrin Gharakhani, Sc.D. Applied Scientific Research, Santa Ana, CA Christopher P. Stone, Ph.D. Computational
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
Parallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Acceleration of a CFD Code with a GPU
Acceleration of a CFD Code with a GPU Dennis C. Jespersen ABSTRACT The Computational Fluid Dynamics code Overflow includes as one of its solver options an algorithm which is a fairly small piece of code
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
C3.8 CRM wing/body Case
C3.8 CRM wing/body Case 1. Code description XFlow is a high-order discontinuous Galerkin (DG) finite element solver written in ANSI C, intended to be run on Linux-type platforms. Relevant supported equation
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
OpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. [email protected] Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures
Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics
Large-Scale Reservoir Simulation and Big Data Visualization
Large-Scale Reservoir Simulation and Big Data Visualization Dr. Zhangxing John Chen NSERC/Alberta Innovates Energy Environment Solutions/Foundation CMG Chair Alberta Innovates Technology Future (icore)
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
CSE Case Study: Optimising the CFD code DG-DES
CSE Case Study: Optimising the CFD code DG-DES CSE Team NAG Ltd., [email protected] Naveed Durrani University of Sheffield June 2008 Introduction One of the activities of the NAG CSE (Computational
TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW
TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW Rajesh Khatri 1, 1 M.Tech Scholar, Department of Mechanical Engineering, S.A.T.I., vidisha
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications
AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications Dr. Bjoern Landmann Dr. Kerstin Wieczorek Stefan Bachschuster 18.03.2015 FluiDyna GmbH, Lichtenbergstr. 8, 85748 Garching
Keys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
Module 6 Case Studies
Module 6 Case Studies 1 Lecture 6.1 A CFD Code for Turbomachinery Flows 2 Development of a CFD Code The lecture material in the previous Modules help the student to understand the domain knowledge required
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
YALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
CFD simulations using an AMR-like approach in the PDE Framework Peano
CFD simulations using an AMR-like approach in the PDE Framework Peano, Fakultät für Informatik Technische Universität München Germany Miriam Mehl, Hans-Joachim Bungartz, Takayuki Aoki Outline PDE Framework
Effect of Aspect Ratio on Laminar Natural Convection in Partially Heated Enclosure
Universal Journal of Mechanical Engineering (1): 8-33, 014 DOI: 10.13189/ujme.014.00104 http://www.hrpub.org Effect of Aspect Ratio on Laminar Natural Convection in Partially Heated Enclosure Alireza Falahat
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
Learn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Differential Relations for Fluid Flow. Acceleration field of a fluid. The differential equation of mass conservation
Differential Relations for Fluid Flow In this approach, we apply our four basic conservation laws to an infinitesimally small control volume. The differential approach provides point by point details of
NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model
NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, and Barbara Chapman Department of Computer Science, University
1. If we need to use each thread to calculate one output element of a vector addition, what would
Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x
Spatial Discretisation Schemes in the PDE framework Peano for Fluid-Structure Interactions
Spatial Discretisation Schemes in the PDE framework Peano for Fluid-Structure Interactions T. Neckel, H.-J. Bungartz, B. Gatzhammer, M. Mehl, C. Zenger TUM Department of Informatics Chair of Scientific
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
COMPARISON OF SOLUTION ALGORITHM FOR FLOW AROUND A SQUARE CYLINDER
Ninth International Conference on CFD in the Minerals and Process Industries CSIRO, Melbourne, Australia - December COMPARISON OF SOLUTION ALGORITHM FOR FLOW AROUND A SQUARE CYLINDER Y. Saito *, T. Soma,
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Part II: Finite Difference/Volume Discretisation for CFD
Part II: Finite Difference/Volume Discretisation for CFD Finite Volume Metod of te Advection-Diffusion Equation A Finite Difference/Volume Metod for te Incompressible Navier-Stokes Equations Marker-and-Cell
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
THE CFD SIMULATION OF THE FLOW AROUND THE AIRCRAFT USING OPENFOAM AND ANSA
THE CFD SIMULATION OF THE FLOW AROUND THE AIRCRAFT USING OPENFOAM AND ANSA Adam Kosík Evektor s.r.o., Czech Republic KEYWORDS CFD simulation, mesh generation, OpenFOAM, ANSA ABSTRACT In this paper we describe
CFD Applications using CFD++ Paul Batten & Vedat Akdag
CFD Applications using CFD++ Paul Batten & Vedat Akdag Metacomp Products available under Altair Partner Program CFD++ Introduction Accurate multi dimensional polynomial framework Robust on wide variety
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD
OpenCL for programming shared memory multicore CPUs
Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG-212 Workshop at HiPEAC-212, Paris, Jan. 212. OpenCL for programming shared memory multicore CPUs
MPI Hands-On List of the exercises
MPI Hands-On List of the exercises 1 MPI Hands-On Exercise 1: MPI Environment.... 2 2 MPI Hands-On Exercise 2: Ping-pong...3 3 MPI Hands-On Exercise 3: Collective communications and reductions... 5 4 MPI
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Building an energy dashboard. Energy measurement and visualization in current HPC systems
Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 [email protected] SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators
NUMERICAL ANALYSIS OF THE EFFECTS OF WIND ON BUILDING STRUCTURES
Vol. XX 2012 No. 4 28 34 J. ŠIMIČEK O. HUBOVÁ NUMERICAL ANALYSIS OF THE EFFECTS OF WIND ON BUILDING STRUCTURES Jozef ŠIMIČEK email: [email protected] Research field: Statics and Dynamics Fluids mechanics
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
A quick tutorial on Intel's Xeon Phi Coprocessor
A quick tutorial on Intel's Xeon Phi Coprocessor www.cism.ucl.ac.be [email protected] Architecture Setup Programming The beginning of wisdom is the definition of terms. * Name Is a... As opposed
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Adaptation of General Purpose CFD Code for Fusion MHD Applications*
Adaptation of General Purpose CFD Code for Fusion MHD Applications* Andrei Khodak Princeton Plasma Physics Laboratory P.O. Box 451 Princeton, NJ, 08540 USA [email protected] Abstract Analysis of many fusion
ME6130 An introduction to CFD 1-1
ME6130 An introduction to CFD 1-1 What is CFD? Computational fluid dynamics (CFD) is the science of predicting fluid flow, heat and mass transfer, chemical reactions, and related phenomena by solving numerically
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
A Load Balancing Tool for Structured Multi-Block Grid CFD Applications
A Load Balancing Tool for Structured Multi-Block Grid CFD Applications K. P. Apponsah and D. W. Zingg University of Toronto Institute for Aerospace Studies (UTIAS), Toronto, ON, M3H 5T6, Canada Email:
