AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications



Similar documents
Accelerating CFD using OpenFOAM with GPUs

GPGPU accelerated Computational Fluid Dynamics

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

HPC enabling of OpenFOAM R for CFD applications

GPGPU acceleration in OpenFOAM

High Performance Computing in CST STUDIO SUITE

The influence of mesh characteristics on OpenFOAM simulations of the DrivAer model

Computational Modeling of Wind Turbines in OpenFOAM

HPC Deployment of OpenFOAM in an Industrial Setting

Modeling of Earth Surface Dynamics and Related Problems Using OpenFOAM

OpenFOAM Optimization Tools

Turbomachinery CFD on many-core platforms experiences and strategies

Simulation of Fluid-Structure Interactions in Aeronautical Applications

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

THE CFD SIMULATION OF THE FLOW AROUND THE AIRCRAFT USING OPENFOAM AND ANSA

Large-Scale Reservoir Simulation and Big Data Visualization

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

YALES2 porting on the Xeon- Phi Early results

Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms. Amani AlOnazi.

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

ME6130 An introduction to CFD 1-1

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Benchmark Computations of 3D Laminar Flow Around a Cylinder with CFX, OpenFOAM and FeatFlow

Adaptation and validation of OpenFOAM CFD-solvers for nuclear safety related flow simulations

STCE. Outline. Introduction. Applications. Ongoing work. Summary. STCE RWTH-Aachen, Industrial Applications of discrete adjoint OpenFOAM, EuroAD 2014

Best practices for efficient HPC performance with large models

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Introduction to CFD Analysis

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Customer Training Material. Lecture 5. Solver Settings ANSYS FLUENT. ANSYS, Inc. Proprietary 2010 ANSYS, Inc. All rights reserved.

Building a Top500-class Supercomputing Cluster at LNS-BUAP

OpenMP Programming on ScaleMP

Recent Advances in HPC for Structural Mechanics Simulations

CFD modelling of floating body response to regular waves

Introduction to CFD Analysis

Retargeting PLAPACK to Clusters with Hardware Accelerators

How To Run A Cdef Simulation

GPU Acceleration of the SENSEI CFD Code Suite

External bluff-body flow-cfd simulation using ANSYS Fluent

Course Outline for the Masters Programme in Computational Engineering

Module 6 Case Studies

CFD Lab Department of Engineering The University of Liverpool

A Fast Double Precision CFD Code using CUDA

Marine CFD applications using OpenFOAM

CastNet: Modelling platform for open source solver technology

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

Customer Training Material. Lecture 2. Introduction to. Methodology ANSYS FLUENT. ANSYS, Inc. Proprietary 2010 ANSYS, Inc. All rights reserved.

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Lecture 16 - Free Surface Flows. Applied Computational Fluid Dynamics

2013 Code_Saturne User Group Meeting. EDF R&D Chatou, France. 9 th April 2013

Abaqus/CFD Sample Problems. Abaqus 6.10

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

walberla: A software framework for CFD applications on Compute Cores

Iterative Solvers for Linear Systems

CastNet: GUI environment for OpenFOAM

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Advanced discretisation techniques (a collection of first and second order schemes); Innovative algorithms and robust solvers for fast convergence.

Aerodynamic Department Institute of Aviation. Adam Dziubiński CFD group FLUENT

Mesh Generation and Load Balancing

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Multicore Parallel Computing with OpenMP

Trends in High-Performance Computing for Power Grid Applications

Model of a flow in intersecting microchannels. Denis Semyonov

How To Run A Steady Case On A Creeper

Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster

FLOW-3D Performance Benchmark and Profiling. September 2012

CONVERGE Features, Capabilities and Applications

Journée Mésochallenges 2015 SysFera and ROMEO Make Large-Scale CFD Simulations Only 3 Clicks Away

Acceleration of a CFD Code with a GPU

FEM Software Automation, with a case study on the Stokes Equations

CFD Simulation of the NREL Phase VI Rotor

High-fidelity electromagnetic modeling of large multi-scale naval structures

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Computer Graphics Hardware An Overview

The UberCloud Experiment

Enterprise HPC & Cloud Computing for Engineering Simulation. Barbara Hutchings Director, Strategic Partnerships ANSYS, Inc.

Benchmark Tests on ANSYS Parallel Processing Technology

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

How To Write A Program For The Pd Framework

Overview of HPC systems and software available within

Benchmarking COMSOL Multiphysics 3.5a CFD problems

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

HP ProLiant SL270s Gen8 Server. Evaluation Report

How High a Degree is High Enough for High Order Finite Elements?

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

O.F.Wind Wind Site Assessment Simulation in complex terrain based on OpenFOAM. Darmstadt,

OpenFOAM Opensource and CFD

Lecture 6 - Boundary Conditions. Applied Computational Fluid Dynamics

Transcription:

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications Dr. Bjoern Landmann Dr. Kerstin Wieczorek Stefan Bachschuster 18.03.2015 FluiDyna GmbH, Lichtenbergstr. 8, 85748 Garching b. München, Germany 1

Content CPU vs. GPU computing for CFD Short summary of Culises hybrid GPU-CPU approach Approach for partially accelerated CFD applications Industrial problem set and achievable speedups aerofluidx fully ported flow solver on GPU Technical approach Problem set and achievable speedups Multi-GPU scaling Conclusions and future roadmap for aerofluidx 18.03.2015 www.fluidyna.com 2

Potential of GPU-computing for CFD Example from automotive industry: Car-truck interference Computing unit Intel E5-2650 V3 2.6 GHz, 10 cores Cost Q1/2015 [ ] Nvidia K40 3500 (2.3x) Theoretical peak DP performance [Gflops] Memory 1400 416 Compute node dependent 1430 (3.4x) 1660 (4.0x) Max. memory bandwidth [GB/s] 68 105 Max. power consumption [W] 12 GB 288 (4.2x) 235 (2.2x) Nvidia K80 4800 Aggregate of 2 K40 GPUs 300 Classical (unstructured) CFD codes are memory bandwith limited OpenFOAM : steady RANS simulation 120M computing cells Simulation time: Medium CPU cluster: 22 dual-cpu socket blades: 44 CPUs of type Sandy Bridge Xeon E5-2650, 8-core runtime 2 days Computing platform 22 blades equipped with 2 x Intel Xeon E5-2650 V3 62k for CPUs only +blade hardware: mainboard, memory, power supply +air-conditioned server room required Hybrid CPU-GPU Two blades: single socket, 2x Xeon E5-2650 + 6 Nvidia Tesla K80 31k Theoretical peak performance 18.3 TFLOPS (4620 Watt) 832 Gflops +19920 Gflops 20.7 TFLOPS (2010 Watt) Conclusion: Hardware and energy costs more than halved! 18.03.2015 www.fluidyna.com 3

Library Culises concept and features Simulation tool e.g. OpenFOAM Culises = Cuda Library for Solving Linear Equation Systems See also www.culises.com State-of-the-art solvers for solution of linear systems (AmgX library of Nvidia) Multi-GPU and multi-node capable Single precision or double precision available Krylov subspace methods CG, BiCGStab, GMRES for symmetric /non-symmetric matrices Preconditioning options Jacobi (Diagonal) Incomplete Cholesky (IC) Incomplete LU (ILU) Algebraic Multigrid (AMG), see below Stand-alone multigrid method Algebraic aggregation and classical coarsening Multitude of smoothers (Jacobi, Jacobi L1, multicolor Gauss-Seidel, ILU etc. ) Flexible interfaces for arbitrary applications e.g.: established coupling with OpenFOAM 18.03.2015 www.fluidyna.com 4

Hybrid CPU-GPU scenario (MPI+Cuda) OpenFOAM MPI-parallelized CPU implementation based on domain decomposition OpenFOAM: CPU 0 CPU 1 CPU 2 linear system Ax=b processor partitioning solution x Application interface Interface: cudamemcpy(. cudamemcpyhostto Device) cudamemcpy(. cudamemcpydeviceto Host) Culises dynamic library: Solves linear system(s) on multiple GPUs Culises: PCG PBiCG AMGPCG GPU 0 GPU 1 GPU 2 MPI-parallel assembly (discretization etc.) of system matrices remains on CPUs Overhead introduced by memory transfer and matrix format conversion 18.03.2015 www.fluidyna.com 5

Benchmark case Automotive setup: simplefoam solver from OpenFoam Steady-state (SIMPLE 1 ) method k-ω SST model with wall functions Linear solver setup Only linear system for pressure correction accelerated by Culises on GPU All other linear systems solved on CPU Momentum x, y, z Kinetic energy k Specific rate ofdissipation omega CPU<->GPU overhead outbalances GPU-acceleration DrivAER: generic car shape model 1: SIMPLE=Semi-Implicit Method for Pressure-Linked Equations 18.03.2015 www.fluidyna.com 6

Benchmark results Automotive industrial setup (Japanese OEM) Same solver applied as with DrivAER case CPU linear solver for pressure: geometric-algebraic multigrid (GAMG) of OpenFoam (V2.3.1) GPU linear solver for pressure: AMG preconditioned CG (AMGPCG) of Culises (V1.1) 200 SIMPLE iterations Grid Cells CPU cores Intel E5-2650 GPUs Nvidia K40 Linear solve time [s] Total simulation time [s] Speedup linear solver Speedup total simulation 18M 8 1 1779 8407 3.83 1.60 18M 16 2 943 4409 3.17 1.44 62M 16 2 3662 16124 3.00 1.44 62M 32 4 1811 7519 2.60 1.36 18.03.2015 www.fluidyna.com 7

Total speedup s Potential speedup for hybrid approach Limited speedup acc. Amdahl s law: acceleration of linear solver 1 on GPU: s = 1 f + f a a LS = 3.0, a MA = 1.0 LS a LS = 4.0, a MA = 1.0 a LS, a MA = 1.0 a LS :Speedup linear solver a MA : Speedup matrix assembly (=discretization, etc.) s=a LS = 3, a MA = 1.0, f 0.46 -> s=1.44x fraction f = CPU time spent in linear solver total CPU time f: Solve linear system 1-f: Assembly of linear system a LS =4, a MA =1.0, f 0.9 -> s=3.1x f(steady-state run) << f(transient run) 18.03.2015 www.fluidyna.com 8

Summary hybrid approach Simulation tool e.g. OpenFOAM Advantages: + Universally applicable (coupled to simulation tool of choice) + Full availability of existing flow models + No validation needed Disadvantages: - Hybrid CPU-GPU produces overhead - In case that solution of linear system not dominant (f<0.5) Application speedup can be limited 18.03.2015 www.fluidyna.com 9

Total speedup s Potential speedup for full GPU approach Speedup s = 1 f a LS + 1 f a MA acceleration of linear solver on GPU: a LS = 3.0, a MA = 1.0 a LS = 4.0, a MA = 1.0 a LS, a MA = 1.0 a LS = 3.0, a MA = 3.5 a LS :Speedup linear solver a MA : Speedup matrix assembly (=discretization, etc.) fraction f = CPU time spent in linear solver total CPU time s=a LS = 3, a MA = 3.5, f 0.46 -> s=3.3x f: Solve linear system 1-f: Assembly of linear system 18.03.2015 www.fluidyna.com 10

aerofluidx a full GPU approach Targeted physical flow model: Incompressible Navier-Stokes equations Single-phase flow Numerical discretization method Finite Volume (FV) method Using unstructured mesh Classical choice for Flux (upwind) Gradient evaluation, Interpolation method, etc. Pressure-velocity coupling using classical segregated approach to save GPU memory: SIMPLE method for steady-state flow PISO method for transient flow Profiling shows pre- & post-processing are negligible mainly 2 parts dominate solution process: (1)Assembly of linear systems (momentum and pressure correction) (2)Solution of linear systems (Culises) + RANS with wall model + DES 18.03.2015 www.fluidyna.com 11

aerofluidx a full GPU approach CPU flow solver e.g. OpenFOAM preprocessing discretization Linear solver postrocessing aerofluidx GPU implementation FV FV module module Culises Culises Porting discretization of equations to GPU discretization module (Finite Volume) running on GPU Possibility of direct coupling to Culises/AMGX Zero overhead from CPU-GPU-CPU memory transfer and matrix format conversion Solution of momentum equations and turbulence also beneficial OpenFOAM environment supported Enables plug-in solution for OpenFOAM customers But communication with other input/output file formats possible 18.03.2015 www.fluidyna.com 12

aerofluidx reasonable benchmarking CFD: simplefoam solver (OpenFOAM V2.3.1) GPU: aerofluidx Fair comparison between OpenFOAM and aerofluidx Linear solver: Convergence criterion: satisfy same tolerance for norm of residual Solver choice: select best available solver on CPU vs. best available linear solver on GPU Discretization approach: use same methods for flux, gradient, interpolation, etc. 18.03.2015 www.fluidyna.com 13

aerofluidx cavity flow test case Validation: Re=400 (laminar), grid 250x250 Ghia et al: High-Re Solutions for Incompressible Flow Using the Navier-Stokes Equations and a Multigrid Method (1982) 18.03.2015 www.fluidyna.com 14

aerofluidx cavity flow test case CPU: Intel E5-2650 (all 8 cores) GPU: Nvidia K40 4M grid cells (unstructured) Running 100 SIMPLE steps with: OpenFOAM (OF) pressure: GAMG Velocitiy: Gauss-Seidel OpenFOAM (OFC) Pressure: Culises AMGPCG (2.4x) Velocity: Gauss-Seidel aerofluidx (AFXC) Pressure: Culises AMGPCG Velocity: Culises Jacobi Total speedup: OF (1x) OFC 2.98x AFXC 5.0x 100 90 80 70 60 50 40 30 20 10 0 Normalized computing time 1x 1x 1x 3.3x 3.97x 5.3x OpenFOAM OpenFOAM+Culises aerofluidx+culises all linear solve all assembly all assembly = assembly of all linear systems (pressure and velocity) all linear solve = solution of all linear systems (pressure and velocity) 18.03.2015 www.fluidyna.com 15

aerofluidx airfoil flow test case Validation: Re=2000 (laminar); angle of attack 0 ; 40K grid cells Velocity profile at 20% chord length Velocity profile at 80% chord length Lippolis et al: Incompressible Navier-Stokes Solutions on Unstructured Grids Using a Co-Volume Technique (1993) 18.03.2015 www.fluidyna.com 16

aerofluidx airfoil flow test case CPU: Intel E5-2650 (all 8 cores) GPU: Nvidia K40 8M grid cells (unstructured) Running 100 SIMPLE steps with: OpenFOAM (OF) pressure: GAMG Velocitiy: Gauss-Seidel OpenFOAM (OFC) Pressure: Culises AMGPCG (1.8x) Velocity: Gauss-Seidel aerofluidx (AFXC) Pressure: Culises AMGPCG Velocity: Culises Jacobi Total speedup: OF (1x) OFC 1.16x AFXC 2.3x 100 90 80 70 60 50 40 30 20 10 0 Normalized computing time 1x 1x 3.72x 1x 1.28x 1.91x OpenFOAM OpenFOAM+Culises aerofluidx+culises all linear solve all assembly all assembly = assembly of all linear systems (pressure and velocity) all linear solve = solution of all linear systems (pressure and velocity) 18.03.2015 www.fluidyna.com 17

aerofluidx 3D sphere test case (laminar) CPU: Intel E5-2650 2 GHz; 8-core Sandy Bridge GPU: K40 Comparison: OpenFOAM vs. aerofluidx 2.4M grid cells (unstructured) Running 1000 SIMPLE steps with: OpenFOAM (OF) Pressure: GAMG Velocitiy: Gauss-Seidel aerofluidx (AFXC) Pressure: Culises AMGPCG Velocity: Culises Jacobi Agreement of OpenFOAM and aerofluidx results; both fit to experimental results B. Fornberg (1988), numerical simulation Experiment 18.03.2015 www.fluidyna.com 18

aerofluidx sphere flow test case CPU: Intel E5-2650 (all 8 cores) GPU: Nvidia K40 8M grid cells (unstructured) Running 1000 SIMPLE steps with: OpenFOAM (OF) pressure: GAMG Velocitiy: Gauss-Seidel OpenFOAM (OFC) Pressure: Culises AMGPCG (1.8x) Velocity: Gauss-Seidel aerofluidx (AFXC) Pressure: Culises AMGPCG Velocity: Culises Jacobi Total speedup: OF (1x) OFC 1.18x AFXC 2.9x 100 90 80 70 60 50 40 30 20 10 0 Normalized computing time 1x 1x 1x 1.33x 5.03x 2.36x OpenFOAM OpenFOAM+Culises aerofluidx+culises all linear solve all assembly all assembly = assembly of all linear systems (pressure and velocity) all linear solve = solution of all linear systems (pressure and velocity) 18.03.2015 www.fluidyna.com 19

aerofluidx 3D elbow pipe flow (laminar) Comparison: OpenFOAM vs. aerofluidx 8M grid cells (unstructured) 6000 SIMPLE steps Speedup with optimized assembly code total simulation: 2.8x Part of JAMA benchmarking test suite JAMA=Japan Automobile Manufacturers Association 100 90 80 70 60 50 40 30 20 10 Normalized computing time 1x 1x 5.8x 2.2x 0 OpenFOAM aerofluidx all linear solve all assembly 18.03.2015 www.fluidyna.com 20

aerofluidx turbulent flow validation Comparison: OpenFOAM vs. aerofluidx 10K grid cells (structured) Re=180 000 k-omega SST model with standard wall functions Good agreement of OpenFOAM and aerofluidx 18.03.2015 www.fluidyna.com 21

aerofluidx multi-gpu weak scaling Cavity flow, laminar Structured mesh 8M, 16M, 24M, 32M grid cells; 8M is maximum for Tesla K40 100 SIMPLE iterations GPU exchange via host memory (no use of CUDA-aware MPI yet) GPUs Nvidia K40 Parallel efficiency system assembly Velocity systems Parallel efficiency (AMGX) Pressure system Pressure system Iterations (# levels) Parallel efficiency total 1 100 100 100 42 (6) 100 2 91 67 80 44 (6) 79 3 81 65 64 47 (6) 65 4 79 62 62 55 (7) 63 18.03.2015 www.fluidyna.com 22

aerofluidx multi-gpu strong scaling Airfoil flow, laminar Unstructured mesh 8M cells; 8M is maximum for Tesla K40 100 SIMPLE iterations Strong scaling up to 4 GPUs; GPU exchange via host memory Scaling improved by merging coarse level matrices of each GPU to one single GPU -> called coarse-level consolidation AMG grid for pressure system Number of Levels: 6 LVL ROWS NNZ SPRSTY Mem (GB) -------------------------------------------------------------- 0(D) 8023002 32086418 4.98e-07 0.557 1(D) 1251522 7966120 5.09e-06 0.227 2(D) 178726 1238808 3.88e-05 0.0347 3(D) 25179 175283 0.000276 0.00491 4(D) 3529 24377 0.00196 0.000684 5(D) 494 3346 0.0137 8.95e-05 GPUs Nvidia K40 Speedup system assembly Speedup linear system solver (AMGX) Velocity systems Pressure system Speedup total simulation vs. 1 GPU 1 1 1 1 1 2 1.95 1.64 1.72 1.76 3 2.94 2.48 2.36 2.49 4 3.60 3.20 2.8 2.99 AeroFluidX assembly phase scales better than linear solver phase Linear solver for pressure scales worse than velocity as expected due to multi-level vs. one-level method scaling 18.03.2015 www.fluidyna.com 23

Conclusions Culises - hybrid approach for accelerated CFD applications (OpenFOAM ) General applicability for industrial cases including various existing flow models Significant speedup of linear solver employing GPUs Moderate speedup of total simulation; better acceleration for transient flow problems Multi-GPU scaling of multigrid-based methods improvable aerofluidx - fully ported flow solver on GPU to harvest full GPU computing power General applicability requires rewrite of large portion of existing code; validation necessary Steady-state, incompressible, multi-gpu, unstructured multigrid flow solver established & validated Significant speedup of matrix assembly Enhanced speedup of total simulation 18.03.2015 www.fluidyna.com 24

Release planning 2014 2015 2016 fv0.98 fv1.0 fv1.2 fv2.0 Steady-state laminar flow Single-GPU Speedup* > 2x Turbulent flow (RANS) k-omega (SST) Multi-GPU Single-node Multi-node Unsteady flow Advanced turbulence modelling (LES/DES) Speedup* 2-3x Basic support for moving geometries (MRF) Porous media Advanced model for rotating devices (sliding mesh approach) aerofluidx V1.0 * Speedup against standard OpenFoam 18.03.2015 www.fluidyna.com 25

Contact information Licensing & testing www.culises.com Culises V1.1 released: Commercial and academic licensing available Free testing & benchmarking opportunities at FluiDyna GPU-servers www.aerofluidx.com Questions & Feedback Email to: ax@fluidyna.com aerofluidx@fluidyna.com Bjoern.Landmann@FluiDyna.de Acknowledgement: AmgX team of Nvidia (Joe Eaton, Maxim Naumov, Andrei Schaffer, Marat Arsaev, Alexandre Fender) 18.03.2015 www.fluidyna.com 26