PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms P. E. Vincent! Department of Aeronautics Imperial College London! 25 th March 2014
Overview Motivation Flux Reconstruction Many-Core Hardware PyFR Results Summary
Motivation Current industry standard CFD tools have limited capabilities
Motivation Technology is decades old and designed for solving steady flow problems (using RANS approach)
Motivation Technology is decades old and designed for solving steady flow problems (using RANS approach)
Motivation Need to expand the industrial CFD envelope
Motivation Its not just me who thinks this [1]! [1] Murray Cross, Airbus, Technology Product Leader - Future Simulations (2012)
Motivation Objective of our research is to advance industrial CFD capabilities from their current RANS plateau
Motivation We aim to develop the de facto industry standard technology for affordable (and hence industrially relevant) high-fidelity scale-resolving simulations of unsteady flow phenomena within the vicinity of complex geometric configurations
Motivation Achieved by intelligently leveraging benefits of (and synergies between) high-order Flux Reconstruction (FR) methods for unstructured grids and massively-parallel Many-Core Hardware platforms
Motivation Flux Reconstruction + Many-Core Hardware
Flux Reconstruction Flux Reconstruction (FR) approach to high-order methods was first proposed by Huynh in 2007 [2] High-order accurate in space Works on unstructured grids [2] H. T. Huynh. A flux reconstruction approach to high-order schemes including discontinuous Galerkin methods. AIAA Paper 2007-4079. 2007
Flux Reconstruction What does high-order mean?
Flux Reconstruction What does high-order mean? Order N accuracy implies error of order h N High-order typically refers to N > 2
Flux Reconstruction Why is high-order useful?
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 0 t = 0
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 0 t = 0
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 1 t = 1
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 2 t = 2
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 3 t = 3
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 4 t = 4
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 5 t = 5
Motivation Introduction Flux Theory Reconstruction PyFR Results Many-Core Summary Hardware PyFR Results Summary Flux Introduction Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 20 t = 20
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 40 t = 40
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 60 t = 60
Flux Reconstruction Why is high-order useful? 2 nd Order (25,600 DOF) 4 th Order (25,600 DOF) t = 180 t = 180
Flux Reconstruction What does unstructured mean?
Flux Reconstruction What does unstructured mean? Unstructured 4th Order Structured 4th Order t = 0 t = 0
Flux Reconstruction What does unstructured mean? Unstructured 4th Order Structured 4th Order t = 1 t = 1
Flux Reconstruction What does unstructured mean? Unstructured 4th Order Structured 4th Order t = 2 t = 2
Flux Reconstruction What does unstructured mean? Unstructured 4th Order Structured 4th Order t = 3 t = 3
Flux Reconstruction What does unstructured mean? Unstructured 4th Order Structured 4th Order t = 4 t = 4
Flux Reconstruction What does unstructured mean? Unstructured 4th Order Structured 4th Order t = 5 t = 5
Flux Reconstruction Why is unstructured useful?
Flux Reconstruction Why is unstructured useful?
Flux Reconstruction Also Unifying (can cast various known + new schemes within FR framework) Significant element locality Abundance of structured compute
Motivation Flux Reconstruction Many-Core Hardware PyFR Results Summary Many-Core Hardware Multi-core CPUs now ubiquitous (with 10 s of cores per device)
Many-Core Hardware Many-core accelerators a hot topic (100s -1000s cores per device) Nvidia K40
Many-Core Hardware Many-core accelerators a hot topic (100s -1000s cores per device) Nvidia K40 Intel Phi AMD S10000
Many-Core Hardware Rapid evolution of hardware Need to exploit massive parallelism Temporal and spatial data locality is important Limited memory per device Exposed to complex memory hierarchy
Many-Core Hardware So, several challenges...
Many-Core Hardware But significant compute available if it can be harnessed
PyFR Flux Reconstruction PyFR + Many-Core Hardware
PyFR Features (v0.1.0 - current stable release) Governing Equations Spatial Discretisation Compressible Euler Compressible Navier Stokes Arbitrary order FR on mixed unstructured grids (tris, quads, hexes) Temporal Discretisation Range of explicit Runge-Kutta schemes Platforms Precision Input Output CPU clusters (C-OpenMP-MPI) Nvidia GPU clusters (CUDA-MPI) Single Double Gmsh Paraview
PyFR Python Outer Layer (Hardware Independent) Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels
PyFR Need to generate the Hardware Specific Kernels Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels
PyFR Templated Kernels (Hardware Independent) Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels Templated Kernels (Hardware Independent) Point-wise non-linear kernels (flux function, Riemann solver) Matrix multiply kernels
PyFR Passed through Mako Derived Templating Engine Python Outer Layer (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels Templated Kernels (Hardware Independent) Point-wise non-linear kernels (flux function, Riemann solver) Matrix multiply kernels Mako Derived Templating Engine Generates hardware specific kernel code Deals with device level parallelism
Motivation Flux Reconstruction Many-Core Hardware PyFR Results Summary PyFR Generates Hardware Specific Kernels Python Outer Layer (Hardware Independent) Templated Kernels (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels C/OpenMP Hardware Specific Kernels CUDA Hardware Specific Kernels Point-wise non-linear kernels (flux function, Riemann solver) Matrix multiply kernels OpenCL Hardware Specific Kernels Mako Derived Templating Engine Generates hardware specific kernel code Deals with device level parallelism
Motivation Flux Reconstruction Many-Core Hardware PyFR Results Summary PyFR These can now be called Python Outer Layer (Hardware Independent) Templated Kernels (Hardware Independent) Setup Distributed memory parallelism Outer for loop and calls to Hardware Specific Kernels C/OpenMP Hardware Specific Kernels CUDA Hardware Specific Kernels Point-wise non-linear kernels (flux function, Riemann solver) Matrix multiply kernels OpenCL Hardware Specific Kernels Mako Derived Templating Engine Generates hardware specific kernel code Deals with device level parallelism
PyFR ~5,000 lines of code
PyFR Speed critical kernel is sgemm/dgemm In first instance we can drop in vendor BLAS (cublas etc.) But can we do better than this
Motivation Flux Reconstruction Many-Core Hardware PyFR Results Summary Results Unstructured 5th orderataccurate Macurved = 0.2inRe element space = 3900 hexahedral mesh Cylinder
Results T106c turbine blade Ma = 0.65 Re = 80,000
Results Unstructured curved element hexahedral mesh
Results 4th order accurate in space
Results Good agreement with experimental data Isentropic Mach Number Chord
Results c.f. RANS result from leading commercial software Isentropic Mach Number Chord
Results Double ~500,000 1x10 4th Unstructured Flow ~100 order over hours precision RK4 accurate a deployed hexahedral time-steps 4 x K20c in spoiler space GPUs meshma in = a desktop 0.3 Re = workstation 10,000
Results 3D Navier-Stokes weak scaling over M2090s (Emerald) 1.2 1 0.8 0.6 0.4 0.2 0 0 16 32 48 64 80 96 112 PyFR Ideal
Results 3D Navier-Stokes strong scaling over M2090s (Emerald) 32 28 24 20 16 12 8 4 0 0 4 8 12 16 20 24 28 32 PyFR Ideal
Motivation Flux Reconstruction Many-Core Hardware PyFR Results Summary Results 1.3 billion DOF on 184 x M2090 GPUs (Emerald)
Summary Objective of research is to advance industrial CFD capabilities from their current RANS plateau
Summary We aim to develop the de facto industry standard technology for affordable (and hence industrially relevant) high-fidelity scale-resolving simulations of unsteady flow phenomena within the vicinity of complex geometric configurations
Summary Achieved by intelligently leveraging benefits of (and synergies between) High-Order Flux Reconstruction (FR) methods for unstructured grids and massively-parallel Many-Core Hardware platforms
Summary Web: www.pyfr.org Twitter: @PyFR_Solver
PhD Students Freddie Witherden (Imperial College) Antony Farrington (Imperial College) George Ntemos (Imperial College) Harry Davis (Imperial College)
Emerald The authors would like to acknowledge the work presented here made use of the EMERALD HPC facility provided by the Centre for Innovation
Funding Early Career Fellowship Platform Grant UK Turbulence Consortium 4 x PhD Studentships Hardware Donations
Questions Web: www.imperial.ac.uk/aeronautics/research/vincentlab Twitter: @Vincent_Lab