Unsteady CFD for aerodynamics profiles



Similar documents
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Computational Fluid Dynamics Research Projects at Cenaero (2011)

Turbomachinery CFD on many-core platforms experiences and strategies

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Aerodynamic Department Institute of Aviation. Adam Dziubiński CFD group FLUENT

Large-Scale Reservoir Simulation and Big Data Visualization

XFlow CFD results for the 1st AIAA High Lift Prediction Workshop

HPC Deployment of OpenFOAM in an Industrial Setting

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

HPC enabling of OpenFOAM R for CFD applications

CFD Applications using CFD++ Paul Batten & Vedat Akdag

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

ME6130 An introduction to CFD 1-1

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Performance Across the Generations: Processor and Interconnect Technologies

Computational Modeling of Wind Turbines in OpenFOAM

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CUDA programming on NVIDIA GPUs

Parallel Programming Survey

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

Simulation of Fluid-Structure Interactions in Aeronautical Applications

Formula One - The Engineering Race Evolution in virtual & physical testing over the last 15 years. Torbjörn Larsson Creo Dynamics AB

HPC with Multicore and GPUs

Simulation Platform Overview

Parallel Simplification of Large Meshes on PC Clusters

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

GPU Acceleration of the SENSEI CFD Code Suite

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

CFD Lab Department of Engineering The University of Liverpool

Keys to node-level performance analysis and threading in HPC applications

GPGPU accelerated Computational Fluid Dynamics

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Enterprise HPC & Cloud Computing for Engineering Simulation. Barbara Hutchings Director, Strategic Partnerships ANSYS, Inc.

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

CFD analysis for road vehicles - case study

GPUs for Scientific Computing

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Aeronautical Testing Service, Inc th DR NE Arlington, WA USA. CFD and Wind Tunnel Testing: Complimentary Methods for Aircraft Design

Multicore Parallel Computing with OpenMP

Kriterien für ein PetaFlop System

High Performance Computing in CST STUDIO SUITE

Introduction to GPU Programming Languages

Fire Simulations in Civil Engineering

A Fast Double Precision CFD Code using CUDA

CCTech TM. ICEM-CFD & FLUENT Software Training. Course Brochure. Simulation is The Future

Data Centric Systems (DCS)

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

STCE. Outline. Introduction. Applications. Ongoing work. Summary. STCE RWTH-Aachen, Industrial Applications of discrete adjoint OpenFOAM, EuroAD 2014

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

C3.8 CRM wing/body Case

CFD Based Air Flow and Contamination Modeling of Subway Stations

Guided Performance Analysis with the NVIDIA Visual Profiler

Introduction to GPU hardware and to CUDA

CONVERGE Features, Capabilities and Applications

Next Generation GPU Architecture Code-named Fermi

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology

L20: GPU Architecture and Models

Module 6 Case Studies

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Collaborative modelling and concurrent scientific data analysis:

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Accelerating CFD using OpenFOAM with GPUs

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Software Development around a Millisecond

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

HYBRID ROCKET TECHNOLOGY IN THE FRAME OF THE ITALIAN HYPROB PROGRAM

CSE Case Study: Optimising the CFD code DG-DES

Building an energy dashboard. Energy measurement and visualization in current HPC systems

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Part I Courses Syllabus

THE CFD SIMULATION OF THE FLOW AROUND THE AIRCRAFT USING OPENFOAM AND ANSA

Advanced discretisation techniques (a collection of first and second order schemes); Innovative algorithms and robust solvers for fast convergence.

OpenFOAM Optimization Tools

YALES2 porting on the Xeon- Phi Early results

Introduction to CFD Analysis

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Imperial College London

Basin simulation for complex geological settings

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Journée Mésochallenges 2015 SysFera and ROMEO Make Large-Scale CFD Simulations Only 3 Clicks Away

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Lecture 7 - Meshing. Applied Computational Fluid Dynamics

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

Recent Advances in HPC for Structural Mechanics Simulations

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Evaluation of CUDA Fortran for the CFD code Strukti

GPGPU acceleration in OpenFOAM

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Acceleration of a CFD Code with a GPU

Transcription:

Unsteady CFD for aerodynamics profiles Jean-Marie Le Gouez, Onera Département MFN Jean-Matthieu Etancelin, ROMEO Thanks to Nikolay Markovskiy dev-tech at NVIDIA research Center, GB 1

Unsteady CFD for aerodynamics profiles Context State of the art for LES / ZDES unsteady Fluid Dynamics simulations of aerodynamics profiles Prototypes for new generation flow solvers NextFlow GPU prototype : Development stages, data models, programming languages, co-processing tools Capacity of TESLA networks for LES simulations Performance measurements, tracks for further optimizations Outlook 2

General context : CFD at Onera The recognized platforms for Research and Industrial applications : elsa : aerodynamics and aeroelasticity, in direct and adjoint modes, automatic shape optimization, numerous turbulence models, Zonal Detached Eddy Simulation (ZDES) Cedre : combustion and aerothermics,2-phase flows, heat radiation, structural thermics and thermomechanics (Zebulon Onera code) Sabrina/Funk : unsteady aerodynamics, optimized for LES and aeroacoustics Their associated tools for application productivity : links with CAD, overlapping grids, AMR patches, grid deformation tools The multi-physics couplings librairies (collaborations with Cerfacs : Open- Palm) Hundreds of thousands of code lines, mix of Fortran, C++, python Important usage by the aerospace industries 3

General context : CFD at Onera Système Cassiopee associated to the elsa solver, partly OpenSource The expectations of the external users: Extended simulation domains: effects of wake on downstream components blade-vortex interaction on helicopters, thermal loadings by the reactor jets on composite structures, Model of full systems and not only the individual components : multi-stage turbomachinery internal flows, couplings between the combustion chamber and the turbine aerodynamics, More multi-scale effects : representation of technological effects to improve the overall flow system efficiency : grooves in the walls, local injectors for flow / acoustics control, Advanced usage of CFD : adjoint modes for automatic shape optimization and grid refinement, uncertainty management, input parameters defined as pdf, 4

CFD at Onera Hard to conduct a deep reengineering of the existing solvers (important workload on the improvement of physical modelling with them) Risks on their future scalability towards PetaFlops computing (expected within 3-5 years) Development of prototypes is undertaken on the following subjects, for more modular solvers and couplers, optimized in their field : Grid strategies : - hybrids structured / unstructured - overlapping, refined automatically, - high order grids (curved faces), Numerical schemes and non-linear solvers : - Discontinous Galerkine Method (AGHORA project) - High order Finite Volumes (NextFlow project) elsa Multi-physics and multi-model couplings : CWIPI library, OpenPalm with Cerfacs HPC : MPI / OpenMP, Vectorization on CPU (X86) Heterogeneous Hardware architectures CPU / GPU Onera 5

2009 Improvement of predictive capabilities over the last 5 years : RANS / zonal LES of the flow around a High-Lift deployed wing Mach 0.18 Rey 1 400 000/corde 2014 Optimized on a CPU architecture MPI / OpenMP / vectorization CPU ressources for 70ms of simulation : JADE computer (CINES) N xyz ~ 2 600 Mpts 4096 cores / 10688 domains T CPU ~ 6 200 000 h 2D steady RANS and 3D LES 7,5 Mpts 6 Computed field of dilatation Identification of acoustic sources LEISA project DLR / Onera cooperation ONERA FUNK software

NextFlow : HO Finite Volume for RANS / LES Demonstration of the feasability of porting these algorithms on heterogeneous architectures,, HighLift Pressure

NextFlow : HO Finite Volume for RANS / LES Demonstration of the feasability of porting these algorithms on heterogeneous architectures Starting from a Fortran / MPI base, with a management of a complex data model on partitioned grids, porting both on OpenMP / Fortran and C / Cuda, sharing an interface that does a transpose of work arrays, manages pointers on CPU and GPU memories,... Preliminary work : I,j,k structure of the grid and the arrays, choice of the programming model (shared or distributed memory, acceleration by directives or dedicated language choice of C / Cuda / structures 1 st version : on the fully unstructured grid. Tests of fine partitioning relying on advanced cache usage on the GPU, 2 nd version : hierarchical model, with an initial coarse grid which is dynamically refined on the GPU : this ensures a coalescent data access to global memory, 3 rd version (on-going work) : 2.5 D model, periodic in spanwise direction, automatic vector processing on the CPU, and full data parallel model for the GPU in the homogeneous 3 rd dimension «Cela est bien dit, répondit Candide, mais il faut écrire notre Cuda» 8

1st Approach: Block Structuration of a Regular Linear Grid Partition the mesh into small blocks Map the GPU scalable structure SM SM SM SM Block Block Block Block SM: Stream Multiprocessor 9

Advantage of the Block Structuration Bigger blocks provide Better occupancy Less latency due to kernel launch Less transfers between blocks Smaller blocks provide Much more data caching 100 80 60 40 20 0 L1 hit rate 1 0,8 0,6 0,4 0,2 0 Fluxes time (normalized) 1,2 1 0,8 0,6 0,4 0,2 0 Overall time (normalized) Final speedup wrt. to 2 hyperthreaded Westmere CPU: ~2 256 1024 4096 24097 10

NXO-GPU Phase 2 : Imposing a sub-structuration to the grid and data model (inspired by the tesselation mechanism in surface rendering) Unique grid connectivity for the algorithm Optimal to organize data for coalescent memory access during the algorithm and communication phases Each coarse element in a block is allocated to an inner thread (threadid.x) Hierachical model for the grid : high order (quartic polynomial) triangles generated by gmsh refined on the GPU the whole fine grid as such could remain unknown to the host CPU 11

Code structure Preprocessing Mesh generation and block and generic refinement generation Solver Allocation and initialization of data structure from the modified mesh file Computational routine GPU allocation and initialization binders Computational binders CUDA kernels Time stepping Data fetching binder C C CUDA C Fortran Fortran Postprocessing Visualization and data analysis 12

Version 2 : Measured efficiency on Tesla 2050 and K20C (with respect to 2 Cpu Xeon 5650, OMP loop-based) Results on a K20C : Max. Acceleration = 38 wrt to 2 Westmere sockets Improvement of the Westmere CPU efficiency : OpenMP task-based the blocks are refined on the CPU also, then the K20C GPU / CPU acceleration drops to 13 ( 1 K20c = 150 Westmere cores) In fact this method is memory bounded, and GPU bandwidth is critical. More CPU optimisation needed (cache blocking, vectorisation?) Flop count : around 80 Gflops/K20C These are valuable flop, not Ax=b flop, but Riemann solver flop with high order (4th, 5th ) extrapolated values, characteristic splitting, : requires a very high memory traffic to permit theses flops Thanks to the NVIDIA dev-tech department for their support, my flop is rich 13

Version 3 : 2.5D periodic spanwise (cshift vectors), MULTI-GPU / MPI High CPU vectorisation (all variables are vectors of length 256 to 512) in the 3rd homogeneous direction Full data parallel Cuda kernels coalesced Objective : one Billion cells on a cluster with 64 TESLA K20 (40 000 cells * 512 spanwise stations per partition) The CPU (Fortran) and GPU (C/ Cuda) versions are in the same executable, for efficiency and accuracy comparisons 14

Version 3 : 2.5D periodic spanwise (cshift vectors), MULTI-GPU / MPI 15

Version 3 Initial Kernel Optimization (done by NVIDIA DevTech Great-Britain) CPU time Left column : IvyBridge (8 OMP threads) Right column : K40 Full time steps and 7 main kernels Performance strategy : Increase occupancy, reduce registers use, reduce amount of operations with global memory After some GPU optimizations : Acceleration 14 on one K20c with respect to 8 IVB cores, GPU memory Bandwith 150 GB/s For LES, GPU memory size not too much of a problem : 20 million cells stored on a K20 (6 Gbytes), 40 millions on a K40 16

Version 3 : on-going work Improve the efficiency of the communications between partitions : MPI / Cuda, GPUDirect, Overlap communications with computations at the center of the partitions, Verify the scalability curve from 1 to 128 accelerators, Choose a strategy for the CPU usage : - identify algorithmic phases that are less pertinent for GPU, - specialize the CPU for co-processing tasks : Spatial Fourier transforms, management and computation of mesh refinement indices, preparation of cell metrics for the next time step for moving/deforming grids, computation of iso-value surfaces Compare different versions of the Fluxes and Riemann solvers, which require less registers, Compare the efficiency and accuracy to standard Fortran / MPI 2 nd order codes, while using larger mesh size Prepare for the specification of the next generation of CFD solvers 17