Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms



Similar documents
Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms. Amani AlOnazi.

Accelerating CFD using OpenFOAM with GPUs

OpenFOAM Workshop. Yağmur Gülkanat Res.Assist.

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

GPGPU accelerated Computational Fluid Dynamics

HPC Deployment of OpenFOAM in an Industrial Setting

High Performance Computing in CST STUDIO SUITE

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

STCE. Outline. Introduction. Applications. Ongoing work. Summary. STCE RWTH-Aachen, Industrial Applications of discrete adjoint OpenFOAM, EuroAD 2014

Turbomachinery CFD on many-core platforms experiences and strategies

Large-Scale Reservoir Simulation and Big Data Visualization

Multicore Parallel Computing with OpenMP

GPGPU acceleration in OpenFOAM

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

HPC enabling of OpenFOAM R for CFD applications

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Numerical Calculation of Laminar Flame Propagation with Parallelism Assignment ZERO, CS 267, UC Berkeley, Spring 2015

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

YALES2 porting on the Xeon- Phi Early results

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

OpenFOAM Opensource and CFD

Scientific Computing Programming with Parallel Objects

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

OPEN-SOURCE CFD ANALYSIS OF MULTI-DOMAIN UNSTEADY HEATING WITH NATURAL CONVECTION

OpenFOAM Optimization Tools

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

How To Run A Cdef Simulation

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Retargeting PLAPACK to Clusters with Hardware Accelerators

TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW

ME6130 An introduction to CFD 1-1

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Model of a flow in intersecting microchannels. Denis Semyonov

Recent Advances in HPC for Structural Mechanics Simulations

Keys to node-level performance analysis and threading in HPC applications

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

CHAPTER 1 INTRODUCTION

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Self Financed One Week Training

MAQAO Performance Analysis and Optimization Tool

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

CastNet: Modelling platform for open source solver technology

Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms

~ Greetings from WSU CAPPLab ~

ultra fast SOM using CUDA

Evaluation of CUDA Fortran for the CFD code Strukti

CastNet: GUI environment for OpenFOAM

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Jean-Pierre Panziera Teratec 2011

Course Outline for the Masters Programme in Computational Engineering

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

A Case Study - Scaling Legacy Code on Next Generation Platforms

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

Scalability evaluation of barrier algorithms for OpenMP

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Customer Training Material. Lecture 2. Introduction to. Methodology ANSYS FLUENT. ANSYS, Inc. Proprietary 2010 ANSYS, Inc. All rights reserved.

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Simulation of Fluid-Structure Interactions in Aeronautical Applications

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

HP ProLiant SL270s Gen8 Server. Evaluation Report

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Advanced discretisation techniques (a collection of first and second order schemes); Innovative algorithms and robust solvers for fast convergence.

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

CFD Simulation of Subcooled Flow Boiling using OpenFOAM

Lecture 6 - Boundary Conditions. Applied Computational Fluid Dynamics

Scalability and Classifications

Introduction to CFD Analysis

OpenFOAM at FM LTH. Erdzan Hodzic. FM Seminars: 16-march Division of Fluid Mechanics, Department of Energy Sciences, Lund University

Adaptation of General Purpose CFD Code for Fusion MHD Applications*

Introduction to GPU Programming Languages

Modeling of Earth Surface Dynamics and Related Problems Using OpenFOAM

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

QCD as a Video Game?

Use of OpenFoam in a CFD analysis of a finger type slug catcher. Dynaflow Conference 2011 January , Rotterdam, the Netherlands

benchmarking Amazon EC2 for high-performance scientific computing

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

GPU Acceleration of the SENSEI CFD Code Suite

Transcription:

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center, KAUST, Thuwal, Saudi Arabia, Heterogeneous Computing Laboratory, UCD, Dublin, Ireland, amani.alonazi@kaust.edu.sa May 21, 2014 Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 1 / 31

Overview 1 Introduction OpenFOAM CFD Package OpenFOAM Applications 2 Performance Analysis laplacianfoam icofoam Conjugate Gradient 3 Proposed Optimizations Hybrid CG Solver Hybrid Pipelined CG Solver Heterogenous Decomposition 4 Experimental Results Heterogenous Decomposition Speedup icofoam Speedup laplacianfoam Hybrid Solvers Performance 5 Conclusion Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 2 / 31

1 Introduction OpenFOAM CFD Package OpenFOAM Applications 2 Performance Analysis laplacianfoam icofoam Conjugate Gradient 3 Proposed Optimizations Hybrid CG Solver Hybrid Pipelined CG Solver Heterogenous Decomposition 4 Experimental Results Heterogenous Decomposition Speedup icofoam Speedup laplacianfoam Hybrid Solvers Performance 5 Conclusion Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 3 / 31

Motivation Hardware changes have to be taken into accounts Parallelism and heterogeneity in modern HW Per-processor performance on heterogeneous systems Algorithms and codes have to be redesigned The heterogeneity of these platforms leads to several challenges and much contemporary attention is devoted to new software solutions. This trend in the HPC platforms invites redesign of the CFD packages or the algorithms themselves to use these platforms efficiently. I would rather have today s algorithms on yesterday s computers than vice versa. Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 4 / 31

OpenFOAM CFD Package Open source Field Operation And Manipulation (OpenFOAM) is a library written in C++ used to solve PDEs It covers a wide range of solvers employed in CFD, such as Laplace and Poisson equations, incompressible flow, multiphase flow, and user defined models Free, open source, parallel, modular, and flexible Large, user-driven support community Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 5 / 31

OpenFOAM Parallel Approach MPI Bulk Synchronous Parallel Model OpenFOAM uses MPI to provide parallel multi-processors functionality The mesh and its associated fields are divided into subdomains and allocated to separate processors (domain decomposition) Domain Decomposition Methods It provides different utilities for partitioning the domain, such as: Simple Hierarchal METIS SCOTCH The communication between the subdomains is implicitly handled within the package Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 6 / 31

OpenFOAM Selected Applications icofoam The incompressible laminar Navier-Stokes equations can be solved by icofoam, which applies the PISO algorithm in time stepping loop. laplacianfoam The solver is used to find the solution of the Laplacian equation. The equation contains one variable, a passive scalar, for instance, a temperature, T. u = 0 u + (uu) (ν u) = p t p: CG u: Bi-CGSTAB T t 2 (D T T) = 0 T : CG Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 7 / 31

1 Introduction OpenFOAM CFD Package OpenFOAM Applications 2 Performance Analysis laplacianfoam icofoam Conjugate Gradient 3 Proposed Optimizations Hybrid CG Solver Hybrid Pipelined CG Solver Heterogenous Decomposition 4 Experimental Results Heterogenous Decomposition Speedup icofoam Speedup laplacianfoam Hybrid Solvers Performance 5 Conclusion Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 8 / 31

Performance Analysis: Platform A single node consisting of two sockets, each socket is connected to a GPU device (Tesla C2050) The socket has 6 dual cores Intel Xeon CPU X5670 @ 2.93GHz Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 9 / 31

laplacianfoam: 3D Heat Equation Heat equation test case solved over a three-dimensional slab geometry using the laplacianfoam. Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 10 / 31

icofoam: Lid-driven Cavity flow The lid-driven cavity flow test case contains the solution of a laminar, isothermal and incompressible flow over a three-dimensional cubic geometry. The top boundary of the cube is a wall that moves in the x direction, whereas the rest are static walls. Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 11 / 31

Conjugate Gradient 1 iteration A single iteration of the conjugate gradient solver using a matrix derived from the 3D heat equation. Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 12 / 31

1 Introduction OpenFOAM CFD Package OpenFOAM Applications 2 Performance Analysis laplacianfoam icofoam Conjugate Gradient 3 Proposed Optimizations Hybrid CG Solver Hybrid Pipelined CG Solver Heterogenous Decomposition 4 Experimental Results Heterogenous Decomposition Speedup icofoam Speedup laplacianfoam Hybrid Solvers Performance 5 Conclusion Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 13 / 31

Proposed Optimizations Hybrid CG Solver Hybrid Pipelined CG Solver Heterogenous Decomposition Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 14 / 31

Hybrid CG Multicore/MultiGPU CUDA Kernel is based on CUSP and Thrust calls. CSR Format in the GPU and Skyline Format in the CPU Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 15 / 31

Hybrid Pipelined CG Recently introduced by Ghysels and Vanroose Reduces the global communication to 1 time instead of three Offers better scalability at the price of extra computations r 0 = b Ax w 0 = Ar 0 k = 0 while r 2 τ do λ k = dotc(r k, r k ) δ = dotc(w k, r k ) q = SpMV (A, w k ) if (k >0) then β = λ k / λ k 1 α = λ k / (δ - (β λ k )/α) else β = 0 α = λ k / δ end if z = q + β z s = w + β s p = r + β p x = x + α p r = r α s w = w α z end while P. Ghysels and W. Vanroose, Hiding global synchronization latency in the preconditioned conjugate gradient algorithm, Parallel Computing, 2013. Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 16 / 31

Hybrid Pipelined CG Multicore/MultiGPU CUDA Kernel is based on CUSP and Thrust calls. CSR Format in the GPU and Skyline Format in the CPU Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 17 / 31

Heterogenous Decomposition The main idea is to adequately partition and assign the subdomain to the processor in proportion to its performance more powerful processors get larger subdomains Combines the performance model and the METIS/SCOTCH library The load of the processor will be balanced if the number of computations performed by each processor is accordance to its speed on execution the kernel s i (n i ) = n i + 2F i T (n i ) r i (n i ) = s i(n i ) p s i(n i ) n i,new = N r i (n i ) Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 18 / 31

1 Introduction OpenFOAM CFD Package OpenFOAM Applications 2 Performance Analysis laplacianfoam icofoam Conjugate Gradient 3 Proposed Optimizations Hybrid CG Solver Hybrid Pipelined CG Solver Heterogenous Decomposition 4 Experimental Results Heterogenous Decomposition Speedup icofoam Speedup laplacianfoam Hybrid Solvers Performance 5 Conclusion Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 19 / 31

Experimental Results: Heterogenous Decomposition 0.02 0.018 Hybrid CG Hybrid Pipelined CG 0.016 Average Time Per Iteration 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 Metis Heterogeneous Decomositon Methods Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 20 / 31

Experimental Results: Speedup icofoam I N=100k 3.5 3 Hybrid CG Hybrid Pipelined CG 2.5 Speedup 2 1.5 1 0.5 0 2 4 8 12 16 Processors Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 21 / 31

Experimental Results: Speedup icofoam II N=1M 2 1.8 1.6 Hybrid CG Hybrid Pipelined CG Speedup 1.4 1.2 1 0.8 0.6 0.4 0.2 0 2 4 8 12 16 Processors Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 22 / 31

Experimental Results: Speedup laplacianfoam I N=100k 8 7 Hybrid CG Hybrid Pipelined CG 6 Speedup 5 4 3 2 1 0 2 4 8 12 16 Processors Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 23 / 31

Experimental Results: Speedup laplacianfoam II N=1M 4 3.5 Hybrid CG Hybrid Pipelined CG 3 2.5 Speedup 2 1.5 1 0.5 0 2 4 8 12 16 Processors Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 24 / 31

Experimental Results: Performance 512 Hybrid Pipelined CG Hybrid CG 256 128 MFLOPs/S 64 32 16 8 4 2 1 10 3 10 4 10 5 10 6 10 7 Problem Size Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 25 / 31

Experimental Results: Roofline 512 Peak DP 256 128 Attainable Gflop/s 64 32 16 8 4 2 1 DRAM BW PCI BW Stream DRAM BW 0.125 0.25 0.5 1 2 4 8 16 FLOP:BYTE Ratio Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 26 / 31

1 Introduction OpenFOAM CFD Package OpenFOAM Applications 2 Performance Analysis laplacianfoam icofoam Conjugate Gradient 3 Proposed Optimizations Hybrid CG Solver Hybrid Pipelined CG Solver Heterogenous Decomposition 4 Experimental Results Heterogenous Decomposition Speedup icofoam Speedup laplacianfoam Hybrid Solvers Performance 5 Conclusion Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 27 / 31

Conclusion Memory bound applications, such as the OpenFOAM selected solvers, can take better advantage of the full hardware potential, which is now complex, hybrid and heterogeneous, if all resources are taken into accounts in a holistic approach. Vector reduction kernel performs n 1 memory transactions, with n the vector size and cannot be combined with other operations low arithmetic intensity, low memory throughput and poor scalability when increasing number of GPU/CPU. The need for dynamic load balancing scheduling, which adaptively balances the workload during the run-time, by memory-aware work stealing. The experimental results show that the hybrid implementation of both solvers significantly outperforms state-of-the-art implementations of a widely used open source package. Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 28 / 31

Thank You! Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 29 / 31

Backup Slides. Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 30 / 31

PISO Algorithm Pressure Implicit with Splitting of Operators (PISO)* 1 Set the boundary conditions. 2 Solve the discretized momentum equation to compute an intermediate velocity field. 3 Compute the mass fluxes at the cells faces. 4 Solve the pressure equation. 5 Correct the mass fluxes at the cell faces. 6 Correct the velocities on the basis of the new pressure field. 7 Update the boundary conditions. 8 Repeat from 3 for the prescribed number of times. 9 Increase the time step and repeat from 1. *J. H. Ferziger, M. Peric, Computational Methods for Fluid Dynamics, Springer, 3rd Ed., 2001. H. Jasak, Error Analysis and Estimation for the Finite Volume Method with Applications to Fluid Flows, Ph.D. Thesis, Imperial College, London, 1996. http://openfoamwiki.net Amani AlOnazi (KAUST) ParCFD 2014 May 21, 2014 31 / 31