Turbomachinery CFD on many-core platforms experiences and strategies

Similar documents

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Multicore Parallel Computing with OpenMP

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

GPUs for Scientific Computing

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Accelerating CFD using OpenFOAM with GPUs

Simulation Platform Overview

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Evaluation of CUDA Fortran for the CFD code Strukti

CUDA programming on NVIDIA GPUs

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Trends in High-Performance Computing for Power Grid Applications

ST810 Advanced Computing

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Introduction to GPU hardware and to CUDA

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

HPC Wales Skills Academy Course Catalogue 2015

walberla: A software framework for CFD applications on Compute Cores

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Overview of HPC Resources at Vanderbilt

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Introduction to GPGPU. Tiziano Diamanti

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Stream Processing on GPUs Using Distributed Multimedia Middleware

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Acceleration of a CFD Code with a GPU

GPGPU accelerated Computational Fluid Dynamics

High Performance Computing in CST STUDIO SUITE

QCD as a Video Game?

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Parallel Programming Survey

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Next Generation GPU Architecture Code-named Fermi

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

Scientific Computing Programming with Parallel Objects

GPU Acceleration of the SENSEI CFD Code Suite

Computer Graphics Hardware An Overview

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Clusters: Mainstream Technology for CAE

Parallel Algorithm Engineering

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Project Discussion Multi-Core Architectures and Programming

Performance of the JMA NWP models on the PC cluster TSUBAME.

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

HPC enabling of OpenFOAM R for CFD applications

Recent Advances in HPC for Structural Mechanics Simulations

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Retargeting PLAPACK to Clusters with Hardware Accelerators

Case Study on Productivity and Performance of GPGPUs

A Fast Double Precision CFD Code using CUDA

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

GeoImaging Accelerator Pansharp Test Results

HPC Deployment of OpenFOAM in an Industrial Setting

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Introduction to GPU Computing

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Parallel Computing with MATLAB

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Embedded Systems: map to FPGA, GPU, CPU?

OpenACC Programming and Best Practices Guide

Module 6 Case Studies

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Le langage OCaml et la programmation des GPU

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

~ Greetings from WSU CAPPLab ~

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

HP ProLiant SL270s Gen8 Server. Evaluation Report

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Recommended hardware system configurations for ANSYS users

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

1 Bull, 2011 Bull Extreme Computing

Cluster Computing at HRI

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Graphic Processing Units: a possible answer to High Performance Computing?

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

HPC with Multicore and GPUs

An Introduction to Parallel Computing/ Programming

HPC-related R&D in 863 Program

High Performance Computing. Course Notes HPC Fundamentals

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

ultra fast SOM using CUDA

Part I Courses Syllabus

Transcription:

Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29 2010

Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan and Tobias Brandvik Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29 2010

Overview The future (the present) is many-core GPU hardware A strategy for programming GPUs for CFD Turbostream a many-core CFD solver for Turbomachinery Example simulations Conclusions

Many-core

Moore s Law Source: Intel

Clock speed the power wall Source: Tom s Hardware

Today s many-core chips Intel 4 core CPU AMD 6 core CPU NVIDIA 240 cores GPU

NVIDIA Tesla GPUs C2050 card for desktop PC; 448 cores, 3GB; peak 515 GFLOP/s (DP) S2050 1U rack mount; 4 x C2050 cards; requires connection to CPU host typical average density is 2 GPUs per 1U

Many-core summary Many-core computing is here today and the trend is for greater and greater core count in future The run-time of a scalar (non-parallel) code is no longer guaranteed to reduce with each new computer (it may even increase) Codes need to be parallel to take advantage of new hardware The current distributed parallel computing model (MPI) will not work well here (for many-core CPUs), and may not work at all (for GPUs)

GPUs what do we get from all those cores?

How so many cores on a GPU? Transistor usage: 4 cores >200 cores Source: NVIDIA

Why so many cores on a GPU?

GPUs and scientific computing GPUs are designed to apply the same shading function to many pixels simultaneously

GPUs and scientific computing GPUs are designed to apply the same function to many data simultaneously This is what CFD needs!

GPU performance

GPUs and CFD Good speedups (> 10x) can be expected for: Codes that apply the same function to lots of data Codes where amount communication to/from card is small (compared to the time to compute the data on the card up to 4GB)

GPUs and CFD Good speedups (> 10x) can be expected for: Codes that apply the same function to lots of data Codes where amount communication to/from card is small (compared to the time to compute the data on the card up to 4GB) Turbomachinery CFD is an excellent fit!

Programming GPUs for CFD

Solving PDEs on multi-core processors

Solving PDEs on multi-core processors Navier-Stokes

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress...

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... Multigrid, Mixing plane, Sliding plane...

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Stencil operations Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... Multigrid, Mixing plane, Sliding plane...

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Stencil operations Non-stencil operations Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... Multigrid, Mixing plane, Sliding plane... 26

Solving PDEs on multi-core processors Navier-Stokes Finite volume, explicit time stepping Stencil operations Non-stencil operations Second order central differences, Scree scheme Set flux, Sum flux, Shear Stress... Multigrid, Mixing plane, Sliding plane... 90% of run-time 10% of run-time 27

Stencil operations on multi-core processors Single implementation? e.g. OpenCL but need to optimize for hardware Multiple implementations? e.g. C with SSE for CPU, CUDA for NVIDIA GPUs, Alternative: High level language for stencil operations a DSL Source-to-source compilation Brandvik, T. and Pullan, G. (2010) SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-Core Platforms 1 st International Workshop on the Frontier of GPU Computing, Bradford, July 2010

Stencil example Structured grid indexing i, j+1, k i, j, k+1 i-1, j, k i, j, k i+1, j, k i, j, k-1 i, j-1, k

Stencil example in Fortran

Stencil example Stencil definition:

Source-to-source compilation The stencil definition is transformed at compile-time into code that can run on the chosen processor The transformation is performed by filling in a pre-defined template using the stencil definition Stencil definition CPU template GPU template CPU source (.c) GPU source (.cu)

Source-to-source compilation The stencil definition is transformed at compile-time into code that can run on the chosen processor The transformation is performed by filling in a pre-defined template using the stencil definition Stencil definition CPU template GPU template X template CPU source (.c) GPU source (.cu) X source

Implementation details There are many optimisation strategies for stencil operations (see paper from Supercomputing 2008 by Datta et al.) CPUs: Parallelise with pthreads SSE vectorisation GPUs: Cyclic queues

CUDA strategy

CUDA strategy

CUDA strategy

CUDA strategy

CUDA strategy

CUDA strategy

Testbed CPU 1: Intel Core i7 920 (2.66 GHZ) CPU 2: AMD Phenom II X4 940 (3.0 GHz) GPU: NVIDIA GTX 280

Stencil benchmark

Stencil benchmark

Turbostream

Turbostream We have implemented a new solver that can run on both CPUs and GPUs The starting point was an existing solver (Fortran, MPI) called TBLOCK The new solver is called Turbostream Brandvik, T. and Pullan, G. (2009) An Accelerated 3D Navier-Stokes Solver for Flows in Turbomachines ASME IGTI Turbo Expo, Orlando, June 2009

TBLOCK Developed by John Denton Blocks with arbitrary patch interfaces Simple and fast algorithm 20,000 lines of Fortran 77 Main solver routines are only 10,000 lines

Turbostream 3000 lines of stencil definitions (~15 different stencil kernels) Code generated from stencil definitions is 15,000 lines Additional 5000 lines of C for boundary conditions, file I/O etc. Source code is very similar to TBLOCK every subroutine has an equivalent stencil definition

Turbostream performance Processor TBLOCK Turbostream Intel Nehalem 1.21 1.48 AMD Phenom II 1 0.89 NVIDIA GT200-10.2 TBLOCK runs in parallel on all four CPU cores using MPI Initial results from latest NVIDIA Fermi GPU show an further 2x seedup

Darwin GPU cluster Darwin is the central Cambridge cluster In 2008, NVIDIA made Cambridge a CUDA Centre of Excellence, donating 128 GPUs to Darwin

Scaling results Unsteady multi-stage turbine Baseline: 27 E6 nodes Largest: 432 E6 nodes Weak scaling: Increase grid size with # GPUs Strong scaling: Constant grid size (27 E6 nodes)

What can this be used for? Steady, single passage: 0.5 million nodes 40s (1 GPU) Steady, 3 stage: 3 million nodes 7.5 mins (1 GPU) Unsteady, 3 stage: 120 million nodes 4 hours (16 GPUs) per rev

Example simulations

Example simulations Three-stage steady turbine simulation (mixing planes) Three-stage unsteady compressor simulation (sliding planes)

Three-stage turbine Turbine from Rosic et al, ASME J. Turbo. (2006)

Entropy function through machine 4 million grid nodes 10 hours on a single CPU 8 minutes on four GPUs

Total pressure loss, Stator 3 exit Experiment Turbostream

Compressor simulation Three-stage compressor test rig at Siemens, UK 160 million grid nodes 5 revolutions needed to obtain a periodic solution (22500 time steps) On 32 NVIDIA GT200 GPUs, each revolution takes 24 hours

Entropy function contours at mid-span

Conclusions

Conclusions The switch to multi-core processors enables a step change in performance, but existing codes have to be rewritten The differences between processors make it difficult to hand-code a solver that will run on all of them We suggest a high level abstraction coupled with source-to-source compilation

Conclusions A new solver called Turbostream, which is based on Denton s TBLOCK, has been implemented Turbostream is ~10 times faster than TBLOCK when running on an NVIDIA GPU as compared to a quad-core Intel or AMD CPUs Single blade-row calculations almost interactive on a desktop (10 30 seconds) Multi-stage calculations in a few minutes on a small cluster ($10,000) Full annulus unsteady calculations complete overnight on a modest cluster ($100,000)