LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU Parallel Computing Architecture and CUDA Programming Model

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Accelerating CFD using OpenFOAM with GPUs

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

ST810 Advanced Computing

Computer Graphics Hardware An Overview

QCD as a Video Game?

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

CUDA programming on NVIDIA GPUs

Introduction to GPGPU. Tiziano Diamanti

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

ultra fast SOM using CUDA

~ Greetings from WSU CAPPLab ~

Parallel Programming Survey

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Pedraforca: ARM + GPU prototype

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

HP ProLiant SL270s Gen8 Server. Evaluation Report

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Real-time Visual Tracker by Stream Processing

Next Generation GPU Architecture Code-named Fermi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

NVIDIA GeForce GTX 580 GPU Datasheet

GPUs for Scientific Computing

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Introduction to GPU Programming Languages

GPGPU Computing. Yong Cao

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Stream Processing on GPUs Using Distributed Multimedia Middleware

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster

Optimizing Application Performance with CUDA Profiling Tools

Introduction to GPU hardware and to CUDA

Turbomachinery CFD on many-core platforms experiences and strategies

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

HPC with Multicore and GPUs

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Clustering Billions of Data Points Using GPUs

walberla: A software framework for CFD applications on Compute Cores

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Introduction to GPU Computing

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Introduction to GPU Architecture

NVIDIA GeForce GTX 750 Ti

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Texture Cache Approximation on GPUs

Data Centric Systems (DCS)

CFD Implementation with In-Socket FPGA Accelerators

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Speeding Up RSA Encryption Using GPU Parallelization

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Interactive Level-Set Deformation On the GPU

Multi-core and Linux* Kernel

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Overview of HPC Resources at Vanderbilt

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Building a Top500-class Supercomputing Cluster at LNS-BUAP

NVIDIA GPUs in the Cloud

SCATTERED DATA VISUALIZATION USING GPU. A Thesis. Presented to. The Graduate Faculty of The University of Akron. In Partial Fulfillment

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Final Project Report. Trading Platform Server

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

High Performance GPGPU Computer for Embedded Systems

Lecture 1: an introduction to CUDA

GPGPU accelerated Computational Fluid Dynamics

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Several tips on how to choose a suitable computer

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

L20: GPU Architecture and Models

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Fast Implementations of AES on Various Platforms

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

IP Video Rendering Basics

Why ClearCube Technology for VDI?

How to program efficient optimization algorithms on Graphics Processing Units - The Vehicle Routing Problem as a case study

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Stochastic Analysis of a Queue Length Model Using a Graphics Processing Unit

(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Transcription:

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1

Framework Introduction Hardware architecture CUDA overview Implementation details A simple case: the lid driven cavity problem 2

Presentation of my activities CETHIL: Thermal Sciences Center of Lyon Current work: simulation of fluid flows with heat transfer Use/development of Thermal LBM 1,2 LBM computation using GPU 3 Objective: coupling efficiently the two approaches 1 A double population lattice Boltzman method with non uniform mesh for the simulation of natural convection in a square cavity, Int. J. Heat and Fluid Flow, vol. 28(5), pp. 862 870, 2007 2 Numerical prediction of natural convection occuring in building components: a double population lattice Boltzmann method, Numerical Heat Transfer A, vol. 52(4), pp. 315 335, 2007 3 LBM based flow simulation using GPU computing processor, Computers & Mathematics with Applications, under review 3

Introduction Due to the market (realtime and high definition 3 D graphics), GPU has evolved into highly parrallel, multithreaded, multi core processor. (from CUDA Programming Guide 06/07/2008) GFLOPS=10 9 floating point operations per second 4

Introduction The bandwidth of GPU has also evolved for faster graphics purpose. (from CUDA Programming Guide 06/07/2008) 5

Previous works using GPU & LBM GPU Cluster for high performance computing (Fan et al. 2004): 32 nodes cluster of GeForce 5800 ultra LB Stream Computing or over 1 Billion Lattice Updates per second on a single PC (J. Tölke ICMMES 2007): GeForce 8800 ultra 6

Hardware Commercial graphics card (can be included in a desktop PC) 240 processors running at 1.35GHz 1 Go DDR3 with 141.7GB/s bandwidth Theoretical peak at 1000GFLOPS Price around 500 nvidia GeForce GTX 280 7

GPU Hardware Architecture 15 texture processors (TPC) 1 TPC is composed of 2 streaming multiprocessors (SM) 1 SM is composed of 8 streaming processors (SP) and 2 special functions units Each SM has 16kB of shared memory 8

GPU Programming Interface NVIDIA CUDA TM C language programming environment (version 2.0) Definitions: Device=GPU Host=CPU Kernel=function that is called from the host and runs on the device A cuda kernel is exucuted by an array of threads 9

GPU Programming Interface Akernelisexecutedbyagrid of thread block 1 thread block is executed by 1 multiprocessor A thread block contains a maximum of 1024 threads Threads are executed by processors within a single multiprocessor => Threads from different blocks cannot cooperate 10

Kernel memory access Registers Shared memory: on chip (fast), small (16kB) Global memory: off chip, large The host can read or write global memory only. 11

LBM ALGORITHM Step 1: Collision (Local 70% computational time) Step 2: Propagation (Neighbors 28% computational time) + Boundary conditions (from BERNSDORF J., How to make my LB code faster software planning, implementation and performance tuning, ICMMES 08, Netherlands) 12

Implementation details Thread block Lattice grid Translation of lattice grid into CUDA topology Grid of blocks 1 Decomposition of the fluid domain into a lattice grid 2 Indexing the lattice nodes using thread ID 3 Decomposition of the CUDA domain into thread blocks 4 Execution of the kernel by the thread blocks 13

Combine collision and propagation steps: for each thread block for each thread load f i in shared memory compute collision step do the propagation step end end Exchange informations across boundaries Pseudo code 14

Application: 2D lid driven cavity Problem description u=u 0, v=0 u=0, v=0 u=0, v=0 y u=0, v=0 x D2Q9 MRT model of d Humières (1992) 15

Reference data from Erturk et al. 2005: Results Re=1000 Vertical velocity profil for x=0.5 Horizontal velocity profil for y=0.5 16

Performances Mesh grid size Number of Threads 16 32 64 128 256 256² 71.0 71.0 71.1 71.1 70.9 512² 284.2 282.7 284.3 284.3 283.7 1024² 346.4 690.2 953.0 997.5 1007.9 2048² 240.0 583.0 803.3 961.3 1001.5 3072² 289.3 572.4 845.2 941.2 974.2 Performances in MLUPS 17

Performances: CPU comparisons GPU/CPU Number of Threads per block CPU = Pentium IV, 3.0 GHz 18

Conclusions The GPU allows a gain of about 180 for the calculation time Multiple GPU server exists: NVIDIA TESLA S1070 composed of 4 GTX 280 (theoretical gain 720) The evolution of GPU is not finished! But using double precision floating point, the gain falls to 20! 19

Outlooks A workgroup concerning LBM and GPU 1 master student with Prof. Bernard TOURANCHEAU (ENS Lyon INRIA) and a PhD student in 2009 1 master student with Prof. Eric Favier (ENISE DIPI) A workgroup concerning Thermal LBM? Everybody is welcome to join the workgroup! 20

THANK YOU FOR YOUR ATTENTION frederic.kuznik@insa lyon.fr 21