Graphic Processing Units: a possible answer to High Performance Computing?



Similar documents
Accelerating CFD using OpenFOAM with GPUs

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

HPC with Multicore and GPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

High Performance Matrix Inversion with Several GPUs

ST810 Advanced Computing

Introduction to GPU Programming Languages

2IP WP8 Materiel Science Activity report March 6, 2013

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPU Computing - CUDA

Evaluation of CUDA Fortran for the CFD code Strukti

L20: GPU Architecture and Models

HPC Wales Skills Academy Course Catalogue 2015

Case Study on Productivity and Performance of GPGPUs

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Texture Cache Approximation on GPUs

Turbomachinery CFD on many-core platforms experiences and strategies

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPGPU accelerated Computational Fluid Dynamics

Next Generation GPU Architecture Code-named Fermi

GPUs for Scientific Computing

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Introduction to GPU hardware and to CUDA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Introduction to GPGPU. Tiziano Diamanti

The Fastest, Most Efficient HPC Architecture Ever Built

GPGPU acceleration in OpenFOAM

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Retargeting PLAPACK to Clusters with Hardware Accelerators

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

GPU Parallel Computing Architecture and CUDA Programming Model

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Scientific Computing Programming with Parallel Objects

Multicore Parallel Computing with OpenMP

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Performance Analysis and Optimization Tool

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Le langage OCaml et la programmation des GPU

Jean-Pierre Panziera Teratec 2011

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Accelerating Fast Fourier Transforms Using Hadoop and CUDA

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

COSCO 2015 Heterogeneous Computing Programming

CUDA programming on NVIDIA GPUs

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPGPU Computing. Yong Cao

Stream Processing on GPUs Using Distributed Multimedia Middleware

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Interactive Level-Set Deformation On the GPU

2: Computer Performance

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

A Case Study - Scaling Legacy Code on Next Generation Platforms

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

YALES2 porting on the Xeon- Phi Early results

OpenMP and Performance

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

High Performance Computing in CST STUDIO SUITE

ultra fast SOM using CUDA

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Performance Characteristics of Large SMP Machines

How To Run A Factory I/O On A Microsoft Gpu 2.5 (Sdk) On A Computer Or Microsoft Powerbook 2.3 (Powerpoint) On An Android Computer Or Macbook 2 (Powerstation) On

Imperial College London

Parallel Computing. Introduction

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Guided Performance Analysis with the NVIDIA Visual Profiler

The Future Of Animation Is Games

Faster polynomial multiplication via multipoint Kronecker substitution

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

BigDFT An introduction

CUDA Basics. Murphy Stein New York University

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

Overview of HPC systems and software available within

Accelerating variant calling

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Advanced Rendering for Engineering & Styling

Parallel Computing with MATLAB

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Transcription:

4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

The code Different numerical performed: Interpolating Daubechies HPC & For each wavefunction (Hamiltonian application) Between wavefunctions (Linear algebra) Numerical Convolutions with short filters BLAS routines FFT (Poisson Solver) http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Separable We must calculate F(I 1,I 2,I 3 ) = L j 1,j 2,j 3 =0 h j1 h j2 h j3 G(I 1 j 1,I 2 j 2,I 3 j 3 ) HPC & = L j 1 =0 h j1 L j 2 =0 h j2 L j 3 =0 h j3 G(i 1 j 1,i 2 j 2,i 3 j 3 ) Application of three successive 1 A 3 (I 3,i 1,i 2 ) = j h j G(i 1,i 2,I 3 j) i 1,i 2 ; 2 A 2 (I 2,I 3,i 1 ) = j h j A 3 (I 3,i 1,I 2 j) I 3,i 1 ; 3 F(I 1,I 2,I 3 ) = j h j A 2 (I 2,I 3,I 1 j) I 2,I 3. Main routine: Convolution + transposition F(I,a) = h j G(a,I j) a ; j http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Evolution of accelerator Increasing power in last years The price (and the power consumption) of a GFlops gets cheaper! HPC & http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Why not to use? HPC & How to code scientific on a? hardware is designed for... graphic calculation Texture shaders rendering Single precision calculations Which programming language? http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

The CUDA programming language HPC & NVidia : the CUDA programming language The API is an extension to ANSI C(++) Low learning curve The hardware is designed for lightweight runtime High performance with moderate optimisation costs A number of SDK which helps the beginner and CUFFT Nvidia provides the CUDA user with pre-built libraries for BLAS and FFT Can really be used as black-boxes Since july 2008 NVidia cards fully support double precision (IEEE compliant) http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Example: BLAS used in Double precision calculations for N orb m matrices (m = 300kB) 70 60 HPC & speedup (Double prec.) 50 40 30 20 DGEMM DSYRK 0 0 500 0 1500 2000 2500 Number of orbitals http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

performances Free BC: 0 HPC & 80 Percent 60 40 Seconds (log. scale) 20 0 1 5 8 17 32 65 128 257 512 25 Number of atoms 1 0.1 Time (sec) Comm (%) Other Precond HamApp PSolver sumrho LinAlg http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

performances Periodic BC (8 atoms/core): 0 80 HPC & Percent 60 40 20 0 1 2 4 6 8 12 14 16 Number of cores 1 Seconds (log. scale) Time (sec) Comm Other PureCPU precond locham locden LinAlg http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

The on Main intensive routine: Convolution + transposition F(I,a) = h j G(a,I j) a ; j HPC & Combined set of 1D : Easy to parallelize (in sense) Short filters: Loop unrolling, less registers Optimal for hiding memory latency by arithmetics code makes possible to access hybrid CPU- supercomputer CEA/GENCI Titane machine, hybrid section 192 Nvidia Tesla over 800 Intel Nehalem cores http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

One dimensional June 2008 Preliminary results (stage of M. Ospici, LIG - Bull) HPC & 60 50 G80 GT200(simple) GT200(double) speedup 40 30 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Data size (Mb) http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

All the three dimensional convolution operators Now Double precision calculations, full 3D operators 20 18 HPC & speedup (Double prec.) 16 14 12 8 6 locden 4 locham precond 2 0 20 30 40 50 60 70 Wavefunction size (MB) http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Full Hybrid code We can insert it in the full code, in parallel 0 HPC & 80 Percent 60 40 Seconds (log. scale) 20 0 1 2 4 6 8 12 14 16 1 2 4 6 8 12 14 16 CPU code Hybrid code (rel.) Number of cores 1 Time (sec) Comm Other PureCPU precond locham locden LinAlg http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Around 7 times faster A lot of can still be improved 90 80 HPC & Percent 70 60 50 40 30 20 0 1 2 4 6 8 12 14 16 Number of cores 1 Seconds (log. scale) Time (sec) Comm Other PureCPU precond locham locden LinAlg http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Summary and outlook Considerations: HPC & Our experience may represent a real alternative for speeding up the s Production s are accessible, not only prototypes... but... Can one draw general conclusions? Probably no... How can we estimate the ratio benefit/costs? Nature of numerical Hot-spot (> 80% of the overall time) Multi-? http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF