# Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Save this PDF as:

Size: px
Start display at page:

Download "Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms"

## Transcription

1 Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

2 Motivation Hybrid Mixed Precision Solvers Iterative Refinement Method Numerical Experiments Conclusions and Future Work Computational Fluid Dynamics Linear Systems in Computational Fluid Dynamics Fluid flow problems can be modeled by partial differential equations Often Finite Element Methods are used to find solutions Linearization methods lead to linear system Ax = b Solving the linear system is typically the most time-consuming step in the simulation process. 2/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

3 Motivation Hybrid Mixed Precision Solvers Iterative Refinement Method Numerical Experiments Conclusions and Future Work Acceleration Scenarios Computational Effort Acceleration Concepts Characteristics of the linear problem Use of different precision formats Characteristics of the applied solver Use of coprocessor technology Used floating point format Parallelization of the linear solvers Fundamental Idea of Mixed Precision Solvers: Acceleration without loss of accuracy High precision only in relevant parts Low precision for most of the algorithm Outsource low precision computations on parallel low precision coprocessor Use different precision formats within the Iterative Refinement Method 3/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

4 Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while 4/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

5 Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while 5/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

6 Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i In-Exact Newton Method Apply iterative method to solve Ac i = r i Residual stopping criterion ε inner >> ε E.g. Krylov Subspace Solver 6/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

7 Mixed Precision Iterative Refinement 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while Any error correction solver can be used i.e. Krylov Subspace Methods (CG, GMRES...) 7/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

8 Hardware Platform HC3 Tesla IC1 Processor type Xeon 5540 Xeon 5450 Xeon 5355 Accelerator type - Tesla T10 - Processors per node 2 2 CPUs / 1 GPU 2 Cores per processor 4 8 / Theoretical comp. rate / core 10.1 GFlop/s 12 / 3.9 GFlop/s 10.7 GFlop/s Theoretical comp. rate / node 81 GFlop/s 96 / 933 GFlop/s 85.3 GFlop/s L2-cache per processor 8 MB 8 / - MB 8 MB Nodes 278 / 32 / Memory per node 24 / 48 / 144 GB 32 GB 16 GB Memory full machine 10.3 TB - 32 TB Theoretical comp. rate full machine 27 TFlop/s 1.0 TFlop/s 17.6 TFlop/s Power consumption load 80.8 kw 539 / 187,8 W a 103 kw a 8/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

9 Numerical Experiments Implementation Issues Test Matrices Iterative Refinement Fluid Flow Problem Double precision GMRES Fluid Flow in Venturi Nozzle Mixed precision GMRES multiple dimensions MKL-Based on CPU different condition number CUDA-Based on GPU different sparsity * Kansas City Star, /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

10 Test-Case CFD Simulation of a Newtonian Fluid through a Venturi Nozzle. Fluid can be described by the (incompressible) Navier-Stokes equations: ρ Du Dt = p + µ u + f, u = 0, (1) Where u is the fluid velocity field, p the pressure field, ρ the constant fluid density, µ its molecular viscosity constant and f combines external forces acting on the fluid. The operator D depicts the non-linear material derivative. 10/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

11 Test-Cases CFD CFD1 CFD2 CFD3 problem: 2D fluid flow dimension: n = sparsity: nnz = storage format: CRS problem: 2D fluid flow dimension: n = sparsity: nnz = storage format: CRS problem: 2D fluid flow dimension: n = sparsity: nnz = storage format: CRS Table: Sparsity plots and properties of the CFD test-matrices. 11/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

12 Computation time CFD 1 Table: Computation time in s for problem CFD 1 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

13 Computation time CFD 2 Table: Computation time in s for problem CFD 2 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

14 Computation time CFD 3 Table: Computation time in s for problem CFD 3 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

15 Energy efficiency Assumptions: E = P t Power Consumption (W) Remarks IC1 514 no variation in consumption HC Wdependingonload CPU-frequency "ondemand", SMT off Tesla 726 node: 539 W, GPU : 187 W * : measurements ** : manufacturer information 15/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

16 Energy Consumption CFD 1 Table: Energy consumption in Wh for problem CFD 1 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

17 Energy Consumption CFD 1 Figure: Energy consumption as a function of time for solving the CFD1 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 17/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

18 Energy Consumption CFD 2 Table: Energy consumption in Wh for problem CFD 2 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

19 Energy Consumption CFD 2 Figure: Energy consumption as a function of time for solving the CFD2 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 19/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

20 Energy Consumption CFD 3 Table: Energy consumption in Wh for problem CFD 3 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

21 Energy Consumption CFD 3 Figure: Energy consumption as a function of time for solving the CFD3 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 21/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

22 Evaluation of GPU-based linear solvers Name Tesla C2050 Tesla C1060 GTX 480 GTX 280a Chip T20 T10 GF100 GT200 Transistors ca. 3 Mrd. ca. 1,4 Mrd. ca. 3 Mrd. ca. 1,4 Mrd. Core frequency 1.15 GHz 1.3 GHz 1.4 GHz 1.3 GHz Shaders (MADD) GFLOPs (single) GFLOPs (double) Memory 3 GB GDDR5 4 GB GDDR3 1.5 GB GDDR5 1 GB GDDR3 Memory Frequency 1.5 GHz 0.8 GHz 1.8 GHz 1.1 GHz Memory Bandwidth 144 GB/s 102 GB/s 177 GB/s 141 GB/s ECC Memory yes no no no Power Consumption 247 W 187 W 250 W 236 W IEEE double/single yes/yes yes/partial yes/yes yes/partial Table: Key system characteristics of the four GPUs used for the tests. Computation rate and memory bandwidth are peak respectively theoretical values. 22/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

23 Performance Experiment setup Computation Time (s) problem solver type C2050 C1060 GTX 480 GTX 280 HC3 IC1 CFD 1 CFD 2 CFD 3 double mixed double mixed double mixed Table: Computation time (s) for problem CFD 1, CFD 2 and CFD 3 based on a GMRES-(30). IC1 and HC3 results on 8 Cores 23/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

24 Rankings Performance Energy consumption 1 GTX 480 (256 s) 1 GTX 480 (17,7 Wh) 2 C2050 (273 s) 2 C2050 (18,7 Wh) 3 GTX 280 (301 s) 3 GTX 280 (19,7 Wh) 4 HC3 (401 s) 4 HC3 (27,2 Wh) 5 C1060 (510 s) 5 C1060 (26,5 Wh) 6 IC1 (896 s) 6 IC1 (127,93 Wh) Performance and energy ranking for problem CFD 2 based on a GMRES-(30). IC1 and HC3 results on 8 Cores. Energy efficiency for GPUs computed without energy consumption for the host. Node for the host for GTX 480 has to consume less then 133 W to be more energy efficient than the HC3! 24/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

25 Conclusion Conclusion Iterative refinement shows high potential in case of computationally expensive problems (e.g. our CFD-Problems) Iterative refinement is able to exploit the excellent low precision performance of accelerator technologies (e.g. GPU) Energy efficiency and performance can be improved by accelerate older computing nodes by GPUs 25/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

26 Future Work Future Work Numerical analysis to optimize choice of floating point format Evaluate more hardware constellations Define more testcases and optimize solvers More accurate power consumption measurements FPGA-Technology offers free choice of Floating Point Formats 26/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

### Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

### AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications Dr. Bjoern Landmann Dr. Kerstin Wieczorek Stefan Bachschuster 18.03.2015 FluiDyna GmbH, Lichtenbergstr. 8, 85748 Garching

### GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

### Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

### GPGPU accelerated Computational Fluid Dynamics

t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

### Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics

### High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

### COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

### Large-Scale Reservoir Simulation and Big Data Visualization

Large-Scale Reservoir Simulation and Big Data Visualization Dr. Zhangxing John Chen NSERC/Alberta Innovates Energy Environment Solutions/Foundation CMG Chair Alberta Innovates Technology Future (icore)

### Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

### Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

### Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

### YALES2 porting on the Xeon- Phi Early results

YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin

### Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

### Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

### LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

### CFD Implementation with In-Socket FPGA Accelerators

CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline

### GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

### ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*

### Experiments in Unstructured Mesh Finite Element CFD Using CUDA

Experiments in Unstructured Mesh Finite Element CFD Using CUDA Graham Markall Software Performance Imperial College London http://www.doc.ic.ac.uk/~grm08 grm08@doc.ic.ac.uk Joint work with David Ham and

### HPC enabling of OpenFOAM R for CFD applications

HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,

### Parallel Computing. Introduction

Parallel Computing Introduction Thorsten Grahs, 14. April 2014 Administration Lecturer Dr. Thorsten Grahs (that s me) t.grahs@tu-bs.de Institute of Scientific Computing Room RZ 120 Lecture Monday 11:30-13:00

### A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

### GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

### OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

### HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

### HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

### Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

### Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

### Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

### QCD as a Video Game?

QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

### Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

### Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

### Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster

Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster Mark J. Stock, Ph.D., Adrin Gharakhani, Sc.D. Applied Scientific Research, Santa Ana, CA Christopher P. Stone, Ph.D. Computational

### GPU Acceleration of the SENSEI CFD Code Suite

GPU Acceleration of the SENSEI CFD Code Suite Chris Roy, Brent Pickering, Chip Jackson, Joe Derlaga, Xiao Xu Aerospace and Ocean Engineering Primary Collaborators: Tom Scogland, Wu Feng (Computer Science)

### Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken, Melven Zöllner** German Aerospace

### Accelerating CST MWS Performance with GPU and MPI Computing. CST workshop series

Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques - Overview - Multithreading GPU Computing Distributed Computing

### LINEAR SYSTEMS. Consider the following example of a linear system:

LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x +2x 2 +3x 3 = 5 x + x 3 = 3 3x + x 2 +3x 3 = 3 x =, x 2 =0, x 3 = 2 In general we want to solve n equations in

### Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

### Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library

Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library Rajib Nath, Stan Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee, Knoxville Speaker: Emmanuel

### Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

### The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014

The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014 David Rohr 1, Gvozden Nešković 1, Volker Lindenstruth 1,2 DOI: 10.14529/jsfi150304

### GPU Renderfarm with Integrated Asset Management & Production System (AMPS)

GPU Renderfarm with Integrated Asset Management & Production System (AMPS) Tackling two main challenges in CG movie production Presenter: Dr. Chen Quan Multi-plAtform Game Innovation Centre (MAGIC), Nanyang

### Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

### AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS by M. Storti, L. Dalcín, R. Paz Centro Internacional de Métodos Numéricos en Ingeniería - CIMEC INTEC, (CONICET-UNL), Santa Fe, Argentina

### High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

### Parallelism and Cloud Computing

Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

### and RISC Optimization Techniques for the Hitachi SR8000 Architecture

1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.

### A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

### Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental

### SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

### Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale

### Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics

### TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW

TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW Rajesh Khatri 1, 1 M.Tech Scholar, Department of Mechanical Engineering, S.A.T.I., vidisha

### walberla: A software framework for CFD applications on 300.000 Compute Cores

walberla: A software framework for CFD applications on 300.000 Compute Cores J. Götz (LSS Erlangen, jan.goetz@cs.fau.de), K. Iglberger, S. Donath, C. Feichtinger, U. Rüde Lehrstuhl für Informatik 10 (Systemsimulation)

### Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, M. Castillo, J. I. Aliaga, J. C. Ferna ndez, V. Heuveline,

### Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

### Performance Tuning of a CFD Code on the Earth Simulator

Applications on HPC Special Issue on High Performance Computing Performance Tuning of a CFD Code on the Earth Simulator By Ken ichi ITAKURA,* Atsuya UNO,* Mitsuo YOKOKAWA, Minoru SAITO, Takashi ISHIHARA

### ultra fast SOM using CUDA

ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

### Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado

### MAGMA: Matrix Algebra on GPU and Multicore Architectures

MAGMA: Matrix Algebra on GPU and Multicore Architectures Presented by Scott Wells Assistant Director Innovative Computing Laboratory (ICL) College of Engineering University of Tennessee, Knoxville Overview

### Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD

### A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications K. P. Apponsah and D. W. Zingg University of Toronto Institute for Aerospace Studies (UTIAS), Toronto, ON, M3H 5T6, Canada Email:

### Intel Xeon Processor E5-2600

Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset

ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

### Performance Portability Study of Linear Algebra Kernels in OpenCL

Performance Portability Study of Linear Algebra Kernels in OpenCL Karl Rupp 1,2, Philippe Tillet 1, Florian Rudolf 1, Josef Weinbub 1, Ansgar Jüngel 2, Tibor Grasser 1 rupp@iue.tuwien.ac.at @karlrupp 1

### Overview of HPC systems and software available within

Overview of HPC systems and software available within Overview Available HPC Systems Ba Cy-Tera Available Visualization Facilities Software Environments HPC System at Bibliotheca Alexandrina SUN cluster

### Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer gkmorris@ra.rockwell.com Standard Drives Division Bruce W. Weiss Principal

### Jean-Pierre Panziera Teratec 2011

Technologies for the future HPC systems Jean-Pierre Panziera Teratec 2011 3 petaflop systems : TERA 100, CURIE & IFERC Tera100 Curie IFERC 1.25 PetaFlops 256 TB ory 30 PB disk storage 140 000+ Xeon cores

### GPUs for Scientific Computing

GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

### Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

### Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

### Modeling of inflatable dams partially filled with fluid and gas considering large deformations and stability

Institute of Mechanics Modeling of inflatable dams partially filled with fluid and gas considering large deformations and stability October 2009 Anne Maurer Marc Hassler Karl Schweizerhof Institute of

### 1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

### Graphic Processing Units: a possible answer to High Performance Computing?

4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/

### ~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

### Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

### GPGPU acceleration in OpenFOAM

Carl-Friedrich Gauß Faculty GPGPU acceleration in OpenFOAM Northern germany OpenFoam User meeting Braunschweig Institute of Technology Thorsten Grahs Institute of Scientific Computing/move-csc 2nd October

### High Performance Computing: A Review of Parallel Computing with ANSYS solutions. Efficient and Smart Solutions for Large Models

High Performance Computing: A Review of Parallel Computing with ANSYS solutions Efficient and Smart Solutions for Large Models 1 Use ANSYS HPC solutions to perform efficient design variations of large

### OpenFOAM Optimization Tools

OpenFOAM Optimization Tools Henrik Rusche and Aleks Jemcov h.rusche@wikki-gmbh.de and a.jemcov@wikki.co.uk Wikki, Germany and United Kingdom OpenFOAM Optimization Tools p. 1 Agenda Objective Review optimisation

### CUDA programming on NVIDIA GPUs

p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

### PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters from One Stop Systems (OSS) PCIe Over Cable PCIe provides greater performance 8 7 6 5 GBytes/s 4

### GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

### Retargeting PLAPACK to Clusters with Hardware Accelerators

Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Simulation Platform Overview

Simulation Platform Overview Build, compute, and analyze simulations on demand www.rescale.com CASE STUDIES Companies in the aerospace and automotive industries use Rescale to run faster simulations Aerospace

### P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France

### Parallel Programming Survey

Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

### Commercial CFD Software Modelling

Commercial CFD Software Modelling Dr. Nor Azwadi bin Che Sidik Faculty of Mechanical Engineering Universiti Teknologi Malaysia INSPIRING CREATIVE AND INNOVATIVE MINDS 1 CFD Modeling CFD modeling can be

### Gas Handling and Power Consumption of High Solidity Hydrofoils:

Gas Handling and Power Consumption of High Solidity Hydrofoils: Philadelphia Mixing Solution's HS Lightnin's A315 Keith E. Johnson1, Keith T McDermott2, Thomas A. Post3 1Independent Consultant, North Canton,

### Model Order Reduction for Linear Convective Thermal Flow

Model Order Reduction for Linear Convective Thermal Flow Christian Moosmann, Evgenii B. Rudnyi, Andreas Greiner, Jan G. Korvink IMTEK, April 24 Abstract Simulation of the heat exchange between a solid

### GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

### Multi Grid for Multi Core

Multi Grid for Multi Core Harald Köstler, Daniel Ritter, Markus Stürmer and U. Rüde (LSS Erlangen, ruede@cs.fau.de) in collaboration with many more Lehrstuhl für Informatik 10 (Systemsimulation) Universität

### NVIDIA GPUs in the Cloud

NVIDIA GPUs in the Cloud 4 EVOLVING CLOUD REQUIREMENTS On premises Off premises Hybrid Cloud Connecting clouds New workloads Components to disrupt 5 GLOBAL CLOUD PLATFORM Unified architecture enabled by

### GPU Programming in Computer Vision

Computer Vision Group Prof. Daniel Cremers GPU Programming in Computer Vision Preliminary Meeting Thomas Möllenhoff, Robert Maier, Caner Hazirbas What you will learn in the practical course Introduction

### A Comparative Study of Conforming and Nonconforming High-Resolution Finite Element Schemes

A Comparative Study of Conforming and Nonconforming High-Resolution Finite Element Schemes Matthias Möller Institute of Applied Mathematics (LS3) TU Dortmund, Germany European Seminar on Computing Pilsen,

### NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

### Intelligent Heuristic Construction with Active Learning

Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field