Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Similar documents

Accelerating CFD using OpenFOAM with GPUs

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPGPU. Tiziano Diamanti

GPGPU accelerated Computational Fluid Dynamics

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

High Performance Computing in CST STUDIO SUITE

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Large-Scale Reservoir Simulation and Big Data Visualization

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

YALES2 porting on the Xeon- Phi Early results

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Turbomachinery CFD on many-core platforms experiences and strategies

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

CFD Implementation with In-Socket FPGA Accelerators

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Parallel Computing. Introduction

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

HPC enabling of OpenFOAM R for CFD applications

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

HP ProLiant SL270s Gen8 Server. Evaluation Report

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Computer Graphics Hardware An Overview

QCD as a Video Game?

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Trends in High-Performance Computing for Power Grid Applications

Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster

GPU Acceleration of the SENSEI CFD Code Suite

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Renderfarm with Integrated Asset Management & Production System (AMPS)

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

Parallelism and Cloud Computing

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

High Performance Matrix Inversion with Several GPUs

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

walberla: A software framework for CFD applications on Compute Cores

SOLVING LINEAR SYSTEMS

TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014

ultra fast SOM using CUDA

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

Overview of HPC systems and software available within

ST810 Advanced Computing

Intel Xeon Processor E5-2600

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Jean-Pierre Panziera Teratec 2011

1 Bull, 2011 Bull Extreme Computing

GPUs for Scientific Computing

Modeling of inflatable dams partially filled with fluid and gas considering large deformations and stability

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Evaluation of CUDA Fortran for the CFD code Strukti

~ Greetings from WSU CAPPLab ~

Graphic Processing Units: a possible answer to High Performance Computing?

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

GPGPU acceleration in OpenFOAM

GeoImaging Accelerator Pansharp Test Results

CUDA programming on NVIDIA GPUs

OpenFOAM Optimization Tools

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Parallel Programming Survey

Retargeting PLAPACK to Clusters with Hardware Accelerators

Simulation Platform Overview

Binary search tree with SIMD bandwidth optimization using SSE

Gas Handling and Power Consumption of High Solidity Hydrofoils:

Model Order Reduction for Linear Convective Thermal Flow

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Programming in Computer Vision

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Hardware Acceleration for CST MICROWAVE STUDIO

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Intelligent Heuristic Construction with Active Learning

Parallel Computing with MATLAB

Performance Characteristics of Large SMP Machines

A Fast Double Precision CFD Code using CUDA

Fast Multipole Method for particle interactions: an open source parallel library component

Master s in Computational Mathematics. Faculty of Mathematics, University of Waterloo

The Application of a Black-Box Solver with Error Estimate to Different Systems of PDEs

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Transcription:

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

Motivation Hybrid Mixed Precision Solvers Iterative Refinement Method Numerical Experiments Conclusions and Future Work Computational Fluid Dynamics Linear Systems in Computational Fluid Dynamics Fluid flow problems can be modeled by partial differential equations Often Finite Element Methods are used to find solutions Linearization methods lead to linear system Ax = b Solving the linear system is typically the most time-consuming step in the simulation process. www.united-airways.eu 2/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Motivation Hybrid Mixed Precision Solvers Iterative Refinement Method Numerical Experiments Conclusions and Future Work Acceleration Scenarios Computational Effort Acceleration Concepts Characteristics of the linear problem Use of different precision formats Characteristics of the applied solver Use of coprocessor technology Used floating point format Parallelization of the linear solvers Fundamental Idea of Mixed Precision Solvers: Acceleration without loss of accuracy High precision only in relevant parts Low precision for most of the algorithm Outsource low precision computations on parallel low precision coprocessor Use different precision formats within the Iterative Refinement Method 3/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while 4/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while 5/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i In-Exact Newton Method Apply iterative method to solve Ac i = r i Residual stopping criterion ε inner >> ε E.g. Krylov Subspace Solver 6/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Mixed Precision Iterative Refinement 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while Any error correction solver can be used i.e. Krylov Subspace Methods (CG, GMRES...) 7/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Hardware Platform HC3 Tesla IC1 Processor type Xeon 5540 Xeon 5450 Xeon 5355 Accelerator type - Tesla T10 - Processors per node 2 2 CPUs / 1 GPU 2 Cores per processor 4 8 / 240 4 Theoretical comp. rate / core 10.1 GFlop/s 12 / 3.9 GFlop/s 10.7 GFlop/s Theoretical comp. rate / node 81 GFlop/s 96 / 933 GFlop/s 85.3 GFlop/s L2-cache per processor 8 MB 8 / - MB 8 MB Nodes 278 / 32 / 12-200 Memory per node 24 / 48 / 144 GB 32 GB 16 GB Memory full machine 10.3 TB - 32 TB Theoretical comp. rate full machine 27 TFlop/s 1.0 TFlop/s 17.6 TFlop/s Power consumption load 80.8 kw 539 / 187,8 W a 103 kw a http://www.nvidia.com 8/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Numerical Experiments Implementation Issues Test Matrices Iterative Refinement Fluid Flow Problem Double precision GMRES Fluid Flow in Venturi Nozzle Mixed precision GMRES multiple dimensions MKL-Based on CPU different condition number CUDA-Based on GPU different sparsity * Kansas City Star, 1936 9/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Test-Case CFD Simulation of a Newtonian Fluid through a Venturi Nozzle. Fluid can be described by the (incompressible) Navier-Stokes equations: ρ Du Dt = p + µ u + f, u = 0, (1) Where u is the fluid velocity field, p the pressure field, ρ the constant fluid density, µ its molecular viscosity constant and f combines external forces acting on the fluid. The operator D depicts the non-linear material derivative. 10/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Test-Cases CFD CFD1 CFD2 CFD3 problem: 2D fluid flow dimension: n = 395009 sparsity: nnz = 3544321 storage format: CRS problem: 2D fluid flow dimension: n = 634453 sparsity: nnz = 5700633 storage format: CRS problem: 2D fluid flow dimension: n = 1019967 sparsity: nnz = 9182401 storage format: CRS Table: Sparsity plots and properties of the CFD test-matrices. 11/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Computation time CFD 1 Table: Computation time in s for problem CFD 1 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU 1 4 8 HC3 double 2267.47 1245.12 776.09 HC3 mixed 886.46 567.51 309.61 IC1 double 3146.61 1656.53 1627.77 IC1 mixed 1378.56 712.83 659.80 Tesla mixed 438.13 12/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Computation time CFD 2 Table: Computation time in s for problem CFD 2 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU 1 4 8 HC3 double 10765.30 4528.09 3363.44 HC3 mixed 4827.98 2177.19 1648.27 IC1 double 13204.70 6843.66 6673.07 IC1 mixed 5924.32 3495.09 3681.28 Tesla mixed 2092.84 13/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Computation time CFD 3 Table: Computation time in s for problem CFD 3 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU 1 4 8 HC3 double 62210.70 19954.50 16541.90 HC3 mixed 42919.80 9860.26 8828.28 IC1 double 60214.50 32875.10 32576.50 IC1 mixed 41927.40 19317.00 19836.80 Tesla mixed 10316.70 14/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Energy efficiency Assumptions: E = P t Power Consumption (W) Remarks IC1 514 no variation in consumption HC3 244 225-244 Wdependingonload CPU-frequency "ondemand", SMT off Tesla 726 node: 539 W, GPU : 187 W * : measurements ** : manufacturer information 15/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Energy Consumption CFD 1 Table: Energy consumption in Wh for problem CFD 1 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU 1 4 8 HC3 double 153.37 84.22 52.49 HC3 mixed 59.96 38.39 20.94 IC1 double 449.27 236.52 232.41 IC1 mixed 196.83 101.78 94.2 Tesla mixed 87.36 16/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Energy Consumption CFD 1 Figure: Energy consumption as a function of time for solving the CFD1 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 17/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Energy Consumption CFD 2 Table: Energy consumption in Wh for problem CFD 2 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU 1 4 8 HC3 double 728.15 306.27 227.50 HC3 mixed 326.56 147.26 111.49 IC1 double 1885.34 977.12 952.77 IC1 mixed 845.86 499.02 525.60 Tesla mixed 417.29 18/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Energy Consumption CFD 2 Figure: Energy consumption as a function of time for solving the CFD2 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 19/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Energy Consumption CFD 3 Table: Energy consumption in Wh for problem CFD 3 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU 1 4 8 HC3 double 4207.86 1349.70 1118.88 HC3 mixed 2903.05 666.94 597.14 IC1 double 8597.29 4693.83 4651.20 IC1 mixed 5986.30 2758.04 2832.25 Tesla mixed 2057.04 20/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Energy Consumption CFD 3 Figure: Energy consumption as a function of time for solving the CFD3 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 21/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Evaluation of GPU-based linear solvers Name Tesla C2050 Tesla C1060 GTX 480 GTX 280a Chip T20 T10 GF100 GT200 Transistors ca. 3 Mrd. ca. 1,4 Mrd. ca. 3 Mrd. ca. 1,4 Mrd. Core frequency 1.15 GHz 1.3 GHz 1.4 GHz 1.3 GHz Shaders (MADD) 448 240 480 240 GFLOPs (single) 1030 933 1.345 933 GFLOPs (double) 515 78 168 78 Memory 3 GB GDDR5 4 GB GDDR3 1.5 GB GDDR5 1 GB GDDR3 Memory Frequency 1.5 GHz 0.8 GHz 1.8 GHz 1.1 GHz Memory Bandwidth 144 GB/s 102 GB/s 177 GB/s 141 GB/s ECC Memory yes no no no Power Consumption 247 W 187 W 250 W 236 W IEEE double/single yes/yes yes/partial yes/yes yes/partial Table: Key system characteristics of the four GPUs used for the tests. Computation rate and memory bandwidth are peak respectively theoretical values. 22/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Performance Experiment setup Computation Time (s) problem solver type C2050 C1060 GTX 480 GTX 280 HC3 IC1 CFD 1 CFD 2 CFD 3 double 164.84 252.74 145.23 183.37 230.31 482.90 mixed 80.48 129.19 60.98 98.46 91.71 195.59 double 473.38 778.75 456.17 518.49 819.46 1626.00 mixed 273.99 510.38 256.43 301.41 401.57 896.94 double 993.63 1921.64 1145.08 1046.49 2493.33 4909.04 mixed 554.28 1555.36 669.57 697.12 1330.70 2990.09 Table: Computation time (s) for problem CFD 1, CFD 2 and CFD 3 based on a GMRES-(30). IC1 and HC3 results on 8 Cores 23/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Rankings Performance Energy consumption 1 GTX 480 (256 s) 1 GTX 480 (17,7 Wh) 2 C2050 (273 s) 2 C2050 (18,7 Wh) 3 GTX 280 (301 s) 3 GTX 280 (19,7 Wh) 4 HC3 (401 s) 4 HC3 (27,2 Wh) 5 C1060 (510 s) 5 C1060 (26,5 Wh) 6 IC1 (896 s) 6 IC1 (127,93 Wh) Performance and energy ranking for problem CFD 2 based on a GMRES-(30). IC1 and HC3 results on 8 Cores. Energy efficiency for GPUs computed without energy consumption for the host. Node for the host for GTX 480 has to consume less then 133 W to be more energy efficient than the HC3! 24/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Conclusion Conclusion Iterative refinement shows high potential in case of computationally expensive problems (e.g. our CFD-Problems) Iterative refinement is able to exploit the excellent low precision performance of accelerator technologies (e.g. GPU) Energy efficiency and performance can be improved by accelerate older computing nodes by GPUs 25/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods

Future Work Future Work Numerical analysis to optimize choice of floating point format Evaluate more hardware constellations Define more testcases and optimize solvers More accurate power consumption measurements FPGA-Technology offers free choice of Floating Point Formats 26/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods