Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
|
|
|
- Mervin Ray
- 10 years ago
- Views:
Transcription
1 Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
2 Motivation Hybrid Mixed Precision Solvers Iterative Refinement Method Numerical Experiments Conclusions and Future Work Computational Fluid Dynamics Linear Systems in Computational Fluid Dynamics Fluid flow problems can be modeled by partial differential equations Often Finite Element Methods are used to find solutions Linearization methods lead to linear system Ax = b Solving the linear system is typically the most time-consuming step in the simulation process. 2/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
3 Motivation Hybrid Mixed Precision Solvers Iterative Refinement Method Numerical Experiments Conclusions and Future Work Acceleration Scenarios Computational Effort Acceleration Concepts Characteristics of the linear problem Use of different precision formats Characteristics of the applied solver Use of coprocessor technology Used floating point format Parallelization of the linear solvers Fundamental Idea of Mixed Precision Solvers: Acceleration without loss of accuracy High precision only in relevant parts Low precision for most of the algorithm Outsource low precision computations on parallel low precision coprocessor Use different precision formats within the Iterative Refinement Method 3/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
4 Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while 4/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
5 Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while 5/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
6 Iterative Refinement Method Newton s Method: x i+1 = x i ( f (x i )) 1 f (x i ) Idea: Apply Newton s Method to the function f (x) = b Ax x i+1 = x i ( f (x i )) 1 f (x i ) = x i ( A 1 )(b Ax i ) = x i + A 1 (b Ax i ) = x i + A 1 r i {z } =:c i In-Exact Newton Method Apply iterative method to solve Ac i = r i Residual stopping criterion ε inner >> ε E.g. Krylov Subspace Solver 6/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
7 Mixed Precision Iterative Refinement 1: initial guess as starting vector: x 0 2: compute initial residual: r 0 = b Ax 0 3: while ( Ax i b 2 > ε r 0 ) do 4: r i = b Ax i 5: solve: Ac i = r i 6: update solution: x i+1 = x i + c i 7: end while Any error correction solver can be used i.e. Krylov Subspace Methods (CG, GMRES...) 7/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
8 Hardware Platform HC3 Tesla IC1 Processor type Xeon 5540 Xeon 5450 Xeon 5355 Accelerator type - Tesla T10 - Processors per node 2 2 CPUs / 1 GPU 2 Cores per processor 4 8 / Theoretical comp. rate / core 10.1 GFlop/s 12 / 3.9 GFlop/s 10.7 GFlop/s Theoretical comp. rate / node 81 GFlop/s 96 / 933 GFlop/s 85.3 GFlop/s L2-cache per processor 8 MB 8 / - MB 8 MB Nodes 278 / 32 / Memory per node 24 / 48 / 144 GB 32 GB 16 GB Memory full machine 10.3 TB - 32 TB Theoretical comp. rate full machine 27 TFlop/s 1.0 TFlop/s 17.6 TFlop/s Power consumption load 80.8 kw 539 / 187,8 W a 103 kw a 8/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
9 Numerical Experiments Implementation Issues Test Matrices Iterative Refinement Fluid Flow Problem Double precision GMRES Fluid Flow in Venturi Nozzle Mixed precision GMRES multiple dimensions MKL-Based on CPU different condition number CUDA-Based on GPU different sparsity * Kansas City Star, /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
10 Test-Case CFD Simulation of a Newtonian Fluid through a Venturi Nozzle. Fluid can be described by the (incompressible) Navier-Stokes equations: ρ Du Dt = p + µ u + f, u = 0, (1) Where u is the fluid velocity field, p the pressure field, ρ the constant fluid density, µ its molecular viscosity constant and f combines external forces acting on the fluid. The operator D depicts the non-linear material derivative. 10/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
11 Test-Cases CFD CFD1 CFD2 CFD3 problem: 2D fluid flow dimension: n = sparsity: nnz = storage format: CRS problem: 2D fluid flow dimension: n = sparsity: nnz = storage format: CRS problem: 2D fluid flow dimension: n = sparsity: nnz = storage format: CRS Table: Sparsity plots and properties of the CFD test-matrices. 11/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
12 Computation time CFD 1 Table: Computation time in s for problem CFD 1 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
13 Computation time CFD 2 Table: Computation time in s for problem CFD 2 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
14 Computation time CFD 3 Table: Computation time in s for problem CFD 3 based on a GMRES-(10) as inner solver for the error correction method. Computation time [s] CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
15 Energy efficiency Assumptions: E = P t Power Consumption (W) Remarks IC1 514 no variation in consumption HC Wdependingonload CPU-frequency "ondemand", SMT off Tesla 726 node: 539 W, GPU : 187 W * : measurements ** : manufacturer information 15/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
16 Energy Consumption CFD 1 Table: Energy consumption in Wh for problem CFD 1 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
17 Energy Consumption CFD 1 Figure: Energy consumption as a function of time for solving the CFD1 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 17/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
18 Energy Consumption CFD 2 Table: Energy consumption in Wh for problem CFD 2 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
19 Energy Consumption CFD 2 Figure: Energy consumption as a function of time for solving the CFD2 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 19/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
20 Energy Consumption CFD 3 Table: Energy consumption in Wh for problem CFD 3 based on a GMRES-(10) as inner solver for the error correction method. Energy consumption in Wh CPU-Cores GPU HC3 double HC3 mixed IC1 double IC1 mixed Tesla mixed /26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
21 Energy Consumption CFD 3 Figure: Energy consumption as a function of time for solving the CFD3 test-case on HC3, IC1 and Tesla. The inner solver is a GMRES-(10). 21/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
22 Evaluation of GPU-based linear solvers Name Tesla C2050 Tesla C1060 GTX 480 GTX 280a Chip T20 T10 GF100 GT200 Transistors ca. 3 Mrd. ca. 1,4 Mrd. ca. 3 Mrd. ca. 1,4 Mrd. Core frequency 1.15 GHz 1.3 GHz 1.4 GHz 1.3 GHz Shaders (MADD) GFLOPs (single) GFLOPs (double) Memory 3 GB GDDR5 4 GB GDDR3 1.5 GB GDDR5 1 GB GDDR3 Memory Frequency 1.5 GHz 0.8 GHz 1.8 GHz 1.1 GHz Memory Bandwidth 144 GB/s 102 GB/s 177 GB/s 141 GB/s ECC Memory yes no no no Power Consumption 247 W 187 W 250 W 236 W IEEE double/single yes/yes yes/partial yes/yes yes/partial Table: Key system characteristics of the four GPUs used for the tests. Computation rate and memory bandwidth are peak respectively theoretical values. 22/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
23 Performance Experiment setup Computation Time (s) problem solver type C2050 C1060 GTX 480 GTX 280 HC3 IC1 CFD 1 CFD 2 CFD 3 double mixed double mixed double mixed Table: Computation time (s) for problem CFD 1, CFD 2 and CFD 3 based on a GMRES-(30). IC1 and HC3 results on 8 Cores 23/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
24 Rankings Performance Energy consumption 1 GTX 480 (256 s) 1 GTX 480 (17,7 Wh) 2 C2050 (273 s) 2 C2050 (18,7 Wh) 3 GTX 280 (301 s) 3 GTX 280 (19,7 Wh) 4 HC3 (401 s) 4 HC3 (27,2 Wh) 5 C1060 (510 s) 5 C1060 (26,5 Wh) 6 IC1 (896 s) 6 IC1 (127,93 Wh) Performance and energy ranking for problem CFD 2 based on a GMRES-(30). IC1 and HC3 results on 8 Cores. Energy efficiency for GPUs computed without energy consumption for the host. Node for the host for GTX 480 has to consume less then 133 W to be more energy efficient than the HC3! 24/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
25 Conclusion Conclusion Iterative refinement shows high potential in case of computationally expensive problems (e.g. our CFD-Problems) Iterative refinement is able to exploit the excellent low precision performance of accelerator technologies (e.g. GPU) Energy efficiency and performance can be improved by accelerate older computing nodes by GPUs 25/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
26 Future Work Future Work Numerical analysis to optimize choice of floating point format Evaluate more hardware constellations Define more testcases and optimize solvers More accurate power consumption measurements FPGA-Technology offers free choice of Floating Point Formats 26/26 Hamburg, June 17th 2010 Björn Rocker - Mixed Precision Iterative Refinement Methods
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications
AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications Dr. Bjoern Landmann Dr. Kerstin Wieczorek Stefan Bachschuster 18.03.2015 FluiDyna GmbH, Lichtenbergstr. 8, 85748 Garching
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
Large-Scale Reservoir Simulation and Big Data Visualization
Large-Scale Reservoir Simulation and Big Data Visualization Dr. Zhangxing John Chen NSERC/Alberta Innovates Energy Environment Solutions/Foundation CMG Chair Alberta Innovates Technology Future (icore)
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
YALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Parallel Computing. Introduction
Parallel Computing Introduction Thorsten Grahs, 14. April 2014 Administration Lecturer Dr. Thorsten Grahs (that s me) [email protected] Institute of Scientific Computing Room RZ 120 Lecture Monday 11:30-13:00
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
QCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu [email protected] High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster
Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster Mark J. Stock, Ph.D., Adrin Gharakhani, Sc.D. Applied Scientific Research, Santa Ana, CA Christopher P. Stone, Ph.D. Computational
GPU Acceleration of the SENSEI CFD Code Suite
GPU Acceleration of the SENSEI CFD Code Suite Chris Roy, Brent Pickering, Chip Jackson, Joe Derlaga, Xiao Xu Aerospace and Ocean Engineering Primary Collaborators: Tom Scogland, Wu Feng (Computer Science)
Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures
Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken, Melven Zöllner** German Aerospace
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS) Tackling two main challenges in CG movie production Presenter: Dr. Chen Quan Multi-plAtform Game Innovation Centre (MAGIC), Nanyang
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS
AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS by M. Storti, L. Dalcín, R. Paz Centro Internacional de Métodos Numéricos en Ingeniería - CIMEC INTEC, (CONICET-UNL), Santa Fe, Argentina
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015
INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, M. Castillo, J. I. Aliaga, J. C. Ferna ndez, V. Heuveline,
and RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
walberla: A software framework for CFD applications on 300.000 Compute Cores
walberla: A software framework for CFD applications on 300.000 Compute Cores J. Götz (LSS Erlangen, [email protected]), K. Iglberger, S. Donath, C. Feichtinger, U. Rüde Lehrstuhl für Informatik 10 (Systemsimulation)
SOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW
TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW Rajesh Khatri 1, 1 M.Tech Scholar, Department of Mechanical Engineering, S.A.T.I., vidisha
Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures
Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy Perspectives of GPU Computing in Physics
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014
The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014 David Rohr 1, Gvozden Nešković 1, Volker Lindenstruth 1,2 DOI: 10.14529/jsfi150304
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems
Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD
A Load Balancing Tool for Structured Multi-Block Grid CFD Applications
A Load Balancing Tool for Structured Multi-Block Grid CFD Applications K. P. Apponsah and D. W. Zingg University of Toronto Institute for Aerospace Studies (UTIAS), Toronto, ON, M3H 5T6, Canada Email:
Overview of HPC systems and software available within
Overview of HPC systems and software available within Overview Available HPC Systems Ba Cy-Tera Available Visualization Facilities Software Environments HPC System at Bibliotheca Alexandrina SUN cluster
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
Intel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks
Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks Garron K. Morris Senior Project Thermal Engineer [email protected] Standard Drives Division Bruce W. Weiss Principal
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Jean-Pierre Panziera Teratec 2011
Technologies for the future HPC systems Jean-Pierre Panziera Teratec 2011 3 petaflop systems : TERA 100, CURIE & IFERC Tera100 Curie IFERC 1.25 PetaFlops 256 TB ory 30 PB disk storage 140 000+ Xeon cores
1 Bull, 2011 Bull Extreme Computing
1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Modeling of inflatable dams partially filled with fluid and gas considering large deformations and stability
Institute of Mechanics Modeling of inflatable dams partially filled with fluid and gas considering large deformations and stability October 2009 Anne Maurer Marc Hassler Karl Schweizerhof Institute of
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
Graphic Processing Units: a possible answer to High Performance Computing?
4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/
PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)
PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters from One Stop Systems (OSS) PCIe Over Cable PCIe provides greater performance 8 7 6 5 GBytes/s 4
GPGPU acceleration in OpenFOAM
Carl-Friedrich Gauß Faculty GPGPU acceleration in OpenFOAM Northern germany OpenFoam User meeting Braunschweig Institute of Technology Thorsten Grahs Institute of Scientific Computing/move-csc 2nd October
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
OpenFOAM Optimization Tools
OpenFOAM Optimization Tools Henrik Rusche and Aleks Jemcov [email protected] and [email protected] Wikki, Germany and United Kingdom OpenFOAM Optimization Tools p. 1 Agenda Objective Review optimisation
P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE
1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Retargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
Simulation Platform Overview
Simulation Platform Overview Build, compute, and analyze simulations on demand www.rescale.com CASE STUDIES Companies in the aerospace and automotive industries use Rescale to run faster simulations Aerospace
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Gas Handling and Power Consumption of High Solidity Hydrofoils:
Gas Handling and Power Consumption of High Solidity Hydrofoils: Philadelphia Mixing Solution's HS Lightnin's A315 Keith E. Johnson1, Keith T McDermott2, Thomas A. Post3 1Independent Consultant, North Canton,
Model Order Reduction for Linear Convective Thermal Flow
Model Order Reduction for Linear Convective Thermal Flow Christian Moosmann, Evgenii B. Rudnyi, Andreas Greiner, Jan G. Korvink IMTEK, April 24 Abstract Simulation of the heat exchange between a solid
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
GPU Programming in Computer Vision
Computer Vision Group Prof. Daniel Cremers GPU Programming in Computer Vision Preliminary Meeting Thomas Möllenhoff, Robert Maier, Caner Hazirbas What you will learn in the practical course Introduction
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Hardware Acceleration for CST MICROWAVE STUDIO
Hardware Acceleration for CST MICROWAVE STUDIO Chris Mason Product Manager Amy Dewis Channel Manager Agenda 1. Introduction 2. Why use Hardware Acceleration? 3. Hardware Acceleration Technologies 4. Current
HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014
HPC Cluster Decisions and ANSYS Configuration Best Practices Diana Collier Lead Systems Support Specialist Houston UGM May 2014 1 Agenda Introduction Lead Systems Support Specialist Cluster Decisions Job
Intelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
Parallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
Performance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
A Fast Double Precision CFD Code using CUDA
A Fast Double Precision CFD Code using CUDA Jonathan M. Cohen *, M. Jeroen Molemaker** *NVIDIA Corporation, Santa Clara, CA 95050, USA (e-mail: [email protected]) **IGPP UCLA, Los Angeles, CA 90095, USA
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Master s in Computational Mathematics. Faculty of Mathematics, University of Waterloo
Master s in Computational Mathematics Faculty of Mathematics, University of Waterloo Master s in Computational Mathematics One-year, interdisciplinary Master s (duration of study 12 months, starts September
The Application of a Black-Box Solver with Error Estimate to Different Systems of PDEs
The Application of a Black-Bo Solver with Error Estimate to Different Systems of PDEs Torsten Adolph and Willi Schönauer Forschungszentrum Karlsruhe Institute for Scientific Computing Karlsruhe, Germany
CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER
CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER Tender Notice No. 3/2014-15 dated 29.12.2014 (IIT/CE/ENQ/COM/HPC/2014-15/569) Tender Submission Deadline Last date for submission of sealed bids is extended
