# High Performance Matrix Inversion with Several GPUs

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República 2 Depto. de Ing. y Ciencia de los Computadores, Univ. Jaume I de Castellón Ayia Napa February 2011

2 Matrix inversion of large-scale matrices Appears in a number of scientific applications like model reduction or optimal control Requires a high computational effort, 2n 3 floating-point operations (flops) Graphics processors (GPUs) Massively parallel architectures Good results on the acceleration of linear algebra operations

3 Outline 1 Motivation 2 Matrix inversion 3 Implementations on a multi-core CPU and multiple CPUs 4 Experimental analysis 5 Concluding remarks

4 Matrix inversion Two different approaches 1 Based on the Gaussian elimination (i.e., the LU factorization) 2 Based on the Gauss-Jordan elimination (GJE) Both approaches present similar computational cost but different properties

5 Matrix inversion via LU factorization 1 PA = LU 2 U U 1 3 Solve the system XL = U 1 for X 4 Undo the permutations A 1 := XP Implementation The algorithm sweeps through the matrix four times. Presents a mild load imbalance, due to the work with triangular factors. Algorithm implemented by LAPACK

6 Matrix inversion via GJE Based on Gauss-Jordan elimination In essence, it is a reordering of the operations Same arithmetical cost Implementation The algorithm sweeps through the matrix once Less memory accesses Most of the computations are highly parallel More parallelism

7 Matrix inversion via GJE At a given iteration: A00 A01 A02 A00 A01 A02 A10 A11 A12 A10 A11 A12 A20 A21 A22 A20 A21 A22 A 11 is b b 2 4 A 01 A 11 A := 4 A 01 A 11 A A

8 Matrix inversion via GJE A00 A01 A02 A00 A01 A02 A10 A11 A12 A10 A11 A12 A20 A21 A22 A20 A21 A22 A 00 := A 00 + A 01 A 10 A 02 := A 02 + A 01 A 12 A 20 := A 20 + A 21 A 10 A 22 := A 22 + A 21 A 12 A 10 := A 11 A 10 A 12 := A 11 A 12

9 Matrix inversion via GJE Move boundaries for the next iteration A00 A01 A02 A10 A11 A12 A20 A21 A22 A 11 is b b

10 Outline 1 Motivation 2 Matrix inversion 3 Implementations on a multi-core CPU and multiple CPUs 4 Experimental analysis 5 Concluding remarks

11 Implementations on a multi-core CPU and multiple GPUs Implementation GJE mgpu Executes every operation on the most convenient device: GJE UNB on the CPU Matrix-matrix products on the GPUs Data distribution Data is uniformly distributed among GPUs Each GPU performs the operations that involve their respective panel GPU 1 GPU 2 GPU 3 GPU 4

12 Implementations on a multi-core CPU and multiple GPUs Implementation GJE mgpu All GPUs compute concurrently The update of current panel on the CPU is overlapped with some updates on the GPUs The active column panel is transferred to the CPU initially, and broadcasted to the GPUs after being updated GPU 1 GPU 2 GPU 3 GPU 4

13 Implementations on a multi-core CPU and multiple GPUs Implementation GJE LA Based on GJE mgpu but reordering the operations applying a look-ahead approach The first b columns of block [A 02 ;A 12 ;A 22 ] are updated and transferred to the CPU GPU 1 GPU 2 GPU 3 GPU 4

14 Implementations on a multi-core CPU and multiple GPUs Implementation GJE LA The CPU updates [A 01 ;A 11 ;A 21 ] and the new received block (that is, blocks [A 01 ;A 11 ;A 21 ] of the next iteration) Concurrently, the GPUs update the rest of the matrix. GPU 1 GPU 2 GPU 3 GPU 4

15 Implementations on a multi-core CPU and multiple GPUs Implementation GJE ML The optimal algorithmic block-size for CPU and GPU are significantly different The use of the same block-size in both architectures limits the performance of GJE LA In this version, the CPU employs a blocked version of GJE instead of the unblocked one The CPU and GPUs employ block-sizes b and b c respectively, allowing each architecture to optimize its performance

16 Implementations on a multi-core CPU and multiple GPUs Implementation GJE CD At any iteration, one of the GPUs require more time than the rest, leading to load imbalance We can partially overcome this problem by employing a cyclic distribution The matrix is partitioned in blocks of b columns, and the ith block of columns is stored and updated by the (i mod k)-th GPU GPU 1 GPU 2 GPU 3 GPU 4

17 Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge All GPUs perform 3 operations per iteration Performing a minor change on our algorithm, we can reduce it to one operation per iteration with the following advantages: Less overhead due to routines invokations Avoid the matrix-matrix products that involve small blocks

18 Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge At a given iteration: A00 A01 A02 A00 A01 A02 A10 A11 A12 A10 A11 A12 A20 A21 A22 A20 A21 A22 A 11 is b b 2 4 A 01 A 11 A := 4 A 01 A 11 A A

19 Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge A00 A01 A02 A10 A11 A12 W 1 := A 10 A 10 := 0 W 2 := A 12 A20 A21 A22 A 12 := A 00A 02 A 10 A 12 A 20 A := 2 4 A 00A 02 A 10 A 12 A 20 A A 01 A 11 A [W 1 W 2 ]

20 Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge Move boundaries for the next iteration A00 A01 A02 A10 A11 A12 A20 A21 A22 A 11 is b b

21 Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge An important improvement in performance can be obtained by merging the extra copies with the swap stage required by pivoting. Thus, W 1 and W 2 will contain blocks A 10 and A 12 after the pivoting has been applied. This considerably reduces the number of memory accesses and partially hides the overhead introduced by the copies of A 10 and A 12

22 Outline 1 Motivation 2 Matrix inversion 3 Implementations on a multi-core CPU and multiple CPUs 4 Experimental analysis 5 Concluding remarks

23 Experimental analysis Processors # Cores Frequency L2 cache Memory (MB) (GB) Intel Xeon QuadCore 8 (2x4) NVIDIA TESLA c (4x240) (4x4) BLAS Implementations MKL 11.1 NVIDIA CUBLAS 3.0 Results for matrices with 1000 n 64000

24 Experimental analysis (Martix inversion on 2 GPUs) LAPACK LIN0 GJE mgpu GJE LA GJE ML GJE CD GJE Merge GFLOPS Matrix dimension (x1000)

25 Experimental analysis (Martix inversion on 4 GPUs) LAPACK GJE mgpu GJE LA GJE ML GJE CD GJE Merge GFLOPS Matrix dimension (x1000)

26 Experimental analysis (Variant GJE Merge on 1, 2, 3 and 4 GPUs) GJE Merge on 1 GPU GJE Merge on 2 GPUs GJE Merge on 3 GPUs GJE Merge on 4 GPUs GFLOPS Matrix dimension (x1000)

27 Outline 1 Motivation 2 Matrix inversion 3 Implementations on a multi-core CPU and multiple CPUs 4 Experimental analysis 5 Concluding remarks

28 Concluding Remarks We have presented five implementations of matrix inversion based on the GJE method on multiple GPUs Those implementations are between 6 and 12 times faster than LAPACK using 4 GPUs and exhibit excellent scalability properties (with a nearly linear speed-up) The use of multiple GPUs Reduce the computational time Increments the amount of memory available, allowing the inversion of larger matrices

29 Future Work Evaluation of double precision arithmetic on the new NVIDIA Fermi architecture GJE Merge implementation is clearly limited by the performance of CUBLAS routine for matrix-matrix products. Other GPU Kernels should be evaluated

30 Questions?

### Using graphics processors to accelerate the computation of the matrix inverse

Journal of Supercomputing manuscript No. (will be inserted by the editor) Using graphics processors to accelerate the computation of the matrix inverse P. Ezzatti E. S. Quintana-Ortí A. Remón Received:

### Trading Off Performance for Power-Energy in Dense Linear Algebra Operations

Trading Off Performance for Power-Energy in Dense Linear Algebra Operations Peter Benner 1, Pablo Ezzatti 2, Enrique S. Quintana-Ortí 3, and Alfredo Remón 1 1 Max-Planck-Institute for Dynamics of Complex

### Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library

Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library Rajib Nath, Stan Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee, Knoxville Speaker: Emmanuel

### Trading Off Performance for Energy in Linear Algebra Operations with Applications in Control Theory

Trading Off Performance for Energy in Linear Algebra Operations with Applications in Control Theory Peter Benner Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany, D-39106,

### Retargeting PLAPACK to Clusters with Hardware Accelerators

Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.

### HPC with Multicore and GPUs

HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

### A Randomized LU-based Solver Using GPU and Intel Xeon Phi Accelerators

A Randomized LU-based Solver Using GPU and Intel Xeon Phi Accelerators Marc Baboulin, Amal Khabou, and Adrien Rémy Université Paris-Sud, Orsay, France baboulin@lri.fr amal.khabou@lri.fr aremy@lri.fr Abstract.

### MAGMA: Matrix Algebra on GPU and Multicore Architectures

MAGMA: Matrix Algebra on GPU and Multicore Architectures Presented by Scott Wells Assistant Director Innovative Computing Laboratory (ICL) College of Engineering University of Tennessee, Knoxville Overview

### Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi and Kai Lu School of Computer Science, National University of Defense

### Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

### 3DES ECB Optimized for Massively Parallel CUDA GPU Architecture

3DES ECB Optimized for Massively Parallel CUDA GPU Architecture Lukasz Swierczewski Computer Science and Automation Institute College of Computer Science and Business Administration in Łomża Lomza, Poland

### Gauss-Jordan Matrix Inversion Speed-Up using GPUs with the Consideration of Power Consumption

Gauss-Jordan Matrix Inversion Speed-Up using GPUs with the Consideration of Power Consumption Mahmoud Shirazi School of Computer Science Institute for Research in Fundamental Sciences (IPM) Tehran, Iran

### Notes on Cholesky Factorization

Notes on Cholesky Factorization Robert A. van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, TX 78712 rvdg@cs.utexas.edu

### HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

### A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

Procedia Computer Science Procedia Computer Science 00 (2012) 1 10 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving general dense linear

### Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs Rajib Nath Computer Science and Engineering University of California, San Diego rknath@ucsd.edu Stanimire Tomov Electrical Engineering and

### Analysis of GPU Parallel Computing based on Matlab

Analysis of GPU Parallel Computing based on Matlab Mingzhe Wang, Bo Wang, Qiu He, Xiuxiu Liu, Kunshuai Zhu (School of Computer and Control Engineering, University of Chinese Academy of Sciences, Huairou,

### Comparing CPU and GPU in OLAP Cube Creation

Comparing CPU and GPU in OLAP Cube Creation SOFSEM 2011 Krzysztof Kaczmarski Faculty of Mathematics and Information Science Warsaw University of Technology 22-28 January 2011 Outline 1 Introduction Introduction

### Boosting Speech Recognition Performance by 5x

Case Study Boosting Speech Recognition Performance by 5x Intel Math Kernel Library, Intel C++ Compiler High-Performance Computing Qihoo360 Technology Co. Ltd. Optimizes its Applications on Intel Architecture

### Improving Perfect Parallelism

Improving Perfect Parallelism Replacing shared resource competition with friendly cooperation Lars Karlsson in collaboration with Carl Christian Kjelgaard Mikkelsen and Bo Kågström Who are we? Parallel

### Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado

### Graphic Processing Units: a possible answer to High Performance Computing?

4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/

### Performance Portability Study of Linear Algebra Kernels in OpenCL

Performance Portability Study of Linear Algebra Kernels in OpenCL Karl Rupp 1,2, Philippe Tillet 1, Florian Rudolf 1, Josef Weinbub 1, Ansgar Jüngel 2, Tibor Grasser 1 rupp@iue.tuwien.ac.at @karlrupp 1

### Overview of CUDA Libraries. Ujval Kapasi November 16, 2011

Overview of CUDA Libraries Ujval Kapasi November 16, 2011 CUDA library ecosystem spans many fields Math, Numerics, Statistics Dense & Sparse Linear Algebra Algorithms (sort, etc.) Image Processing Computer

### GMP implementation on CUDA - A Backward Compatible Design With Performance Tuning

1 GMP implementation on CUDA - A Backward Compatible Design With Performance Tuning Hao Jun Liu, Chu Tong Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto haojun.liu@utoronto.ca,

### High Performance Computing for Mechanical Simulations using ANSYS. Jeff Beisheim ANSYS, Inc

High Performance Computing for Mechanical Simulations using ANSYS Jeff Beisheim ANSYS, Inc HPC Defined High Performance Computing (HPC) at ANSYS: An ongoing effort designed to remove computing limitations

### The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

### Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

### Dense Linear Algebra Solvers for Multicore with GPU Accelerators

Dense Linear Algebra Solvers for Multicore with GPU Accelerators Stanimire Tomov, Rajib Nath, Hatem Ltaief, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee,

### Batched Matrix Computations on Hardware Accelerators Based on GPUs

Batched Matrix Computations on Hardware Accelerators Based on GPUs Azzam Haidar *, Tingxing Dong *, Piotr Luszczek *, Stanimire Tomov *, and Jack Dongarra *,, * University of Tennessee Oak Ridge National

### 3 P0 P0 P3 P3 8 P1 P0 P2 P3 P1 P2

A Comparison of 1-D and 2-D Data Mapping for Sparse LU Factorization with Partial Pivoting Cong Fu y Xiangmin Jiao y Tao Yang y Abstract This paper presents a comparative study of two data mapping schemes

### GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

### Introduction to GPU Programming Languages

CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

### Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

### Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

### A PORTABLE BENCHMARK SUITE FOR HIGHLY PARALLEL DATA INTENSIVE QUERY PROCESSING

A PORTABLE BENCHMARK SUITE FOR HIGHLY PARALLEL DATA INTENSIVE QUERY PROCESSING Ifrah Saeed, Sudhakar Yalamanchili, School of ECE Jeff Young, School of Computer Science February 8, 2015 The Need for Accelerated

ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

### Experiments in Unstructured Mesh Finite Element CFD Using CUDA

Experiments in Unstructured Mesh Finite Element CFD Using CUDA Graham Markall Software Performance Imperial College London http://www.doc.ic.ac.uk/~grm08 grm08@doc.ic.ac.uk Joint work with David Ham and

### Accelerating MCAE with GPUs

Accelerating MCAE with GPUs Information Sciences Institute 15 Sept 2010 15 Sept 2010 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com Outline MCAE

### GPUs for Scientific Computing

GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

### ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

### 7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

7. LU factorization EE103 (Fall 2011-12) factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization algorithm effect of rounding error sparse

### Lecture 3: Single processor architecture and memory

Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting

### GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

### GPU programming using C++ AMP

GPU programming using C++ AMP Petrika Manika petrika.manika@fshn.edu.al Elda Xhumari elda.xhumari@fshn.edu.al Julian Fejzaj julian.fejzaj@fshn.edu.al Abstract Nowadays, a challenge for programmers is to

### Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

### GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

### SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

### Using Accelerators The FLOPs, the Bandwidth, and the Latency

Using Accelerators The FLOPs, the Bandwidth, and the Latency Karl Rupp 1,2 rupp@iue.tuwien.ac.at http://karlrupp.net/ based on stimuli from PETSc+ViennaCL users 1 Institute for Microelectronics, TU Wien,

### HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

### Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

### Accelerating CST MWS Performance with GPU and MPI Computing. CST workshop series

Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques - Overview - Multithreading GPU Computing Distributed Computing

### Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

### Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

### Shattering the 1U Server Performance Record. Figure 1: Supermicro Product and Market Opportunity Growth

Shattering the 1U Server Performance Record Supermicro and NVIDIA recently announced a new class of servers that combines massively parallel GPUs with multi-core CPUs in a single server system. This unique

### Accelerating Market Value at Risk Estimation on GPUs

Accelerating Market Value at Risk Estimation on GPUs NVIDIA Theater, SC'09 Matthew Dixon1 Jike Chong2 1 Department of Computer Science, UC Davis 2 Department of Electrical Engineering and Computer Science,

### A distributed CPU-GPU sparse direct solver

A distributed CPU-GPU sparse direct solver Piyush Sao 1, Richard Vuduc 1, and Xiaoye Li 2 1 Georgia Institute of Technology, {piyush3,richie}@gatech.edu 2 Lawrence Berkeley National Laboratory, xsli@lbl.gov

### Designing and dynamically load balancing hybrid LU for multi/many-core

Comput Sci Res Dev (2011) 26: 211 220 DOI 10.1007/s00450-011-0169-x SPECIAL ISSUE PAPER Designing and dynamically load balancing hybrid LU for multi/many-core Michael Deisher Mikhail Smelyanskiy Brian

### ultra fast SOM using CUDA

ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

### Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography

Multi-GPU Jacobian Accelerated Computing for Soft Field Tomography A. Borsic, E. A. Attardo, R. J. Halter Thayer School Of Engineering, Dartmouth College, 8000 Cummings Hall Hanover, NH 03755, US E-mail:

### Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

### L20: GPU Architecture and Models

L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

### HSL and its out-of-core solver

HSL and its out-of-core solver Jennifer A. Scott j.a.scott@rl.ac.uk Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many

### Multifrontal Computations on GPUs and Their Multi-core Hosts

Multifrontal Computations on GPUs and Their Multi-core Hosts Robert F. Lucas 1, Gene Wagenbreth 1, Dan M. Davis 1, and Roger Grimes 2 1 Information Sciences Institute, University of Southern California

### clspmv: A Cross-Platform OpenCL SpMV Framework on GPUs

clspmv: A Cross-Platform OpenCL SpMV Framework on GPUs Bor-Yiing Su, subrian@eecs.berkeley.edu Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/42 Motivation

### Solving Sets of Equations. 150 B.C.E., 九章算術 Carl Friedrich Gauss,

Solving Sets of Equations 5 B.C.E., 九章算術 Carl Friedrich Gauss, 777-855 Gaussian-Jordan Elimination In Gauss-Jordan elimination, matrix is reduced to diagonal rather than triangular form Row combinations

### Revision of Relational Joins for Multi-Core And Revision of Relational Joins for Multi-Core and Many-Core Architectures. Many-Core Architectures

Revision of Relational Joins for Multi-Core And Revision of Relational Joins for Multi-Core and Many-Core Architectures Many-Core Architectures Martin Kruliš, Jakub Yaghob Martin Kruliš and Jakub Yaghob

### The Methodology of Application Development for Hybrid Architectures

Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department

### GPGPU accelerated Computational Fluid Dynamics

t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

### Operation Count; Numerical Linear Algebra

10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

### Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

### Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

### Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

### Direct Methods for Solving Linear Systems. Matrix Factorization

Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011

### Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

### GRAPHICS CARD AS A CHEAP SUPERCOMPUTER

Programs and Algorithms of Numerical Matematics 16 J. Chleboun, K. Segeth, J. Šístek, T. Vejchodský (Eds.) Institute of Mathematics AS CR, Prague 2013 GRAPHICS CARD AS A CHEAP SUPERCOMPUTER Jan Přikryl

### NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

### Challenges to Obtaining Good Parallel Processing Performance

Outline: Challenges to Obtaining Good Parallel Processing Performance Coverage: The Parallel Processing Challenge of Finding Enough Parallelism Amdahl s Law: o The parallel speedup of any program is limited

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### Parallelization of Matrix Multiply

18.337/6.338: Parallel Computing Final Project Report Parallelization of Matrix Multiply A Look At How Differing Algorithmic Approaches and CPU Hardware Impact Scaling Calculation Performance in Java Elliotte

### Hierarchical QR factorization algorithms for multi-core cluster systems

Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra 1,2,3, Mathieu Faverge 1, Thomas Herault 1, Julien Langou 4 and Yves Robert 1,5 1. University of Tennessee Knoxville,

### Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

### Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

### CFD Implementation with In-Socket FPGA Accelerators

CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline

### GPUs: Doing More Than Just Games. Mark Gahagan CSE 141 November 29, 2012

GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

### Parallel Computing with MATLAB

Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best

### Parallel Computing for Data Science

Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

### Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1, and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and Engineering,

### AMD Stream Computing: Software Stack

AMD Stream Computing: Software Stack EXECUTIVE OVERVIEW Advanced Micro Devices, Inc. (AMD) a leading global provider of innovative computing solutions is working with other leading companies and academic

### A DAG-based sparse Cholesky solver. architectures. Jonathan Hogg. Sparse Days at CERFACS June John Reid

for multicore architectures Jonathan Hogg j.hogg@ed.ac.uk John Reid john.reid@stfc.ac.uk Jennifer Scott jennifer.scott@stfc.ac.uk Sparse Days at CERFACS June 2009 Outline of talk How to efficiently solve

### Parallel Processing of Large-Scale XML-Based Application Documents on Multi-core Architectures with PiXiMaL

1 / 35 Parallel Processing of Large-Scale XML-Based Application Documents on Multi-core Architectures with PiXiMaL Michael R. Head Madhusudhan Govindaraju Department of Computer Science Grid Computing

### Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel

### Iterate More, Innovate Faster Working Differently with Quad-Core Intel Xeon Processor-Based HP Workstations and Dassault Systèmes Solutions

White Paper Quad-Core Intel Xeon Processor Iterate More, Innovate Faster Working Differently with Quad-Core Intel Xeon Processor-Based HP Workstations and Dassault Systèmes Solutions Innovative Intel Quad-Core

### Numerical Linear Algebra on Emerging Architectures: the PLASMA and MAGMA Projects

Numerical Linear Algebra on Emerging Architectures: the and MAGMA Projects Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, Stanimire

### CUDAMat: a CUDA-based matrix class for Python

Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 November 25, 2009 UTML TR 2009 004 CUDAMat: a CUDA-based

### Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms

### Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

### ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*