High Performance Matrix Inversion with Several GPUs



Similar documents
Retargeting PLAPACK to Clusters with Hardware Accelerators

HPC with Multicore and GPUs

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Notes on Cholesky Factorization

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Graphic Processing Units: a possible answer to High Performance Computing?

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

Accelerating CFD using OpenFOAM with GPUs

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

SOLVING LINEAR SYSTEMS

Trends in High-Performance Computing for Power Grid Applications

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Introduction to GPU Programming Languages

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPUs for Scientific Computing

Solution of Linear Systems

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Batched Matrix Computations on Hardware Accelerators Based on GPUs

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ST810 Advanced Computing

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

A distributed CPU-GPU sparse direct solver

HSL and its out-of-core solver

Introduction to GPGPU. Tiziano Diamanti

Evaluation of CUDA Fortran for the CFD code Strukti

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Direct Methods for Solving Linear Systems. Matrix Factorization

Introduction to GPU hardware and to CUDA

ultra fast SOM using CUDA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPGPU accelerated Computational Fluid Dynamics

Clustering Billions of Data Points Using GPUs

L20: GPU Architecture and Models

Next Generation GPU Architecture Code-named Fermi

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Operation Count; Numerical Linear Algebra

Lecture 1: the anatomy of a supercomputer

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

CUDAMat: a CUDA-based matrix class for Python

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

7 Gaussian Elimination and LU Factorization

Parallel Programming Survey

The Methodology of Application Development for Hybrid Architectures

Turbomachinery CFD on many-core platforms experiences and strategies

Multicore Parallel Computing with OpenMP

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

CFD Implementation with In-Socket FPGA Accelerators

Parallel Computing for Data Science

OpenCL Programming for the CUDA Architecture. Version 2.3

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

High Performance Computing in CST STUDIO SUITE

Parallel Computing with MATLAB

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Overlapping Data Transfer With Application Execution on Clusters

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

GPGPU acceleration in OpenFOAM

Understanding the Benefits of IBM SPSS Statistics Server

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Matrix Multiplication

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

CUDA programming on NVIDIA GPUs

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes

Introduction to GPU Computing

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Large-Scale Reservoir Simulation and Big Data Visualization

Drivers to support the growing business data demand for Performance Management solutions and BI Analytics

1 Bull, 2011 Bull Extreme Computing

Case Study on Productivity and Performance of GPGPUs

Transcription:

High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República pezzatti@fing.edu.uy 2 Depto. de Ing. y Ciencia de los Computadores, Univ. Jaume I de Castellón {quintana,remon}@icc.uji.es Ayia Napa February 2011

Matrix inversion of large-scale matrices Appears in a number of scientific applications like model reduction or optimal control Requires a high computational effort, 2n 3 floating-point operations (flops) Graphics processors (GPUs) Massively parallel architectures Good results on the acceleration of linear algebra operations

Outline 1 Motivation 2 Matrix inversion 3 Implementations on a multi-core CPU and multiple CPUs 4 Experimental analysis 5 Concluding remarks

Matrix inversion Two different approaches 1 Based on the Gaussian elimination (i.e., the LU factorization) 2 Based on the Gauss-Jordan elimination (GJE) Both approaches present similar computational cost but different properties

Matrix inversion via LU factorization 1 PA = LU 2 U U 1 3 Solve the system XL = U 1 for X 4 Undo the permutations A 1 := XP Implementation The algorithm sweeps through the matrix four times. Presents a mild load imbalance, due to the work with triangular factors. Algorithm implemented by LAPACK

Matrix inversion via GJE Based on Gauss-Jordan elimination In essence, it is a reordering of the operations Same arithmetical cost Implementation The algorithm sweeps through the matrix once Less memory accesses Most of the computations are highly parallel More parallelism

Matrix inversion via GJE At a given iteration: A00 A01 A02 A00 A01 A02 A10 A11 A12 A10 A11 A12 A20 A21 A22 A20 A21 A22 A 11 is b b 2 4 A 01 A 11 A 21 3 02 5 := GJEunb @ 4 A 01 A 11 A 21 31 5A

Matrix inversion via GJE A00 A01 A02 A00 A01 A02 A10 A11 A12 A10 A11 A12 A20 A21 A22 A20 A21 A22 A 00 := A 00 + A 01 A 10 A 02 := A 02 + A 01 A 12 A 20 := A 20 + A 21 A 10 A 22 := A 22 + A 21 A 12 A 10 := A 11 A 10 A 12 := A 11 A 12

Matrix inversion via GJE Move boundaries for the next iteration A00 A01 A02 A10 A11 A12 A20 A21 A22 A 11 is b b

Outline 1 Motivation 2 Matrix inversion 3 Implementations on a multi-core CPU and multiple CPUs 4 Experimental analysis 5 Concluding remarks

Implementations on a multi-core CPU and multiple GPUs Implementation GJE mgpu Executes every operation on the most convenient device: GJE UNB on the CPU Matrix-matrix products on the GPUs Data distribution Data is uniformly distributed among GPUs Each GPU performs the operations that involve their respective panel GPU 1 GPU 2 GPU 3 GPU 4

Implementations on a multi-core CPU and multiple GPUs Implementation GJE mgpu All GPUs compute concurrently The update of current panel on the CPU is overlapped with some updates on the GPUs The active column panel is transferred to the CPU initially, and broadcasted to the GPUs after being updated GPU 1 GPU 2 GPU 3 GPU 4

Implementations on a multi-core CPU and multiple GPUs Implementation GJE LA Based on GJE mgpu but reordering the operations applying a look-ahead approach The first b columns of block [A 02 ;A 12 ;A 22 ] are updated and transferred to the CPU GPU 1 GPU 2 GPU 3 GPU 4

Implementations on a multi-core CPU and multiple GPUs Implementation GJE LA The CPU updates [A 01 ;A 11 ;A 21 ] and the new received block (that is, blocks [A 01 ;A 11 ;A 21 ] of the next iteration) Concurrently, the GPUs update the rest of the matrix. GPU 1 GPU 2 GPU 3 GPU 4

Implementations on a multi-core CPU and multiple GPUs Implementation GJE ML The optimal algorithmic block-size for CPU and GPU are significantly different The use of the same block-size in both architectures limits the performance of GJE LA In this version, the CPU employs a blocked version of GJE instead of the unblocked one The CPU and GPUs employ block-sizes b and b c respectively, allowing each architecture to optimize its performance

Implementations on a multi-core CPU and multiple GPUs Implementation GJE CD At any iteration, one of the GPUs require more time than the rest, leading to load imbalance We can partially overcome this problem by employing a cyclic distribution The matrix is partitioned in blocks of b columns, and the ith block of columns is stored and updated by the (i mod k)-th GPU GPU 1 GPU 2 GPU 3 GPU 4

Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge All GPUs perform 3 operations per iteration Performing a minor change on our algorithm, we can reduce it to one operation per iteration with the following advantages: Less overhead due to routines invokations Avoid the matrix-matrix products that involve small blocks

Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge At a given iteration: A00 A01 A02 A00 A01 A02 A10 A11 A12 A10 A11 A12 A20 A21 A22 A20 A21 A22 A 11 is b b 2 4 A 01 A 11 A 21 3 02 5 := GJEblk @ 4 A 01 A 11 A 21 31 5A

Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge A00 A01 A02 A10 A11 A12 W 1 := A 10 A 10 := 0 W 2 := A 12 A20 A21 A22 A 12 := 0 2 4 A 00A 02 A 10 A 12 A 20 A 22 3 5 := 2 4 A 00A 02 A 10 A 12 A 20 A 22 3 5 + 2 4 A 01 A 11 A 21 3 5 [W 1 W 2 ]

Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge Move boundaries for the next iteration A00 A01 A02 A10 A11 A12 A20 A21 A22 A 11 is b b

Implementations on a multi-core CPU and multiple GPUs Implementation GJE Merge An important improvement in performance can be obtained by merging the extra copies with the swap stage required by pivoting. Thus, W 1 and W 2 will contain blocks A 10 and A 12 after the pivoting has been applied. This considerably reduces the number of memory accesses and partially hides the overhead introduced by the copies of A 10 and A 12

Outline 1 Motivation 2 Matrix inversion 3 Implementations on a multi-core CPU and multiple CPUs 4 Experimental analysis 5 Concluding remarks

Experimental analysis Processors # Cores Frequency L2 cache Memory (MB) (GB) Intel Xeon QuadCore 8 (2x4) 2.27 8 48 NVIDIA TESLA c1060 960 (4x240) 1.3-16(4x4) BLAS Implementations MKL 11.1 NVIDIA CUBLAS 3.0 Results for matrices with 1000 n 64000

Experimental analysis (Martix inversion on 2 GPUs) 600 500 400 LAPACK LIN0 GJE mgpu GJE LA GJE ML GJE CD GJE Merge GFLOPS 300 200 100 0 0 5 10 15 20 25 30 35 40 Matrix dimension (x1000)

Experimental analysis (Martix inversion on 4 GPUs) 1200 1000 800 LAPACK GJE mgpu GJE LA GJE ML GJE CD GJE Merge GFLOPS 600 400 200 0 0 5 10 15 20 25 30 35 40 Matrix dimension (x1000)

Experimental analysis (Variant GJE Merge on 1, 2, 3 and 4 GPUs) 1400 1200 1000 GJE Merge on 1 GPU GJE Merge on 2 GPUs GJE Merge on 3 GPUs GJE Merge on 4 GPUs GFLOPS 800 600 400 200 0 0 10 20 30 40 50 60 70 Matrix dimension (x1000)

Outline 1 Motivation 2 Matrix inversion 3 Implementations on a multi-core CPU and multiple CPUs 4 Experimental analysis 5 Concluding remarks

Concluding Remarks We have presented five implementations of matrix inversion based on the GJE method on multiple GPUs Those implementations are between 6 and 12 times faster than LAPACK using 4 GPUs and exhibit excellent scalability properties (with a nearly linear speed-up) The use of multiple GPUs Reduce the computational time Increments the amount of memory available, allowing the inversion of larger matrices

Future Work Evaluation of double precision arithmetic on the new NVIDIA Fermi architecture GJE Merge implementation is clearly limited by the performance of CUBLAS routine for matrix-matrix products. Other GPU Kernels should be evaluated

Questions?