# A DAG-based sparse Cholesky solver. architectures. Jonathan Hogg. Sparse Days at CERFACS June John Reid

Save this PDF as:

Size: px
Start display at page:

Download "A DAG-based sparse Cholesky solver. architectures. Jonathan Hogg. Sparse Days at CERFACS June John Reid"

## Transcription

1 for multicore architectures Jonathan Hogg John Reid Jennifer Scott Sparse Days at CERFACS June 2009

2 Outline of talk How to efficiently solve Ax = b on machines Introduction Dense systems Sparse systems Future directions and conclusions

3 Solving systems in parallel Haven t we been solving linear systems in parallel for years? Yes large problems on distributed memory machines We want to solve Medium and large problems (more than flops) On desktop machines Shared memory, complex cache-based architectures 2 6 cores now in all new machines. Soon cores will be standard. Traditional MPI methods work, but not well.

4 Faster Intro Dense Sparse Conclusions I have an 8-core machine......i want to go (nearly) 8 times faster

5 The dense problem Solve Ax = b with A Symmetric and dense Positive definite (indefinite problems require pivoting) Not small (order at least a few hundred)

6 Pen and paper approach Factorize A = LL T then solve Ax = b as Ly = b L T x = y

7 Pen and paper approach Factorize A = LL T then solve Ax = b as Ly = b L T x = y Algorithm: For each column k: L kk = A kk (Calculate diagonal element) For rows i > k: L ik = A ik L 1 kk (Divide column by diagonal) Update trailing submatrix A (k+1:n)(k+1:n) A (k+1:n)(k+1:n) L (k+1:n)k L T (k+1:n)k

8 Serial approach Exploit caches Use algorithm by blocks Same algorithm, but submatrices not elements 10 faster than a naive implementation Built using Basic Linear Algebra Subroutines (BLAS)

9 Cholesky by blocks Factorize diagonal Factor(col)

10 Cholesky by blocks Solve column block Solve(row, col)

11 Cholesky by blocks Update block Update(row, source col, target col)

12 Parallelism mechanisms MPI Designed for distributed memory, requires substantial changes OpenMP Designed for shared memory pthreads POSIX threads, no Fortran API ITBB Intel Thread Building Blocks, no Fortran API Coarrays Not yet widely supported

13 Parallelism mechanisms MPI Designed for distributed memory, requires substantial changes OpenMP Designed for shared memory pthreads POSIX threads, no Fortran API ITBB Intel Thread Building Blocks, no Fortran API Coarrays Not yet widely supported

14 Parallelism mechanisms MPI Designed for distributed memory, requires substantial changes OpenMP Designed for shared memory pthreads POSIX threads, no Fortran API ITBB Intel Thread Building Blocks, no Fortran API Coarrays Not yet widely supported

15 Traditional approach Just parallelise the operations Solve(row,col) Can do the solve in parallel Update(row,scol,tcol) Easily split as well

16 Traditional approach Just parallelise the operations Solve(row,col) Can do the solve in parallel Update(row,scol,tcol) Easily split as well What does this look like...

17 Parallel right looking

18 Jack says... Jack Dongarra s suggested aims for multicore: Low granularity Many many more tasks than cores. Asyncronicity Don t use tight coupling, only enforce necessary ordering. Dynamic Scheduling Static (precomputed) scheduling is easily upset. Locality of reference Cache/performance ratios will only get worse, try not to upset them.

19 DAGs Intro Dense Sparse Conclusions What do we really need to synchronise?

20 DAGs Intro Dense Sparse Conclusions What do we really need to synchronise? Represent each block operation (Factor, Solve, Update) as a task. Tasks have dependencies.

21 DAGs Intro Dense Sparse Conclusions What do we really need to synchronise? Represent each block operation (Factor, Solve, Update) as a task. Tasks have dependencies. Represent this as a directed graph Tasks are vertices Dependencies are directed edges

22 DAGs Intro Dense Sparse Conclusions What do we really need to synchronise? Represent each block operation (Factor, Solve, Update) as a task. Tasks have dependencies. Represent this as a directed graph Tasks are vertices Dependencies are directed edges It is acyclic hence have a Directed Acyclic Graph (DAG)

23 Task DAG Intro Dense Sparse Conclusions Factor(col) A kk = L kk L T kk Solve(row, col) L ik = A ik L T kk Update(row, scol, tcol) A ij A ij L ik L T jk

24 Task DAG Intro Dense Sparse Conclusions Factor(col) A kk = L kk L T kk Solve(row, col) L ik = A ik L T kk Update(row, scol, tcol) A ij A ij L ik L T jk

25 Profile Intro Dense Sparse Conclusions

26 Results Intro Dense Sparse Conclusions

27 Speedup for dense case n Speedup

28 Speedup for dense case n Speedup New dense DAG code HSL MP54 available in HSL2007.

29 Sparse case? So far, so dense. What about sparse factorizations?

30 Sparse matrices Sparse matrix is mostly zero only track non-zeros. Factor L is denser than A. Extra entries are known as fill-in. Reduce fill-in by preordering A.

31 Direct methods Generally comprise four phases: Reorder Symmetric permutation P to reduce fill-in. Analyse Predict non-zero pattern. Build elimination tree. Factorize Using data structures built in analyse phase perform the numerical factorization. Solve Using the factors solve Ax = b. Aim: Organise computations to use dense kernels on submatrices.

32 Elimination and assembly tree The elimination tree provides partial ordering for operations. If U is a descendant of V then we must factorize U first. To exploit BLAS, combine adjacent nodes whose cols have same (or similar) sparsity structure. Condensed tree is assembly tree.

33 Factorize phase Existing parallel approaches usually rely on two levels of parallelism Tree-level parallelism: assembly tree specifies only partial ordering (parent processed after its children). Independent subtrees processed in parallel. Node-level parallelism: parallelism within operations at a node. Normally used near the root.

34 Factorize phase Existing parallel approaches usually rely on two levels of parallelism Tree-level parallelism: assembly tree specifies only partial ordering (parent processed after its children). Independent subtrees processed in parallel. Node-level parallelism: parallelism within operations at a node. Normally used near the root. Our experience: speedups less than ideal on multicore machines.

35 Sparse DAG Basic idea: Extend DAG-based approach to the sparse case by adding new type of task to perform sparse update operations.

36 Sparse DAG Basic idea: Extend DAG-based approach to the sparse case by adding new type of task to perform sparse update operations. Hold set of contiguous cols of L with (nearly) same pattern as a dense trapezoidal matrix, referred to as nodal matrix.

37 Sparse DAG Basic idea: Extend DAG-based approach to the sparse case by adding new type of task to perform sparse update operations. Hold set of contiguous cols of L with (nearly) same pattern as a dense trapezoidal matrix, referred to as nodal matrix. Divide the nodal matrix into blocks and perform tasks on the blocks.

38 Tasks in sparse DAG factorize(diag) Computes dense Cholesky factor L triang of the triangular part of block diag on diagonal. If block trapezoidal, perform triangular solve of rectangular part L rect L rect L T triang

39 Tasks in sparse DAG factorize(diag) Computes dense Cholesky factor L triang of the triangular part of block diag on diagonal. If block trapezoidal, perform triangular solve of rectangular part L rect L rect L T triang solve(dest, diag) Performs triangular solve of off-diagonal block dest by Cholesky factor L triang of block diag on its diagonal. L dest L dest L T triang

40 Tasks in sparse DAG update internal(dest, rsrc, csrc) Within nodal matrix, performs update L dest L dest L rsrc L T csrc

41 Tasks in sparse DAG update between(dest, snode, scol) Performs update L dest L dest L rsrc L T csrc where L dest is a submatrix of the block dest of an ancestor of node snode L rsrc and L csrc are submatrices of contiguous rows of block column scol of snode.

42 update between(dest, snode, scol) 1. Form outer product L rsrc L T csrc into Buffer. 2. Distribute the results into the destination block L dest.

43 Dependency count During analyse, calculate number of tasks to be performed for each block of L.

44 Dependency count During analyse, calculate number of tasks to be performed for each block of L. During factorization, keep running count of outstanding tasks for each block.

45 Dependency count During analyse, calculate number of tasks to be performed for each block of L. During factorization, keep running count of outstanding tasks for each block. When count reaches 0 for block on the diagonal, store factorize task and decrement count for each off-diagonal block in its block column by one.

46 Dependency count During analyse, calculate number of tasks to be performed for each block of L. During factorization, keep running count of outstanding tasks for each block. When count reaches 0 for block on the diagonal, store factorize task and decrement count for each off-diagonal block in its block column by one. When count reaches 0 for off-diagonal block, store solve task and decrement count for blocks awaiting the solve by one. Update tasks may then be spawned.

47 Task pool Intro Dense Sparse Conclusions Each cache keeps small stack of tasks that are intended for use by threads sharing this cache. Tasks added to or drawn from top of local stack. If becomes full, move bottom half to task pool. Tasks in pool given priorities: 1. factorize Highest priority 2. solve 3. update internal 4. update between Lowest priority

48 Sparse DAG results Results on machine with 2 Intel E5420 quad core processors. Problem Time Speedup cores 1 8 DNVS/thread GHS psdef/apache Koutsovasilis/F JGD Trefethen/Trefethen 20000b ND/nd24k

49 Comparisons with other solvers, one thread

50 Comparisons with other solvers, 8 threads

51 Indefinite case Sparse DAG approach very encouraging for multicore architectures.

52 Indefinite case Sparse DAG approach very encouraging for multicore architectures. BUT So far, only considered positive definite case. Indefinite case is harder because of pivoting. Currently looking at block column dependency counts and combining factor and solve tasks. Anticipate speed ups will not be quite so good.

53 Code availability New sparse DAG code is HSL MA87. Will shortly be available as part of HSL. If you want to try it out, let us know.

### 6. Cholesky factorization

6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix

### HSL and its out-of-core solver

HSL and its out-of-core solver Jennifer A. Scott j.a.scott@rl.ac.uk Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many

### Matrix Multiplication

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

### A note on fast approximate minimum degree orderings for symmetric matrices with some dense rows

NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. (2009) Published online in Wiley InterScience (www.interscience.wiley.com)..647 A note on fast approximate minimum degree orderings

### 7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

### 7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

7. LU factorization EE103 (Fall 2011-12) factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization algorithm effect of rounding error sparse

### Operation Count; Numerical Linear Algebra

10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

### It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU A. Windisch PhD Seminar: High Performance Computing II G. Haase March 29 th, 2012, Graz Outline 1 MUMPS 2 PaStiX 3 SuperLU 4 Summary

### 9. Numerical linear algebra background

Convex Optimization Boyd & Vandenberghe 9. Numerical linear algebra background matrix structure and algorithm complexity solving linear equations with factored matrices LU, Cholesky, LDL T factorization

### 3 P0 P0 P3 P3 8 P1 P0 P2 P3 P1 P2

A Comparison of 1-D and 2-D Data Mapping for Sparse LU Factorization with Partial Pivoting Cong Fu y Xiangmin Jiao y Tao Yang y Abstract This paper presents a comparative study of two data mapping schemes

### SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

### Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

A tutorial on: Iterative methods for Sparse Matrix Problems Yousef Saad University of Minnesota Computer Science and Engineering CRM Montreal - April 30, 2008 Outline Part 1 Sparse matrices and sparsity

### PARDISO. User Guide Version 5.0.0

P a r a l l e l S p a r s e D i r e c t A n d M u l t i - R e c u r s i v e I t e r a t i v e L i n e a r S o l v e r s PARDISO User Guide Version 5.0.0 (Updated February 07, 2014) O l a f S c h e n k

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 19: SVD revisited; Software for Linear Algebra Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 9 Outline 1 Computing

### High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

### Parallel Algorithm for Dense Matrix Multiplication

Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem

### A survey of direct methods for sparse linear systems

A survey of direct methods for sparse linear systems Timothy A. Davis, Sivasankaran Rajamanickam, and Wissam M. Sid-Lakhdar Technical Report, Department of Computer Science and Engineering, Texas A&M Univ,

### Notes on Cholesky Factorization

Notes on Cholesky Factorization Robert A. van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, TX 78712 rvdg@cs.utexas.edu

### On fast factorization pivoting methods for sparse symmetric indefinite systems

On fast factorization pivoting methods for sparse symmetric indefinite systems by Olaf Schenk 1, and Klaus Gärtner 2 Technical Report CS-2004-004 Department of Computer Science, University of Basel Submitted

### SECTIONS 1.5-1.6 NOTES ON GRAPH THEORY NOTATION AND ITS USE IN THE STUDY OF SPARSE SYMMETRIC MATRICES

SECIONS.5-.6 NOES ON GRPH HEORY NOION ND IS USE IN HE SUDY OF SPRSE SYMMERIC MRICES graph G ( X, E) consists of a finite set of nodes or vertices X and edges E. EXMPLE : road map of part of British Columbia

### AN OUT-OF-CORE SPARSE SYMMETRIC INDEFINITE FACTORIZATION METHOD

AN OUT-OF-CORE SPARSE SYMMETRIC INDEFINITE FACTORIZATION METHOD OMER MESHAR AND SIVAN TOLEDO Abstract. We present a new out-of-core sparse symmetric-indefinite factorization algorithm. The most significant

### Direct Methods for Solving Linear Systems. Matrix Factorization

Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011

### Parallelization: Binary Tree Traversal

By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

### Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

### Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library

Autotuning dense linear algebra libraries on GPUs and overview of the MAGMA library Rajib Nath, Stan Tomov, Jack Dongarra Innovative Computing Laboratory University of Tennessee, Knoxville Speaker: Emmanuel

### c 1999 Society for Industrial and Applied Mathematics

SIAM J. MATRIX ANAL. APPL. Vol. 20, No. 4, pp. 915 952 c 1999 Society for Industrial and Applied Mathematics AN ASYNCHRONOUS PARALLEL SUPERNODAL ALGORITHM FOR SPARSE GAUSSIAN ELIMINATION JAMES W. DEMMEL,

### CS3220 Lecture Notes: QR factorization and orthogonal transformations

CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss

### Numerical Solution of Linear Systems

Numerical Solution of Linear Systems Chen Greif Department of Computer Science The University of British Columbia Vancouver B.C. Tel Aviv University December 17, 2008 Outline 1 Direct Solution Methods

### Mesh Generation and Load Balancing

Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable

### Numerical Linear Algebra Software

Numerical Linear Algebra Software (based on slides written by Michael Grant) BLAS, ATLAS LAPACK sparse matrices Prof. S. Boyd, EE364b, Stanford University Numerical linear algebra in optimization most

### Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

### LINEAR ALGEBRA. September 23, 2010

LINEAR ALGEBRA September 3, 00 Contents 0. LU-decomposition.................................... 0. Inverses and Transposes................................. 0.3 Column Spaces and NullSpaces.............................

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

### MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

### Power-Aware High-Performance Scientific Computing

Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan

### Bindel, Fall 2012 Matrix Computations (CS 6210) Week 1: Wednesday, Aug 22

Logistics Week 1: Wednesday, Aug 22 We will go over the syllabus in the first class, but in general you should look at the class web page at http://www.cs.cornell.edu/~bindel/class/cs6210-f12/ The web

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### Experiments in Unstructured Mesh Finite Element CFD Using CUDA

Experiments in Unstructured Mesh Finite Element CFD Using CUDA Graham Markall Software Performance Imperial College London http://www.doc.ic.ac.uk/~grm08 grm08@doc.ic.ac.uk Joint work with David Ham and

### Similar matrices and Jordan form

Similar matrices and Jordan form We ve nearly covered the entire heart of linear algebra once we ve finished singular value decompositions we ll have seen all the most central topics. A T A is positive

### PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING MARKUS HEGLAND Abstract. Parallel computing enables the analysis of very large data sets using large collections of flexible models with many variables. The

### Scalability evaluation of barrier algorithms for OpenMP

Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### We know a formula for and some properties of the determinant. Now we see how the determinant can be used.

Cramer s rule, inverse matrix, and volume We know a formula for and some properties of the determinant. Now we see how the determinant can be used. Formula for A We know: a b d b =. c d ad bc c a Can we

### Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma

### Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001,

### Diagonal, Symmetric and Triangular Matrices

Contents 1 Diagonal, Symmetric Triangular Matrices 2 Diagonal Matrices 2.1 Products, Powers Inverses of Diagonal Matrices 2.1.1 Theorem (Powers of Matrices) 2.2 Multiplying Matrices on the Left Right by

### Hybrid Programming with MPI and OpenMP

Hybrid Programming with and OpenMP Ricardo Rocha and Fernando Silva Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2015/2016 R. Rocha and F. Silva (DCC-FCUP) Programming

### Parallel Interior Point Solver for Structured Quadratic Programs: Application to Financial Planning Problems

Parallel Interior Point Solver for Structured uadratic Programs: Application to Financial Planning Problems Jacek Gondzio Andreas Grothey April 15th, 2003 MS-03-001 For other papers in this series see

### CHAPTER 1 INTRODUCTION

1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

### Architecture Aware Programming on Multi-Core Systems

Architecture Aware Programming on Multi-Core Systems M.R. Pimple Department of Computer Science & Engg. Visvesvaraya National Institute of Technology, Nagpur 440010 (India) S.R. Sathe Department of Computer

### Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Chapter 3 Linear Least Squares Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

### A Case Study - Scaling Legacy Code on Next Generation Platforms

Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy

### Solving Systems of Linear Equations. Substitution

Solving Systems of Linear Equations There are two basic methods we will use to solve systems of linear equations: Substitution Elimination We will describe each for a system of two equations in two unknowns,

### Multifrontal Computations on GPUs and Their Multi-core Hosts

Multifrontal Computations on GPUs and Their Multi-core Hosts Robert F. Lucas 1, Gene Wagenbreth 1, Dan M. Davis 1, and Roger Grimes 2 1 Information Sciences Institute, University of Southern California

### ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,

### Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like

### 18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

### Hierarchical QR factorization algorithms for multi-core cluster systems

Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra 1,2,3, Mathieu Faverge 1, Thomas Herault 1, Julien Langou 4 and Yves Robert 1,5 1. University of Tennessee Knoxville,

### Minimizing Communication in Linear Algebra

1 Minimizing Communication in Linear Algebra Grey Ballard UC Berkeley ParLab Retreat June 3, 2009 2 Collaborators Berkeley/ParLab Jim Demmel Mark Hoemmen Marghoob Mohiyuddin and others Ioana Dumitriu,

### solution flow update flow

A High Performance wo Dimensional Scalable Parallel Algorithm for Solving Sparse riangular Systems Mahesh V. Joshi y Anshul Gupta z George Karypis y Vipin Kumar y Abstract Solving a system of equations

### Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Chapter 9. General Matrices An n m matrix is an array a a a m a a a m... = [a ij]. a n a n a nm The matrix A has n row vectors and m column vectors row i (A) = [a i, a i,..., a im ] R m a j a j a nj col

### 5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

1. The advantage of.. is that they solve the problem if sequential storage representation. But disadvantage in that is they are sequential lists. [A] Lists [B] Linked Lists [A] Trees [A] Queues 2. The

### Link-based Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011

Link-based Analysis on Large Graphs Presented by Weiren Yu Mar 01, 2011 Overview 1 Introduction 2 Problem Definition 3 Optimization Techniques 4 Experimental Results 2 1. Introduction Many applications

### Building an energy dashboard. Energy measurement and visualization in current HPC systems

Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 thomas.geenen@surfsara.nl SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators

### Chapter 6. Orthogonality

6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be

### Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations. Tzu-Yi Chen

Preconditioning Sparse Matrices for Computing Eigenvalues and Solving Linear Systems of Equations by Tzu-Yi Chen B.S. (Massachusetts Institute of Technology) 1995 B.S. (Massachusetts Institute of Technology)

### clspmv: A Cross-Platform OpenCL SpMV Framework on GPUs

clspmv: A Cross-Platform OpenCL SpMV Framework on GPUs Bor-Yiing Su, subrian@eecs.berkeley.edu Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/42 Motivation

### General Framework for an Iterative Solution of Ax b. Jacobi s Method

2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

### 4. MATRICES Matrices

4. MATRICES 170 4. Matrices 4.1. Definitions. Definition 4.1.1. A matrix is a rectangular array of numbers. A matrix with m rows and n columns is said to have dimension m n and may be represented as follows:

### HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

### Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

### MATH10212 Linear Algebra B Homework 7

MATH22 Linear Algebra B Homework 7 Students are strongly advised to acquire a copy of the Textbook: D C Lay, Linear Algebra and its Applications Pearson, 26 (or other editions) Normally, homework assignments

### Speeding up MATLAB Applications

Speeding up MATLAB Applications Mannheim, 19. Februar 2014 Michael Glaßer Dipl.-Ing. Application Engineer 2014 The MathWorks, Inc. 1 Ihr MathWorks Team heute: Andreas Himmeldorf Senior Team Leader Educational

### Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,

### Systems of Linear Equations

Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and

### Poisson Equation Solver Parallelisation for Particle-in-Cell Model

WDS'14 Proceedings of Contributed Papers Physics, 233 237, 214. ISBN 978-8-7378-276-4 MATFYZPRESS Poisson Equation Solver Parallelisation for Particle-in-Cell Model A. Podolník, 1,2 M. Komm, 1 R. Dejarnac,

### Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No. # 02 Lecture No. # 05 Run-time Environments-Part 3 and Local Optimizations

### PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP Sruthi Sankar CSE 633: Parallel Algorithms Spring 2014 Professor: Dr. Russ Miller Sudoku: the puzzle A standard Sudoku puzzles contains 81 grids :9 rows

### Toward a New Metric for Ranking High Performance Computing Systems

SANDIA REPORT SAND2013-4744 Unlimited Release Printed June 2013 Toward a New Metric for Ranking High Performance Computing Systems Jack Dongarra, University of Tennessee Michael A. Heroux, Sandia National

### Techniques of the simplex basis LU factorization update

Techniques of the simplex basis L factorization update Daniela Renata Cantane Electric Engineering and Computation School (FEEC), State niversity of Campinas (NICAMP), São Paulo, Brazil Aurelio Ribeiro

### Parallel Computing for Data Science

Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

### HPC with Multicore and GPUs

HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

### Introduction to Matrix Algebra I

Appendix A Introduction to Matrix Algebra I Today we will begin the course with a discussion of matrix algebra. Why are we studying this? We will use matrix algebra to derive the linear regression model

### Real-time Process Network Sonar Beamformer

Real-time Process Network Sonar Gregory E. Allen Applied Research Laboratories gallen@arlut.utexas.edu Brian L. Evans Dept. Electrical and Computer Engineering bevans@ece.utexas.edu The University of Texas

### Arithmetic and Algebra of Matrices

Arithmetic and Algebra of Matrices Math 572: Algebra for Middle School Teachers The University of Montana 1 The Real Numbers 2 Classroom Connection: Systems of Linear Equations 3 Rational Numbers 4 Irrational

### OVERVIEW AND PERFORMANCE ANALYSIS OF THE EPETRA/OSKI MATRIX CLASS IN TRILINOS

CSRI Summer Proceedings 8 OVERVIEW AND PERFORMANCE ANALYSIS OF THE EPETRA/OSKI MATRIX CLASS IN TRILINOS I. KARLIN STUDENT AND J. HU MENTOR Abstract. In this paper, we describe a new matrix class in Epetra

Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

### Technical Report. Parallel iterative solution method for large sparse linear equation systems. Rashid Mehmood, Jon Crowcroft. Number 650.

Technical Report UCAM-CL-TR-65 ISSN 1476-2986 Number 65 Computer Laboratory Parallel iterative solution method for large sparse linear equation systems Rashid Mehmood, Jon Crowcroft October 25 15 JJ Thomson

### A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Section 1.3 Matrix Products A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form (scalar #1)(quantity #1) + (scalar #2)(quantity #2) +...

### Adaptive Time-Dependent CFD on Distributed Unstructured Meshes

Adaptive Time-Dependent CFD on Distributed Unstructured Meshes Chris Walshaw and Martin Berzins School of Computer Studies, University of Leeds, Leeds, LS2 9JT, U K e-mails: chris@scsleedsacuk, martin@scsleedsacuk

### Definition: A square matrix A is block diagonal if A has the form A 1 O O O A 2 O A =

The question we want to answer now is the following: If A is not similar to a diagonal matrix, then what is the simplest matrix that A is similar to? Before we can provide the answer, we will have to introduce

### A distributed CPU-GPU sparse direct solver

A distributed CPU-GPU sparse direct solver Piyush Sao 1, Richard Vuduc 1, and Xiaoye Li 2 1 Georgia Institute of Technology, {piyush3,richie}@gatech.edu 2 Lawrence Berkeley National Laboratory, xsli@lbl.gov

### Modification of the Minimum-Degree Algorithm by Multiple Elimination

Modification of the Minimum-Degree Algorithm by Multiple Elimination JOSEPH W. H. LIU York University The most widely used ordering scheme to reduce fills and operations in sparse matrix computation is

### BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

### Linear Algebra Notes

Linear Algebra Notes Chapter 19 KERNEL AND IMAGE OF A MATRIX Take an n m matrix a 11 a 12 a 1m a 21 a 22 a 2m a n1 a n2 a nm and think of it as a function A : R m R n The kernel of A is defined as Note