Matrix Multiplication

Similar documents

Solution of Linear Systems

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

DATA ANALYSIS II. Matrix Algorithms

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

CS3220 Lecture Notes: QR factorization and orthogonal transformations

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

6. Cholesky factorization

SOLVING LINEAR SYSTEMS

Linear Algebra and TI 89

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Optimizing matrix multiplication Amitabha Banerjee

Linear Algebra Review. Vectors

Operation Count; Numerical Linear Algebra

Linear Programming. March 14, 2014

Lecture 2 Matrix Operations

Introduction to Programming (in C++) Multi-dimensional vectors. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

Parallel Algorithm for Dense Matrix Multiplication

K80TTQ1EP-??,VO.L,XU0H5BY,_71ZVPKOE678_X,N2Y-8HI4VS,,6Z28DDW5N7ADY013

Typical Linear Equation Set and Corresponding Matrices

Lecture Notes 2: Matrices as Systems of Linear Equations

Lecture 3: Finding integer solutions to systems of linear equations

Introduction to Matrix Algebra

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

A Primer on Index Notation

Introduction to Matrices

SECTIONS NOTES ON GRAPH THEORY NOTATION AND ITS USE IN THE STUDY OF SPARSE SYMMETRIC MATRICES

HSL and its out-of-core solver

Solution to Homework 2

5. Orthogonal matrices

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Arithmetic and Algebra of Matrices

CSE 6040 Computing for Data Analytics: Methods and Tools

Section 1.1 Linear Equations: Slope and Equations of Lines

Similarity and Diagonalization. Similar Matrices

October 3rd, Linear Algebra & Properties of the Covariance Matrix

1 Introduction to Matrices

7 Gaussian Elimination and LU Factorization

Solving Systems of Linear Equations Using Matrices

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

[1] Diagonal factorization

Notes on Cholesky Factorization

CUDAMat: a CUDA-based matrix class for Python

Parallel and Distributed Computing Programming Assignment 1

Data Mining: Algorithms and Applications Matrix Math Review

Linear Algebra: Vectors

1.2 Solving a System of Linear Equations

Arkansas Tech University MATH 4033: Elementary Modern Algebra Dr. Marcel B. Finan

8 Square matrices continued: Determinants

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Abstract: We describe the beautiful LU factorization of a square matrix (or how to write Gaussian elimination in terms of matrix multiplication).

Linearly Independent Sets and Linearly Dependent Sets

Section V.3: Dot Product

AP Computer Science Java Mr. Clausen Program 9A, 9B

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Math 215 HW #6 Solutions

1.4 Arrays Introduction to Programming in Java: An Interdisciplinary Approach Robert Sedgewick and Kevin Wayne Copyright /6/11 12:33 PM!

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Similar matrices and Jordan form

Direct Methods for Solving Linear Systems. Matrix Factorization

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

by the matrix A results in a vector which is a reflection of the given

MATH2210 Notebook 1 Fall Semester 2016/ MATH2210 Notebook Solving Systems of Linear Equations... 3

MATH APPLIED MATRIX THEORY

DETERMINANTS IN THE KRONECKER PRODUCT OF MATRICES: THE INCIDENCE MATRIX OF A COMPLETE GRAPH

Math Common Core Sampler Test

Linear Algebra Notes

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Section 9.1 Vectors in Two Dimensions

2) Write in detail the issues in the design of code generator.

Elementary Matrices and The LU Factorization

Matrix Differentiation

High Performance Computing Lab Exercises

The PageRank Citation Ranking: Bring Order to the Web

Mesh Generation and Load Balancing

THE NAS KERNEL BENCHMARK PROGRAM

MATHEMATICS FOR ENGINEERS BASIC MATRIX THEORY TUTORIAL 2

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Vector Math Computer Graphics Scott D. Anderson

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

Base Conversion written by Cathy Saxton

Excel supplement: Chapter 7 Matrix and vector algebra

Here are some examples of combining elements and the operations used:

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Introduction to GPU Programming Languages

Continued Fractions and the Euclidean Algorithm

MPI Hands-On List of the exercises

Solving Systems of Linear Equations

Transcription:

Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 2 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 3 / 32

Definition of a matrix A matrix is a rectangular two-dimensional array of numbers. We say a matrix is m n if it has m rows and n columns. These values are sometimes called the dimensions of the matrix. Note that, in contrast to Cartesian coordinates, we specify the number of rows (the vertical dimension) and then the number of columns (the horizontal dimension). In most contexts, the rows and columns are numbered starting with 1. Several programming APIs, however, index rows and columns from 0. We use a ij to refer to the entry in i th row and j th column of the matrix A. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 4 / 32

Matrices are extremely important in HPC While it s certainly not the case that high performance computing involves only computing with matrices, matrix operations are key to many important HPC applications. Many important applications can be reduced to operations on matrices, including (but not limited to) 1 searching and sorting 2 numerical simulation of physical processes 3 optimization The list of the top 500 supercomputers in the world (found at http://www.top500.org) is determined by a benchmark program that performs matrix operations. Like most benchmark programs, this is just one measure, however, and does not predict the relative performance of a supercomputer on non-matrix problems, or even different matrix-based problems. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 5 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 6 / 32

Dense matrices The m n matrix A is dense if all or most of its entries are nonzero. Storing a dense matrix (sometimes called a full matrix) requires storing all mn elements of the matrix. Usually an array data structure is used to store a dense matrix. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 7 / 32

Dense matrix example F 15 E 5 11 13 4 8 14 A 3 D 12 7 1 10 9 2 B 6 C Find a matrix to represent this complete graph if the ij entry contains the weight of the edge connecting node corresponding to row i with the node corresponding to column j. Use the value 0 if a connection is missing. 0 1 2 3 4 5 1 0 6 7 8 9 2 6 0 10 11 12 3 7 10 0 13 14 4 8 11 13 0 15 5 9 12 14 15 0 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 8 / 32

Dense matrix example Note: 0 1 2 3 4 5 1 0 6 7 8 9 2 6 0 10 11 12 3 7 10 0 13 14 4 8 11 13 0 15 5 9 12 14 15 0 This is considered a dense matrix even though it contains zeros. This matrix is symmetric, meaning that a ij = a ji. What would be a good way to store this matrix? CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 9 / 32

Sparse matrices A matrix is sparse if most of its entries are zero. Here most is not usually just a simple majority, rather we expect the number of zeros to far exceed the number of nonzeros. It is often most efficient to store only the nonzero entries of a sparse matrix, but this requires that location information also be stored. Arrays and lists are most commonly used to store sparse matrices. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 10 / 32

Sparse matrix example F 15 E 5 14 A 3 D 9 B 6 C Find a matrix to represent this graph if the ij entry contains the weight of the edge connecting node corresponding to row i with the node corresponding to column j. As before, use the value 0 if a connection is missing. 0 0 0 3 0 5 0 0 6 0 0 9 0 6 0 0 0 0 3 0 0 0 0 14 0 0 0 0 0 15 5 9 0 14 15 0 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 11 / 32

Sparse matrix example Sometimes its helpful to leave out the zeros to better see the structure of the matrix 0 0 0 3 0 5 0 0 6 0 0 9 0 6 0 0 0 0 3 0 0 0 0 14 0 0 0 0 0 15 5 9 0 14 15 0 = 3 5 6 9 6 3 14 15 5 9 14 15 This matrix is also symmetric. How could it be stored efficiently? CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 12 / 32

Banded matrices An important type of sparse matrices are banded matrices. Nonzeros are along diagonals close to main diagonal. Example: 3 1 6 0 0 0 0 4 8 5 0 0 0 0 1 2 1 1 3 0 0 0 1 0 4 2 6 0 0 0 6 9 5 2 5 0 0 0 7 1 8 7 0 0 0 0 4 4 9 The bandwidth of this matrix is 5. = 3 1 6 4 8 5 0 1 2 1 1 3 1 0 4 2 6 6 9 5 2 5 7 1 8 7 4 4 9 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 13 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 14 / 32

Using a two-dimensional arrays It is natural to use a 2D array to store a dense or banded matrix. Unfortunately, there are a couple of significant issues that complicate this seemingly simple approach. 1 Row-major vs. column-major storage pattern is language depent. 2 It is not possible to dynamically allocate two-dimensional arrays in C and C++; at least not without pointer storage and manipulation overhead. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 15 / 32

Row-major storage Both C and C++ use what is often called a row-major storage pattern for 2D matrices. In C and C++, the last index in a multidimensional array indexes contiguous memory locations. Thus a[i][j] and a[i][j+1] are adjacent in memory. Example: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 The stride between adjacent elements in the same row is 1. The stride between adjacent elements in the same column is the row length (5 in this example). CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 16 / 32

Column-major storage In Fortran 2D matrices are stored in column-major form. The first index in a multidimensional array indexes contiguous memory locations. Thus a(i,j) and a(i+1,j) are adjacent in memory. Example: 0 1 2 3 4 5 6 7 8 9 0 5 1 6 2 7 3 8 4 9 The stride between adjacent elements in the same row is the column length (2 in this example) while the stride between adjacent elements in the same column is 1. Notice that if C is used to read a matrix stored in Fortran (or vice-versa), the transpose matrix will be read. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 17 / 32

Significance of array ordering There are two main reasons why HPC programmers need to be aware of this issue: CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 18 / 32

Significance of array ordering There are two main reasons why HPC programmers need to be aware of this issue: 1 Memory access patterns can have a dramatic impact on performance, especially on modern systems with a complicated memory hierarchy. These code segments access the same elements of an array, but the order of accesses is different. Access by rows Access by columns for ( i = 0; i < 2; i ++) for ( j = 0; j < 5; j ++) a[i][j] =... for ( j = 0; j < 5; j ++) for ( i = 0; i < 2; i ++) a[i][j] =... CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 18 / 32

Significance of array ordering There are two main reasons why HPC programmers need to be aware of this issue: 1 Memory access patterns can have a dramatic impact on performance, especially on modern systems with a complicated memory hierarchy. These code segments access the same elements of an array, but the order of accesses is different. Access by rows Access by columns for ( i = 0; i < 2; i ++) for ( j = 0; j < 5; j ++) a[i][j] =... for ( j = 0; j < 5; j ++) for ( i = 0; i < 2; i ++) a[i][j] =... 2 Many important numerical libraries (e.g. LAPACK) are written in Fortran. To use them with C or C++ the programmer must often work with a transposed matrix. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 18 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 19 / 32

Row-sweep matrix-vector multiplication Row-major matrix-vector product y = A x, A is M N: for ( i = 0; i < M; i ++) { y[ i] = 0.0; for ( j = 0; j < N; j ++) { y[i] += a[i][j] * x[j]; } } matrix elements accessed in row-major order repeated consecutive updates to y[i]...... we can usually assume the compiler will optimize this also called inner product form since the i th entry of y is the result of an inner product between the i th column of A and the vector x. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 20 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 21 / 32

Column-sweep matrix-vector multiplication Column-major matrix-vector product y = A x, A is M N: for ( i = 0; i < M; i ++) { y[ i] = 0.0; } for ( j = 0; j < N; j ++) { for ( i = 0; i < M; i ++) { y[i] += a[i][j] * x[j]; } } matrix elements accessed in column-major order repeated updates to y[i], but each element in array is updated before any element is updated again. also called outer product form. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 22 / 32

Matrix-vector algorithm comparison Which of these two algorithms will run faster? Why? Row-Sweep Form for ( i = 0; i < M; i ++) { y[ i] = 0.0; for ( j = 0; j < N; j ++) { y[i] += a[i][j] * x[j]; } } Column-Sweep Form for ( i = 0; i < M; i ++) { y[ i] = 0.0; } for ( j = 0; j < N; j ++) { for ( i = 0; i < M; i ++) { y[i] += a[i][j] * x[j]; } } CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 23 / 32

Matrix-vector algorithm comparison Answer: it deps... Both algorithms carry out the same operations but do so in a different order. In particular, the memory access patterns are quite different. The row-sweep form will typically work better using a language like C or C++ which access 2D arrays in row-major form. Since Fortran accesses 2D arrays column-by-column, it is usually best to use the column-sweep form. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 24 / 32

Operation counts In order to compute the computation rate in MFLOPS or GFLOPS we need to know the number of floating point operations carried out and the elapsed time. Divisions are usually the most expensive of the four basic operations, followed by multiplication. Addition and subtraction are equivalent in terms of time. We typically count all four operations on floating point numbers when calculating FLOPS but ignore integer operations like array subscript calculations. In the case of a matrix-vector product, the innermost loop body contains a multiplication and an addition: y i = y i + a ij x j The inner and outer loop are done M and N times respectively, so there are a total of 2MN FLOPs. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 25 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 26 / 32

The textbook algorithm Consider the problem of multiplying two matrices: 3 8 2 5 2 3 3 0 5 4 0 C = AB = 1 8 4 2 6 1 3 6 2 3 7 9 2 2 7 5 4 0 2 The standard textbook algorithm to form the product C of the M P matrix A and the P N matrix B is based on the inner product. The c ij entry in the product is the inner product (or dot product) of the i th row of A and the j th column of B. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 27 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) known as the ijk-form of the product due to the loop ordering CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 28 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) known as the ijk-form of the product due to the loop ordering What is the operation count...? CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 28 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) known as the ijk-form of the product due to the loop ordering What is the operation count...? Number of FLOPs is 2MNP. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 28 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) known as the ijk-form of the product due to the loop ordering What is the operation count...? Number of FLOPs is 2MNP. For square matrices M = N = P = n so number of FLOPs is 2n 3. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 28 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) Notice that A is accessed row-by-row but B is accessed column-by-column. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 29 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) Notice that A is accessed row-by-row but B is accessed column-by-column. The column index for C varies faster than the row index, but these are constant with respect to the inner loop so is much less significant. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 29 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) Notice that A is accessed row-by-row but B is accessed column-by-column. The column index for C varies faster than the row index, but these are constant with respect to the inner loop so is much less significant. Regardless of the language we use (C or Fortran), we have an efficient access pattern for one matrix but not for the other. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 29 / 32

The textbook algorithm ijk-form Matrix-matrix product pseudocode: for i = 1 to M for j = 1 to N c(i,j) = 0 for k = 1 to P c(i,j) = c(i,j) + a(i,k) * b(k,j) Notice that A is accessed row-by-row but B is accessed column-by-column. The column index for C varies faster than the row index, but these are constant with respect to the inner loop so is much less significant. Regardless of the language we use (C or Fortran), we have an efficient access pattern for one matrix but not for the other. Could we improve things by rearranging the order of operations? CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 29 / 32

Outline 1 Matrix operations Importance Dense and sparse matrices Matrices and arrays 2 Matrix-vector multiplication Row-sweep algorithm Column-sweep algorithm 3 Matrix-matrix multiplication Standard algorithm ijk-forms CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 30 / 32

The ijk forms From linear algebra we know that column i of the matrix-matrix product AB is defined as a linear combination of the columns of A using the values in the i th column of B as weights. The pseudocode for this is: for j = 1 to N for i = 1 to M c(i,j) = 0.0 for k = 1 to P for i = 1 to M c(i,j) = c(i,j) + a(i,k) * b(k,j) the loop ordering changes but the innermost statement is unchanged the initialization of values in C is done one column at a time the operation count is still 2MNP CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 31 / 32

The ijk forms Other loop orderings are possible... How many? How many ways can i, j, and k be arranged? Recall from discrete math that this is a permutation problem. three ways to choose the first letter, two ways to choose the second, and one way to choose the third : 3 2 1 = 6 There are six possible loop orderings. We ll work with all six on Friday. CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 32 / 32