Notes on Matrix Calculus



Similar documents
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Linear Algebra Notes for Marsden and Tromba Vector Calculus

F Matrix Calculus F 1

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Notes on Determinant

Introduction to Matrix Algebra

1 Introduction to Matrices

Lecture 2 Matrix Operations

Matrix Differentiation

Inner Product Spaces

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Using row reduction to calculate the inverse and the determinant of a square matrix

Solution to Homework 2

The Determinant: a Means to Calculate Volume

Inner products on R n, and more

5. Orthogonal matrices

Notes on Symmetric Matrices

MAT188H1S Lec0101 Burbulla

Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

Solving Linear Systems, Continued and The Inverse of a Matrix

Vector and Matrix Norms

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

T ( a i x i ) = a i T (x i ).

Mathematics Course 111: Algebra I Part IV: Vector Spaces

6. Cholesky factorization

Linear Algebra Review. Vectors

minimal polyonomial Example

Introduction to Matrices for Engineers

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Data Mining: Algorithms and Applications Matrix Math Review

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Section 5.3. Section 5.3. u m ] l jj. = l jj u j + + l mj u m. v j = [ u 1 u j. l mj

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Differentiation and Integration

Section Inner Products and Norms

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

Linear Algebra: Determinants, Inverses, Rank

Lecture L3 - Vectors, Matrices and Coordinate Transformations

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

by the matrix A results in a vector which is a reflection of the given

Chapter 17. Orthogonal Matrices and Symmetries of Space

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

British Columbia Institute of Technology Calculus for Business and Economics Lecture Notes

Multi-variable Calculus and Optimization

LS.6 Solution Matrices

Question 2: How do you solve a matrix equation using the matrix inverse?

MATH APPLIED MATRIX THEORY

Pythagorean vectors and their companions. Lattice Cubes

Unit 18 Determinants

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Unified Lecture # 4 Vectors

[1] Diagonal factorization

Math 312 Homework 1 Solutions

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

DERIVATIVES AS MATRICES; CHAIN RULE

Some Lecture Notes and In-Class Examples for Pre-Calculus:

Direct Methods for Solving Linear Systems. Matrix Factorization

Microeconomic Theory: Basic Math Concepts

Continued Fractions and the Euclidean Algorithm

3. INNER PRODUCT SPACES

1 VECTOR SPACES AND SUBSPACES

NOTES ON LINEAR TRANSFORMATIONS

x = + x 2 + x

Multivariate normal distribution and testing for means (see MKB Ch 3)

Linear Algebra: Vectors

LINEAR ALGEBRA. September 23, 2010

v w is orthogonal to both v and w. the three vectors v, w and v w form a right-handed set of vectors.

LINEAR ALGEBRA W W L CHEN

Elementary Matrices and The LU Factorization

1.5. Factorisation. Introduction. Prerequisites. Learning Outcomes. Learning Style

We can express this in decimal notation (in contrast to the underline notation we have been using) as follows: b + 90c = c + 10b

The Characteristic Polynomial

Operation Count; Numerical Linear Algebra

Linear Algebraic Equations, SVD, and the Pseudo-Inverse

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Typical Linear Equation Set and Corresponding Matrices

Theory of Matrices. Chapter 5

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

7 Gaussian Elimination and LU Factorization

Applications of Fermat s Little Theorem and Congruences

18.06 Problem Set 4 Solution Due Wednesday, 11 March 2009 at 4 pm in Total: 175 points.

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Solving Systems of Linear Equations

Linear Algebra Notes

9 MATRICES AND TRANSFORMATIONS

Section 9.5: Equations of Lines and Planes

MA106 Linear Algebra lecture notes

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Lecture 1: Schur s Unitary Triangularization Theorem

A Primer on Index Notation

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

1 Completeness of a Set of Eigenfunctions. Lecturer: Naoki Saito Scribe: Alexander Sheynis/Allen Xue. May 3, The Neumann Boundary Condition

RESULTANT AND DISCRIMINANT OF POLYNOMIALS

4. How many integers between 2004 and 4002 are perfect squares?

Linear Maps. Isaiah Lankham, Bruno Nachtergaele, Anne Schilling (February 5, 2007)

DETERMINANTS IN THE KRONECKER PRODUCT OF MATRICES: THE INCIDENCE MATRIX OF A COMPLETE GRAPH

CURVE FITTING LEAST SQUARES APPROXIMATION

Rotation Matrices and Homogeneous Transformations

Chapter 6. Orthogonality

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Transcription:

Notes on Matrix Calculus Paul L. Fackler North Carolina State University September 27, 2005 Matrix calculus is concerned with rules for operating on functions of matrices. For example, suppose that an m n matrix X is mapped into a p q matrix Y. We are interested in obtaining expressions for derivatives such as Y ij X kl, for all i, j and k, l. The main difficulty here is keeping track of where things are put. There is no reason to use subscripts; it is far better instead to use a system for ordering the results using matrix operations. Matrix calculus makes heavy use of the vec operator and Kronecker products. The vec operator vectorizes a matrix by stacking its columns (it is convention that column rather than row stacking is used). For example, vectorizing the matrix 1 2 3 4 5 6 Paul L. Fackler is an Associate Professor in the Department of Agricultural and Resource Economics at North Carolina State University. These notes are copyrighted material. They may be freely copied for individual use but should be appropriated referenced in published work. Mail: Department of Agricultural and Resource Economics NCSU, Box 8109 Raleigh NC, 27695, USA e-mail: paul fackler@ncsu.edu Web-site: http://www4.ncsu.edu/ pfackler/ c 2005, Paul L. Fackler 1

produces 1 3 5 2 4 6 The Kronecker product of two matrices, A and B, where A is m n and B is p q, is defined as A 11 B A 12 B... A 1n B A A B = 21 B A 22 B... A 2n B, A m1 B A m2 B... A mn B which is an mp nq matrix. There is an important relationship between the Kronecker product and the vec operator: vec(axb) = (B A)vec(X). This relationship is extremely useful in deriving matrix calculus results. Another matrix operator that will prove useful is one related to the vec operator. Define the matrix T m,n as the matrix that transforms vec(a) into vec(a ): T m,n vec(a) = vec(a ). Note the size of this matrix is mn mn. T m,n has a number of special properties. The first is clear from its definition; if T m,n is applied to the vec of an m n matrix and then T n,m applied to the result, the original vectorized matrix results: Thus T n,m T m,n vec(a) = vec(a). T n,m T m,n = I mn. The fact that T n,m = T 1 m,n follows directly. Perhaps less obvious is that T m,n = T n,m 2

(also combining these results means that T m,n is an orthogonal matrix). The matrix operator T m,n is a permutation matrix, i.e., it is composed of 0s and 1s, with a single 1 on each row and column. When premultiplying another matrix, it simply rearranges the ordering of rows of that matrix (postmultiplying by T m,n rearranges columns). The transpose matrix is also related to the Kronecker product. With A and B defined as above, B A = T p,m (A B)T n,q. This can be shown by introducing an arbitrary n q matrix C: T p,m (A B)T n,q vec(c) = T p,m (A B)vec(C ) = T p,m vec(bc A ) = vec(acb ) = (B A)vec(C). This implies that ((B A) T p,m (A B)T n,q )vec(c) = 0. Because C is arbitrary, the desired result must hold. An immediate corollary to the above result is that (A B)T n,q = T m,p (B A). It is also useful to note that T 1,m = T m,1 = I m. Thus, if A is 1 n then (A B)T n,q = (B A). When working with derivatives of scalars this can result in considerable simplification. Turning now to calculus, define the derivative of a function mapping R n R m as the m n matrix of partial derivatives: [Df] ij = f i(x) x j. For example, the simplest derivative is dax dx = A. Using this definition, the usual rules for manipulating derivatives apply naturally if one respects the rules of matrix conformability. The summation rule is obvious: D[αf(x) + βg(x)] = αdf(x) + βdg(x), 3

where α and β are scalars. The chain rule involves matrix multiplication, which requires conformability. Given two functions f : R n R m and g : R p R n, the derivative of the composite function is D[f(g(x))] = f (g(x))g (x). Notice that this satisfies matrix multiplication conformability, whereas the expression g (x)f (g(x)) attempts to postmultipy an n p matrix by an m n matrix. To define a product rule, consider the expression f(x) g(x), where f, g : R n R m. The derivative is the 1 n vector given by D[f(x) g(x)] = g(x) f (x) + f(x) g (x). Notice that no other way of multiplying g by f and f by g would ensure conformability. A more general version of the product rule is defined below. The product rule leads to a useful result about quadratic functions: dx Ax dx = x A + x A = x (A + A ). When A is symmetric this has the very natural form dx Ax/dx = 2x A. These rules define derivatives for vectors. Defining derivatives of matrices with respect to matrices is accomplished by vectorizing the matrices, so da(x)/dx is the same thing as dvec(a(x))/dvec(x). This is where the the relationship between the vec operator and Kronecker products is useful. Consider differentiating dx Ax with respect to A (rather than with respect to x as above): dvec(x Ax) dvec(a) = d(x x )vec(a) dvec(a) = (x x ) (the derivative of an m n matrix A with respect to itself is I mn ). A more general product rule can be defined. Suppose that f : R n R m p and g : R n R p q, so f(x)g(x) : R n R m q. Using the relationship between the vec and Kronecker product operators vec(i m f(x)g(x)i q ) = (g(x) I m )vec(f(x)) = (I q f(x))vec(g(x)). A natural product rule is therefore Df(x)g(x) = (g(x) I m )f (x) + (I q f(x))g (x). This can be used to determine the derivative of da A/dA where A is m n. vec(a A) = (I n A )vec(a) = (A I n )vec(a ) = (A I n )T m,n vec(a). 4

Thus (using the product rule) da A da = (I n A ) + (A I n )T m,n. This can be simplified somewhat by noting that Thus (A I n )T m,n = T n,n (I n A ). da A da = (I n 2 + T n,n)(i n A ). The product rule is also useful in determining the derivative of a matrix inverse: da 1 A da = (A I n ) da 1 da + (I n A 1 ). But A 1 A is identically equal to I, so its derivative is identically 0. Thus da 1 da = (A I n ) 1 (I n A 1 ) = (A I n )(I n A 1 ) = (A A 1 ). It is also useful to have an expression for the derivative of a determinant. Suppose A is n n with A 0. The determinant can be written as the product of the ith row of the adjoint of A (A ) with the ith column of A: A = A i A i. Recall that the elements of the ith row of A are not influenced by the elements in the ith column of A and hence A A i = A i. To obtain the derivative with respect to all of the elements of A, we can concatenate the partial derivatives with respect to each column of A: d A ( da = [A 1 A 2... A n ] = A [[A 1 ] 1 [A 1 ] 2... [A 1 ] n ] = A vec A ). The following result is an immediate consequence d ln A da = vec ( A ). 5

Matrix differentiation results allow us to compute the derivatives of the solutions to certain classes of equilibrium problems. Consider, for example, the solution, x, to a linear complementarity problem LCP(M,q) that solves Mx + q 0, x 0, x (Mx + q) = 0 The ith element of x is either exactly equal to 0 or is equal to the ith element of Mx + q. Define a diagonal matrix D such that D ii = 1 if x > 0 and equal 0 otherwise. The solution can then be written as x = ˆM 1 Dq, where ˆM = DM + I D. It follows that and that x q = ˆM 1 D ˆM 1 x M = x ˆM 1 ˆM ˆM M = ( q D I)( ˆM ˆM 1 )(I D) = q D ˆM ˆM 1 D = x x/ q Given the prevalence of Kronecker products in matrix derivatives, it would be useful to have rules for computing derivatives of Kronecker products themselves, i.e. da B/dA and da B/dB. Because each element of a Kronecker product involves the product of one element from A multiplied by one element of B, the derivative da B/dA must be composed of zeros and the elements of B arranged in a certain fashion. Similarly, the derivative da B/dB is composed of zeros and the elements of A arranged in a certain fashion. 6

It can be verified that da B/dA can be written as Ψ 1 0... 0 Ψ 2 0... 0 Ψ q 0... 0 0 Ψ 1... 0 da B 0 Ψ 2... 0 Ψ 1 da = Ψ = I n 2... 0 Ψ q... 0 Ψ q 0 0... Ψ 1 0 0... Ψ 2 0 0... Ψ q where Ψ i = B 1i 0... 0 B 2i 0... 0 B pi 0... 0 0 B 1i... 0 0 B 2i... 0 0 B pi... 0 0 0... B 1i 0 0... B 2i 0 0... B pi This can be written more compactly as = I m B i da B da = (I n T qm I p )(I mn vec(b)) = (I nq T mp )(I n vec(b) I m ). 7

Similarly da B/dB can be written as Θ 1 0... 0 0 Θ 1... 0 0 0... Θ 1 Θ 2 0... 0 da B 0 Θ 2... 0 db = 0 0... Θ 2 Θ q 0... 0 0 Θ q... 0 0 0... Θ q where Θ i = A i1 0... 0 0 A i1... 0 0 0... A i1 A i2 0... 0 0 A i2... 0 0 0... A i2 A im 0... 0 0 A im... 0 0 0... A im. This can be written more compactly as da B db = (I n T qm I p )(vec(a) I pq ) = (T pq I mn )(I q vec(a) I p ). Notice that if either A is a row vector (m = 1) or B is a column vector (q = 1), the matrix (I n T qm I p ) = Imnpq and hence can be ignored. To illustrate a use for these relationships, consider the second derivative of xx with respect to x, an n-vector. dxx dx = x I n + I n x; 8

hence d 2 xx dxdx = (T nn I n )(I n vec(i n ) + vec(i n ) I n ). Another example is d 2 A 1 [( ( dada = (I n T nn I n ) I n 2 vec A 1)) ( ( T nn + vec A ) )] ( I n 2 A A 1 ) Often, especially in statistical applications, one encounters matrices that are symmetrical. It would not make sense to take a derivative with respect to the i, jth element of a symmetric matrix while holding the j, ith element constant. Generally it is preferable to work with a vectorized version of a symmetric matrix that excludes with the upper or lower portion of the matrix. The vech operator is typically taken to be the column-wise vectorization with the upper portion excluded: vech(a) = A 11 A n,1 A 22 A n2 A nn 1 A nn. One obtains this by selecting elements of vec(a) and therefore we can write vech(a) = S n vec(a), where S n is an n(n + 1)/2 n 2 matrix of 0s and 1s, with a single 1 in each row. The vech operator can be applied to lower triangular matrices as well; there is no reason to take derivatives with respect to the upper part of a lower triangular matrix (it can also be applied to the transpose of an upper triangular matrix). The use of the vech operator is also important in efficient computer storage of symmetric and triangular matrices. To illustrate the use of the vech operator in matrix calculus applications, consider an n n symmetric matrix C defined in terms of a lower triangular matrix, L, C = LL. 9

Using the already familiar methods it is easy to see that dc dl = (I n L)T n,n + (L I n ). Using the chain rule dvech(c) dvech(l) = dvech(c) dc dl dc dl dvech(l) = S dc n dl S n. Inverting this expression provides an expression for dvech(l)/dvech(c). Related to matrix derivatives is the issue of Taylor expansions of matrixto-matrix functions. One way to think of matrix derivatives is in terms of multidimensional arrays. An mn pq matrix can also be thought of as an m n p q 4-dimensional array. The reshape function in MATLAB implements such transformations. The ordering of the individual elements has not change, only the way the elements are indexed. The dth order Taylor expansion of a function f(x) : R m n R p q at X can be computed in the following way f = vec(f( X)) dx = vec(x X) for i = 1 to d { f i = f (i) ( X)dX for j = 2 to i f i = reshape(f i, mn(pq) j 2, pq)dx/j f = f + f i } f = reshape(f, m, n) The techniques described thus far can be applied to computation of derivatives of common special functions. First, consider the derivative of a nonnegative integer power A i of a square matrix A. Application of the chain rule leads to the recursive definition da i da = dai 1 A da = (A I) dai 1 da + (I Ai 1 ) which can also be expressed as a sum of i terms da i da = i (A ) i j A j 1. j=1 10

This result can be used to derive an expression for the derivative of the matrix exponential function, which is defined in the usual way in terms of a Taylor expansion: Thus exp(a) = d exp(a) da i=0 = A i i!. i=0 1 i! i (A ) i j A j 1. j=1 The same approach can be applied to the matrix natural logarithm: ln(a) = i=1 1 i (I A)i. 11

Summary of Operator Results A is m n, B is p q, X is defined comformably. A 11 B A 12 B... A 1n B A A B = 21 B A 22 B... A 2n B A m1 B A m2 B... A mn B (AC BD) = (A B)(C D) (A B) 1 = A 1 B 1 (A B) = A B vec(axb) = (B A)vec(X) trace(ax) = vec(a ) vec(x) trace(ax) = trace(xa) T m,n vec(a) = vec(a ) T n,m T m,n = I mn T n,m = T 1 m,n T m,n = T n,m B A = T p,m (A B)T n,q 12

Summary of Differentiation Results A is m n, B is p q, x is n 1, X is defined comformably. [Df] ij = df i(x) dx j dax dx = A D[αf(x) + βg(x)] = αdf(x) + βdg(x) D[f(g(x))] = f (g(x))g (x) D[f(x)g(x)] = (g(x) I m )f (x) + (I p f(x))g (x). dx Ax dx = x (A + A ) dvec(x Ax) dvec(a) = (x x ) da A da = (I n 2 + T n,n)(i n A ) daa da = (I m 2 + T m,m)(a I m ) dx A Ax da dx AA x da = 2x x A (when x is a vector) = 2x A x (when x is a vector) daxb dx = B A da 1 da = (A A 1 ) d ln A da = vec(a ) dtrace(ax) dx = vec(a ) 13

da B da = (I n T qm I p )(I mn vec(b)) = (I nq T mp )(I n vec(b) I m ). da B db = (I n T qm I p )(vec(a) I pq ) = (T pq I mn )(I q vec(a) I p ). dxx dx = x I n + I n x d 2 xx dxdx = (T nn I n )(I n vec(i n ) + vec(i n ) I n ) da i da = (A I) dai 1 da + (I Ai 1 ) = i (A ) i j A j 1. j=1 d exp(a) da = i=0 1 i! i (A ) i j A j 1. j=1 14