# 1 Singular Value Decomposition (SVD)

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Contents 1 Singular Value Decomposition (SVD) Singular Vectors Singular Value Decomposition (SVD) Best Rank k Approximations Power Method for Computing the Singular Value Decomposition Applications of Singular Value Decomposition Principal Component Analysis Clustering a Mixture of Spherical Gaussians An Application of SVD to a Discrete Optimization Problem SVD as a Compression Algorithm Spectral Decomposition Singular Vectors and ranking documents Bibliographic Notes Exercises

2 1 Singular Value Decomposition (SVD) The singular value decomposition of a matrix A is the factorization of A into the product of three matrices A = UDV T where the columns of U and V are orthonormal and the matrix D is diagonal with positive real entries. The SVD is useful in many tasks. Here we mention some examples. First, in many applications, the data matrix A is close to a matrix of low rank and it is useful to find a low rank matrix which is a good approximation to the data matrix. We will show that from the singular value decomposition of A, we can get the matrix B of rank k which best approximates A; in fact we can do this for every k. Also, singular value decomposition is defined for all matrices (rectangular or square) unlike the more commonly used spectral decomposition in Linear Algebra. The reader familiar with eigenvectors and eigenvalues (we do not assume familiarity here) will also realize that we need conditions on the matrix to ensure orthogonality of eigenvectors. In contrast, the columns of V in the singular value decomposition, called the right singular vectors of A, always form an orthogonal set with no assumptions on A. The columns of U are called the left singular vectors and they also form an orthogonal set. A simple consequence of the orthogonality is that for a square and invertible matrix A, the inverse of A is V D 1 U T, as the reader can verify. To gain insight into the SVD, treat the rows of an n d matrix A as n points in a d-dimensional space and consider the problem of finding the best k-dimensional subspace with respect to the set of points. Here best means minimize the sum of the squares of the perpendicular distances of the points to the subspace. We begin with a special case of the problem where the subspace is 1-dimensional, a line through the origin. We will see later that the best-fitting k-dimensional subspace can be found by k applications of the best fitting line algorithm. Finding the best fitting line through the origin with respect to a set of points {x i 1 i n} in the plane means minimizing the sum of the squared distances of the points to the line. Here distance is measured perpendicular to the line. The problem is called the best least squares fit. In the best least squares fit, one is minimizing the distance to a subspace. An alternative problem is to find the function that best fits some data. Here one variable y is a function of the variables x 1, x 2,..., x d and one wishes to minimize the vertical distance, i.e., distance in the y direction, to the subspace of the x i rather than minimize the perpendicular distance to the subspace being fit to the data. Returning to the best least squares fit problem, consider projecting a point x i onto a line through the origin. Then x 2 i1 + x 2 i id = (length of projection) 2 + (distance of point to line) 2. See Figure 1.1. Thus (distance of point to line) 2 = x 2 i1 + x 2 i id (length of projection) 2. To minimize the sum of the squares of the distances to the line, one could minimize 2

3 x j α j x i Min α 2 α i v Equivalent to Max β 2 β j β i Figure 1.1: The projection of the point x i onto the line through the origin in the direction of v (x 2 i1 + x 2 i id ) minus the sum of the squares of the lengths of the projections of the points to the line. However, (x 2 i1 + x 2 i id ) is a constant (independent of the line), so minimizing the sum of the squares of the distances is equivalent to maximizing the sum of the squares of the lengths of the projections onto the line. Similarly for best-fit subspaces, we could maximize the sum of the squared lengths of the projections onto the subspace instead of minimizing the sum of squared distances to the subspace. The reader may wonder why we minimize the sum of squared perpendicular distances to the line. We could alternatively have defined the best-fit line to be the one which minimizes the sum of perpendicular distances to the line. There are examples where this definition gives a different answer than the line minimizing the sum of perpendicular distances squared. [The reader could construct such examples.] The choice of the objective function as the sum of squared distances seems arbitrary and in a way it is. But the square has many nice mathematical properties - the first of these is the use of Pythagoras theorem above to say that this is equivalent to maximizing the sum of squared projections. We will see that in fact we can use the Greedy Algorithm to find best-fit k dimensional subspaces (which we will define soon) and for this too, the square is important. The reader should also recall from Calculus that the best-fit function is also defined in terms of least-squares fit. There too, the existence of nice mathematical properties is the motivation for the square. 1.1 Singular Vectors We now define the singular vectors of an n d matrix A. Consider the rows of A as n points in a d-dimensional space. Consider the best fit line through the origin. Let v be a 3

4 unit vector along this line. The length of the projection of a i, the i th row of A, onto v is a i v. From this we see that the sum of length squared of the projections is Av 2. The best fit line is the one maximizing Av 2 and hence minimizing the sum of the squared distances of the points to the line. With this in mind, define the first singular vector, v 1, of A, which is a column vector, as the best fit line through the origin for the n points in d-space that are the rows of A. Thus v 1 = arg max v =1 Av. The value σ 1 (A) = Av 1 is called the first singular value of A. Note that σ 2 1 is the sum of the squares of the projections of the points to the line determined by v 1. The greedy approach to find the best fit 2-dimensional subspace for a matrix A, takes v 1 as the first basis vector for the 2-dimenional subspace and finds the best 2-dimensional subspace containing v 1. The fact that we are using the sum of squared distances will again help. For every 2-dimensional subspace containing v 1, the sum of squared lengths of the projections onto the subspace equals the sum of squared projections onto v 1 plus the sum of squared projections along a vector perpendicular to v 1 in the subspace. Thus, instead of looking for the best 2-dimensional subspace containing v 1, look for a unit vector, call it v 2, perpendicular to v 1 that maximizes Av 2 among all such unit vectors. Using the same greedy strategy to find the best three and higher dimensional subspaces, defines v 3, v 4,... in a similar manner. This is captured in the following definitions. There is no apriori guarantee that the greedy algorithm gives the best fit. But, in fact, the greedy algorithm does work and yields the best-fit subspaces of every dimension as we will show. The second singular vector, v 2, is defined by the best fit line perpendicular to v 1 v 2 = arg max Av. v v 1, v =1 The value σ 2 (A) = Av 2 is called the second singular value of A. The third singular vector v 3 is defined similarly by v 3 = arg max Av v v 1,v 2, v =1 and so on. The process stops when we have found as singular vectors and v 1, v 2,..., v r arg max v v 1,v 2,...,v r v =1 Av = 0. If instead of finding v 1 that maximized Av and then the best fit 2-dimensional subspace containing v 1, we had found the best fit 2-dimensional subspace, we might have 4

5 done better. This is not the case. We now give a simple proof that the greedy algorithm indeed finds the best subspaces of every dimension. Theorem 1.1 Let A be an n d matrix where v 1, v 2,..., v r are the singular vectors defined above. For 1 k r, let V k be the subspace spanned by v 1, v 2,..., v k. Then for each k, V k is the best-fit k-dimensional subspace for A. Proof: The statement is obviously true for k = 1. For k = 2, let W be a best-fit 2- dimensional subspace for A. For any basis w 1, w 2 of W, Aw Aw 2 2 is the sum of squared lengths of the projections of the rows of A onto W. Now, choose a basis w 1, w 2 of W so that w 2 is perpendicular to v 1. If v 1 is perpendicular to W, any unit vector in W will do as w 2. If not, choose w 2 to be the unit vector in W perpendicular to the projection of v 1 onto W. Since v 1 was chosen to maximize Av 1 2, it follows that Aw 1 2 Av 1 2. Since v 2 was chosen to maximize Av 2 2 over all v perpendicular to v 1, Aw 2 2 Av 2 2. Thus Aw Aw 2 2 Av Av 2 2. Hence, V 2 is at least as good as W and so is a best-fit 2-dimensional subspace. For general k, proceed by induction. By the induction hypothesis, V k 1 is a best-fit k-1 dimensional subspace. Suppose W is a best-fit k-dimensional subspace. Choose a basis w 1, w 2,..., w k of W so that w k is perpendicular to v 1, v 2,..., v k 1. Then Aw Aw Aw k 2 Av Av Av k Aw k 2 since V k 1 is an optimal k -1 dimensional subspace. Since w k is perpendicular to v 1, v 2,..., v k 1, by the definition of v k, Aw k 2 Av k 2. Thus Aw Aw Aw k Aw k 2 Av Av Av k Av k 2, proving that V k is at least as good as W and hence is optimal. Note that the n-vector Av i is really a list of lengths (with signs) of the projections of the rows of A onto v i. Think of Av i = σ i (A) as the component of the matrix A along v i. For this interpretation to make sense, it should be true that adding up the squares of the components of A along each of the v i gives the square of the whole content of the matrix A. This is indeed the case and is the matrix analogy of decomposing a vector into its components along orthogonal directions. Consider one row, say a j, of A. Since v 1, v 2,..., v r span the space of all rows of A, r a j v = 0 for all v perpendicular to v 1, v 2,..., v r. Thus, for each row a j, (a j v i ) 2 = a j 2. Summing over all rows j, a j 2 = j=1 r (a j v i ) 2 = j=1 r (a j v i ) 2 = j=1 r Av i 2 = r σi 2 (A). 5

6 But a j 2 = n j=1 d j=1 k=1 a 2 jk, the sum of squares of all the entries of A. Thus, the sum of squares of the singular values of A is indeed the square of the whole content of A, i.e., the sum of squares of all the entries. There is an important norm associated with this quantity, the Frobenius norm of A, denoted A F defined as A F = a 2 jk. Lemma 1.2 For any matrix A, the sum of squares of the singular values equals the Frobenius norm. That is, σ 2 i (A) = A 2 F. Proof: By the preceding discussion. A matrix A can be described fully by how it transforms the vectors v i. Every vector v can be written as a linear combination of v 1, v 2,..., v r and a vector perpendicular to all the v i. Now, Av is the same linear combination of Av 1, Av 2,..., Av r as v is of v 1, v 2,..., v r. So the Av 1, Av 2,..., Av r form a fundamental set of vectors associated with A. We normalize them to length one by j,k u i = 1 σ i (A) Av i. The vectors u 1, u 2,..., u r are called the left singular vectors of A. The v i are called the right singular vectors. The SVD theorem (Theorem 1.5) will fully explain the reason for these terms. Clearly, the right singular vectors are orthogonal by definition. We now show that the left singular vectors are also orthogonal and that A = r σ i u i vi T. Theorem 1.3 Let A be a rank r matrix. The left singular vectors of A, u 1, u 2,..., u r, are orthogonal. Proof: The proof is by induction on r. For r = 1, there is only one u i so the theorem is trivially true. For the inductive part consider the matrix B = A σ 1 u 1 v T 1. The implied algorithm in the definition of singular value decomposition applied to B is identical to a run of the algorithm on A for its second and later singular vectors and singular values. To see this, first observe that Bv 1 = Av 1 σ 1 u 1 v1 T v 1 = 0. It then follows that the first right singular vector, call it z, of B will be perpendicular to v 1 since if it had a component z 1 along v 1, then, B z z 1 z z 1 = Bz z z 1 > Bz, contradicting the arg max definition of z. But for any v perpendicular to v 1, Bv = Av. Thus, the top singular 6

7 vector of B is indeed a second singular vector of A. Repeating this argument shows that a run of the algorithm on B is the same as a run on A for its second and later singular vectors. This is left as an exercise. Thus, there is a run of the algorithm that finds that B has right singular vectors v 2, v 3,..., v r and corresponding left singular vectors u 2, u 3,..., u r. By the induction hypothesis, u 2, u 3,..., u r are orthogonal. It remains to prove that u 1 is orthogonal to the other u i. Suppose not and for some i 2, u T 1 u i 0. Without loss of generality assume that u T 1 u i > 0. The proof is symmetric for the case where u T 1 u i < 0. Now, for infinitesimally small ε > 0, the vector A ( ) v1 + εv i v 1 + εv i = σ 1u 1 + εσ i u i 1 + ε 2 has length at least as large as its component along u 1 which is u T 1 ( σ 1u 1 + εσ i u i ) = ( ) ( ) σ 1 + εσ i u T 1 + ε 2 1 u i 1 ε2 + O 2 (ε4 ) = σ 1 + εσ i u T 1 u i O ( ε 2) > σ 1 a contradiction. Thus, u 1, u 2,..., u r are orthogonal. 1.2 Singular Value Decomposition (SVD) Let A be an n d matrix with singular vectors v 1, v 2,..., v r and corresponding singular values σ 1, σ 2,..., σ r. Then u i = 1 σ i Av i, for i = 1, 2,..., r, are the left singular vectors and by Theorem 1.5, A can be decomposed into a sum of rank one matrices as A = r σ i u i vi T. We first prove a simple lemma stating that two matrices A and B are identical if Av = Bv for all v. The lemma states that in the abstract, a matrix A can be viewed as a transformation that maps vector v onto Av. Lemma 1.4 Matrices A and B are identical if and only if for all vectors v, Av = Bv. Proof: Clearly, if A = B then Av = Bv for all v. For the converse, suppose that Av = Bv for all v. Let e i be the vector that is all zeros except for the i th component which has value 1. Now Ae i is the i th column of A and thus A = B if for each i, Ae i = Be i. 7

8 A n d = U n r D r r V T r d Figure 1.2: The SVD decomposition of an n d matrix. Theorem 1.5 Let A be an n d matrix with right singular vectors v 1, v 2,..., v r, left singular vectors u 1, u 2,..., u r, and corresponding singular values σ 1, σ 2,..., σ r. Then A = r σ i u i vi T. Proof: For each singular vector v j, Av j = r σ i u i v T i v j. Since any vector v can be expressed as a linear combination of the singular vectors plus a vector perpendicular to the v i, Av = r σ i u i vi T v and by Lemma 1.4, A = r σ i u i vi T. The decomposition is called the singular value decomposition, SVD, of A. In matrix notation A = UDV T where the columns of U and V consist of the left and right singular vectors, respectively, and D is a diagonal matrix whose diagonal entries are the singular values of A. For any matrix A, the sequence of singular values is unique and if the singular values are all distinct, then the sequence of singular vectors is unique also. However, when some set of singular values are equal, the corresponding singular vectors span some subspace. Any set of orthonormal vectors spanning this subspace can be used as the singular vectors. 1.3 Best Rank k Approximations There are two important matrix norms, the Frobenius norm denoted A F 2-norm denoted A 2. The 2-norm of the matrix A is given by and the max Av v =1 8

9 and thus equals the largest singular value of the matrix. Let A be an n d matrix and think of the rows of A as n points in d-dimensional space. The Frobenius norm of A is the square root of the sum of the squared distance of the points to the origin. The 2-norm is the square root of the sum of squared distances to the origin along the direction that maximizes this quantity. Let A = be the SVD of A. For k {1, 2,..., r}, let r σ i u i vi T A k = k σ i u i vi T be the sum truncated after k terms. It is clear that A k has rank k. Furthermore, A k is the best rank k approximation to A when the error is measured in either the 2-norm or the Frobenius norm. Lemma 1.6 The rows of A k are the projections of the rows of A onto the subspace V k spanned by the first k singular vectors of A. Proof: Let a be an arbitrary row vector. Since the v i are orthonormal, the projection of the vector a onto V k is given by k (a v i )v i T. Thus, the matrix whose rows are the projections of the rows of A onto V k is given by k Av i vi T. This last expression simplifies to k Av i v T i = k σ i u i v T i = A k. The matrix A k is the best rank k approximation to A in both the Frobenius and the 2-norm. First we show that the matrix A k is the best rank k approximation to A in the Frobenius norm. Theorem 1.7 For any matrix B of rank at most k A A k F A B F Proof: Let B minimize A B 2 F among all rank k or less matrices. Let V be the space spanned by the rows of B. The dimension of V is at most k. Since B minimizes A B 2 F, it must be that each row of B is the projection of the corresponding row of A onto V, 9

10 otherwise replacing the row of B with the projection of the corresponding row of A onto V does not change V and hence the rank of B but would reduce A B 2 F. Since each row of B is the projection of the corresponding row of A, it follows that A B 2 F is the sum of squared distances of rows of A to V. Since A k minimizes the sum of squared distance of rows of A to any k-dimensional subspace, from Theorem (1.1), it follows that A A k F A B F. Next we tackle the 2-norm. We first show that the square of the 2-norm of A A k is the square of the (k + 1) st singular value of A, Lemma 1.8 A A k 2 2 = σ2 k+1. Proof: Let A = r k σ i u i v T i and A A k = r σ i u i v i T be the singular value decomposition of A. Then A k = i=k+1 σ i u i v i T. Let v be the top singular vector of A A k. Express v as a linear combination of v 1, v 2,..., v r. That is, write v = r α i v i. Then (A A k )v = = i=k+1 r r T r σ i u i v i α j v j = α i σ i u i v T i v i j=1 i=k+1 r r α i σ i u i = αi 2σ2 i. i=k+1 i=k+1 The v maximizing this last quantity, subject to the constraint that v 2 = r α 2 i = 1, occurs when α k+1 = 1 and the rest of the α i are 0. Thus, A A k 2 2 = σ2 k+1 proving the lemma. Finally, we prove that A k is the best rank k 2-norm approximation to A. Theorem 1.9 Let A be an n d matrix. For any matrix B of rank at most k A A k 2 A B 2 Proof: If A is of rank k or less, the theorem is obviously true since A A k 2 = 0. Thus assume that A is of rank greater than k. By Lemma 1.8, A A k 2 2 = σ2 k+1. Now suppose there is some matrix B of rank at most k such that B is a better 2-norm approximation to A than A k. That is, A B 2 < σ k+1. The null space of B, Null (B), (the set of vectors v such that Bv = 0) has dimension at least d k. Let v 1, v 2,..., v k+1 be the first k + 1 singular vectors of A. By a dimension argument, it follows that there exists a z 0 in Null (B) Span {v 1, v 2,..., v k+1 }. 10

11 [Indeed, there are d k independent vectors in Null(B); say, u 1, u 2,..., u d k are any d k independent vectors in Null(B). Now, u 1, u 2,..., u d k, v 1, v 2,... v k+1 are d + 1 vectors in d space and they are dependent, so there are real numbers α 1, α 2,..., α d k and β 1, β 2,..., β k not all zero so that d k α iu i = k j=1 β jv j. Take z = d k α iu i and observe that z cannot be the zero vector.] Scale z so that z = 1. We now show that for this vector z, which lies in the space of the first k + 1 singular vectors of A, that (A B) z σ k+1. Hence the 2-norm of A B is at least σ k+1 contradicting the assumption that A B 2 < σ k+1. First Since Bz = 0, Since z is in the Span {v 1, v 2,..., v k+1 } Az 2 = σ i u i v T i z It follows that 2 = A B 2 2 (A B) z 2. σ 2 i A B 2 2 Az 2. ( T vi z ) k+1 2 = σ 2 i A B 2 2 σ2 k+1 ( T vi z ) k+1 2 ( σ 2 T k+1 vi z ) 2 = σ 2 k+1. contradicting the assumption that A B 2 < σ k+1. This proves the theorem. 1.4 Power Method for Computing the Singular Value Decomposition Computing the singular value decomposition is an important branch of numerical analysis in which there have been many sophisticated developments over a long period of time. Here we present an in-principle method to establish that the approximate SVD of a matrix A can be computed in polynomial time. The reader is referred to numerical analysis texts for more details. The method we present, called the Power Method, is simple and is in fact the conceptual starting point for many algorithms. It is easiest to describe first in the case when A is square symmetric and has the same right and left singular vectors, namely, r A = σ i v i v T i. In this case, we have ( r ) ( r ) A 2 = T σ i v i v i T σ j v j v j = j=1 r σ i σ j v i v T i v j v T j = i,j=1 r σi 2 v i v T i, where, first we just multiplied the two sums and then observed that if i j, the dot product v i T v j equals 0 by orthogonality. [Caution: The outer product v i v j T is a matrix 11

12 and is not zero even for i j.] Similarly, if we take the k th power of A, again all the cross terms are zero and we will get A k = r σi k v i v T i. If we had σ 1 > σ 2, we would have 1 A k v 1 v T 1. σ k 1 Now we do not know σ 1 beforehand and cannot find this limit, but if we just take A k and divide by A k F so that the Frobenius norm is normalized to 1 now, that matrix will converge to the rank 1 matrix v 1 v T 1 from which v 1 may be computed. [This is still an intuitive description, which we will make precise shortly. First, we cannot make the assumption that A is square and has the same right and left singular vectors. But, B = AA T satisfies both these conditions. If again, the SVD of A is σ i u i vi T, then by i direct multiplication ( ) ( ) B = AA T = σ i u i vi T σ j v j u T j i j = i,j σ i σ j u i v T i v j u T j = i,j σ i σ j u i (v T i v j )u T j = i σ 2 i u i u T i, since v T i v j is the dot product of the two vectors and is zero unless i = j. This is the spectral decomposition of B. Using the same kind of calculation as above, B k = i σ 2k i u i u T i. As k increases, for i > 1, σi 2k /σ1 2k goes to zero and B k is approximately equal to σ 2k 1 u 1 u T 1 provided that for each i > 1, σ i (A) < σ 1 (A). This suggests a way of finding σ 1 and u 1, by successively powering B. But there are two issues. First, if there is a significant gap between the first and second singular values of a matrix, then the above argument applies and the power method will quickly converge to the first left singular vector. Suppose there is no significant gap. In the extreme case, there may be ties for the top singular value. Then the above argument does not work. There are cumbersome ways of overcoming this by assuming a gap between σ 1 and σ 2 ; 12

14 1 20 d 1 Figure 1.3: The volume of the cylinder of height 20 d of the hemisphere below x 1 = 1 20 d is an upper bound on the volume portion of the sphere with v 1 α is less than or equal to 2αV (d 1) (recall notation from Chapter 2) and 2αV (d 1) Prob( v 1 α) V (d) Now the volume of the unit radius sphere is at least twice the volume of the cylinder of 1 height d 1 and radius 1 1 or d 1 Using (1 x) a 1 ax V (d) 2 V (d 1)(1 1 d 1 d 1 ) d 1 2 V (d) 2 V (d 1)(1 d 1 1 V (d 1) ) d 1 2 d 1 d 1 and Prob( v 1 α) Thus the probability that v d 2αV (d 1) 1 d 1 V (d 1) is at least 9/10. d 1 10 d Theorem 1.11 Let A be an n d matrix and x a random unit length vector. Let V be the space spanned by the left singular vectors of A corresponding to singular values greater 14

15 ( ) than (1 ε) σ 1. Let k be Ω ln(d/ε). Let w be unit vector after k iterations of the power ε method, namely, ( ) AA T k x w = (AA T ) k. x The probability that w has a component of at least ɛ perpendicular to V is at most 1/10. Proof: Let A = r σ i u i vi T be the SVD of A. If the rank of A is less than d, then complete {u 1, u 2,... u r } into a basis {u 1, u 2,... u d } of d-space. Write x in the basis of the u i s as Since (AA T ) k = d x = c i u i. σ 2k i u i u T i, it follows that (AAT ) k x = d σi 2k c i u i. For a random unit length vector x picked independent of A, the u i are fixed vectors and picking x at random is equivalent to picking random c i. From Lemma 1.10, c with probability at d least 9/10. Suppose that σ 1, σ 2,..., σ m are the singular values of A that are greater than or equal to (1 ε) σ 1 and that σ m+1,..., σ n are the singular values that are less than (1 ε) σ 1. Now (AA T ) k x 2 = σ 2k i c i u i 2 = σi 4k c 2 i σ1 4k c d σ4k 1, with probability at least 9/10. Here we used the fact that a sum of positive quantities is at least as large as its first element and the first element is greater than or equal to 1 400d σ4k 1 with probability at least 9/10. [Note: If we did not choose x at random in the beginning and use Lemma 1.10, c 1 could well have been zero and this argument would not work.] The component of (AA T ) k x 2 perpendicular to the space V is d i=m+1 σi 4k c 2 i (1 ε) 4k σ1 4k d i=m+1 c 2 i (1 ε) 4k σ 4k 1, since n c2 i = x = 1. Thus, the component of w perpendicular to V is at most (1 ε) 2k σ1 2k = O( d(1 ε) 2k ) = O( ( de 2εk ) = O de Ω(ln(d/ε))) = O(ε) as desired d σ2k 1 15

16 1.5 Applications of Singular Value Decomposition Principal Component Analysis The traditional use of SVD is in Principal Component Analysis (PCA). PCA is illustrated by an example - customer-product data where there are n customers buying d products. Let matrix A with elements a ij represent the probability of customer i purchasing product j (or the amount or utility of product j to customer i). One hypothesizes that there are really only k underlying basic factors like age, income, family size, etc. that determine a customer s purchase behavior. An individual customer s behavior is determined by some weighted combination of these underlying factors. That is, a customer s purchase behavior can be characterized by a k-dimensional vector where k is much smaller that n and d. The components of the vector are weights for each of the basic factors. Associated with each basic factor is a vector of probabilities, each component of which is the probability of purchasing a given product by someone whose behavior depends only on that factor. More abstractly, A is an n d matrix that can be expressed as the product of two matrices U and V where U is an n k matrix expressing the factor weights for each customer and V is a k d matrix expressing the purchase probabilities of products that correspond to that factor. One twist is that A may not be exactly equal to UV, but close to it since there may be noise or random perturbations. Taking the best rank k approximation A k from SVD (as described above) gives us such a U, V. In this traditional setting, one assumed that A was available fully and we wished to find U, V to identify the basic factors or in some applications to denoise A (if we think of A UV as noise). Now imagine that n and d are very large, on the order of thousands or even millions, there is probably little one could do to estimate or even store A. In this setting, we may assume that we are given just given a few elements of A and wish to estimate A. If A was an arbitrary matrix of size n d, this would require Ω(nd) pieces of information and cannot be done with a few entries. But again hypothesize that A was a small rank matrix with added noise. If now we also assume that the given entries are randomly drawn according to some known distribution, then there is a possibility that SVD can be used to estimate the whole of A. This area is called collaborative filtering and one of its uses is to target an ad to a customer based on one or two purchases. We will not be able to describe it here Clustering a Mixture of Spherical Gaussians In clustering, we are given a set of points in d space and the task is to partition the points into k subsets (clusters) where each cluster consists of nearby points. Different definitions of the goodness of a clustering lead to different solutions. Clustering is an important area which we will study in detail in Chapter??. Here we will see how to solve a particular clustering problem using Singular Value Decomposition. In general, a solution to any clustering problem comes up with k cluster centers 16

17 factors customers U products V Figure 1.4: Customer-product data which define the k clusters - a cluster is the set of data points which have a particular cluster center as the closest cluster center. Hence the Vornoi cells of the cluster centers determine the clusters. Using this observation, it is relatively easy to cluster points in two or three dimensions. However, clustering is not so easy in higher dimensions. Many problems have high-dimensional data and clustering problems are no exception. Clustering problems tend to be NP-hard, so we do not have polynomial time algorithms to solve them. One way around this is to assume stochastic models of input data and devise algorithms to cluster under such models. Mixture models are a very important class of stochastic models. A mixture is a probability density or distribution that is the weighted sum of simple component probability densities. It is of the form w 1 p 1 + w 2 p w k p k where p 1, p 2,..., p k are the basic densities and w 1, w 2,..., w k are positive real numbers called weights that add up to one. Clearly, w 1 p 1 + w 2 p w k p k is a probability density, it integrates to one. PUT IN PICTURE OF A 1-DIMENSIONAL GAUSSIAN MIXTURE The model fitting problem is to fit a mixture of k basic densities to n samples, each sample drawn according to the same mixture distribution. The class of basic densities is known, but the component weights of the mixture are not. Here, we deal with the case where the basic densities are all spherical Gaussians. The samples are generated by picking an integer i from the set {1, 2,..., k} with probabilities w 1, w 2,..., w k, respectively. Then, picking a sample according to p i and repeating the process n times. This process generates n samples according to the mixture where the set of samples is naturally partitioned into k sets, each set corresponding to one p i. The model-fitting problem consists of two sub problems. The first sub problem is to cluster the sample into k subsets where each subset was picked according to one component density. The second sub problem is to fit a distribution to each subset. We discuss 17

18 only the clustering problem here. The problem of fitting a single Gaussian to a set of data points is a lot easier and was discussed in Section?? of Chapter 2. If the component Gaussians in the mixture have their centers very close together, then the clustering problem is unresolvable. In the limiting case where a pair of component densities are the same, there is no way to distinguish between them. What condition on the inter-center separation will guarantee unambiguous clustering? First, by looking at 1-dimensional examples, it is clear that this separation should be measured in units of the standard deviation, since the density is a function of the number of standard deviation from the mean. In one dimension, if two Gaussians have inter-center separation at least ten times the maximum of their standard deviations, then they hardly overlap. What is the analog of this statement in higher dimensions? For a d-dimensional spherical Gaussian, with standard deviation σ in each direction 1, it is easy to see that the expected distance squared from the center is dσ 2. Define the radius r of the Gaussian to be the square root of the average distance squared from the center; so r is dσ. If the inter-center separation between two spherical Gaussians, both of radius r is at least 2r = 2 dσ, then it is easy to see that the densities hardly overlap. But this separation requirement grows with d. For problems with large d, this would impose a separation requirement not met in practice. The main aim is to answer affirmatively the question: Can we show an analog of the 1-dimensional statement for large d : In a mixture of k spherical Gaussians in d space, if the centers of each pair of component Gaussians are Ω(1) standard deviations apart, then we can separate the sample points into the k components (for k O(1)). The central idea is the following: Suppose for a moment, we can (magically) find the subspace spanned by the k centers. Imagine projecting the sample points to this subspace. It is easy to see (see Lemma (1.12) below) that the projection of a spherical Gaussian with standard deviation σ is still a spherical Gaussian with standard deviation σ. But in the projection, now, the inter-center separation still remains the same. So in the projection, the Gaussians are distinct provided the inter-center separation (in the whole space) is Ω( kσ) which is a lot smaller than the Ω( dσ) for k << d. Interestingly we will see that the subspace spanned by the k centers is essentially the best-fit k dimensional subspace which we can find by Singular Value Decomposition. Lemma 1.12 Suppose p is a d-dimensional spherical Gaussian with center µ and standard deviation σ. The density of p projected onto an arbitrary k-dimensional subspace V is a spherical Gaussian with the same standard deviation. Proof: Since p is spherical, the projection is independent of the k-dimensional subspace. Pick V to be the subspace spanned by the first k coordinate vectors. For a point x = 1 Since a spherical Gaussian has the same standard deviation in every direction, we call it the standard deviation of the Gaussian. 18

19 1. Best fit 1-dimension subspace to a spherical Gaussian is the line through its center and the origin. 2. Any k-dimensional subspace containing the line is a best fit k-dimensional subspace for the Gaussian. 3. The best fit k-dimensional subspace for k Gaussians is the subspace containing their centers. 1. Best fit 1-dimension subspace to a spherical Gaussian is the line through its center and the origin. 2. Any k-dimensional subspace containing the line is a best fit k-dimensional subspace for the Gaussian. 3. The best fit k-dimensional subspace for k Gaussians is the subspace containing their centers. Figure 1.5: Best fit subspace to a spherical Gaussian. (x 1, x 2,..., x d ), we will use the notation x = (x 1, x 2,... x k ) and x = (x k+1, x k+2,..., x n ). The density of the projected Gaussian at the point (x 1, x 2,..., x k ) is ce x µ 2 2σ 2 This clearly implies the lemma. x e x µ 2 2σ 2 dx = c e x µ 2 2σ 2. We now show that the top k singular vectors produced by the SVD span the space of the k centers. First, we extend the notion of best fit to probability distributions. Then we show that for a single spherical Gaussian (whose center is not the origin), the best fit 1-dimensional subspace is the line though the center of the Gaussian and the origin. Next, we show that the best fit k-dimensional subspace for a single Gaussian (whose center is 19

20 not the origin) is any k-dimensional subspace containing the line through the Gaussian s center and the origin. Finally, for k spherical Gaussians, the best fit k-dimensional subspace is the subspace containing their centers. Thus, the SVD finds the subspace that contains the centers. Recall that for a set of points, the best-fit line is the line passing through the origin which minimizes the sum of squared distances to the points. We extend this definition to probability densities instead of a set of points. Definition 4.1: If p is a probability density in d space, the best fit line for p is the line l passing through the origin that minimizes the expected squared (perpendicular) distance to the line, namely, dist (x, l) 2 p (x) dx. Recall that a k-dimensional subspace is the best-fit subspace if the sum of squared distances to it is minimized or equivalently, the sum of squared lengths of projections onto it is maximized. This was defined for a set of points, but again it can be extended to a density as above. Definition 4.2: If p is a probability density in d-space and V is a subspace, then the expected squared perpendicular distance of V to p, denoted f(v, p), is given by f(v, p) = dist 2 (x, V ) p (x) dx, where dist(x, V ) denotes the perpendicular distance from the point x to the subspace V. For the uniform density on the unit circle centered at the origin, it is easy to see that any line passing through the origin is a best fit line for the probability distribution. Lemma 1.13 Let the probability density p be a spherical Gaussian with center µ 0. The best fit 1-dimensional subspace is the line passing through µ and the origin. Proof: For a randomly chosen x (according to p) and a fixed unit length vector v, E [ (v T x) 2] [ (v = E T (x µ) + v T µ ) ] 2 [ (v = E T (x µ) ) 2 ( + 2 v T µ ) ( v T (x µ) ) + ( v T µ ) ] 2 [ (v = E T (x µ) ) ] ( v T µ ) E [ v T (x µ) ] + ( v T µ ) 2 [ (v = E T (x µ) ) ] 2 + ( v T µ ) 2 = σ 2 + ( v T µ ) 2 20

21 since E [ (v T (x µ) ) 2 ] is the variance in the direction v and E ( v T (x µ) ) = 0. The lemma follows from the fact that the best fit line v is the one that maximizes ( v T u ) 2 which is maximized when v is aligned with the center µ. Lemma 1.14 For a spherical Gaussian with center µ, a k-dimensional subspace is a best fit subspace if and only if it contains µ. Proof: By symmetry, every k-dimensional subspace through µ has the same sum of distances squared to the density. Now by the SVD procedure, we know that the bestfit k-dimensional subspace contains the best fit line, i.e., contains µ. Thus, the lemma follows. This immediately leads to the following theorem. Theorem 1.15 If p is a mixture of k spherical Gaussians whose centers span a k- dimensional subspace, then the best fit k-dimensional subspace is the one containing the centers. Proof: Let p be the mixture w 1 p 1 +w 2 p 2 + +w k p k. Let V be any subspace of dimension k or less. Then, the expected squared perpendicular distance of V to p is f(v, p) = dist 2 (x, V )p(x)dx = k w i dist 2 (x, V )p i (x)dx k w i ( distance squared of p i to its best fit k-dimensional subspace). Choose V to be the space spanned by the centers of the densities p i. By Lemma?? the last inequality becomes an equality proving the theorem. For an infinite set of points drawn according to the mixture, the k-dimensional SVD subspace gives exactly the space of the centers. In reality, we have only a large number of samples drawn according to the mixture. However, it is intuitively clear that as the number of samples increases, the set of sample points approximates the probability density and so the SVD subspace of the sample is close to the space spanned by the centers. The details of how close it gets as a function of the number of samples are technical and we do not carry this out here. 21

22 1.5.3 An Application of SVD to a Discrete Optimization Problem In the last example, SVD was used as a dimension reduction technique. It found a k-dimensional subspace (the space of centers) of a d-dimensional space and made the Gaussian clustering problem easier by projecting the data to the subspace. Here, instead of fitting a model to data, we have an optimization problem. Again applying dimension reduction to the data makes the problem easier. The use of SVD to solve discrete optimization problems is a relatively new subject with many applications. We start with an important NP-hard problem, the Maximum Cut Problem for a directed graph G(V, E). The Maximum Cut Problem is to partition the node set V of a directed graph into two subsets S and S so that the number of edges from S to S is maximized. Let A be the adjacency matrix of the graph. With each vertex i, associate an indicator variable x i. The variable x i will be set to 1 for i S and 0 for i S. The vector x = (x 1, x 2,..., x n ) is unknown and we are trying to find it (or equivalently the cut), so as to maximize the number of edges across the cut. The number of edges across the cut is precisely x i (1 x j )a ij. i,j Thus, the Maximum Cut Problem can be posed as the optimization problem In matrix notation, Maximize x i (1 x j )a ij subject to x i {0, 1}. i,j x i (1 x j )a ij = x T A(1 x), i,j where 1 denotes the vector of all 1 s. So, the problem can be restated as Maximize x T A(1 x) subject to x i {0, 1}. (1.1) The SVD is used to solve this problem approximately by computing the SVD of A and replacing A by A k = k σ i u i v T i in (1.1) to get Maximize x T A k (1 x) subject to x i {0, 1}. (1.2) Note that the matrix A k is no longer a 0-1 adjacency matrix. We will show that: 1. For each 0-1 vector x, x T A k (1 x) and x T A(1 x) differ by at most n 2 k+1. Thus, the maxima in (1.1) and (1.2) differ by at most this amount. 2. A near optimal x for (1.2) can be found by exploiting the low rank of A k, which by Item 1 is near optimal for (1.1) where near optimal means with additive error of at n most 2 k+1. 22

23 First, we prove Item 1. Since x and 1 x are 0-1 n-vectors, each has length at most n. By the definition of the 2-norm, (A Ak )(1 x) n A A k 2. Now since x T (A A k )(1 x) is the dot product of the vector x with the vector (A A k )(1 x), x T (A A k )(1 x) n A A k 2. By Lemma 1.8, A A k 2 = σ k+1 (A). The inequalities, (k + 1)σ 2 k+1 σ σ σ 2 k+1 A 2 F = i,j a 2 ij n 2 imply that σ 2 k+1 n2 k+1 and hence A A k 2 n k+1 proving Item 1. Next we focus on Item 2. It is instructive to look at the special case when k=1 and A is approximated by the rank one matrix A 1. An even more special case when the left and right singular vectors u and v are required to be identical is already NP-hard to solve exactly because it subsumes the problem of whether for a set of n integers, {a 1, a 2,..., a n }, there is a partition into two subsets whose sums are equal. So, we look for algorithms that solve the Maximum Cut Problem approximately. For Item 2, we want to maximize k σ i(x T u i )(v T i (1 x)) over 0-1 vectors x. A piece of notation will be useful. For any S {1, 2,... n}, write u i (S) for the sum of coordinates of the vector u i corresponding to elements in the set S and also for v i. That is, u i (S) = u ij. We will maximize k σ iu i (S)v i ( S) using dynamic programming. j S For a subset S of {1, 2,..., n}, define the 2k-dimensional vector w(s) = (u 1 (S), v 1 ( S), u 2 (S), v 2 ( S),..., u If we had the list of all such vectors, we could find k σ iu i (S)v i ( S) for each of them and take the maximum. There are 2 n subsets S, but several S could have the same w(s) and in that case it suffices to list just one of them. Round each coordinate of 1 each u i to the nearest integer multiple of. Call the rounded vector ũ nk 2 i. Similarly obtain ṽ i. Let w(s) denote the vector (ũ 1 (S), ṽ 1 ( S), ũ 2 (S), ṽ 2 ( S),..., ũ k (S), ṽ k ( S)). We will construct a list of all possible values of the vector w(s). [Again, if several different S s lead to the same vector w(s), we will keep only one copy on the list.] The list will be constructed by Dynamic Programming. For the recursive step of Dynamic Programming, assume we already have a list of all such vectors for S {1, 2,..., i} and wish to construct the list for S {1, 2,..., i + 1}. Each S {1, 2,..., i} leads to two possible S {1, 2,..., i + 1}, namely, S and S {i + 1}. In the first case, the vector w(s ) = (ũ 1 (S) + ũ 1,i+1, ṽ 1 (S), ũ 2 (S) + ũ 2,i+1, ṽ 2 (S),...,...). In the second case, w(s ) = (ũ 1 (S), ṽ 1 (S) + ṽ 1,i+1, ũ 2 (S), ṽ 2 (S) + ṽ 2,i+1,...,...). We put in these two vectors for each vector in the previous list. Then, crucially, we prune - i.e., eliminate duplicates. Assume that k is constant. Now, we show that the error is at most n 2 k+1 as claimed. Since u i, v i are unit length vectors, u i (S), v i (S) n. Also ũ i (S) u i (S) n = 1 nk 2 k 2 and similarly for v i. To bound the error, we use an elementary fact: if a, b are reals with a, b M and we estimate a by a and b by b so that a a, b b δ M, then ab a b = a(b b ) + b (a a ) a b b + ( b + b b ) a a 3Mδ. 23

24 Using this, we get that k σ i ũ i (S)ṽ i ( S) k σ i u i (S)v i (S) 3kσ 1 n/k 2 3n 3/2 /k, and this meets the claimed error bound. Next, we show that the running time is polynomially bounded. ũ i (S), ṽ i (S) 2 n. Since ũ i (S), ṽ i (S) are all integer multiples of 1/(nk 2 ), there are at most 2/ nk 2 possible values of ũ i (S), ṽ i (S) from which it follows that the list of w(s) never gets larger than (1/ nk 2 ) 2k which for fixed k is polynomially bounded. We summarize what we have accomplished. Theorem 1.16 Given a directed graph G(V, E), a cut of size at least the maximum cut minus O ( ) n 2 k can be computed in polynomial time n for any fixed k. It would be quite a surprise to have an algorithm that actually achieves the same accuracy in time polynomial in n and k because this would give an exact max cut in polynomial time SVD as a Compression Algorithm Suppose A is the pixel intensity matrix of a large image. The entry a ij gives the intensity of the ij th pixel. If A is n n, the transmission of A requires transmitting O(n 2 ) real numbers. Instead, one could send A k, that is, the top k singular values σ 1, σ 2,..., σ k along with the left and right singular vectors u 1, u 2,..., u k, and v 1, v 2,..., v k. This would require sending O(kn) real numbers instead of O(n 2 ) real numbers. If k is much smaller than n, this results in savings. For many images, a k much smaller than n can be used to reconstruct the image provided that a very low resolution version of the image is sufficient. Thus, one could use SVD as a compression method. It turns out that in a more sophisticated approach, for certain classes of pictures one could use a fixed basis so that the top (say) hundred singular vectors are sufficient to represent any picture approximately. This means that the space spanned by the top hundred singular vectors is not too different from the space spanned by the top two hundred singular vectors of a given matrix in that class. Compressing these matrices by this standard basis can save substantially since the standard basis is transmitted only once and a matrix is transmitted by sending the top several hundred singular values for the standard basis Spectral Decomposition Let B be a square matrix. If the vector x and scalar λ are such that Bx = λx, then x is an eigenvector of the matrix B and λ is the corresponding eigenvalue. We present here a spectral decomposition theorem for the special case where B is of the form 24

25 B = AA T for some (possibly rectangular) matrix A. If A is a real valued matrix, then B is symmetric and positive definite. That is, x T Bx > 0 for all nonzero vectors x. The spectral decomposition theorem holds more generally and the interested reader should consult a linear algebra book. Theorem 1.17 (Spectral Decomposition) If B = AA T then B = i σ 2 i u i u T i where A = i σ i u i v T i is the singular valued decomposition of A. Proof: ( ) ( B = AA T T = σ i u i v i σ j u j vj T i j = σ i σ j u i v T T i v j u j i j = σi 2 u i u T i. i ) T When the σ i are all distinct, the u i are the eigenvectors of B and the σi 2 are the corresponding eigenvalues. If the σ i are not distinct, then any vector that is a linear combination of those u i with the same eigenvalue is an eigenvector of B Singular Vectors and ranking documents An important task for a document collection is to rank the documents. Recall the term-document vector representation from Chapter 2. A näive method would be to rank in order of the total length of the document (which is the sum of the components of its term-document vector). Clearly, this is not a good measure in many cases. This näive method attaches equal weight to each term and takes the projection (dot product) termdocument vector in the direction of the all 1 vector. Is there a better weighting of terms, i.e., a better projection direction which would measure the intrinsic relevance of the document to the collection? A good candidate is the best-fit direction for the collection of term-document vectors, namely the top (left) singular vector of the term-document matrix. An intuitive reason for this is that this direction has the maximum sum of squared projections of the collection and so can be thought of as a synthetic term-document vector best representing the document collection. Ranking in order of the projection of each document s term vector along the best fit direction has a nice interpretation in terms of the power method. For this, we consider a different example - that of web with hypertext links. The World Wide Web can be represented by a directed graph whose nodes correspond to web pages and directed edges to hypertext links between pages. Some web pages, called authorities, are the most 25

### Lecture 7: Singular Value Decomposition

0368-3248-01-Algorithms in Data Mining Fall 2013 Lecturer: Edo Liberty Lecture 7: Singular Value Decomposition Warning: This note may contain typos and other inaccuracies which are usually discussed during

### Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Singular Value Decomposition SVD and Principal Component Analysis PCA Edo Liberty Algorithms in Data mining 1 Singular Value Decomposition SVD We will see that any matrix A R m n w.l.o.g. m n can be written

### Matrix Norms. Tom Lyche. September 28, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

Matrix Norms Tom Lyche Centre of Mathematics for Applications, Department of Informatics, University of Oslo September 28, 2009 Matrix Norms We consider matrix norms on (C m,n, C). All results holds for

### Markov Chains, part I

Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X

### Numerical Linear Algebra Chap. 4: Perturbation and Regularisation

Numerical Linear Algebra Chap. 4: Perturbation and Regularisation Heinrich Voss voss@tu-harburg.de Hamburg University of Technology Institute of Numerical Simulation TUHH Heinrich Voss Numerical Linear

### Summary of week 8 (Lectures 22, 23 and 24)

WEEK 8 Summary of week 8 (Lectures 22, 23 and 24) This week we completed our discussion of Chapter 5 of [VST] Recall that if V and W are inner product spaces then a linear map T : V W is called an isometry

### The Power Method for Eigenvalues and Eigenvectors

Numerical Analysis Massoud Malek The Power Method for Eigenvalues and Eigenvectors The spectrum of a square matrix A, denoted by σ(a) is the set of all eigenvalues of A. The spectral radius of A, denoted

### MATH 240 Fall, Chapter 1: Linear Equations and Matrices

MATH 240 Fall, 2007 Chapter Summaries for Kolman / Hill, Elementary Linear Algebra, 9th Ed. written by Prof. J. Beachy Sections 1.1 1.5, 2.1 2.3, 4.2 4.9, 3.1 3.5, 5.3 5.5, 6.1 6.3, 6.5, 7.1 7.3 DEFINITIONS

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### Vector and Matrix Norms

Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

### Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

### MATHEMATICAL BACKGROUND

Chapter 1 MATHEMATICAL BACKGROUND This chapter discusses the mathematics that is necessary for the development of the theory of linear programming. We are particularly interested in the solutions of a

### Iterative Methods for Computing Eigenvalues and Eigenvectors

The Waterloo Mathematics Review 9 Iterative Methods for Computing Eigenvalues and Eigenvectors Maysum Panju University of Waterloo mhpanju@math.uwaterloo.ca Abstract: We examine some numerical iterative

### Inner Product Spaces and Orthogonality

Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,

### Linear Algebra Review. Vectors

Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

### An important class of codes are linear codes in the vector space Fq n, where F q is a finite field of order q.

Chapter 3 Linear Codes An important class of codes are linear codes in the vector space Fq n, where F q is a finite field of order q. Definition 3.1 (Linear code). A linear code C is a code in Fq n for

### Using the Singular Value Decomposition

Using the Singular Value Decomposition Emmett J. Ientilucci Chester F. Carlson Center for Imaging Science Rochester Institute of Technology emmett@cis.rit.edu May 9, 003 Abstract This report introduces

### The Classical Linear Regression Model

The Classical Linear Regression Model 1 September 2004 A. A brief review of some basic concepts associated with vector random variables Let y denote an n x l vector of random variables, i.e., y = (y 1,

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

### Basics Inversion and related concepts Random vectors Matrix calculus. Matrix algebra. Patrick Breheny. January 20

Matrix algebra January 20 Introduction Basics The mathematics of multiple regression revolves around ordering and keeping track of large arrays of numbers and solving systems of equations The mathematical

### Diagonal, Symmetric and Triangular Matrices

Contents 1 Diagonal, Symmetric Triangular Matrices 2 Diagonal Matrices 2.1 Products, Powers Inverses of Diagonal Matrices 2.1.1 Theorem (Powers of Matrices) 2.2 Multiplying Matrices on the Left Right by

### Review: Vector space

Math 2F Linear Algebra Lecture 13 1 Basis and dimensions Slide 1 Review: Subspace of a vector space. (Sec. 4.1) Linear combinations, l.d., l.i. vectors. (Sec. 4.3) Dimension and Base of a vector space.

### Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

### Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Chapter 7 Eigenvalues and Eigenvectors In this last chapter of our exploration of Linear Algebra we will revisit eigenvalues and eigenvectors of matrices, concepts that were already introduced in Geometry

### Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Chapter 3 Linear Least Squares Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

### Orthogonal Projections

Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors

### Linear Least Squares

Linear Least Squares Suppose we are given a set of data points {(x i,f i )}, i = 1,...,n. These could be measurements from an experiment or obtained simply by evaluating a function at some points. One

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12,

LINEAR ALGEBRA: NUMERICAL METHODS. Version: August 12, 2000 45 4 Iterative methods 4.1 What a two year old child can do Suppose we want to find a number x such that cos x = x (in radians). This is a nonlinear

### An Application of Linear Algebra to Image Compression

An Application of Linear Algebra to Image Compression Paul Dostert July 2, 2009 1 / 16 Image Compression There are hundreds of ways to compress images. Some basic ways use singular value decomposition

### CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye

CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize

### Additional Topics in Linear Algebra Supplementary Material for Math 540. Joseph H. Silverman

Additional Topics in Linear Algebra Supplementary Material for Math 540 Joseph H Silverman E-mail address: jhs@mathbrownedu Mathematics Department, Box 1917 Brown University, Providence, RI 02912 USA Contents

### Diagonalisation. Chapter 3. Introduction. Eigenvalues and eigenvectors. Reading. Definitions

Chapter 3 Diagonalisation Eigenvalues and eigenvectors, diagonalisation of a matrix, orthogonal diagonalisation fo symmetric matrices Reading As in the previous chapter, there is no specific essential

### PROVING STATEMENTS IN LINEAR ALGEBRA

Mathematics V2010y Linear Algebra Spring 2007 PROVING STATEMENTS IN LINEAR ALGEBRA Linear algebra is different from calculus: you cannot understand it properly without some simple proofs. Knowing statements

### Section 6.1 - Inner Products and Norms

Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,

### NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

### Vector Spaces II: Finite Dimensional Linear Algebra 1

John Nachbar September 2, 2014 Vector Spaces II: Finite Dimensional Linear Algebra 1 1 Definitions and Basic Theorems. For basic properties and notation for R N, see the notes Vector Spaces I. Definition

### Inner product. Definition of inner product

Math 20F Linear Algebra Lecture 25 1 Inner product Review: Definition of inner product. Slide 1 Norm and distance. Orthogonal vectors. Orthogonal complement. Orthogonal basis. Definition of inner product

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### Notes on Symmetric Matrices

CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.

### Preliminaries of linear algebra

Preliminaries of linear algebra (for the Automatic Control course) Matteo Rubagotti March 3, 2011 This note sums up the preliminary definitions and concepts of linear algebra needed for the resolution

### Block designs/1. 1 Background

Block designs 1 Background In a typical experiment, we have a set Ω of experimental units or plots, and (after some preparation) we make a measurement on each plot (for example, the yield of the plot).

### ON THE FIBONACCI NUMBERS

ON THE FIBONACCI NUMBERS Prepared by Kei Nakamura The Fibonacci numbers are terms of the sequence defined in a quite simple recursive fashion. However, despite its simplicity, they have some curious properties

### Chapter 6. Orthogonality

6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be

### Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition

Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Jack Gilbert Copyright (c) 2014 Jack Gilbert. Permission is granted to copy, distribute and/or modify this document under the terms

### 1 Orthogonal projections and the approximation

Math 1512 Fall 2010 Notes on least squares approximation Given n data points (x 1, y 1 ),..., (x n, y n ), we would like to find the line L, with an equation of the form y = mx + b, which is the best fit

### Orthogonal Diagonalization of Symmetric Matrices

MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding

### CHAPTER III - MARKOV CHAINS

CHAPTER III - MARKOV CHAINS JOSEPH G. CONLON 1. General Theory of Markov Chains We have already discussed the standard random walk on the integers Z. A Markov Chain can be viewed as a generalization of

### Linear Codes. In the V[n,q] setting, the terms word and vector are interchangeable.

Linear Codes Linear Codes In the V[n,q] setting, an important class of codes are the linear codes, these codes are the ones whose code words form a sub-vector space of V[n,q]. If the subspace of V[n,q]

### GRADUATE STUDENT NUMBER THEORY SEMINAR

GRADUATE STUDENT NUMBER THEORY SEMINAR MICHELLE DELCOURT Abstract. This talk is based primarily on the paper Interlacing Families I: Bipartite Ramanujan Graphs of All Degrees by Marcus, Spielman, and Srivastava

### Spectral graph theory

Spectral graph theory Uri Feige January 2010 1 Background With every graph (or digraph) one can associate several different matrices. We have already seen the vertex-edge incidence matrix, the Laplacian

### Solution based on matrix technique Rewrite. ) = 8x 2 1 4x 1x 2 + 5x x1 2x 2 2x 1 + 5x 2

8.2 Quadratic Forms Example 1 Consider the function q(x 1, x 2 ) = 8x 2 1 4x 1x 2 + 5x 2 2 Determine whether q(0, 0) is the global minimum. Solution based on matrix technique Rewrite q( x1 x 2 = x1 ) =

### 9.3 Advanced Topics in Linear Algebra

548 93 Advanced Topics in Linear Algebra Diagonalization and Jordan s Theorem A system of differential equations x = Ax can be transformed to an uncoupled system y = diag(λ,, λ n y by a change of variables

### Matrix Inverse and Determinants

DM554 Linear and Integer Programming Lecture 5 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1 2 3 4 and Cramer s rule 2 Outline 1 2 3 4 and

### We call this set an n-dimensional parallelogram (with one vertex 0). We also refer to the vectors x 1,..., x n as the edges of P.

Volumes of parallelograms 1 Chapter 8 Volumes of parallelograms In the present short chapter we are going to discuss the elementary geometrical objects which we call parallelograms. These are going to

### 1 Eigenvalues and Eigenvectors

Math 20 Chapter 5 Eigenvalues and Eigenvectors Eigenvalues and Eigenvectors. Definition: A scalar λ is called an eigenvalue of the n n matrix A is there is a nontrivial solution x of Ax = λx. Such an x

### 1 2 3 1 1 2 x = + x 2 + x 4 1 0 1

(d) If the vector b is the sum of the four columns of A, write down the complete solution to Ax = b. 1 2 3 1 1 2 x = + x 2 + x 4 1 0 0 1 0 1 2. (11 points) This problem finds the curve y = C + D 2 t which

### Recall that two vectors in are perpendicular or orthogonal provided that their dot

Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### A Introduction to Matrix Algebra and Principal Components Analysis

A Introduction to Matrix Algebra and Principal Components Analysis Multivariate Methods in Education ERSH 8350 Lecture #2 August 24, 2011 ERSH 8350: Lecture 2 Today s Class An introduction to matrix algebra

### Lecture 9. 1 Introduction. 2 Random Walks in Graphs. 1.1 How To Explore a Graph? CS-621 Theory Gems October 17, 2012

CS-62 Theory Gems October 7, 202 Lecture 9 Lecturer: Aleksander Mądry Scribes: Dorina Thanou, Xiaowen Dong Introduction Over the next couple of lectures, our focus will be on graphs. Graphs are one of

### WHICH LINEAR-FRACTIONAL TRANSFORMATIONS INDUCE ROTATIONS OF THE SPHERE?

WHICH LINEAR-FRACTIONAL TRANSFORMATIONS INDUCE ROTATIONS OF THE SPHERE? JOEL H. SHAPIRO Abstract. These notes supplement the discussion of linear fractional mappings presented in a beginning graduate course

### Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### 3. INNER PRODUCT SPACES

. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.

### by the matrix A results in a vector which is a reflection of the given

Eigenvalues & Eigenvectors Example Suppose Then So, geometrically, multiplying a vector in by the matrix A results in a vector which is a reflection of the given vector about the y-axis We observe that

### LECTURE 1 I. Inverse matrices We return now to the problem of solving linear equations. Recall that we are trying to find x such that IA = A

LECTURE I. Inverse matrices We return now to the problem of solving linear equations. Recall that we are trying to find such that A = y. Recall: there is a matri I such that for all R n. It follows that

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### UNIT 2 MATRICES - I 2.0 INTRODUCTION. Structure

UNIT 2 MATRICES - I Matrices - I Structure 2.0 Introduction 2.1 Objectives 2.2 Matrices 2.3 Operation on Matrices 2.4 Invertible Matrices 2.5 Systems of Linear Equations 2.6 Answers to Check Your Progress

### Hypothesis Testing in the Classical Regression Model

LECTURE 5 Hypothesis Testing in the Classical Regression Model The Normal Distribution and the Sampling Distributions It is often appropriate to assume that the elements of the disturbance vector ε within

### 2.5 Complex Eigenvalues

1 25 Complex Eigenvalues Real Canonical Form A semisimple matrix with complex conjugate eigenvalues can be diagonalized using the procedure previously described However, the eigenvectors corresponding

### ADVANCED LINEAR ALGEBRA FOR ENGINEERS WITH MATLAB. Sohail A. Dianat. Rochester Institute of Technology, New York, U.S.A. Eli S.

ADVANCED LINEAR ALGEBRA FOR ENGINEERS WITH MATLAB Sohail A. Dianat Rochester Institute of Technology, New York, U.S.A. Eli S. Saber Rochester Institute of Technology, New York, U.S.A. (g) CRC Press Taylor

### Induction. Margaret M. Fleck. 10 October These notes cover mathematical induction and recursive definition

Induction Margaret M. Fleck 10 October 011 These notes cover mathematical induction and recursive definition 1 Introduction to induction At the start of the term, we saw the following formula for computing

### Lecture 5 - Triangular Factorizations & Operation Counts

LU Factorization Lecture 5 - Triangular Factorizations & Operation Counts We have seen that the process of GE essentially factors a matrix A into LU Now we want to see how this factorization allows us

### Linear Systems. Singular and Nonsingular Matrices. Find x 1, x 2, x 3 such that the following three equations hold:

Linear Systems Example: Find x, x, x such that the following three equations hold: x + x + x = 4x + x + x = x + x + x = 6 We can write this using matrix-vector notation as 4 {{ A x x x {{ x = 6 {{ b General

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### Lecture 6. Inverse of Matrix

Lecture 6 Inverse of Matrix Recall that any linear system can be written as a matrix equation In one dimension case, ie, A is 1 1, then can be easily solved as A x b Ax b x b A 1 A b A 1 b provided that

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### Foundations of Data Science 1

Foundations of Data Science John Hopcroft Ravindran Kannan Version 4/9/203 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes

### S = {1, 2,..., n}. P (1, 1) P (1, 2)... P (1, n) P (2, 1) P (2, 2)... P (2, n) P = . P (n, 1) P (n, 2)... P (n, n)

last revised: 26 January 2009 1 Markov Chains A Markov chain process is a simple type of stochastic process with many social science applications. We ll start with an abstract description before moving

### Lecture 7: Approximation via Randomized Rounding

Lecture 7: Approximation via Randomized Rounding Often LPs return a fractional solution where the solution x, which is supposed to be in {0, } n, is in [0, ] n instead. There is a generic way of obtaining

### Lecture No. # 02 Prologue-Part 2

Advanced Matrix Theory and Linear Algebra for Engineers Prof. R.Vittal Rao Center for Electronics Design and Technology Indian Institute of Science, Bangalore Lecture No. # 02 Prologue-Part 2 In the last

### Chapter 17. Orthogonal Matrices and Symmetries of Space

Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length

### Section 1.1. Introduction to R n

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

### Algebra and Linear Algebra

Vectors Coordinate frames 2D implicit curves 2D parametric curves 3D surfaces Algebra and Linear Algebra Algebra: numbers and operations on numbers 2 + 3 = 5 3 7 = 21 Linear algebra: tuples, triples,...

### Lecture 1: Schur s Unitary Triangularization Theorem

Lecture 1: Schur s Unitary Triangularization Theorem This lecture introduces the notion of unitary equivalence and presents Schur s theorem and some of its consequences It roughly corresponds to Sections

### α = u v. In other words, Orthogonal Projection

Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v

### NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS

NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS DO Q LEE Abstract. Computing the solution to Least Squares Problems is of great importance in a wide range of fields ranging from numerical

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### Notes on Linear Recurrence Sequences

Notes on Linear Recurrence Sequences April 8, 005 As far as preparing for the final exam, I only hold you responsible for knowing sections,,, 6 and 7 Definitions and Basic Examples An example of a linear

### Sequences and Convergence in Metric Spaces

Sequences and Convergence in Metric Spaces Definition: A sequence in a set X (a sequence of elements of X) is a function s : N X. We usually denote s(n) by s n, called the n-th term of s, and write {s

### The Determinant: a Means to Calculate Volume

The Determinant: a Means to Calculate Volume Bo Peng August 20, 2007 Abstract This paper gives a definition of the determinant and lists many of its well-known properties Volumes of parallelepipeds are

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,