0368-3248-01-Algorithms in Data Mining Fall 2013 Lecturer: Edo Liberty Lecture 7: Singular Value Decomposition Warning: This note may contain typos and other inaccuracies which are usually discussed during class. Please do not cite this note as a reliable source. If you find mistakes, please inform me. We will see that any matrix A R m n w.l.o.g. m n can be written as A = σ l u l vl T 1 l=1 l σ l R, σ l 0 2 l, l u l, u l = v l, v l = δl, l 3 To prove this consider the matrix AA T R m m. Set u l to be the l th eigenvector of AA T. By definition we have that AA T u l = λ l u l. Since AA T is positive semidefinite we have λ l 0. Since AA T is symmetric we have that l, l u l, u l = δl, l. Set σ l = λ l and v l = 1 σ l A T u l. Now we can compute the following: v l, v l = 1 σl 2 u T l AA T u l = 1 σl 2 λ l u l, u l = δl, l We are only left to show that A = m l=1 σ lu l vl T. To do that we examine the norm or the difference multiplied by a test vector w = m α iu i. w T A σ l u l vl T = α i u T i A σ l u l vl T l=1 l=1 = α i u T i A δi, lα i σ l vl T l=1 = α i σ i vi T α i σ i vi T = 0 The vectors u l and v l are called the left and right singular vectors of A and σ l are the singular vectors of A. It is costumery to order the singular values in descending order σ 1 σ 2,..., σ m 0. Also, we will denote by r the rank of A. Here is another very convenient way to write the fact that A = m l=1 σ lu l v T l Let Σ R r r be a diagonal matrix whose entries are Σi, i = σ i and σ 1 σ 2... σ r. Let U R m r be the matrix whose i th column is the left singular vectors of A corresponding to singular value σ i. Let V R n r be the matrix whose i th column is the right singular vectors of A corresponding to singular value σ i. We have that A = USV T and that U T U = V T V = I r. Note that the sum goes only up to r which is the rank of A. Clearly, not summing up zero valued singular values does not change the sum. 1
Applications of the SVD 1. Determining range, null space and rank also numerical rank. 2. Matrix approximation. 3. Inverse and Pseudo-inverse: If A = UΣV T and Σ is full rank, then A 1 = V Σ 1 U T. If Σ is singular, then its pseudo-inverse is given by A = V Σ U T, where Σ is formed by replacing every nonzero entry by its reciprocal. 4. Least squares: If we need to solve Ax = b in the least-squares sense, then x LS = V Σ U T b. 5. Denoising Small singular values typically correspond to noise. Take the matrix whose columns are the signals, compute SVD, zero small singular values, and reconstruct. 6. Compression We have signals as the columns of the matrix S, that is, the i signal is given by S i = r σ j v ij u j. If some of the σ i are small, we can discard them with small error, thus obtaining a compressed representation of each signal. We have to keep the coefficients σ j v ij for each signal and the dictionary, that is, the vectors u i that correspond to the retained coefficients. SVD and eigen-decomposition are related but there are quite a few differences between them. 1. Not every matrix has an eigen-decomposition not even any square matrix. Any matrix even rectangular has an SVD. 2. In eigen-decomposition A = XΛX 1, that is, the eigen-basis is not always orthogonal. The basis of singular vectors is always orthogonal. 3. In SVD we have two singular-spaces right and left. 4. Computing the SVD of a matrix is more numerically stable. Rank-k approximation in the spectral norm The following will claim that the best approximation to A by a rank deficient matrix is obtained by the top singular values and vectors of A. More accurately: Fact 0.1. Set Then, Proof. A A k = A k = k σ j u j vj T. min A B 2 = A A k 2 = σ k+1. B R m n rankb k r σ j u j vj T k σ j u j vj T = r j=k+1 σ j u j v T j = σ k+1 and thus σ k+1 is the largest singular value of A A k. Alternatively, look at U T A k V = diagσ 1,..., σ k, 0,..., 0, which means that ranka k = k, and that A A k 2 = U T A A k V 2 = diag0,..., 0, σ k+1,..., σ r 2 = σ k+1. 2
Let B be an arbitrary matrix with rankb k = k. Then, it has a null space of dimension n k, that is, nullb = spanw 1,..., w n k. A dimension argument shows that spanw 1,..., w n k spanv 1,..., v k+1 {0}. Let w be a unit vector from the intersection. Since k+1 Aw = σ j vj T wu j, we have k+1 A B 2 2 A Bw 2 2 = Aw 2 2 = since w span{v 1,..., v n+1 }, and the v j are orthogonal. Rank-k approximation in the Frobenius norm The same theorem holds with the Frobenius norm. v j T w 2 k+1 σk+1 2 v j T w 2 = σk+1, 2 σj 2 Theorem 0.1. Set Then, A k = k σ j u j vj T. min A B F = A A k F = B R m n rankb k i=k+1 σ 2 i. Proof. Suppose A = UΣV T. Then Now, min A rankb k B 2 F = Σ U T BV 2 F = min UΣV T UU T BV V T 2 F = rankb k min Σ U T BV 2 F. rankb k n Σii U T 2 BV ii + off-diagonal terms. If B is the best approximation matrix and U T BV is not diagonal, then write U T BV = D + O, where D is diagonal and O contains the off-diagonal elements. Then the matrix B = UDV T is a better approximation, which is a contradiction. Thus, U T BV must be diagonal. Hence, Σ D 2 F = n σ i d i 2 = k σ i d i 2 + and this is minimal when d i = σ i, i = 1,..., k. The best approximating matrix is A k = UDV T, and the n approximation error is i=k+1 σ2 i. n i=k+1 σ 2 i, 3
Data mining applications of the SVD Linear regression in the least-squared loss In Linear regression we aim to find the best linear approximation to a set of observed data. For the m data points {x 1,..., x m }, x i R n, each receiving the value y i, we look for the weight vector w that minimizes: n x T i w y i 2 = Aw y 2 2 Where A is a matrix that holds the data points as rows A i = x T i. Proposition 0.1. The vector w that minimizes Aw y 2 2 is w = A y = V Σ U T y for A = UΣV T Σ ii = 1/Σ ii if Σ ii > 0 and 0 else. and Let us define U and U as the parts of U corresponding to positive and zero singular values of A respectively. Also let y = 0 and y be two vectors such that y = y + y and U y = 0 and U y = 0. Since y and y are orthogonal we have that Aw y 2 2 = Aw y y 2 2 = Aw y 2 2 + y 2 2. Now, since y is in the range of A there is a solution w for which Aw y 2 2 = 0. Namely, w = A y = V Σ U T y for A = UΣV T. This is because UΣV T V Σ U T y = y. Moreover, we get that the minimal cost is exactly y 2 2 which is independent of w. PCA, Optimal squared loss dimension reduction Given a set of n vectors x 1,..., x n in R m. We look for a rank k projection matrix P R m m that minimizes: P x i x i 2 2 If we denote by A the matrix whose i th column is x i then this is equivalent to minimizing P A A 2 F Since the best possible rank k approximation to the matrix A is A k = k σ iu i vi T the best possible solution would be a projection P for which P A = A k. This is achieved by P = U k Uk T where U k is the matrix corresponding to the first k left singular vectors of A. If we define y i = Uk T x i we see that the values of y i R k are optimally fitted to the set of points x i in the sense that they minimize: min min Ψy i x i 2 2 y 1,...,y n Ψ R k m The mapping of x i U T k x i = y i thus reduces the dimension of any set of points x 1,..., x n in R m to a set of points y 1,..., y n in R k optimally in the squared loss sense. This is commonly referred to as Principal Component Analysis PCA. Closest orthogonal matrix The SVD also allows to find the orthogonal matrix that is closest to a given matrix. Again, suppose that A = UΣV T and W is an orthogonal matrix that minimizes A W 2 F among all orthogonal matrices. Now, UΣV T W 2 F = UΣV T UU T W V V T = Σ W, where W = U T W V is another orthogonal matrix. We need to find the orthogonal matrix W that is closest to Σ. Alternatively, we need to minimize W T Σ I 2 F. 4
If U is orthogonal and D is diagonal and positive, then traceud = u ik d ki 1/2 1/2 u 2 ik d 2 ik i,k i k k = 1/2 d 2 ki = d 2 1/2 ii = d ii = traced. i i i k 4 Now W T T Σ I 2 F = trace W T Σ I W T Σ I = trace W T Σ I I = trace W T Σ 2 W trace W T Σ = trace T 2 trace = 2 F 2 trace + n = Σ 2 F 2 trace + n. trace + n + n Thus, we need to maximize trace. But this is maximized by W = I by 4. Thus, the best approximating matrix is W = UV T. Computing the SVD: The power method We give a simple algorithm for computing the Singular Value Decomposition of a matrix A R m n. We start by computing the first singular value σ 1 and left and right singular vectors u 1 and v 1 of A, for which min i<j logσ i /σ j λ: 1. Generate x 0 such that x 0 i N 0, 1. 2. s log4 log2n/δ/εδ/2λ 3. for i in [1,..., s]: 4. x i A T Ax i 1 5. v 1 x i / x i 6. σ 1 Av 1 7. u 1 Av 1 /σ 1 8. return σ 1, u 1, v 1 Let us prove the correctness of this algorithm. First, write each vector x i as a linear combination of the right singular values of A i.e. x i = j αi j v j. From the fact that v j are the eigenvectors of A T A corresponding to eigenvalues σj 2 we get that αi j = αi 1 j σj 2. Thus, αs j = α0 j σ2s j. Looking at the ratio between the coefficients of v 1 and v i for x s we get that: 2s < x s, v 1 > < x s, v i > = α0 1 σ1 αi 0 σ i 5
Demanding that the error in the estimation of σ 1 is less than ε gives the requirement on s. α 0 1 α 0 i σ1 σ i 2s n ε s logn α0 i /ε α0 1 2 logσ 1 /σ i 5 6 From the two-stability of the gaussian distribution we have that αi 0 N 0, 1. Therefore, Pr[α0 i > t] e t2 which gives that with probability at least 1 δ/2 we have for all i, αi 0 log2n/δ. Also, Pr[ α1 0 δ/4] δ/2 this is because Pr[ z < t] max r Ψ z r 2t for any distribution and the normal distribution function at zero takes it maximal value which is less than 2 Thus, with probability at least 1 δ we have that for all i, α 0 1 α 0 i log2n/δ δ/4. Combining all of the above we get that it is sufficient to set s = log4n log2n/δ/εδ/2λ = Ologn/εδ/λ in order to get ε precision with probability at least 1 δ. We now describe how to extend this to a full SVD of A. Since we have computed σ 1, u 1, v 1, we can repeat this procedure for A σ 1 u 1 v1 T = n i=2 σ iu i vi T. The top singular value and vectors of which are σ 2, u 2, v 2. Thus, computing the rank-k approximation of A requires Omnks = Omnk logn/εδ/λ operations. This is because computing A T Ax requires Omn operations and for each of the first k singular values and vectors this is performed s times. The main problem with this algorithm is that its running time is heavily influenced by the value of λ. Other variants of this algorithm are much less sensitive to the value of this parameter, but are out of the scope of this class. 6