Lecture 7: Singular Value Decomposition

Similar documents
Linear Algebra Review. Vectors

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

Lecture 5: Singular Value Decomposition SVD (1)

Notes on Symmetric Matrices

Similar matrices and Jordan form

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Inner Product Spaces and Orthogonality

Applied Linear Algebra I Review page 1

Chapter 6. Orthogonality

Review Jeopardy. Blue vs. Orange. Review Jeopardy

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

NOTES ON LINEAR TRANSFORMATIONS

Lecture Topic: Low-Rank Approximations

LINEAR ALGEBRA. September 23, 2010

by the matrix A results in a vector which is a reflection of the given

1 Introduction to Matrices

Similarity and Diagonalization. Similar Matrices

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

DATA ANALYSIS II. Matrix Algorithms

Math 312 Homework 1 Solutions

Systems of Linear Equations

Factorization Theorems

Solving Linear Systems, Continued and The Inverse of a Matrix

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Least-Squares Intersection of Lines

Orthogonal Diagonalization of Symmetric Matrices

Lecture 3: Finding integer solutions to systems of linear equations

x = + x 2 + x

Direct Methods for Solving Linear Systems. Matrix Factorization

A note on companion matrices

Inner product. Definition of inner product

Section Inner Products and Norms

Chapter 17. Orthogonal Matrices and Symmetries of Space

Linear Algebra: Determinants, Inverses, Rank

Methods for Finding Bases

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

8 Square matrices continued: Determinants

α = u v. In other words, Orthogonal Projection

LS.6 Solution Matrices

Vector and Matrix Norms

University of Lille I PC first year list of exercises n 7. Review

Introduction to Matrix Algebra

CONTROLLABILITY. Chapter Reachable Set and Controllability. Suppose we have a linear system described by the state equation

Solutions to Math 51 First Exam January 29, 2015

Inner products on R n, and more

3 Orthogonal Vectors and Matrices

[1] Diagonal factorization

Numerical Analysis Lecture Notes

Matrix Representations of Linear Transformations and Changes of Coordinates

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

1 Determinants and the Solvability of Linear Systems

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

October 3rd, Linear Algebra & Properties of the Covariance Matrix

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Orthogonal Bases and the QR Algorithm

7 Gaussian Elimination and LU Factorization

LINEAR ALGEBRA W W L CHEN

Abstract: We describe the beautiful LU factorization of a square matrix (or how to write Gaussian elimination in terms of matrix multiplication).

T ( a i x i ) = a i T (x i ).

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Linear Algebraic Equations, SVD, and the Pseudo-Inverse

Lecture 1: Schur s Unitary Triangularization Theorem

1 Homework 1. [p 0 q i+j p i 1 q j+1 ] + [p i q j ] + [p i+1 q j p i+j q 0 ]

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

MATH APPLIED MATRIX THEORY

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

THE DIMENSION OF A VECTOR SPACE

The Characteristic Polynomial

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Quadratic forms Cochran s theorem, degrees of freedom, and all that

3. Let A and B be two n n orthogonal matrices. Then prove that AB and BA are both orthogonal matrices. Prove a similar result for unitary matrices.

Continued Fractions and the Euclidean Algorithm

Statistical machine learning, high dimension and big data

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

Inner Product Spaces

Lecture 5 Least-squares

Solving Systems of Linear Equations

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Lectures notes on orthogonal matrices (with exercises) Linear Algebra II - Spring 2004 by D. Klain

Numerical Methods I Eigenvalue Problems

The Matrix Elements of a 3 3 Orthogonal Matrix Revisited

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Math 215 HW #6 Solutions

1.5 SOLUTION SETS OF LINEAR SYSTEMS

Nonlinear Iterative Partial Least Squares Method

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

1 VECTOR SPACES AND SUBSPACES

Multidimensional data and factorial methods

SYMMETRIC EIGENFACES MILI I. SHAH

Constrained Least Squares

5. Orthogonal matrices

Decision-making with the AHP: Why is the principal eigenvector necessary

Using row reduction to calculate the inverse and the determinant of a square matrix

6. Cholesky factorization

A =

Transcription:

0368-3248-01-Algorithms in Data Mining Fall 2013 Lecturer: Edo Liberty Lecture 7: Singular Value Decomposition Warning: This note may contain typos and other inaccuracies which are usually discussed during class. Please do not cite this note as a reliable source. If you find mistakes, please inform me. We will see that any matrix A R m n w.l.o.g. m n can be written as A = σ l u l vl T 1 l=1 l σ l R, σ l 0 2 l, l u l, u l = v l, v l = δl, l 3 To prove this consider the matrix AA T R m m. Set u l to be the l th eigenvector of AA T. By definition we have that AA T u l = λ l u l. Since AA T is positive semidefinite we have λ l 0. Since AA T is symmetric we have that l, l u l, u l = δl, l. Set σ l = λ l and v l = 1 σ l A T u l. Now we can compute the following: v l, v l = 1 σl 2 u T l AA T u l = 1 σl 2 λ l u l, u l = δl, l We are only left to show that A = m l=1 σ lu l vl T. To do that we examine the norm or the difference multiplied by a test vector w = m α iu i. w T A σ l u l vl T = α i u T i A σ l u l vl T l=1 l=1 = α i u T i A δi, lα i σ l vl T l=1 = α i σ i vi T α i σ i vi T = 0 The vectors u l and v l are called the left and right singular vectors of A and σ l are the singular vectors of A. It is costumery to order the singular values in descending order σ 1 σ 2,..., σ m 0. Also, we will denote by r the rank of A. Here is another very convenient way to write the fact that A = m l=1 σ lu l v T l Let Σ R r r be a diagonal matrix whose entries are Σi, i = σ i and σ 1 σ 2... σ r. Let U R m r be the matrix whose i th column is the left singular vectors of A corresponding to singular value σ i. Let V R n r be the matrix whose i th column is the right singular vectors of A corresponding to singular value σ i. We have that A = USV T and that U T U = V T V = I r. Note that the sum goes only up to r which is the rank of A. Clearly, not summing up zero valued singular values does not change the sum. 1

Applications of the SVD 1. Determining range, null space and rank also numerical rank. 2. Matrix approximation. 3. Inverse and Pseudo-inverse: If A = UΣV T and Σ is full rank, then A 1 = V Σ 1 U T. If Σ is singular, then its pseudo-inverse is given by A = V Σ U T, where Σ is formed by replacing every nonzero entry by its reciprocal. 4. Least squares: If we need to solve Ax = b in the least-squares sense, then x LS = V Σ U T b. 5. Denoising Small singular values typically correspond to noise. Take the matrix whose columns are the signals, compute SVD, zero small singular values, and reconstruct. 6. Compression We have signals as the columns of the matrix S, that is, the i signal is given by S i = r σ j v ij u j. If some of the σ i are small, we can discard them with small error, thus obtaining a compressed representation of each signal. We have to keep the coefficients σ j v ij for each signal and the dictionary, that is, the vectors u i that correspond to the retained coefficients. SVD and eigen-decomposition are related but there are quite a few differences between them. 1. Not every matrix has an eigen-decomposition not even any square matrix. Any matrix even rectangular has an SVD. 2. In eigen-decomposition A = XΛX 1, that is, the eigen-basis is not always orthogonal. The basis of singular vectors is always orthogonal. 3. In SVD we have two singular-spaces right and left. 4. Computing the SVD of a matrix is more numerically stable. Rank-k approximation in the spectral norm The following will claim that the best approximation to A by a rank deficient matrix is obtained by the top singular values and vectors of A. More accurately: Fact 0.1. Set Then, Proof. A A k = A k = k σ j u j vj T. min A B 2 = A A k 2 = σ k+1. B R m n rankb k r σ j u j vj T k σ j u j vj T = r j=k+1 σ j u j v T j = σ k+1 and thus σ k+1 is the largest singular value of A A k. Alternatively, look at U T A k V = diagσ 1,..., σ k, 0,..., 0, which means that ranka k = k, and that A A k 2 = U T A A k V 2 = diag0,..., 0, σ k+1,..., σ r 2 = σ k+1. 2

Let B be an arbitrary matrix with rankb k = k. Then, it has a null space of dimension n k, that is, nullb = spanw 1,..., w n k. A dimension argument shows that spanw 1,..., w n k spanv 1,..., v k+1 {0}. Let w be a unit vector from the intersection. Since k+1 Aw = σ j vj T wu j, we have k+1 A B 2 2 A Bw 2 2 = Aw 2 2 = since w span{v 1,..., v n+1 }, and the v j are orthogonal. Rank-k approximation in the Frobenius norm The same theorem holds with the Frobenius norm. v j T w 2 k+1 σk+1 2 v j T w 2 = σk+1, 2 σj 2 Theorem 0.1. Set Then, A k = k σ j u j vj T. min A B F = A A k F = B R m n rankb k i=k+1 σ 2 i. Proof. Suppose A = UΣV T. Then Now, min A rankb k B 2 F = Σ U T BV 2 F = min UΣV T UU T BV V T 2 F = rankb k min Σ U T BV 2 F. rankb k n Σii U T 2 BV ii + off-diagonal terms. If B is the best approximation matrix and U T BV is not diagonal, then write U T BV = D + O, where D is diagonal and O contains the off-diagonal elements. Then the matrix B = UDV T is a better approximation, which is a contradiction. Thus, U T BV must be diagonal. Hence, Σ D 2 F = n σ i d i 2 = k σ i d i 2 + and this is minimal when d i = σ i, i = 1,..., k. The best approximating matrix is A k = UDV T, and the n approximation error is i=k+1 σ2 i. n i=k+1 σ 2 i, 3

Data mining applications of the SVD Linear regression in the least-squared loss In Linear regression we aim to find the best linear approximation to a set of observed data. For the m data points {x 1,..., x m }, x i R n, each receiving the value y i, we look for the weight vector w that minimizes: n x T i w y i 2 = Aw y 2 2 Where A is a matrix that holds the data points as rows A i = x T i. Proposition 0.1. The vector w that minimizes Aw y 2 2 is w = A y = V Σ U T y for A = UΣV T Σ ii = 1/Σ ii if Σ ii > 0 and 0 else. and Let us define U and U as the parts of U corresponding to positive and zero singular values of A respectively. Also let y = 0 and y be two vectors such that y = y + y and U y = 0 and U y = 0. Since y and y are orthogonal we have that Aw y 2 2 = Aw y y 2 2 = Aw y 2 2 + y 2 2. Now, since y is in the range of A there is a solution w for which Aw y 2 2 = 0. Namely, w = A y = V Σ U T y for A = UΣV T. This is because UΣV T V Σ U T y = y. Moreover, we get that the minimal cost is exactly y 2 2 which is independent of w. PCA, Optimal squared loss dimension reduction Given a set of n vectors x 1,..., x n in R m. We look for a rank k projection matrix P R m m that minimizes: P x i x i 2 2 If we denote by A the matrix whose i th column is x i then this is equivalent to minimizing P A A 2 F Since the best possible rank k approximation to the matrix A is A k = k σ iu i vi T the best possible solution would be a projection P for which P A = A k. This is achieved by P = U k Uk T where U k is the matrix corresponding to the first k left singular vectors of A. If we define y i = Uk T x i we see that the values of y i R k are optimally fitted to the set of points x i in the sense that they minimize: min min Ψy i x i 2 2 y 1,...,y n Ψ R k m The mapping of x i U T k x i = y i thus reduces the dimension of any set of points x 1,..., x n in R m to a set of points y 1,..., y n in R k optimally in the squared loss sense. This is commonly referred to as Principal Component Analysis PCA. Closest orthogonal matrix The SVD also allows to find the orthogonal matrix that is closest to a given matrix. Again, suppose that A = UΣV T and W is an orthogonal matrix that minimizes A W 2 F among all orthogonal matrices. Now, UΣV T W 2 F = UΣV T UU T W V V T = Σ W, where W = U T W V is another orthogonal matrix. We need to find the orthogonal matrix W that is closest to Σ. Alternatively, we need to minimize W T Σ I 2 F. 4

If U is orthogonal and D is diagonal and positive, then traceud = u ik d ki 1/2 1/2 u 2 ik d 2 ik i,k i k k = 1/2 d 2 ki = d 2 1/2 ii = d ii = traced. i i i k 4 Now W T T Σ I 2 F = trace W T Σ I W T Σ I = trace W T Σ I I = trace W T Σ 2 W trace W T Σ = trace T 2 trace = 2 F 2 trace + n = Σ 2 F 2 trace + n. trace + n + n Thus, we need to maximize trace. But this is maximized by W = I by 4. Thus, the best approximating matrix is W = UV T. Computing the SVD: The power method We give a simple algorithm for computing the Singular Value Decomposition of a matrix A R m n. We start by computing the first singular value σ 1 and left and right singular vectors u 1 and v 1 of A, for which min i<j logσ i /σ j λ: 1. Generate x 0 such that x 0 i N 0, 1. 2. s log4 log2n/δ/εδ/2λ 3. for i in [1,..., s]: 4. x i A T Ax i 1 5. v 1 x i / x i 6. σ 1 Av 1 7. u 1 Av 1 /σ 1 8. return σ 1, u 1, v 1 Let us prove the correctness of this algorithm. First, write each vector x i as a linear combination of the right singular values of A i.e. x i = j αi j v j. From the fact that v j are the eigenvectors of A T A corresponding to eigenvalues σj 2 we get that αi j = αi 1 j σj 2. Thus, αs j = α0 j σ2s j. Looking at the ratio between the coefficients of v 1 and v i for x s we get that: 2s < x s, v 1 > < x s, v i > = α0 1 σ1 αi 0 σ i 5

Demanding that the error in the estimation of σ 1 is less than ε gives the requirement on s. α 0 1 α 0 i σ1 σ i 2s n ε s logn α0 i /ε α0 1 2 logσ 1 /σ i 5 6 From the two-stability of the gaussian distribution we have that αi 0 N 0, 1. Therefore, Pr[α0 i > t] e t2 which gives that with probability at least 1 δ/2 we have for all i, αi 0 log2n/δ. Also, Pr[ α1 0 δ/4] δ/2 this is because Pr[ z < t] max r Ψ z r 2t for any distribution and the normal distribution function at zero takes it maximal value which is less than 2 Thus, with probability at least 1 δ we have that for all i, α 0 1 α 0 i log2n/δ δ/4. Combining all of the above we get that it is sufficient to set s = log4n log2n/δ/εδ/2λ = Ologn/εδ/λ in order to get ε precision with probability at least 1 δ. We now describe how to extend this to a full SVD of A. Since we have computed σ 1, u 1, v 1, we can repeat this procedure for A σ 1 u 1 v1 T = n i=2 σ iu i vi T. The top singular value and vectors of which are σ 2, u 2, v 2. Thus, computing the rank-k approximation of A requires Omnks = Omnk logn/εδ/λ operations. This is because computing A T Ax requires Omn operations and for each of the first k singular values and vectors this is performed s times. The main problem with this algorithm is that its running time is heavily influenced by the value of λ. Other variants of this algorithm are much less sensitive to the value of this parameter, but are out of the scope of this class. 6