Lecture Topic: Low-Rank Approximations

Transcription

1 Lecture Topic: Low-Rank Approximations

2 Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original matrix by a rank-1 matrix. In this chapter, we will consider problems, where a sparse matrix is given and one hopes to find a structured (e.g., low-rank), dense matrix as close as possible to it, in some norm. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

5 The Continuing Example Consider the example of collaborative filtering: Let us know only some elements (i, j) E of matrix A R m n, corresponding to ratings of m users of n movies or books. There, the set M could be the rank-r matrices, motivated by the best possible transformation to new coordinate system with r axes, such as likes horrors and likes romantic comedies. Notice that in collaborative filtering, each user may rate 200 out of movies on offer. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

11 Another Example One may also consider estimating positions of sensors from some of their pair-wise distances, which is known as sensor network localisation. In many applications, e.g. in the sewers, the sensors do not actually have GPS signal, but they have low-power radios, which allow them to estimate their distance from a handful of closest sensors. From these pair-wise measurements, you want to retrieve the positions of all sensors. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

17 Yet Another Example In the most striking result, we will see that for random rank-r matrices, knowing randomly drawn O(nr(log n) 2 ) elements makes it possible to reconstruct the complete matrix of O(n 2 ) elements without any error, with high probability. This has far-reaching consequences: Consider, for instance a digital camera. The price of sensors increases with the number of pixels, but many images are naturally low-rank. Although cameras with a single-pixel chip ( remain a curiosity, super-resolution techniques are actually wide-spread in medical imagining, where battery capacity is not a concern. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

18 Yet Another Example In the most striking result, we will see that for random rank-r matrices, knowing randomly drawn O(nr(log n) 2 ) elements makes it possible to reconstruct the complete matrix of O(n 2 ) elements without any error, with high probability. This has far-reaching consequences: Consider, for instance a digital camera. The price of sensors increases with the number of pixels, but many images are naturally low-rank. Although cameras with a single-pixel chip ( remain a curiosity, super-resolution techniques are actually wide-spread in medical imagining, where battery capacity is not a concern. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

19 Key Concepts A singular value and pair of singular vectors of A R m n are a scalar σ R, σ 0 and two non-zero vectors u R m and v R n such that Av = σu. In a matrix completion problem, with some elements (i, j) E of matrix A R m n known, you solve: min rank(m) s.t. M M R m r i,j = A i,j (i, j) E. (1.1) Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

20 Key Concepts A singular value and pair of singular vectors of A R m n are a scalar σ R, σ 0 and two non-zero vectors u R m and v R n such that Av = σu. In a matrix completion problem, with some elements (i, j) E of matrix A R m n known, you solve: min rank(m) s.t. M M R m r i,j = A i,j (i, j) E. (1.1) Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

21 Some Revision Definition (Orthogonality) Two vectors u, v R n are orthogonal if and only if their dot product n i=1 u iv i is zero. This suggest the angle of 90 degrees. The columns and rows of an orthogonal matrix U R n n are orthogonal unit vectors, i.e., U T U = UU T = I, where I is the identity matrix. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

22 Some Revision Definition (Orthogonality) Two vectors u, v R n are orthogonal if and only if their dot product n i=1 u iv i is zero. This suggest the angle of 90 degrees. The columns and rows of an orthogonal matrix U R n n are orthogonal unit vectors, i.e., U T U = UU T = I, where I is the identity matrix. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

23 Some More Intuition The linear transformation x Qx, for an orthogonal Q, is an isometry, i.e., preserves the dot product of vectors. Imagine a rotation or reflection. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

26 Some Revision Definition (Singular values and vectors of a matrix A R m n ) For every matrix A R m n, there exists a decomposition A = UΣV T, where: U is an m m orthogonal matrix whose m columns are left-singular vectors of A; Σ is m n matrix with Σ i,i 0, i min{m, n} being the singular values of A and all other elements 0; V T is n n orthogonal matrix whose n columns are right-singular vectors of A. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

27 Some More Intuition For A, det(a) > 0, Σ is a scaling matrix and U, V T rotation matrices. UΣV T is a composition a rotation, a scaling, and another rotation Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

28 Some More Intuition Every matrix A = UΣV T corresponds to a linear map T : R n R m. There are orthonormal bases of R n and R m such that T maps a basis vector of R n to a non-negative multiple of a basis vector of R m, for i = 1,, min{m, n} With respect to these bases, the T is represented by a diagonal matrix Σ with non-negative real diagonal entries, which are the lengths of semi-axes of an ellipsoid in R m, which would result in applying T to the unit sphere in R n. Formally, T (x) := Ax for A = UΣV T, T : R n R m. T (V i ) = σ i U i for all i = 1,, min{m, n}, T (V i ) = 0 for i > min{m, n}. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

34 Singular Values: Perturbation Analysis Much of the perturbation analysis we have seen for eigenvalues carries over. Let 0 m n, and let A R m n. Weyl inequality, for example: σ i+j 1 (A + B) σ i (A) + σ j (B) for all 1 i, j, i + j 1 m (2.1) Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

35 Some Revision We have seen a variety of norms of x R n : Example n l 1 norm x 1 := x i (3.1) i=1 Maximum norm x := max { x 1,..., x n }. (3.2) Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

36 Some Revision Let us consider a new concept, the conjugate norms and. By definition, In particular, 2 = 2 and 1 =. z = max y 1 y T z. (3.3) Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

39 Some Revision Definition (Matrix norm) A is a norm of a matrix A R m n if and only if: A 0 A = 0 if and only if A = 0 αa = α A for all α in R and A R m n A + B A + B for all A, B R m n. Definition (Trace of A R n n ) trace(a) = a 11 + a a nn = n i=1 a ii. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

40 Some Revision ( ) min{m, n} Nuclear norm A := trace A T A = σ i. (3.4) Frobenius norm A F := trace(a T A) = i=1 k i=1 j=1 n a ij 2 1/2 = min{m, n} σi 2 i=1 (3.5) Spectral norm A 2 := λ max (A A) = σ max (A) (3.6) where A A denotes a positive semidefinite B such that B = A T A. F = F and spectral norm is the conjugate of the nuclear norm. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

41 Some Revision ( ) min{m, n} Nuclear norm A := trace A T A = σ i. (3.4) Frobenius norm A F := trace(a T A) = i=1 k i=1 j=1 n a ij 2 1/2 = min{m, n} σi 2 i=1 (3.5) Spectral norm A 2 := λ max (A A) = σ max (A) (3.6) where A A denotes a positive semidefinite B such that B = A T A. F = F and spectral norm is the conjugate of the nuclear norm. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

42 Some More Understanding Previously, we have mentioned that all matrix norms are similar. For matrix A R m n of rank r: A 2 A F r A 2 A F A r A F Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

43 Some More Understanding Previously, we have mentioned that all matrix norms are similar. For matrix A R m n of rank r: A 2 A F r A 2 A F A r A F Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

44 Matrix Completion In general, let us consider: min M M A M N where M R m n is some subset of m n matrices, N is a matrix norm. In particular: 2 or F, M is rank-r, A is dense, M is dense: SVD F, M is rank-r, A is sparse, M is dense: NP-Hard various N, M is rank-1 with sparsity: NP-Hard Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

45 Low-Rank Matrices Theorem (Eckart and Young) Let us have rank-r matrix A R m n, A = UΣV T = r i=1 σ iu i vi T. Consider k k < r and the so called truncated singular value decomposition A k = σ i u i vi T, More visually, arg min B R m n rank(b) k A B F = arg min B R m n rank(b) k i=1 A B 2 = A k (4.1) A = [ ] [ ] Σ U 1 U 1 0 [V1 ] T 2 V 0 Σ 2, (4.2) 2 A K = U 1 Σ 1 V T 1 (4.3) where Σ 1 R k k, U 1 R m k, and V 1 R n k. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

46 Low-Rank Matrices There are a number of proofs One can use the Weyl inequality: σ i+j 1 (A + B) σ i (A) + σ j (B) for all 1 i, j, i + j 1 m If B has rank k, σ k+1 (B) = 0. One uses B and AB, j = k + 1. For the spectral norm, i = 1 suffices. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

51 Sparse Low-Rank Matrices Consider again the applications of low-rank matrix reconstruction: predicting ratings of movies by individual users, in collaborative filtering, wher each user has rated 200 out of movies on offer, or estimating positions of sensors from some of their pair-wise positions, in sensor network localisation, where one may know positions to 4 or 5 sensors. They share the property that we know only a very small number of entries of the matrix. Imputing 0 or similar is a bad idea. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

56 Sparse Low-Rank Matrices Let us know only some elements (i, j) E of matrix A R m n. Assume that there exists only one rank-r matrix M with those entries. Then, the search for the simplest explanation fitting the observed data is: The problem is: min rank(m) s.t. M M R m r i,j = A i,j (i, j) E (5.1) non-convex in M and very hard easy to reformulate in a number of ways. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

57 Sparse Low-Rank Matrices Let us know only some elements (i, j) E of matrix A R m n. Consider the fact that rank-r matrix M = XY T, X R m r, Y R n r and: The problem is: non-convex in XY T arg convex in either X or Y. min X R m r Y R n r (i,j) E ( (XY T ) i,j A i,j ) 2 (5.2) Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

58 Sparse Low-Rank Matrices A rank-r matrix has exactly r non-zero singular values. Rank can hence be seen as the l 0 norm of the spectrum. Considering we have seen l 0 norm being replaced by l 1 norm, Fazel proposed to replace rank with the spectral norm: The problem is: arg min M subject to M R m n (i,j) E ( (MY T ) i,j A i,j ) 2 (5.3) convex in M and possible to solve using interior-point methods the optimum of the convex problem coincides with the global optimum of the non-convex problem (!) with high probability: Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

61 Sparse Low-Rank Matrices Theorem (Candes and Recht) Let us assume M R m n of rank r is sampled from the random orthogonal model. Suppose we observe entries of M with locations E sampled uniformly at random. Then there are numerical constants C 1 and C 1 such that if E C 1 r (max{m, n}) 5/4 log(max{m, n}), (5.4) the minimizer to the -minimisation problem is unique and equal to M with probability at least 1 C 2 (max{m, n}) 3. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

62 Sparse Low-Rank Matrices The result of the previous theorem is of considerable theoretical and practical interest. It has been cited more than 1800 times. Although the -minimisation problem is possible to approximate within any fixed precision in polynomial time, this is limited to modest n 1000 in practice. Notice that the interior point method needs to invert the Hessian, where even the matrix variable is n n. One would hence like to find more efficient algorithms. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

67 : Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

68 Sparse Low-Rank Matrices Alternating Minimisation: 1 Partition E = E 1 E 2... E kmax 2 Compute SVD min m,n i σ i X i Yi T considering only E 1 3 Initialise X 1 = mn E 1 σi x i Y 1 = mn E 1 σi y i 4 For each iteration k = 1... k max O(log n): X k+1 = min (X ((Y k ) T ) i,j A i,j ) 2 (5.5) X R m r (i,j) E k+1 Y k+1 = min (X k+1 Y T ) i,j A i,j ) 2 (5.6) Y R n r (i,j) E k+1 This: solves linear least squares twice in each iteration, in dimensions mr and nr generally takes O((mr) 2 ), O((nr) 2 ), but for the partial separable structure, it is O( E r 2 ), O( E r 2 ) Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

69 Sparse Low-Rank Matrices Theorem (Keshavan et al.) Let us assume M = X (Y ) T + W, X R m r, Y R n r, W R m n with elements of W, X, and Y being bounded i.i.d random variables, for X, Y zero-mean, and expectation of W satisfying, among others: θ = σ max (W ), and P ( W i,j W i,j t ) ) 2 exp ( t2 2ω 2. (5.7) There exists constants C 1, C 2 such that k max = C 1 log n and E C 2 κ 8 nr(log n) 2 and E uniformly distributed over all sets of E, such that with probability larger than 1 1/n 4, one has: M (X k (Y k ) T ) F 6 r 2 2k + C 2 rκ 2 (θ + nω ) ɛ (5.8) where κ = max{σ min (X ) 1, σ min (Y ) 1 }. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

70 Regularisations of PCA Alternating minimisation is a very general approach to optimisation problems. For example, consider a generalisation of PCA: with v being l 2 and no norm s. max x R n{ Ax v : x 2 1, x s k}, (6.1) v of l 1 norm works better, in terms of perturbation analysis (stability, robustness). s such as l 1 improves interpretability (sparsity in the loading vector) by approximating l 0. As we have seen in the previous chapter, l 1 norm is non-smooth. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

75 Regularisations Richtárik et al. summarise 8 possible regularisations of the problem of computing the first PC by combining: two norms for measuring variance (l 1, l 2 ) and two sparsity-inducing norms (cardinality l 0 and l 1 ), either in a constraint or in a penalty term. All have the form with X R n and f. OPT = max f (x), (6.2) x X Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

79 Regularisations # v s s use X f (x) 1 L 2 L 0 constraint {x R n : x 2 1, x 0 s} Ax 2 2 L 1 L 0 constraint {x R n : x 2 1, x 0 s} Ax 1 3 L 2 L 1 constraint {x R n : x 2 1, x 1 s} Ax 2 4 L 1 L 1 constraint {x R n : x 2 1, x 1 s} Ax 1 5 L 2 L 0 penalty {x R n : x 2 1} Ax 2 2 γ x 0 6 L 1 L 0 penalty {x R n : x 2 1} Ax 2 1 γ x 0 7 L 2 L 1 penalty {x R n : x 2 1} Ax 2 γ x 1 8 L 1 L 1 penalty {x R n : x 2 1} Ax 1 γ x 1 Table : Eight regularisations of PCA, cited in verbatim from Richtárik et al. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

80 Regularisations Let Y := {y R m Y := {y R m : y 2 1} for the l 2 -norm and : y 1} for the l 1 norm, and let F (x, y) be the function obtained from f (x) after replacing Ax with y T Ax (resp. Ax 2 with (y T Ax) 2 ). Then, in view of the above, (6.2) takes on the equivalent form OPT = max F (x, y). (6.3) max x X y Y That is, the 8 problems can be reformulated into the form (6.3). Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

83 Regularisations # X Y F (x, y) 1 {x R n : x 2 1, x 0 s} {y R m : y 2 1} y T Ax 2 {x R n : x 2 1, x 0 s} {y R m : y 1} y T Ax 3 {x R n : x 2 1, x 1 s} {y R m : y 2 1} y T Ax 4 {x R n : x 2 1, x 1 s} {y R m : y 1} y T Ax 5 {x R n : x 2 1} {y R m : y 2 1} (y T Ax) 2 γ x 0 6 {x R n : x 2 1} {y R m : y 1} (y T Ax) 2 γ x 0 7 {x R n : x 2 1} {y R m : y 2 1} y T Ax γ x 1 8 {x R n : x 2 1} {y R m : y 1} y T Ax γ x 1 Table : Reformulations of the problems from Table 1. Cited in verbatim from Richtárik et al. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

84 Generalising the Power Method The alternating minimisation for the regularised problem (6.3) is: y k = arg max y Y F (x k, y) (6.4) x k+1 = arg max x X F (x, y k ). (6.5) As it turns out, there are closed-form solutions for the two sub-problems for all the variants above. Notice that Hotelling s deflation is no longer guaranteed to work, although there are replacements. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

87 A Summary Overall, we have seen that there are NP-Hard problems, for which one can retrieve the global optimum with high probability. Leading solvers based on alternating minimisation can tackle gigabyte-sized instances in minutes. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1

88 A Summary Overall, we have seen that there are NP-Hard problems, for which one can retrieve the global optimum with high probability. Leading solvers based on alternating minimisation can tackle gigabyte-sized instances in minutes. Jakub Mareček and Seán McGarraghy (UCD) Numerical Analysis and Software November 11, / 1