Linear Algebra Methods for Data Mining

Transcription

1 Linear Algebra Methods for Data Mining Saara Hyvönen, Spring 2007 Overview of some topics covered and some topics not covered on this course Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

2 Linear algebra tool kit QR iteration eigenvalues, eigenvalue decomposition, generalized eigenvalue problem singular value decomposition SVD NMF power method (for finding eigenvalues and -vectors) Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 1

3 Data mining tasks encountered regression classification clustering finding latent variables visualizing and exploration ranking Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 2

4 QR was used for... orthogonalizing a set of (basis) vectors X = QR. solving the least-squares problem: r 2 = b Ax 2 = Q T b ( ) R x 2 = b 1 Rx 2 + b least squares problems were encountered e.g. when we wish to express a matrix A R m n in terms of a set of basis vectors X R m k, k < m. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 3

5 Eigenvalues/vectors were encountered in... Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 4

6 PageRank: eigenvector corresponding to largest eigenvalue of the Google matrix. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 5

7 Linear discriminant analysis: linear discriminants = eigenvectors corresponding to the largest eigenvalues of the generalized eigenvalue problem S b w = λs w w. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 6

8 Spectral clustering: based on running k-means clustering on the matrix obtained from the eigenvectors corresponding to the largest eigenvalues of the graph laplacian matrix L = D 1/2 AD 1/2. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 7

9 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 8

10 Spectral clustering Use methods from spectral graph partitioning to do clustering. Needed: pairwise distances between data points. These can be thought of as weights of links in a graph: clustering problem becomes a graph partitioning problem. Unlike k-means, clusters need not be convex. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 9

11 Algorithm We have n data points (x 1,..., x n ). We wish to partition them into k disjoint clusters C 1,..., C k. 1. Form affinity matrix A R n n defined by A ij = { exp( xi x j 2 /2σ 2 ) if i j = 0 if i = j. 2. Define D to be the diagonal matrix whose i th diagonal element is the sum of A s i th row, and construct the matrix L = D 1/2 AD 1/2. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 10

12 3. Find the eigenvectors v j of L corresponding to the k largest eigenvalues, and form the matrix V = [v 1 v 2... v k ] R n k. 4. Form the matrix Y from V by renormalizing each of V s rows to have unit length. 5. Treating each row of Y as a point in R k, cluster them into k clusters via k-means (or any other clustering algorithm). 6. If the row i of the matrix Y was assigned to cluster j, assign the data point x i to the cluster j. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 11

14 SVD was useful for... noise reduction 8 singular values index Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 13

15 data compression If A k = U k Σ k V T k, where Σ k contains the k first singular values of A, and the columns of U k and V k are the corresponding (left and right) singular vectors, then min rank(b) k A B 2 = A A k 2 = σ k+1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 14

16 visualizing data: PCA Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 15

17 information retrieval, LSI A term-to-document matrix, q query. Instead of doing query matching q T A > tol in the full space, do SVD on A and use only the k first singular values/vectors. Result: compression plus (often) better performance in terms of precision vs recall. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 16

18 HITS The HITS algorithm distinguishes between authorities, which contain high-quality information, and hubs which are comprehensive lists of links to authorities. Form the adjacency matrix of the directed web graph. Hub scores and authority scores are the left and right singular vectors of the adjacency matrix. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 17

19 Power method is used to find the largest eigenvalue in magnitude and the corresponding eigenvector. PageRank subsequent eigenvalues/vectors could be found by using deflation. In the symmetric case: A = n j=1 λ j u j u T j, Â = A λ 1 u 1 u T 1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 18

20 Nonnegative matrix factorization Given a nonnegative matrix A R m n, we wish to express the matrix as a product of two nonnegative matrices W R m k and H R k n : A WH Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 19

21 70 nmf !10! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 20

24 Roaming beyond the scope of this course There are plenty of thing related to linear algebra and data mining that we did not cover on this course, e.g. tensor SVD, generalized SVD kernel methods independent component analysis ICA multidimensional scaling canonical correlations Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 23

25 generalized linear models factor analysis, mpca,... spectral ordering... Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 24

26 Tensors vector: one-dimensional array of data matrix: two-dimensional array of data tensor: n-dimensional array of data, e.g. n=3: A R l m n it is possible to define Higher Order SVD for such a 3-mode array or tensor. psychometrics, chemometrics Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 25

27 Face recognition using Tensor SVD collection of images of n p persons each image is an m 1 m 2 array: stack columns to get vector of length n i = m 1 m 2. each person has been photographed with n e different expressions/illuminations. so we have a tensor A R n i n e n p HOSVD can be used for face recognition, or e.g. reducing the effect of illumination. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 26

28 Data: Digitized images of 10 people, 11 expressions. Task: Find from the data base the closest match to the given figures (top row). Below: closest match using HOSVD. In each case, the right person was identified. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 27

29 Independent Component Analysis Consider the cocktail-party problem: in a room, you have two people speaking simultaneously, and two microphones recording the mixture of these speech signals. Each recording is a weighted sum of the speech signals s 1 (t) and s 2 (t): x 1 (t) = a 11 s 1 + a 12 s 2 x 2 (t) = a 21 s 1 + a 22 s 2 where a ij are some parameters depending on the distances of the microphones from the speakers. How to recover the original signals s 1 and s 2 from the recorded signals x 1 and x 2? Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 28

30 From Hastie, Tibshirani, Friedman [5]. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 29

31 PCA versus ICA PCA gives uncorrelated components. In the cocktail-party problem this is not the right answer. ICA gives statistically independent components. Variables y 1 and y 2 are independent, if information on the value of y 1 does not give any information on the value of y 2, and vica versa. Note: data must be nongaussian! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 30

32 Example: Image separation. Mixtures of images: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 31

33 ICA produced the following images: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 32

34 See also Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 33

35 Kernel methods Idea: take a mapping φ : X F, where F is an inner product space, and map data x to the (higher dimensional) feature space: Then work in the feature space F. x φ(x). Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 34

37 In this case: φ : (x 1, x 2 ) (x 2 1, 2x 1 x 2, x 2 2). So the inner product in the feature space is φ(x), φ(x ) = (x 2 1, 2x 1 x 2, x 2 2)(x 12, 2x 1 x 2, x 2 2 ) T = x, x 2 =: k(x, x ) So the inner product can be computed in R 2! Here k is the kernel function. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 36

38 This is the very idea in kernel methods: you can operate in high dimensional feature spaces while doing all your (inner product) computations in a lower dimensional space. All you need is a suitable kernel. A kernel is a function k such that for all x, y X, k(x, y) = φ(x), φ(y), where φ is a mapping from X to and (inner product) feature space F. There are numerous ways to define kernels. We can use any algorithm that only depends on dot products: after the kernel trick, we are operating in the feature space. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 37

39 In practice the dimension of the feature space can be huge. If our data consists of images of size 16 16, and we use as a feature map polynomials of degree d = 5, then our feature space is of dimension 10 10! Regardless of the dimension of the feature space, we can compute the inner products in the lower dimensional space: computation is not a problem. Overfitting? Not a problem (in theory): for reasons, see the references. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 38

40 Kernel methods can be used for Pattern recognition classification (SVM= support vector machines) outlier detection, canoncial correlations,... Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 39

41 MDS uses pairwise distances between points finds a low dimensional representation of the data in such a way, that distances between points are preserved as well as possible Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 40

42 1. Ingria 2. South-East 3. E. Savonia 4. C. Savonia W. Savonia 6. S.E. Tavastia 7. C. Tavastia E. South-West N. South-West Satakunta 11. S. Ostrobothnia C. Ostrobothnia 13. N. Ostrobothnia 14. Kainuu 15. Northernmost Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 41

43 Northernmost N. Ostrobothnia Kainuu Central Ostrobothnia S. Ostrobothnia C. Savonia Satakunta N. South-West C. Tavastia W.Savonia E. Savonia E. South-West S.E. Tavastia South-East Ingria Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 42

44 Final words You can get far with a basic linear algebra toolkit. But there remains a world of methods to explore! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 43

45 References [1] Lars Eldén: Matrix Methods in Data Mining and Pattern Recognition, SIAM [2] A. Hyvärinen and E. Oja: Independent Component Analysis: Algorithms and Applications, Neural Networks 13 (4-5), [3] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004 [4] D. Lee and H. S. Seung, Learning the parts of objects with nonnegative matrix factorization, Nature 401, 788 (1999). Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 44

46 [5] T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer Verlag, [6 D. Hand, H. Mannila, P. Smyth: Principles of Data Mining, MIT Press, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 45