CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye

Similar documents
Linear Algebra Review. Vectors

Similarity and Diagonalization. Similar Matrices

Orthogonal Diagonalization of Symmetric Matrices

α = u v. In other words, Orthogonal Projection

1 Introduction to Matrices

Chapter 6. Orthogonality

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

3 Orthogonal Vectors and Matrices

Inner product. Definition of inner product

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Lecture Notes 2: Matrices as Systems of Linear Equations

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Applied Linear Algebra I Review page 1

Notes on Symmetric Matrices

MATH APPLIED MATRIX THEORY

Inner Product Spaces

Inner Product Spaces and Orthogonality

NOTES ON LINEAR TRANSFORMATIONS

[1] Diagonal factorization

Least-Squares Intersection of Lines

Vector and Matrix Norms

Greedy Column Subset Selection for Large-scale Data Sets

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

5. Orthogonal matrices

SYMMETRIC EIGENFACES MILI I. SHAH

BANACH AND HILBERT SPACE REVIEW

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

Numerical Methods I Eigenvalue Problems

DATA ANALYSIS II. Matrix Algorithms

Learning, Sparsity and Big Data

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Lecture 5: Singular Value Decomposition SVD (1)

Manifold Learning Examples PCA, LLE and ISOMAP

Examination paper for TMA4205 Numerical Linear Algebra

MAT 242 Test 2 SOLUTIONS, FORM T

How To Solve The Cluster Algorithm

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Finite Dimensional Hilbert Spaces and Linear Inverse Problems

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

CS3220 Lecture Notes: QR factorization and orthogonal transformations

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

Numerical Analysis Lecture Notes

Mathematical finance and linear programming (optimization)

Section Inner Products and Norms

Nonlinear Iterative Partial Least Squares Method

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Introduction to General and Generalized Linear Models

Linear Algebra Methods for Data Mining

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Sparse Nonnegative Matrix Factorization for Clustering

Introduction to Matrix Algebra

Cyber-Security Analysis of State Estimators in Power Systems

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Principal components analysis

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Introduction to Clustering

CONQUEST: A Coarse-Grained Algorithm for Constructing Summaries of Distributed Discrete Datasets 1

Unsupervised and supervised dimension reduction: Algorithms and connections

Statistical Machine Learning

University of Lille I PC first year list of exercises n 7. Review

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes

Introduction: Overview of Kernel Methods

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

Constrained Least Squares

The Geometry of Polynomial Division and Elimination

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Factor analysis. Angela Montanari

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Mathematics Course 111: Algebra I Part IV: Vector Spaces

by the matrix A results in a vector which is a reflection of the given

The Characteristic Polynomial

Nonlinear Algebraic Equations Example

SALEM COMMUNITY COLLEGE Carneys Point, New Jersey COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.

Statistical Databases and Registers with some datamining

Collaborative Filtering. Radek Pelánek

LINEAR ALGEBRA. September 23, 2010

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Lecture 3: Linear methods for classification

Statistical machine learning, high dimension and big data

Categorical Data Visualization and Clustering Using Subjective Factors

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Multidimensional data analysis

Transcription:

CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize data points in clusters: A cluster is a subset of the set of data points that are close together, using some distance measure. A loose definition of clustering could be the process of organizing data into groups whose members are similar in some way. A cluster is therefore a collection of data points which are similar between them and are dissimilar to the data points belonging to other clusters One can compute the mean value of each cluster separately, and use the means as representatives of the clusters. Equivalently, the means can be used as basis vectors, and all the data points are represented by their coordinates with respect to this basis. Clustering algorithms can be applied in many fields: Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records; Biology: classification of plants and animals given their features; WWW: document clustering; clustering weblog data to discover groups of similar access patterns. There are several methods for computing a clustering. k-means algorithm. One of the most important is the 2 K-means Clustering We assume that we have n data points {x i } n i=1 IRm, which we organize as columns in a matrix X = [x 1, x 2,, x n ] IR m n. Let Π = {π j } k j=1 denote a partitioning of the data in X into k clusters: π j = {v x v belongs to cluster j}. Let the mean, or the centroid, of the cluster be c j = 1 x v, n j where n j is the number of elements in π j. We describe K-means algorithm based on the Euclidean distance measure.

The tightness or coherence of cluster π j can be measured as the sum q j = The closer the vectors are to the centroid, the smaller the value of q j. clustering can be measured as the overall coherence, QΠ = k j=1 The quality of a In the k-means algorithm we seek a partitioning that has optimal coherence, in the sense that it is the solution of the minimization problem min Π QΠ. The K-means algorithm Initialization: Choose k initial centroids. Form k clusters by assigning all data points to the closest centroid. Recompute the centroid of each cluster. Repeat the second and third steps until convergence. The initial partitioning is often chosen randomly. The algorithms usually has rather fast convergence, but one cannot guarantee that the algorithm finds the global minimum. 3 Spectral Relaxation for K-means Clustering Despite the popularity of K-means clustering, one of its major drawbacks is that it is prone to local minima. Much research has been done on computing refined initial points and adding explicit constraints to the sum-of-squares cost function for K-means clustering so that the search can converge to better local minimum. Zha et al. tackled the problem from a different angle: formulate the sum-of-squares minimization in K-means as a trace maximization problem with special constraints: relaxing the constraints leads to a maximization problem that possesses optimal global solutions. Spectral Relaxation for K-means Clustering. H. Zha, X. He, D. Ding, and H. Simon. NIPS 2001. 3.1 Spectral Relaxation Recall that the n data points {x i } n i=1 IRm, which we organize as columns in a matrix X = [x 1, x 2,, x n ] IR m n. are partitioned into k clusters: Π = {π j } k j=1 as π j = {v x v belongs to cluster j}. The mean, or the centroid, of the cluster is c j = 1 x v, n j where n j is the number of elements in π j.

For a given partition Π, the associated sum-of-squares cost function is defined as QΠ = k j=1 Let e be the vector of all ones with appropriate length. It is easy to see that c j = X j e/n j, where X j is the data matrix of the j-th cluster. The sum-of-squares cost function of the j-th cluster is q j = x v c j 2 = X j c j e T 2 F = X j I nj ee T /n j 2 F. Note that I nj ee T /n j is a projection matrix and It follows that q j = trace I nj ee T /n j 2 = I nj ee T /n j. X j I nj ee T /n j Xj T = trace I nj ee T /n j Xj T X j. Therefore, k k QΠ = q j = trace X j T X j et X T e j X j. nj nj j=1 j=1 Define the n-by-k orthogonal matrix Y as follows e/ n 1 e/ n 2 Y = Then. e/ n k QΠ = trace X T X trace Y T X T XY. The minimization of QΠ is equivalent to the maximization of trace of the form in Eq. 1. 1 Y T X T XY with Y is Ignoring the special structure of Y and let it be an arbitrary orthonormal matrix, we obtain a relaxed maximization problem max trace Y T X T XY. Y T Y =I k It turns out the above trace maximization problem has a closed-form solution. Theorem Ky Fan: Let H be a symmetric matrix with eigenvalues λ 1 λ 2 λ n and the corresponding eigenvectors U = [u 1,, u n ]. Then λ 1 + λ k = max tracey T HY. Y T Y =I k Moreover, the optimal Y is given by Y = [u 1,, u k ]Q with Q an arbitrary orthogonal matrix of size k by k.

We may derive the following lower bound for the minimum of the sum-of-squares cost function: min{m,n} min QΠ Π tracext X max trace Y T X T XY = σ Y T i 2 X, Y =I k i=k+1 where σ i X is the i-th largest singular value of X. It is easy to see from the above derivation that we can replace X with X ae T where a is an arbitrary vector. If we choose a to the mean of all data in X, then relaxed maximization problem is equivalent to Principal Component Analysis PCA. Let Y be the n-by-k matrix consisting of the k largest eigenvectors of X T X. Each row of Y corresponds to a data vector. This can be considered as transforming the original data vectors which live in a m-dimensional space to new data vectors which now live in a k-dimensional space. One might be attempted to compute the cluster assignment by applying the ordinary K-means method to those data vectors in the reduced dimension space. More references K-means Clustering via Principal Component Analysis. Chris Ding and Xiaofeng He. ICML 2004. A Unified View of Kernel k-means, Spectral Clustering and Graph Partitioning. I.S. Dhillon, Y. Guan, and B. Kulis. UTCS Technical Report #TR-04-25. http://www.cs.utexas.edu/users/kulis/pubs/spectral techreport.pdf 4 Matrix Approximations using Clustering Given any partitioning Π = {π j } k j=1 of the data in X into k clusters, we can approximate a document vector by the closest mean centroid vector. In other words, if a document vector is in cluster π j, we can approximate it by the mean vector c j. This leads to the matrix approximation X ˆX such that, for 1 i n, its i-th column is the mean vector closest to the data point x i. We can express ˆX as low-rank matrix approximation as follows: ˆX = [c 1, c 2,, c k ] I where I IR k n indicates the cluster membership. More specifically, I ij = 1, if x j belongs to the i-th cluster π i and I ij = 0 otherwise. Denote C = [c 1, c 2,, c k ]. The matrix approximation ˆX has rank at most k. It is thus natural to compare the approximation power of ˆX to that of the best possible rank-k approximation to the data matrix X based on SVD. Best rank-k approximation: X k = U k Σ k V T k, where U k and V k consist of the top k left and right singular vectors of X, respectively, and Σ k contains the top k singular values. Empirical studies showed that, for each fixed k, the approximation error for the k-truncated SVD is significantly lower than that for ˆX.

5 Concept Decompositions It can be shown that by approximating each document vector by a linear combination of the concept vectors it is possible to obtain significantly better matrix approximations. Concept Decompositions for Large Sparse Text Data using Clustering. I.S. Dhillon and D.S. Modha. Machine Learning, 2001. Given any partitioning Π = {π j } k j=1 of the data in X into k clusters. Let {c j} k j=1 denote the k cluster centroids. Define the concept matrix as a m k matrix such that, for 1 j k, the j-th column of the matrix is the centroid concept vector c j, that is, C = [c 1, c 2,, c k ]. Assuming linear independence of the k concept vectors, it follows that the concept matrix has rank k. For any partitioning of the data, we define the corresponding concept decomposition X k of the data matrix X as the least-squares approximation of X onto the column space of the concept matrix C. We can write the concept decomposition as an m n matrix X k = CZ ; where Z is a k n matrix that is to be determined by solving the following least-squares problem: Z = arg min Z X CZ 2 F. It is well known that a closed-form solution exists for the least-squares problem above, namely, 1 Z = C C T C T X. Although the above equation is intuitively pleasing, it does not constitute an efficient and numerically stable way to compute the matrix Z. Instead, we can use the QR decomposition of the concept matrix. 1 Let C = QR be the thin QR decomposition of C, then Z = R R T R T Q T X. It follows that 1 X k = CZ = QR R R T R T Q T X = QQ T X. Here we assume that R is nonsingular, i.e., the k cluster centroids in C are linearly independent. Show that the concept decomposition X k is a better matrix approximation than ˆX k. 5.1 Empirical Observations The approximation power when measured using the Frobenius norm of concept decompositions is comparable to the best possible approximations by truncated SVD. An important advantage of concept decompositions is that they are computationally more efficient and require much less memory than truncated SVD. When applied for document clustering, concept decompositions produce concept vectors which are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors obtained from SVD are global in the word space and are dense. Nonetheless, the subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them.