Outline. Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Multi-Dimensional Scaling (MDS)

Similar documents
Manifold Learning Examples PCA, LLE and ISOMAP

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

1 Spectral Methods for Dimensionality

Linear Algebra Review. Vectors

So which is the best?

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Inner Product Spaces and Orthogonality

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Orthogonal Diagonalization of Symmetric Matrices

by the matrix A results in a vector which is a reflection of the given

DATA ANALYSIS II. Matrix Algorithms

Similarity and Diagonalization. Similar Matrices

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

x = + x 2 + x

Nonlinear Iterative Partial Least Squares Method

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Visualization of Large Font Databases

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Ranking on Data Manifolds

α = u v. In other words, Orthogonal Projection

Text Analytics (Text Mining)

Data Mining: Algorithms and Applications Matrix Math Review

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Numerical Analysis Lecture Notes

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Selection of the Suitable Parameter Value for ISOMAP

Visualization by Linear Projections as Information Retrieval

Inner Product Spaces

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Dimensionality Reduction: Principal Components Analysis

Visualizing Data using t-sne

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Exploratory Data Analysis with MATLAB

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Image Hallucination Using Neighbor Embedding over Visual Primitive Manifolds

Machine Learning and Pattern Recognition Logistic Regression

Applied Linear Algebra I Review page 1

Data visualization and dimensionality reduction using kernel maps with a reference point

CONTROLLABILITY. Chapter Reachable Set and Controllability. Suppose we have a linear system described by the state equation

Dimensionality Reduction - Nonlinear Methods

3. INNER PRODUCT SPACES

Multidimensional data and factorial methods

Partial Least Squares (PLS) Regression.

Problem Set 5 Due: In class Thursday, Oct. 18 Late papers will be accepted until 1:00 PM Friday.

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Component Ordering in Independent Component Analysis Based on Data Power

Principal components analysis

Inner product. Definition of inner product

Introduction to Principal Components and FactorAnalysis

Introduction: Overview of Kernel Methods

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Lecture Topic: Low-Rank Approximations

Math 550 Notes. Chapter 7. Jesse Crawford. Department of Mathematics Tarleton State University. Fall 2010

Chapter 6. Orthogonality

Data visualization using nonlinear dimensionality reduction techniques: method review and quality assessment John A. Lee Michel Verleysen

Visualization of General Defined Space Data

Visualization of Topology Representing Networks

Using Nonlinear Dimensionality Reduction in 3D Figure Animation

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Lecture 5: Singular Value Decomposition SVD (1)

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

Notes on Symmetric Matrices

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Soft Clustering with Projections: PCA, ICA, and Laplacian

Elasticity Theory Basics

Statistical Machine Learning

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

NOTES ON LINEAR TRANSFORMATIONS

Introduction Epipolar Geometry Calibration Methods Further Readings. Stereo Camera Calibration

3 Orthogonal Vectors and Matrices

Accurate and robust image superresolution by neural processing of local image representations

Predictive Modeling using Dimensionality Reduction and Dependency Structures

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Introduction to Matrix Algebra

MATH1231 Algebra, 2015 Chapter 7: Linear maps

Tutorial on Exploratory Data Analysis

Section Inner Products and Norms

Solutions to Math 51 First Exam January 29, 2015

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Social Media Mining. Network Measures

1 VECTOR SPACES AND SUBSPACES

Factorization Theorems

How To Cluster

SYMMETRIC EIGENFACES MILI I. SHAH

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Adding vectors We can do arithmetic with vectors. We ll start with vector addition and related operations. Suppose you have two vectors

1 Introduction to Matrices

Differential Privacy Preserving Spectral Graph Analysis

Euclidean Heuristic Optimization

Transcription:

Outline Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Multi-Dimensional Scaling (MDS) Non-linear extensions: Kernel PCA Isomap

PCA PCA: Principle Component Analysis (closely related to SVD). PCA finds a linear projection of high dimensional data into a lower dimensional subspace such as: o The variance retained is maximized. o The least square reconstruction error is minimized.

Some PCA/SVD applications LSI: Latent Semantic Indexing. Kleinberg/Hits algorithm (compute hubs and authority scores for nodes). Google/PageRank algorithm (random walk with restart). Image compression (eigen faces) Data visualization (by projecting the data on 2D).

PCA PCA steps: transform an N d matrix X into an N m matrix Y: Centralized the data (subtract the mean). Calculate the d d covariance matrix: C = 1 N 1 XT X (different notation from tutorial!!!) o C i,j = 1 N 1 N q=1 X q,i. X q,j o C i,i (diagonal) is the variance of variable i. o C i,j (off-diagonal) is the covariance between variables i and j. Calculate the eigenvectors of the covariance matrix (orthonormal). Select m eigenvectors that correspond to the largest m eigenvalues to be the new basis.

Eigenvectors If A is a square matrix, a non-zero vector v is an eigenvector of A if there is a scalar λ (eigenvalue) such that Av = λv Example: 2 3 2 1 3 2 = 12 8 = 4 3 2 If we think of the squared matrix as a transformation matrix, then multiply it with the eigenvector do not change its direction. What are the eigenvectors of the identity matrix?

PCA example X : the data matrix with N=11 objects and d=2 dimensions.

PCA example Step 1: subtract the mean and calculate the covariance matrix C. C = 0.716 0.615 0.615 0.616

PCA example Step 2: Calculate the eigenvectors and eigenvalues of the covariance matrix: λ 1 1.28, v 1 [-0.677-0.735] T, λ 2 0.49, v 2 [-0.735 0.677] T Notice that v 1 and v 2 are orthonormal: v 1 =1 v 2 =1 v 1. v 2 = 0

Step 3: project the data PCA example Let V = [v 1, v m ] is d m matrix where the columns v i are the eigenvectors corresponding to the largest m eigenvalues The projected data: Y = X V is N m matrix. If m=d (more precisely rank(x)), then there is no loss of information!

PCA example Step 3: project the data λ 1 1.28, v 1 [-0.677-0.735] T, λ 2 0.49, v 2 [-0.735 0.677] T The eigenvector with the highest eigenvalue is the principle component of the data. if we are allowed to pick only one dimension, the principle component is the best direction (retain the maximum variance). Our PC is v 1 [-0.677-0.735] T

Step 3: project the data PCA example If we select the first PC and reconstruct the data, this is what we get: We lost variance along the other component (lossy compression!)

Useful properties The covariance matrix is always symmetric C T 1 = ( N 1 XT X) T = 1 N 1 XT X TT = C The principal components of X are orthonormal v T i v j = 1 if i = j 0 if i j V=[v 1, v m ], then V T = V 1, i.e V T V = I

Useful properties Theorem 1: if square d d matrix S is a real and symmetric matrix (S=S T ) then S = V Λ V T Where V = [v 1, v d ] are the eigenvectors of S and Λ = diag (λ 1, λ d ) are the eigenvalues. Proof: S V = V Λ [S v 1 S v d ] = [λ 1. v 1 λ d. v d ]: the definition of eigenvectors. S = V Λ V 1 S = V Λ V T because V is orthonormal V 1 = V T

Useful properties The projected data: Y = X V The covariance matrix of Y is C Y = 1 N 1 YT Y = 1 N 1 VT X T X V = V T C X V = V T V Λ V T V because the covariance matrix C X is symmetric = V 1 V Λ V 1 V because V is orthonormal = Λ After the transformation, the covariance matrix becomes diagonal!

PCA (derivation) Find the direction for which the variance is maximized: v 1 = argmax v1 var Xv 1 Subject to: v 1 T v 1 =1 Rewrite in terms of the covariance matrix: var Xv 1 = 1 N 1 Xv 1 T Xv 1 = v 1 T 1 N 1 XT X v 1 = v 1 T C v 1 Solve via constrained optimization: L v 1, λ 1 = v 1 T C v 1 + λ 1 (1 v 1 T v 1 )

Constrained optimization: PCA (derivation) L v 1, λ 1 = v 1 T C v 1 + λ 1 (1 v 1 T v 1 ) Gradient with respect to v 1 : Multiply by v 1T : dl v 1, λ 1 dv 1 = 2Cv 1 2λ 1 v 1 Cv 1 = λ 1 v 1 This is the eigenvector problem! λ 1 =v 1 T C v 1 The projection variance is the eigenvalue

PCA Unsupervised: maybe bad for classification!

Outline Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Multi-Dimensional Scaling (MDS) Non-linear extensions: Kernel PCA Isomap

SVD Any N d matrix X can be uniquely expressed as: X = U x Σ x V T r is the rank of the matrix X (# of linearly independent columns/rows). U is a column-orthonormal N r matrix. Σ is a diagonal r r matrix where the singular values σ i are sorted in descending order. V is a column-orthonormal d r matrix.

SVD example doc-to-concept similarity matrix concepts strengths term-to-concept similarity matrix The rank of this matrix r=2 because we have 2 types of documents (CS and Medical documents), i.e. 2 concepts.

SVD example doc-to-concept similarity matrix concepts strengths term-to-concept similarity matrix U: document-to-concept similarity matrix V: term-to-concept similarity matrix. Example: U 1,1 is the weight of CS concept in document d 1, σ 1 is the strength of the CS concept, V 1,1 is the weight of data in the CS concept. V 1,2 =0 means data has zero similarity with the 2nd concept (Medical). What does U 4,1 means?

PCA and SVD relation Theorem: Let X = U Σ V T be the SVD of an N d matrix X and C = 1 N 1 XT X be the d d covariance matrix. The eigenvectors of C are the same as the right singular vectors of X. Proof: X T X = V Σ U T U Σ V T = V Σ Σ V T = V Σ 2 V T C= V Σ2 N 1 VT But C is symmetric, hence C = V Λ V T (according to theorem1). Therefore, the eigenvectors of the covariance matrix are the same as matrix V (right singular vectors) and the eigenvalues of C can be computed from the singular values λ i = σ i 2 N 1

Summary for PCA and SVD Objective: project an N d data matrix X using the largest m principal components V = [v 1, v m ]. 1. zero mean the columns of X. 2. Apply PCA or SVD to find the principle components of X. PCA: I. Calculate the covariance matrix C = 1 N 1 XT X. II. V corresponds to the eigenvectors of C. SVD: I. Calculate the SVD of X=U Σ V T. II. V corresponds to the right singular vectors. 3. Project the data in an m dimensional space: Y = X V

Outline Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Multi-Dimensional Scaling (MDS) Non-linear extensions: Kernel PCA Isomap

MDS Multi-Dimensional Scaling [Cox and Cox, 1994]. MDS give points in a low dimensional space such that the Euclidean distances between them best approximate the original distance matrix. Given distance matrix Map input points x i to z i such as z i z i δ i,j Classical MDS: the norm. is the Euclidean distance. Distances inner products (Gram matrix) embedding There is a formula to obtain Gram matrix G from distance matrix Δ.

MDS example Given pairwise distances between different cities (Δ matrix), plot the cities on a 2D plane (recover location)!!

PCA and MDS relation Preserve Euclidean distances = retaining the maximum variance. Classical MDS is equivalent to PCA when the distances in the input space are the Euclidean distance. PCA uses the d d covariance matrix: C = 1 N 1 XT X MDS uses the N N Gram (inner product) matrix: G = X X T If we have only a distance matrix (we don t know the points in the original space), we cannot perform PCA! Both PCA and MDS are invariant to space rotation!

Kernel PCA Kernel PCA [Scholkopf et al. 1998] performs nonlinear projection. Given input (x 1, x N ), kernel PCA computes the principal components in the feature space (φ(x 1 ), φ(x N )). Avoid explicitly constructing the covariance matrix in feature space. The kernel trick: formulate the problem in terms of the kernel function k(x, x ) = φ(x). φ(x ) without explicitly doing the mapping. Kernel PCA is non-linear version of MDS use Gram matrix in the feature space (a.k.a Kernel matrix) instead of Gram matrix in the input space.

Kernel PCA Original space A non-linear feature space

Isomap Isomap [Tenenbaum et al. 2000] tries to preserve the distances along the data Manifold (Geodesic distance ). Cannot compute Geodesic distances without knowing the Manifold! Blue: true manifold distance, red: approximated shortest path distance Approximate the Geodesic distance by the shortest path in the adjacency graph

Isomap Construct the neighborhood graph (connect only k-nearest neighbors): the edge weight is the Euclidean distance. Estimate the pairwise Geodesic distances by the shortest path (use Dijkstra algorithm). Feed the distance matrix to MDS.

Isomap Euclidean distances between outputs match the geodesic distances between inputs on the Manifold from which they are sampled.

Related Feature Extraction Techniques Linear projections: Probabilistic PCA [Tipping and Bishop 1999] Independent Component Analysis (ICA) [Comon, 1994] Random Projections Nonlinear projection (manifold learning): Locally Linear Embedding (LLE) [Roweis and Saul, 2000] Laplacian Eigenmaps [Belkin and Niyogi, 2003] Hessian Eigenmaps [Donoho and Grimes, 2003] Maximum Variance Unfolding [Weinberger and Saul, 2005]