Spectral Methods for Dimensionality Reduction

Similar documents
1 Spectral Methods for Dimensionality

So which is the best?

Manifold Learning Examples PCA, LLE and ISOMAP

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

DATA ANALYSIS II. Matrix Algorithms

Visualization of Large Font Databases

Selection of the Suitable Parameter Value for ISOMAP

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Dimensionality Reduction - Nonlinear Methods

Image Hallucination Using Neighbor Embedding over Visual Primitive Manifolds

Data visualization using nonlinear dimensionality reduction techniques: method review and quality assessment John A. Lee Michel Verleysen

Visualization of Topology Representing Networks

Visualization by Linear Projections as Information Retrieval

Ranking on Data Manifolds

Least-Squares Intersection of Lines

Visualizing Data using t-sne

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

A Computational Framework for Exploratory Data Analysis

ON THE DEGREES OF FREEDOM OF SIGNALS ON GRAPHS. Mikhail Tsitsvero and Sergio Barbarossa

Soft Clustering with Projections: PCA, ICA, and Laplacian

Using Nonlinear Dimensionality Reduction in 3D Figure Animation

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Lecture Topic: Low-Rank Approximations

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Composite Kernel Machines on Kernel Locally Consistent Concept Factorization Space for Data Mining

DISSERTATION EXPLOITING GEOMETRY, TOPOLOGY, AND OPTIMIZATION FOR KNOWLEDGE DISCOVERY IN BIG DATA. Submitted by. Lori Beth Ziegelmeier

Statistical machine learning, high dimension and big data

Inner Product Spaces and Orthogonality

HDDVis: An Interactive Tool for High Dimensional Data Visualization

SYMMETRIC EIGENFACES MILI I. SHAH

Euclidean Heuristic Optimization

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Distance Metric Learning for Large Margin Nearest Neighbor Classification

Data visualization and dimensionality reduction using kernel maps with a reference point

Visualization of General Defined Space Data

Social Media Mining. Network Measures

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Orthogonal Projections

Large-Scale Similarity and Distance Metric Learning

Exploratory Data Analysis with MATLAB

IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 23, NO. 3, AUGUST

Linear Algebra Review. Vectors

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Nonlinear Iterative Partial Least Squares Method

Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Part 2: Community Detection

View-Invariant Dynamic Texture Recognition using a Bag of Dynamical Systems

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Orthogonal Diagonalization of Symmetric Matrices

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Factorization Theorems

3. INNER PRODUCT SPACES

Unsupervised and supervised dimension reduction: Algorithms and connections

Cluster Analysis: Advanced Concepts

Efficient Out-of-Sample Extension of Dominant-Set Clusters

Distributed Coverage Verification in Sensor Networks Without Location Information Alireza Tahbaz-Salehi and Ali Jadbabaie, Senior Member, IEEE

Estimating Effective Degrees of Freedom in Motor Systems

SOLVING LINEAR SYSTEMS

Learning with Local and Global Consistency

Data Visualization and Dimensionality Reduction. using Kernel Maps with a Reference Point

Subspace Analysis and Optimization for AAM Based Face Alignment

Learning with Local and Global Consistency

Binary Image Reconstruction

Kernel-Based Visualization of Large Collections of Medical Images Involving Domain Knowledge

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Similarity and Diagonalization. Similar Matrices

Social Media Mining. Graph Essentials

Class-specific Sparse Coding for Learning of Object Representations

Sketch As a Tool for Numerical Linear Algebra

Accurate and robust image superresolution by neural processing of local image representations

GRAPH-BASED ranking models have been deeply

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Mathematics Course 111: Algebra I Part IV: Vector Spaces

THE GOAL of this work is to learn discriminative components

Data Mining: Algorithms and Applications Matrix Math Review

Neural Networks Lesson 5 - Cluster Analysis

Geometry and Topology from Point Cloud Data

Support Vector Machines with Clustering for Training with Very Large Datasets

Feature Point Selection using Structural Graph Matching for MLS based Image Registration

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Predictive Modeling using Dimensionality Reduction and Dependency Structures

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

A Tutorial on Spectral Clustering

1 Introduction to Matrices

Predict Influencers in the Social Network

ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization

α = u v. In other words, Orthogonal Projection

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

UW CSE Technical Report Probabilistic Bilinear Models for Appearance-Based Vision

Computing a Nearest Correlation Matrix with Factor Structure

Practical Graph Mining with R. 5. Link Analysis

Component Ordering in Independent Component Analysis Based on Data Power

Learning gradients: predictive models that infer geometry and dependence

Maximum Likelihood Graph Structure Estimation with Degree Distributions

Final Report: MINDES - Data Mining for Inverse Design

Applied Linear Algebra I Review page 1

Notes on Symmetric Matrices

Transcription:

Spectral Methods for Dimensionality Reduction Prof. Lawrence Saul Dept of Computer & Information Science University of Pennsylvania UCLA IPAM Tutorial, July 11-14, 2005

Dimensionality reduction Question How can we detect low dimensional structure in high dimensional data? Applications Digital image and speech libraries Neuronal population activities Gene expression microarrays Financial time series

Framework Data representation Inputs are real-valued vectors in a high dimensional space. Linear structure Does the data live in a low dimensional subspace? Nonlinear structure Does the data live on a low dimensional submanifold?

Linear vs nonlinear What computational price must we pay for nonlinear dimensionality reduction?

Spectral methods Matrix analysis Low dimensional structure is revealed by eigenvalues and eigenvectors. Links to spectral graph theory Matrices are derived from sparse weighted graphs. Usefulness Tractable methods can reveal nonlinear structure.

Notation Inputs (high dimensional) Outputs (low dimensional) Goals r x R D with i = 1,2,...,n i r y R d where d = D i Nearby points remain nearby. Distant points remain distant. (Estimate d.)

Manifold learning Given high dimensional data sampled from a low dimensional submanifold, how to compute a faithful embedding?

Image Manifolds (Seung & Lee, 2000) (Tenenbaum et al, 2000)

Outline Day 1 - linear, nonlinear, and graph-based methods Day 2 - sparse matrix methods Day 3 - semidefinite programming Day 4 - kernel methods

Questions for today How to detect linear structure? - principal components analysis - metric multidimensional scaling How (not) to generalize these methods? - neural network autoencoders - nonmetric multidimensional scaling How to detect nonlinear structure? - graphs as discretized manifolds - Isomap algorithm

Linear method #1 Principal Components Analysis (PCA)

Principal components analysis D = 2 d =1 D = 3 d = 2 Does the data mostly lie in a subspace? If so, what is its dimensionality?

Maximum variance subspace Assume inputs are centered: r r x = 0 i Project into subspace: i r y = Px r with P 2 = P i i Maximize projected variance: var( r y) = 1 n i P r x i 2

Matrix diagonalization Covariance matrix var( r y) = Tr(PCP T ) with C = n 1 Spectral decomposition C = i Maximum variance projection P = D r xi r xi T r r T λ αeαeα with λ 1 L λ D 0 α =1 d r e α r eα T α =1 Projects into subspace spanned by top d eigenvectors.

Interpreting PCA Eigenvectors: principal axes of maximum variance subspace. Eigenvalues: projected variance of inputs along principle axes. Estimated dimensionality: number of significant (nonnegative) eigenvalues.

Example of PCA Eigenvectors and eigenvalues of covariance matrix for n=1600 inputs in d=3 dimensions.

Example: faces Eigenfaces from 7562 images: top left image is linear combination of rest. Sirovich & Kirby (1987) Turk & Pentland (1991)

Another interpretation of PCA: Assume inputs are centered: r r x = 0 i Project into subspace: i r y = Px r with P 2 = P i i Minimize reconstruction error: err( r y) = n 1 i r xi P r x i 2

Equivalence Minimum reconstruction error: err( r y) = n 1 Maximum variance subspace var( r y) = n 1 i i r xi P r x i P r x i Both models for linear dimensionality reduction yield the same solution. 2 2

PCA as linear autoencoder Network Each layer implements a linear transformation. Cost function Minimize reconstruction error through bottleneck: err(p) = n 1 i r xi P T P r x i 2

Summary of PCA 1) Center inputs on origin. 2) Compute covariance matrix. 3) Diagonalize. 4) Project. 1) r 0 = i r x i r r 2) C = n 1 T xi xi i r r 3) C = T λ αeαeα α 4) y r i = Px r i with P = α d r e α r eα T D = 2 d =1 D = 3 d = 2

Properties of PCA Strengths Eigenvector method No tuning parameters Non-iterative No local optima Weaknesses Limited to second order statistics Limited to linear projections

So far.. Q: How to detect linear structure? A: Principal components analysis Maximum variance subspace Minimum reconstruction error Linear network autoencoders Q: How (not) to generalize for manifolds?

Nonlinear autoencoder Neural network Each layer parameterizes a nonlinear transformation. Cost function Minimize reconstruction error: i err(w ) = n 1 r xi l W (h W (g W ( f W ( r x i ) l h g f 2

Properties of neural network Strengths Parameterizes nonlinear mapping (in both directions). Generalizes to new inputs. Weaknesses Many unspecified choices: network size, parameterization, learning rates. Highly nonlinear, iterative optimization with local minima.

Linear vs nonlinear What computational price must we pay for nonlinear dimensionality reduction?

Linear method #2 Metric Multidimensional Scaling (MDS)

Multidimensional scaling 0 12 13 14 12 0 23 24 13 23 0 34 14 24 34 0 y 1 y 2 y 3 y 4 Given n(n-1)/2 pairwise distances ij, find vectors y i such that y i -y j ij.

Metric Multidimensional Scaling Lemma If ij denote the Euclidean distances of zero mean vectors, then the inner products are: G ij = 1 2 2 ij2 kl ( ik2 + 2 ) k kj kl Optimization Preserve dot products (proxy for distances). Choose vectors y i to minimize: err( r y) = ij ( ) G ij r y i g r y j 2

Matrix diagonalization Gram matrix matching Spectral decomposition G = err( r y) = n r r T λ αvαvα with λ 1 L λ n 0 α =1 ij ( ) G ij r y i g r y j Optimal approximation y iα = λ α v αi for α = 1,2,...,d with d n (scaled truncated eigenvectors) 2

Interpreting MDS y iα = λ α v αi for α = 1,2,...,d with d = n Eigenvectors Ordered, scaled, and truncated to yield low dimensional embedding. Eigenvalues Measure how each dimension contributes to dot products. Estimated dimensionality Number of significant (nonnegative) eigenvalues.

Dual matrices Relation to PCA C αβ = n 1 x iα x i iβ covariance matrix (D D) G ij = x r i x r j Gram matrix (n n) Same eigenvalues Matrices share nonzero eigenvalues up to constant factor. Same results, different computation PCA scales as O((n+d)D 2 ). MDS scales as O((D+d)n 2 ).

So far.. Q: How to detect linear structure? A1: Principal components analysis A2: Metric multidimensional scaling Q: How (not) to generalize for manifolds?

Nonmetric MDS 0 12 13 14 12 0 23 24 13 23 0 34 14 24 34 0 y 1 y 2 y 3 y 4 Transform pairwise distances: ij g( ij ). Find vectors y i such that y i -y j g( ij ).

Non-Metric MDS Distance transformation Nonlinear, but monotonic. Preserves rank order of distances. Optimization Preserve transformed distances. Choose vectors y i to minimize: err( r y) = ij ( ) g( ij ) r y i r y j 2

Properties of non-metric MDS Strengths Relaxes distance constraints. Yields nonlinear embeddings. Weaknesses Highly nonlinear, iterative optimization with local minima. Unclear how to choose distance transformation.

Non-metric MDS for manifolds? Rank ordering of Euclidean distances is NOT preserved in manifold learning. C A C B A B d(a,c) < d(a,b) d(a,c) > d(a,b)

Linear vs nonlinear What computational price must we pay for nonlinear dimensionality reduction?

Graph-based method #1 Isometric mapping of data manifolds (ISOMAP) (Tenenbaum, de Silva, & Langford, 2000)

Dimensionality reduction Inputs Outputs Goals r x R D with i = 1,2,...,n i r y R d where d = D i Nearby points remain nearby. Distant points remain distant. (Estimate d.)

Isomap Key idea: Preserve geodesic distances as measured along submanifold. Algorithm in a nutshell: Use geodesic instead of (transformed) Euclidean distances in MDS.

Step 1. Build adjacency graph. Adjacency graph Vertices represent inputs. Undirected edges connect neighbors. Neighborhood selection Many options: k-nearest neighbors, inputs within radius r, prior knowledge. Graph is discretized approximation of submanifold.

Building the graph Computation knn scales naively as O(n 2 D). Faster methods exploit data structures. Assumptions 1) Graph is connected. 2) Neighborhoods on graph reflect neighborhoods on manifold. No shortcuts connect different arms of swiss roll.

Step 2. Estimate geodesics. Dynamic programming Weight edges by local distances. Compute shortest paths through graph. Geodesic distances Estimate by lengths ij of shortest paths: denser sampling = better estimates. Computation Djikstra s algorithm for shortest paths scales as O(n 2 logn + n 2 k).

Embedding Step 3. Metric MDS Top d eigenvectors of Gram matrix yield embedding. Dimensionality Number of significant eigenvalues yield estimate of dimensionality. Computation Top d eigenvectors can be computed in O(n 2 d).

Algorithm Summary 1) k nearest neighbors 2) shortest paths through graph 3) MDS on geodesic distances Impact Much simpler than earlier algorithms for manifold learning. Does it work?

Examples Swiss roll n = 1024 k = 12 Wrist images n = 2000 k = 6 D = 64 2

Examples Face images n = 698 k = 6 Digit images n = 1000 r = 4.2 D = 20 2

Interpolations A. Faces B. Wrists C. Digits Linear in Isomap feature space. Nonlinear in pixel space.

Properties of Isomap Strengths Polynomial-time optimizations No local minima Non-iterative (one pass thru data) Non-parametric Only heuristic is neighborhood size. Weaknesses Sensitive to shortcuts No out-of-sample extension

Large-scale applications Problem: Too expensive to compute all shortest paths and diagonalize full Gram matrix. Solution: Only compute shortest paths in green and diagonalize submatrix in red. n n Gram matrix

Landmark Isomap Approximation Identify subset of inputs as landmarks. Estimate geodesics to/from landmarks. Apply MDS to landmark distances. Embed non-landmarks by triangulation. Related to Nystrom approximation. Computation Reduced by l/n for l<n landmarks. Reconstructs large Gram matrix from thin rectangular sub-matrix.

Embedding of sparse music similarity graph n = 267K e = 3.22M l = 400 τ = 6 minutes (Platt, 2004) Example

Theoretical guarantees Asymptotic convergence For data sampled from a submanifold that is isometric to a convex subset of Euclidean space, Isomap will recover the subset up to rotation & translation. (Tenenbaum et al; Donoho & Grimes) Convexity assumption Geodesic distances are not estimated correctly for manifolds with holes

Connected but not convex 2d region with hole Isomap input Images of 360o rotated teapot eigenvalues of Isomap

Connected but not convex Occlusion Images of two disks, one occluding the other. Locomotion Images of periodic gait.

Linear vs nonlinear What computational price must we pay for nonlinear dimensionality reduction?

Nonlinear dimensionality reduction since 2000 These strengths and weaknesses are typical of graph-based spectral methods for dimensionality reduction. Properties of Isomap Strengths Polynomial-time optimizations No local minima Non-iterative (one pass thru data) Non-parametric Only heuristic is neighborhood size. Weaknesses Sensitive to shortcuts No out-of-sample extension

Spectral Methods Common framework 1) Derive sparse graph from knn. 2) Derive matrix from graph weights. 3) Derive embedding from eigenvectors. Varied solutions Algorithms differ in step 2. Types of optimization: shortest paths, least squares fits, semidefinite programming.

Algorithms 2000 Isomap (Tenenbaum, de Silva, & Langford) Locally Linear Embedding 2002 Laplacian eigenmaps (Belkin & Niyogi) 2003 Hessian LLE (Donoho & Grimes) 2004 Maximum variance unfolding (Weinberger & Saul) (Sun, Boyd, Xiao, & Diaconis) 2005 Conformal eigenmaps (Sha & Saul) (Roweis & Saul)

Looking ahead Trade-offs Sparse vs dense eigensystems? Preserving distances vs angles? Connected vs convex sets? Connections Spectral graph theory Convex optimization Differential geometry

Tuesday 2000 Isomap (Tenenbaum, de Silva, & Langford) Locally Linear Embedding (Roweis & Saul) 2002 Laplacian eigenmaps (Belkin & Niyogi) 2003 Hessian LLE (Donoho & Grimes) 2004 Maximum variance unfolding (Weinberger & Saul) (Sun, Boyd, Xiao, & Diaconis) 2005 Conformal eigenmaps (Sha & Saul) Sparse Matrix Methods

Wednesday 2000 Isomap (Tenenbaum, de Silva, & Langford) Locally Linear Embedding (Roweis & Saul) 2002 Laplacian eigenmaps (Belkin & Niyogi) 2003 Hessian LLE (Donoho & Grimes) 2004 Maximum variance unfolding (Weinberger & Saul) (Sun, Boyd, Xiao, & Diaconis) 2005 Conformal eigenmaps (Sha & Saul) Semidefinite Programming

To be continued... See you tomorrow.