Applications of Random Matrices in Spectral Computations and Machine Learning

Similar documents
Linear Algebra Review. Vectors

Unsupervised Data Mining (Clustering)

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Orthogonal Projections

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Randomized Robust Linear Regression for big data applications

Similarity and Diagonalization. Similar Matrices

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Supervised Feature Selection & Unsupervised Dimensionality Reduction

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

SYMMETRIC EIGENFACES MILI I. SHAH

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Manifold Learning Examples PCA, LLE and ISOMAP

x = + x 2 + x

Learning Tools for Big Data Analytics

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Linear Codes. Chapter Basics

Lecture Topic: Low-Rank Approximations

( ) which must be a vector

So which is the best?

ISOMETRIES OF R n KEITH CONRAD

α = u v. In other words, Orthogonal Projection

Problem Set 5 Due: In class Thursday, Oct. 18 Late papers will be accepted until 1:00 PM Friday.

Lecture 5: Singular Value Decomposition SVD (1)

Nonlinear Iterative Partial Least Squares Method

Applied Linear Algebra I Review page 1

Orthogonal Diagonalization of Symmetric Matrices

1 Introduction to Matrices

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Going Big in Data Dimensionality:

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Differential privacy in health care analytics and medical research An interactive tutorial

Unsupervised and supervised dimension reduction: Algorithms and connections

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Machine Learning Final Project Spam Filtering

Lecture 5 Least-squares

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Random Projection-based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining

Statistical Machine Learning

x + y + z = 1 2x + 3y + 4z = 0 5x + 6y + 7z = 3

3. INNER PRODUCT SPACES

NOTES ON LINEAR TRANSFORMATIONS

Learning, Sparsity and Big Data

Vector and Matrix Norms

1 Spectral Methods for Dimensionality

Recall that two vectors in are perpendicular or orthogonal provided that their dot

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

1 VECTOR SPACES AND SUBSPACES

Inner Product Spaces and Orthogonality

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Probability and Random Variables. Generation of random variables (r.v.)

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Support Vector Machines with Clustering for Training with Very Large Datasets

STA 4273H: Statistical Machine Learning

Principal components analysis

Text Analytics (Text Mining)

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Linear Algebra Notes

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Matrix Representations of Linear Transformations and Changes of Coordinates

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let)

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Component Ordering in Independent Component Analysis Based on Data Power

Solutions to Exam in Speech Signal Processing EN2300

Knowledge Discovery from patents using KMX Text Analytics

5. Orthogonal matrices

Compact Summaries for Large Datasets

A Survey on Pre-processing and Post-processing Techniques in Data Mining

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Data Mining Practical Machine Learning Tools and Techniques

Differential Privacy Preserving Spectral Graph Analysis

Orthogonal Projections and Orthonormal Bases

Lectures notes on orthogonal matrices (with exercises) Linear Algebra II - Spring 2004 by D. Klain

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

1 Review of Least Squares Solutions to Overdetermined Systems

Big Data Analytics CSCI 4030

We shall turn our attention to solving linear systems of equations. Ax = b

Supporting Online Material for

Numerical Methods I Eigenvalue Problems

Fionn Murtagh, Pedro Contreras International Conference p-adic MATHEMATICAL PHYSICS AND ITS APPLICATIONS. p-adics.2015, September 2015

ALGEBRAIC EIGENVALUE PROBLEM

1 Teaching notes on GMM 1.

A =

Math 215 HW #6 Solutions

Lecture 18 - Clifford Algebras and Spin groups

An Introduction to Machine Learning

Convolution. 1D Formula: 2D Formula: Example on the web:

Lecture 4 Online and streaming algorithms for clustering

A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property

T ( a i x i ) = a i T (x i ).

LINEAR ALGEBRA W W L CHEN

Section Continued

Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication

Active Learning SVM for Blogs recommendation

Transcription:

Applications of Random Matrices in Spectral Computations and Machine Learning Dimitris Achlioptas UC Santa Cruz

This talk Viewpoint: use randomness to transform the data

This talk Viewpoint: use randomness to transform the data Random Projections Fast Spectral Computations Sampling in Kernel PCA

The Setting

The Setting n d n d n d

The Setting n d n d n d

The Setting n d n d n d P Output: AP

The Johnson-Lindenstrauss lemma

The Johnson-Lindenstrauss lemma Algorithm: Projecting onto a random hyperplane (subspace) of dimension succeeds with probability

Applications Approximation algorithms [Charikar 02] Hardness of approximation [Trevisan 97] Learning mixtures of Gaussians [Arora, Kannan 01] Approximate nearest-neighbors [Kleinberg 97] Data-stream computations [Alon et al. 99, Indyk 00] Min-cost clustering [Schulman 00]. Information Retrieval (LSI) [Papadimitriou et al. 97]

How to pick a random hyperplane

How to pick a random hyperplane Take where the are independent random variables [Dasgupta Gupta 99] [Indyk Motwani 99] [Johnson Lindenstrauss 82]

How to pick a random hyperplane Take where the are independent random variables Intuition: Each column of P points to a uniformly random direction in

How to pick a random hyperplane Take where the are independent random variables Intuition: Each column of P points to a uniformly random direction in Each column is an unbiased, independent estimator of (via its squared inner product)

How to pick a random hyperplane Take where the are independent random variables Intuition: Each column of P points to a uniformly random direction in Each column is an unbiased, independent estimator of (via its squared inner product) is the average estimate (since we take the sum)

How to pick a random hyperplane Take where the are independent random variables With orthonormalization: Estimators are equal Estimators are uncorrelated

How to pick a random hyperplane Take where the are independent random variables With orthonormalization: Estimators are equal Estimators are uncorrelated Without orthonormalization:

How to pick a random hyperplane Take where the are independent random variables With orthonormalization: Estimators are equal Estimators are uncorrelated Without orthonormalization: Same thing!

Orthonormality: Take #1 Random vectors in high-dimensional Euclidean space are very nearly orthonormal.

Orthonormality: Take #1 Random vectors in high-dimensional Euclidean space are very nearly orthonormal. Do they have to be uniformly random? Is the Gaussian distribution magical?

JL with binary coins Take where the are independent random variables with

JL with binary coins Take where the are independent random variables with Benefits: Much faster in practice ± Only operations (no ) Fewer random bits Derandomization Slightly smaller(!) k

JL with binary coins Take where the are independent random variables with Preprocessing with a randomized FFT [Ailon, Chazelle 06]

Let s at least look at the data

The Setting n d n d n d P Output: AP

Spectral Norm: Low Rank Approximations

Low Rank Approximations Spectral Norm: Frobenius Norm:

Low Rank Approximations Spectral Norm: Frobenius Norm:

Low Rank Approximations Spectral Norm: Frobenius Norm:

Low Rank Approximations Spectral Norm: Frobenius Norm:

Start with a random How to compute A k

Start with a random Repeat until fixpoint How to compute A k Have each row in A vote for x:

Start with a random Repeat until fixpoint How to compute A k Have each row in A vote for x: Synthesize a new candidate by combining the rows of A according to their enthusiasm for x: (This is power iteration on. Also known as PCA.)

Start with a random Repeat until fixpoint How to compute A k Have each row in A vote for x: Synthesize a new candidate by combining the rows of A according to their enthusiasm for x: (This is power iteration on. Also known as PCA.) Project A on subspace orthogonal to x and repeat

PCA for Denoising Assume that we perturb the entries of a matrix A by adding independent Gaussian noise

PCA for Denoising Assume that we perturb the entries of a matrix A by adding independent Gaussian noise Claim: If σ is not too big then the optimal projections for are close to those for A.

PCA for Denoising Assume that we perturb the entries of a matrix A by adding independent Gaussian noise Claim: If σ is not too big then the optimal projections for are close to those for A. Intuition: The perturbation vectors are nearly orthogonal

PCA for Denoising Assume that we perturb the entries of a matrix A by adding independent Gaussian noise Claim: If σ is not too big then the optimal projections for are close to those for A. Intuition: The perturbation vectors are nearly orthogonal No small subspace accommodates many of them

Rigorously Lemma: For any matrices A and

Rigorously Lemma: For any matrices A and Perspective: For any fixed x we have w.h.p.

Two new ideas A rigorous criterion for choosing k: Stop when A-A k has as much structure as a random matrix

Two new ideas A rigorous criterion for choosing k: Stop when A-A k has as much structure as a random matrix Computation-friendly noise:

Two new ideas A rigorous criterion for choosing k: Stop when A-A k has as much structure as a random matrix Computation-friendly noise: Inject data-dependent noise

Quantization

Quantization

Quantization

Quantization

Sparsification

Sparsification

Sparsification

Sparsification

Accelerating spectral computations By injecting sparsification/quantization noise we can accelerate spectral computations: Fewer/simpler arithmetic operations Reduced memory footprint

Accelerating spectral computations By injecting sparsification/quantization noise we can accelerate spectral computations: Fewer/simpler arithmetic operations Reduced memory footprint Amount of noise that can be tolerated increases with redundancy in data

Accelerating spectral computations By injecting sparsification/quantization noise we can accelerate spectral computations: Fewer/simpler arithmetic operations Reduced memory footprint Amount of noise that can be tolerated increases with redundancy in data L2 error can be quadratically better than Nystrom

Orthonormality: Take #2 Matrices with independent, 0-mean entries are white noise matrices

A scalar analogue

A scalar analogue Crude quantization at extremely high rate

A scalar analogue Crude quantization at extremely high rate + low-pass filter

A scalar analogue Crude quantization at extremely high rate + low-pass filter

A scalar analogue Crude quantization at extremely high rate + low-pass filter = 1-bit CD player ( Bitstream )

Accelerating spectral computations By injecting sparsification/quantization noise we can accelerate spectral computations: Fewer/simpler arithmetic operations Reduced memory footprint Amount of noise that can be tolerated increases with redundancy in data L2 error can be quadratically better than Nystrom Useful even for exact computations

Accelerating exact computations

Kernels

Kernels & Support Vector Machines Red and Blue pointclouds Which linear separator (hyperplane)? Maximum margin Optimal can be expressed by inner products with (a few) data points

Not always linearly separable

Population density

Kernel PCA

Kernel PCA We can also compute the SVD via the spectrum of n d n n d n

Kernel PCA We can also compute the SVD via the spectrum of n d n n d Each entry in AA T n is the inner product of two inputs

Kernel PCA We can also compute the SVD via the spectrum of n n d n d n Each entry in AA T is the inner product of two inputs Replace inner product with a kernel function

Kernel PCA We can also compute the SVD via the spectrum of n n d n d n Each entry in AA T is the inner product of two inputs Replace inner product with a kernel function Work implicitly in high-dimensional space

Kernel PCA We can also compute the SVD via the spectrum of n n d n d n Each entry in AA T is the inner product of two inputs Replace inner product with a kernel function Work implicitly in high-dimensional space Good linear separators in that space

From linear to non-linear PCA X-Y p kernel illustrates how the contours of the first 2 components change from straight lines for p=2 to non-linea for p=1.5, 1 and 0.5. From Schölkopf and Smola, Learning with kernels, MIT 2002

Kernel PCA with Gaussian Kernel KPCA with Gaussian kernels. The contours follow the cluster densities! First two kernel PCs separate the data nicely. Linear PCA has only 2 components, but kernel PCA has more, since the space dimension is usually large (in this case infinite)

Good News: KPCA in brief - Work directly with non-vectorial inputs

KPCA in brief Good News: - Work directly with non-vectorial inputs - Very powerful: e.g. LLE, Isomap, Laplacian Eigenmaps [Ham et al. 03]

KPCA in brief Good News: - Work directly with non-vectorial inputs - Very powerful: e.g. LLE, Isomap, Laplacian Eigenmaps [Ham et al. 03] Bad News: n 2 kernel evaluations are too many.

KPCA in brief Good News: - Work directly with non-vectorial inputs - Very powerful: e.g. LLE, Isomap, Laplacian Eigenmaps [Ham et al. 03] Bad News: n 2 kernel evaluations are too many. Good News: [Shaw-Taylor et al. 03] good generalization rapid spectral decay

So, it s enough to sample n n d n d n In practice, 1% of the data is more than enough In theory, we can go down to n polylog(n)

Important Features are Preserved

Open Problems How general is this stability under noise? For example, does it hold for Support Vector Machines? When can we prove such stability in a black-box fashion, i.e. as with matrices? Can we exploit if for data privacy?