Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings

Similar documents

Tackling Big Data with Tensor Methods

Statistical Machine Learning

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Machine Learning for Data Science (CS4786) Lecture 1

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Lecture 3: Linear methods for classification

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Predict Influencers in the Social Network

Machine Learning with MATLAB David Willingham Application Engineer

The Data Mining Process

Machine Learning Logistic Regression

Data Mining: Algorithms and Applications Matrix Math Review

Lecture 9: Introduction to Pattern Analysis

Linear Threshold Units

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Introduction: Overview of Kernel Methods

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Principal Component Analysis

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Statistical machine learning, high dimension and big data

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Statistical Models in Data Mining

Advanced In-Database Analytics

An Introduction to Data Mining

Big Data Analytics CSCI 4030

Chapter 6. Orthogonality

MS1b Statistical Data Mining

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Social Media Mining. Data Mining Essentials

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

CSCI567 Machine Learning (Fall 2014)

HT2015: SC4 Statistical Data Mining and Machine Learning

DATA ANALYSIS II. Matrix Algorithms

Linear Classification. Volker Tresp Summer 2015

Accurate and robust image superresolution by neural processing of local image representations

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

5. Orthogonal matrices

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Neural Network Add-in

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Course: Model, Learning, and Inference: Lecture 5

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Spark and the Big Data Library

Data Mining Algorithms Part 1. Dejan Sarka

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Linear Algebra Review. Vectors

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

MACHINE LEARNING IN HIGH ENERGY PHYSICS

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Unsupervised Data Mining (Clustering)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Scalable Machine Learning - or what to do with all that Big Data infrastructure

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Principles of Data Mining by Hand&Mannila&Smyth

Common factor analysis

Randomized Robust Linear Regression for big data applications

Big learning: challenges and opportunities

Azure Machine Learning, SQL Data Mining and R

Soft Clustering with Projections: PCA, ICA, and Laplacian

Environmental Remote Sensing GEOG 2021

The Artificial Prediction Market

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Collaborative Filtering. Radek Pelánek

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

ISOMETRIES OF R n KEITH CONRAD

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Exploratory Data Analysis with MATLAB

How To Understand Multivariate Models

Nonlinear Iterative Partial Least Squares Method

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Part 2: Community Detection

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Supervised Learning (Big Data Analytics)

Manifold Learning Examples PCA, LLE and ISOMAP

Machine Learning and Pattern Recognition Logistic Regression

Factor analysis. Angela Montanari

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

MLlib: Scalable Machine Learning on Spark

Journée Thématique Big Data 13/03/2015

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Chapter 7. Cluster Analysis

Partial Least Squares (PLS) Regression.

Machine Learning.

Transcription:

Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings Anima Anandkumar U.C. Irvine

Learning with Big Data

Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem.

Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack

Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack Learning with big data: computationally challenging! Principled approaches for finding low dimensional structures?

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data.

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical.

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical. Advanced: Probabilistic models Hidden variables have more general distributions. Can model mixed membership/hierarchical groups. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

Latent Variable Models (LVMs) Document modeling Observed: words. Hidden: topics. Social Network Modeling Observed: social interactions. Hidden: communities, relationships. Recommendation Systems Observed: recommendations (e.g., reviews). Hidden: User and business attributes Unsupervised Learning: Learn LVM without labeled examples.

LVM for Feature Engineering Learn good features/representations for classification tasks, e.g., computer vision and NLP. Sparse Coding/Dictionary Learning Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.

Associative Latent Variable Models Supervised Learning Given labeled examples {(x i,y i )}, learn a classifier ŷ = f(x).

Associative Latent Variable Models Supervised Learning Given labeled examples {(x i,y i )}, learn a classifier ŷ = f(x). Associative/conditional models: p(y x). Example: Logistic regression: E[y x] = σ( u, x ).

Associative Latent Variable Models Supervised Learning Given labeled examples {(x i,y i )}, learn a classifier ŷ = f(x). Associative/conditional models: p(y x). Example: Logistic regression: E[y x] = σ( u, x ). Mixture of Logistic Regressions E[y x,h] = g( Uh,x + b,h )

Associative Latent Variable Models Supervised Learning Given labeled examples {(x i,y i )}, learn a classifier ŷ = f(x). Associative/conditional models: p(y x). Example: Logistic regression: E[y x] = σ( u, x ). Mixture of Logistic Regressions E[y x,h] = g( Uh,x + b,h ) Multi-layer/Deep Network E[y x] = σ d (A d σ d 1 (A d 1 σ d 2 ( A 2 σ 1 (A 1 x))))

Challenges in Learning LVMs Computational Challenges Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees. Sample Complexity Sample complexity is exponential (w.r.t hidden variable dimension) for many learning methods. Guaranteed and efficient learning through spectral methods

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Classical Spectral Methods: Matrix PCA and CCA Unsupervised Setting: PCA For centered samples {x i }, find projection P with Rank(P) = k s.t. min P 1 x i Px i 2. n i [n] Result: Eigen-decomposition of S = Cov(X). Supervised Setting: CCA For centered samples {x i,y i }, find max a,b a Ê[xy ]b. a Ê[xx ]a b Ê[yy ]b Result: Generalized eigen decomposition. x a,x b,y y

Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means).

Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Basic method works only for single memberships. Failure to cluster under small separation.

Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Basic method works only for single memberships. Failure to cluster under small separation. Efficient Learning Without Separation Constraints?

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Beyond SVD: Spectral Methods on Tensors How to learn the mixture models without separation constraints? PCA uses covariance matrix of data. Are higher order moments helpful? Unified framework? Moment-based estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. What are the analogues for tensors?

Moment Matrices and Tensors Multivariate Moments in Unsupervised Setting M 1 := E[x], M 2 := E[x x], M 3 := E[x x x]. Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].

Moment Matrices and Tensors Multivariate Moments in Unsupervised Setting M 1 := E[x], M 2 := E[x x], M 3 := E[x x x]. Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ]. Multivariate Moments in Supervised Setting M 1 := E[x],E[y], M 2 := E[x y], M 3 := E[x x y].

Spectral Decomposition of Tensors M 2 = i λ i u i v i = +... Matrix M 2 λ 1 u 1 v 1 λ 2 u 2 v 2

Spectral Decomposition of Tensors M 2 = i λ i u i v i = +... Matrix M 2 λ 1 u 1 v 1 λ 2 u 2 v 2 M 3 = i λ i u i v i w i = +... Tensor M 3 λ 1 u 1 v 1 w 1 λ 2 u 2 v 2 w 2 u v w is a rank-1 tensor since its (i 1,i 2,i 3 ) th entry is u i1 v i2 w i3. How to solve this non-convex problem?

Decomposition of Orthogonal Tensors M 3 = i w i a i a i a i. Suppose A has orthogonal columns.

Decomposition of Orthogonal Tensors M 3 = i w i a i a i a i. Suppose A has orthogonal columns. M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1.

Decomposition of Orthogonal Tensors M 3 = i w i a i a i a i. Suppose A has orthogonal columns. M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv.

Decomposition of Orthogonal Tensors M 3 = i w i a i a i a i. Suppose A has orthogonal columns. M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Two Problems How to find eigenvectors of a tensor? A is not orthogonal in general.

Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i.

Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v).

Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v).

Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points.

Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points. All other eigenvectors are saddle points.

Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points. All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

Whitening: Conversion to Orthogonal Tensor M 3 = i w i a i a i a i, M 2 = i w i a i a i. Find whitening matrix W s.t. W A = V is an orthogonal matrix. When A R d k has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2 Use pairwise moments M 2 to find W. SVD of M 2 is needed.

Putting it together Non-orthogonal tensor M 3 = i w ia i a i a i, M 2 = i w ia i a i. Whitening matrix W: Multilinear transform: T = M 3 (W,W,W) a 1a2a3 W v 3 v 1 v 2 Tensor M 3 Tensor T

Putting it together Non-orthogonal tensor M 3 = i w ia i a i a i, M 2 = i w ia i a i. Whitening matrix W: Multilinear transform: T = M 3 (W,W,W) a 1a2a3 W v 3 v 1 v 2 Tensor M 3 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization!

Putting it together Non-orthogonal tensor M 3 = i w ia i a i a i, M 2 = i w ia i a i. Whitening matrix W: Multilinear transform: T = M 3 (W,W,W) a 1a2a3 W v 3 v 1 v 2 Tensor M 3 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization! For what latent variable models can we obtain M 2 and M 3 forms?

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Types of Latent Variable Models What is the form of hidden variables h? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships, e.g. Dirichlet distribution. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Topic Modeling

Geometric Picture for Topic Models Topic proportions vector (h) Document

Geometric Picture for Topic Models Single topic (h)

Geometric Picture for Topic Models Single topic (h) A A A x 2 x 1 x 3 Word generation (x 1,x 2,...)

Geometric Picture for Topic Models Single topic (h) A A A x 2 x 1 Linear model: E[x i h] = Ah. x 3 Word generation (x 1,x 2,...)

Moments for Single Topic Models E[x i h] = Ah. w := E[h]. Learn topic-word matrix A, vector w h A A A A A x 1 x 2 x 3 x 4 x 5

Moments for Single Topic Models E[x i h] = Ah. w := E[h]. Learn topic-word matrix A, vector w h A A A A A x 1 x 2 x 3 x 4 x 5 Pairwise Co-occurence Matrix M x M 2 := E[x 1 x 2 ] = E[E[x 1 x 2 h]] = k w i a i a i i=1 Triples Tensor M 3 M 3 := E[x 1 x 2 x 3 ] = E[E[x 1 x 2 x 3 h]] = k w i a i a i a i i=1

Moments under LDA M 2 := E[x 1 x 2 ] α 0 α 0 +1 E[x 1] E[x 1 ] M 3 := E[x 1 x 2 x 3 ] α 0 α 0 +2 E[x 1 x 2 E[x 1 ]] more stuff... Then M 2 = w i a i a i M 3 = w i a i a i a i. Three words per document suffice for learning LDA. Similar forms for HMM, ICA, sparse coding etc. Tensor Decompositions for Learning Latent Variable Models by A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014.

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Network Community Models

Network Community Models 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Network Community Models 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Network Community Models 0.9 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Network Community Models 0.1 0.8 0.1 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Network Community Models 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1

Subgraph Counts as Graph Moments A Tensor Spectral Approach to Learning Mixed Membership Community Models by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Subgraph Counts as Graph Moments A Tensor Spectral Approach to Learning Mixed Membership Community Models by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Subgraph Counts as Graph Moments 3-Star Count Tensor M 3 (a,b,c) = 1 # of common neighbors in X X = 1 G(x, a)g(x, b)g(x, c). X M 3 = 1 X x X [G x,a G x,b G x,c] x X X x A B C a b c A Tensor Spectral Approach to Learning Mixed Membership Community Models by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Computational Complexity (k n) n = # of nodes N = # of iterations k = # of communities. c = # of cores. Whiten STGD Unwhiten Space O(nk) O(k 2 ) O(nk) Time O(nsk/c+k 3 ) O(Nk 3 /c) O(nsk/c) Whiten: matrix/vector products and SVD. STGD: Stochastic Tensor Gradient Descent Unwhiten: matrix/vector products Our approach: O( nsk c +k 3 ) Embarrassingly Parallel and fast!

Tensor Decomposition on GPUs 10 4 10 3 Running time(secs) 10 2 10 1 10 0 10 1 10 2 10 3 Number of communities k MATLAB Tensor Toolbox(CPU) CULA Standard Interface(GPU) CULA Device Interface(GPU) Eigen Sparse(CPU)

Summary of Results Users Friend Business User Reviews Author Coauthor Facebook n 20k Error (E) and Recovery ratio (R) Yelp n 40k DBLP(sub) n 1 million( 100k) Dataset ˆk Method Running Time E R Facebook(k=360) 500 ours 468 0.0175 100% Facebook(k=360) 500 variational 86,808 0.0308 100%. Yelp(k=159) 100 ours 287 0.046 86% Yelp(k=159) 100 variational N.A.. DBLP sub(k=250) 500 ours 10,157 0.139 89% DBLP sub(k=250) 500 variational 558,723 16.38 99% DBLP(k=6000) 100 ours 5407 0.105 95% Thanks to Prem Gopalan and David Mimno for providing variational code.

Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31

Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31 Bridgeness: Distance from vector [1/ˆk,...,1/ˆk] Top-5 bridging nodes (businesses) Business Four Peaks Brewing Pizzeria Bianco FEZ Matt s Big Breakfast Cornish Pasty Co Categories Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe Restaurants, Pizza, Phoenix Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix Restaurants, Phoenix, Breakfast& Brunch Restaurants, Bars, Nightlife, Pubs, Tempe

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Moment Tensors for Associative Models Multivariate Moments: Many possibilities... E[x y],e[x x y],e[ψ(x) y]... Feature Transformations of the Input: x ψ(x) How to exploit them? Are moments E[ψ(x) y] useful? If ψ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments.

Score Function Features Higher order score function: S m (x) := ( 1) m (m) p(x) p(x) Can be a matrix or a tensor instead of a vector. Derivative w.r.t parameter or input Form the cross-moments: E[y S m (x)]. [ ] Extension of Stein s lemma: E[y S m (x)] = E (m) G(x) when E[y x] := G(x) Spectral decomposition: [ ] E (m) G(x) = u m j j [k] Can be applied for learning of associative latent variable models.

Learning Deep Neural Networks Realizable Setting E[y x] = σ d (A d σ d 1 (A d 1 σ d 2 ( A 2 σ 1 (A 1 x)))) M 3 = E[y S 3 (x)] = i [r] λ i u 3 i where u i = e i A 1 are rows of A 1. Guaranteed learning of weights (layer-by-layer) via tensor decomposition. Similar guarantees for learning mixture of classifiers

Automated Extraction of Discriminative Features

Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

Conclusion: Guaranteed Non-Convex Optimization Tensor Decomposition Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. In practice Scalable and embarrassingly parallel: handle large datasets. Efficient performance: perplexity or ground truth validation. Related Topics Overcomplete Tensor Decomposition: Neural networks, sparse coding and ICA models tend to be overcomplete (more neurons than input dimensions). Provable Non-Convex Iterative Methods: Robust PCA, Dictionary learning etc.

My Research Group and Resources Furong Huang Majid Janzamin Hanie Sedghi Niranjan UN Forough Arabshahi ML summer school lectures available at http://newport.eecs.uci.edu/anandkumar/mlss.html