Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings

Transcription

1 Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings Anima Anandkumar U.C. Irvine

2 Learning with Big Data

3 Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem.

4 Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack

5 Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack Learning with big data: computationally challenging! Principled approaches for finding low dimensional structures?

6 How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data.

7 How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical.

8 How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical. Advanced: Probabilistic models Hidden variables have more general distributions. Can model mixed membership/hierarchical groups. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

9 Latent Variable Models (LVMs) Document modeling Observed: words. Hidden: topics. Social Network Modeling Observed: social interactions. Hidden: communities, relationships. Recommendation Systems Observed: recommendations (e.g., reviews). Hidden: User and business attributes Unsupervised Learning: Learn LVM without labeled examples.

10 LVM for Feature Engineering Learn good features/representations for classification tasks, e.g., computer vision and NLP. Sparse Coding/Dictionary Learning Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.

11 Associative Latent Variable Models Supervised Learning Given labeled examples {(x i,y i )}, learn a classifier ŷ = f(x).

12 Associative Latent Variable Models Supervised Learning Given labeled examples {(x i,y i )}, learn a classifier ŷ = f(x). Associative/conditional models: p(y x). Example: Logistic regression: E[y x] = σ( u, x ).

13 Associative Latent Variable Models Supervised Learning Given labeled examples {(x i,y i )}, learn a classifier ŷ = f(x). Associative/conditional models: p(y x). Example: Logistic regression: E[y x] = σ( u, x ). Mixture of Logistic Regressions E[y x,h] = g( Uh,x + b,h )

14 Associative Latent Variable Models Supervised Learning Given labeled examples {(x i,y i )}, learn a classifier ŷ = f(x). Associative/conditional models: p(y x). Example: Logistic regression: E[y x] = σ( u, x ). Mixture of Logistic Regressions E[y x,h] = g( Uh,x + b,h ) Multi-layer/Deep Network E[y x] = σ d (A d σ d 1 (A d 1 σ d 2 ( A 2 σ 1 (A 1 x))))

15 Challenges in Learning LVMs Computational Challenges Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees. Sample Complexity Sample complexity is exponential (w.r.t hidden variable dimension) for many learning methods. Guaranteed and efficient learning through spectral methods

16 Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion

18 Classical Spectral Methods: Matrix PCA and CCA Unsupervised Setting: PCA For centered samples {x i }, find projection P with Rank(P) = k s.t. min P 1 x i Px i 2. n i [n] Result: Eigen-decomposition of S = Cov(X). Supervised Setting: CCA For centered samples {x i,y i }, find max a,b a Ê[xy ]b. a Ê[xx ]a b Ê[yy ]b Result: Generalized eigen decomposition. x a,x b,y y

19 Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means).

20 Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Basic method works only for single memberships. Failure to cluster under small separation.

21 Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Basic method works only for single memberships. Failure to cluster under small separation. Efficient Learning Without Separation Constraints?

23 Beyond SVD: Spectral Methods on Tensors How to learn the mixture models without separation constraints? PCA uses covariance matrix of data. Are higher order moments helpful? Unified framework? Moment-based estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. What are the analogues for tensors?

24 Moment Matrices and Tensors Multivariate Moments in Unsupervised Setting M 1 := E[x], M 2 := E[x x], M 3 := E[x x x]. Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].

25 Moment Matrices and Tensors Multivariate Moments in Unsupervised Setting M 1 := E[x], M 2 := E[x x], M 3 := E[x x x]. Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ]. Multivariate Moments in Supervised Setting M 1 := E[x],E[y], M 2 := E[x y], M 3 := E[x x y].

26 Spectral Decomposition of Tensors M 2 = i λ i u i v i = +... Matrix M 2 λ 1 u 1 v 1 λ 2 u 2 v 2

27 Spectral Decomposition of Tensors M 2 = i λ i u i v i = +... Matrix M 2 λ 1 u 1 v 1 λ 2 u 2 v 2 M 3 = i λ i u i v i w i = +... Tensor M 3 λ 1 u 1 v 1 w 1 λ 2 u 2 v 2 w 2 u v w is a rank-1 tensor since its (i 1,i 2,i 3 ) th entry is u i1 v i2 w i3. How to solve this non-convex problem?

28 Decomposition of Orthogonal Tensors M 3 = i w i a i a i a i. Suppose A has orthogonal columns.

29 Decomposition of Orthogonal Tensors M 3 = i w i a i a i a i. Suppose A has orthogonal columns. M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1.

30 Decomposition of Orthogonal Tensors M 3 = i w i a i a i a i. Suppose A has orthogonal columns. M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv.

31 Decomposition of Orthogonal Tensors M 3 = i w i a i a i a i. Suppose A has orthogonal columns. M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Two Problems How to find eigenvectors of a tensor? A is not orthogonal in general.

32 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i.

33 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v).

34 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v).

35 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points.

36 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points. All other eigenvectors are saddle points.

37 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points. All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

38 Whitening: Conversion to Orthogonal Tensor M 3 = i w i a i a i a i, M 2 = i w i a i a i. Find whitening matrix W s.t. W A = V is an orthogonal matrix. When A R d k has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2 Use pairwise moments M 2 to find W. SVD of M 2 is needed.

39 Putting it together Non-orthogonal tensor M 3 = i w ia i a i a i, M 2 = i w ia i a i. Whitening matrix W: Multilinear transform: T = M 3 (W,W,W) a 1a2a3 W v 3 v 1 v 2 Tensor M 3 Tensor T

40 Putting it together Non-orthogonal tensor M 3 = i w ia i a i a i, M 2 = i w ia i a i. Whitening matrix W: Multilinear transform: T = M 3 (W,W,W) a 1a2a3 W v 3 v 1 v 2 Tensor M 3 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization!

41 Putting it together Non-orthogonal tensor M 3 = i w ia i a i a i, M 2 = i w ia i a i. Whitening matrix W: Multilinear transform: T = M 3 (W,W,W) a 1a2a3 W v 3 v 1 v 2 Tensor M 3 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization! For what latent variable models can we obtain M 2 and M 3 forms?

43 Types of Latent Variable Models What is the form of hidden variables h? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships, e.g. Dirichlet distribution. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

45 Topic Modeling

46 Geometric Picture for Topic Models Topic proportions vector (h) Document

47 Geometric Picture for Topic Models Single topic (h)

48 Geometric Picture for Topic Models Single topic (h) A A A x 2 x 1 x 3 Word generation (x 1,x 2,...)

49 Geometric Picture for Topic Models Single topic (h) A A A x 2 x 1 Linear model: E[x i h] = Ah. x 3 Word generation (x 1,x 2,...)

50 Moments for Single Topic Models E[x i h] = Ah. w := E[h]. Learn topic-word matrix A, vector w h A A A A A x 1 x 2 x 3 x 4 x 5

51 Moments for Single Topic Models E[x i h] = Ah. w := E[h]. Learn topic-word matrix A, vector w h A A A A A x 1 x 2 x 3 x 4 x 5 Pairwise Co-occurence Matrix M x M 2 := E[x 1 x 2 ] = E[E[x 1 x 2 h]] = k w i a i a i i=1 Triples Tensor M 3 M 3 := E[x 1 x 2 x 3 ] = E[E[x 1 x 2 x 3 h]] = k w i a i a i a i i=1

52 Moments under LDA M 2 := E[x 1 x 2 ] α 0 α 0 +1 E[x 1] E[x 1 ] M 3 := E[x 1 x 2 x 3 ] α 0 α 0 +2 E[x 1 x 2 E[x 1 ]] more stuff... Then M 2 = w i a i a i M 3 = w i a i a i a i. Three words per document suffice for learning LDA. Similar forms for HMM, ICA, sparse coding etc. Tensor Decompositions for Learning Latent Variable Models by A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014.

54 Network Community Models

60 Subgraph Counts as Graph Moments A Tensor Spectral Approach to Learning Mixed Membership Community Models by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

61 Subgraph Counts as Graph Moments A Tensor Spectral Approach to Learning Mixed Membership Community Models by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

62 Subgraph Counts as Graph Moments 3-Star Count Tensor M 3 (a,b,c) = 1 # of common neighbors in X X = 1 G(x, a)g(x, b)g(x, c). X M 3 = 1 X x X [G x,a G x,b G x,c] x X X x A B C a b c A Tensor Spectral Approach to Learning Mixed Membership Community Models by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

64 Computational Complexity (k n) n = # of nodes N = # of iterations k = # of communities. c = # of cores. Whiten STGD Unwhiten Space O(nk) O(k 2 ) O(nk) Time O(nsk/c+k 3 ) O(Nk 3 /c) O(nsk/c) Whiten: matrix/vector products and SVD. STGD: Stochastic Tensor Gradient Descent Unwhiten: matrix/vector products Our approach: O( nsk c +k 3 ) Embarrassingly Parallel and fast!

65 Tensor Decomposition on GPUs Running time(secs) Number of communities k MATLAB Tensor Toolbox(CPU) CULA Standard Interface(GPU) CULA Device Interface(GPU) Eigen Sparse(CPU)

66 Summary of Results Users Friend Business User Reviews Author Coauthor Facebook n 20k Error (E) and Recovery ratio (R) Yelp n 40k DBLP(sub) n 1 million( 100k) Dataset ˆk Method Running Time E R Facebook(k=360) 500 ours % Facebook(k=360) 500 variational 86, %. Yelp(k=159) 100 ours % Yelp(k=159) 100 variational N.A.. DBLP sub(k=250) 500 ours 10, % DBLP sub(k=250) 500 variational 558, % DBLP(k=6000) 100 ours % Thanks to Prem Gopalan and David Mimno for providing variational code.

67 Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant Gluten Free P.F. Chang s China Bistro Hobby Shops Make Meaning Mass Media KJZZ 91.5FM Yoga Sutra Midtown

68 Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant Gluten Free P.F. Chang s China Bistro Hobby Shops Make Meaning Mass Media KJZZ 91.5FM Yoga Sutra Midtown Bridgeness: Distance from vector [1/ˆk,...,1/ˆk] Top-5 bridging nodes (businesses) Business Four Peaks Brewing Pizzeria Bianco FEZ Matt s Big Breakfast Cornish Pasty Co Categories Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe Restaurants, Pizza, Phoenix Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix Restaurants, Phoenix, Breakfast& Brunch Restaurants, Bars, Nightlife, Pubs, Tempe

70 Moment Tensors for Associative Models Multivariate Moments: Many possibilities... E[x y],e[x x y],e[ψ(x) y]... Feature Transformations of the Input: x ψ(x) How to exploit them? Are moments E[ψ(x) y] useful? If ψ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments.

71 Score Function Features Higher order score function: S m (x) := ( 1) m (m) p(x) p(x) Can be a matrix or a tensor instead of a vector. Derivative w.r.t parameter or input Form the cross-moments: E[y S m (x)]. [ ] Extension of Stein s lemma: E[y S m (x)] = E (m) G(x) when E[y x] := G(x) Spectral decomposition: [ ] E (m) G(x) = u m j j [k] Can be applied for learning of associative latent variable models.

72 Learning Deep Neural Networks Realizable Setting E[y x] = σ d (A d σ d 1 (A d 1 σ d 2 ( A 2 σ 1 (A 1 x)))) M 3 = E[y S 3 (x)] = i [r] λ i u 3 i where u i = e i A 1 are rows of A 1. Guaranteed learning of weights (layer-by-layer) via tensor decomposition. Similar guarantees for learning mixture of classifiers

73 Automated Extraction of Discriminative Features

75 Conclusion: Guaranteed Non-Convex Optimization Tensor Decomposition Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. In practice Scalable and embarrassingly parallel: handle large datasets. Efficient performance: perplexity or ground truth validation. Related Topics Overcomplete Tensor Decomposition: Neural networks, sparse coding and ICA models tend to be overcomplete (more neurons than input dimensions). Provable Non-Convex Iterative Methods: Robust PCA, Dictionary learning etc.

76 My Research Group and Resources Furong Huang Majid Janzamin Hanie Sedghi Niranjan UN Forough Arabshahi ML summer school lectures available at