Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Size: px

Start display at page:

Download "Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu"

Violet Crawford
10 years ago
Views:

1 Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July

2 Data A lot of data

3 Outline Nonlinear tensor models (and stochastic blockmodels) Sparse Gaussian process models

4 Tensor data: multiple aspects Patients Biomarkers Medicines Yi,j,k: the value of the k-th biomarker (i.e., cell population) for the j-th patient after taking the i-th medicine Predict drug response

5 Matrix data: networks people people Yi,j: 1 if node i is linked to node j, 0 otherwise. Discover communities and predict unknown interactions

6 Goals - Predict unknown elements (e.g., drug response and network interactions) - Identify latent multi-aspect groups (communities)

7 Classical Tucker decomposition Generalization of matrix factorization 3D case: core tensor loading matrices U Y Z! Y = G 1 U 2 Z 3 V y ijk = g rst u ir z js v ks r s t Sun et al. 2008

8 Assumptions Complete Continuous Multi-linear

9 Solution Sparse latent Gaussian processes on tensors

10 Latent sparse GP on tensors Patients Biomarkers (i, j, k) u i : Medicines Element u i : z j : v k : (i, j, k) is characterized by Sparse loading vector in latent medicine groups Sparse loading vector in latent patient groups Sparse loading vector in latent biomarker groups

Sparse loading vector in latent medicine groups Sparse loading

11 Separate covariance for each dimension K v Biomarkers -Separate covariance/kernel function for each dimension Kz Patients i r K u (i, r) =k(u i, u r ) Medicines Nonlinear relationship between medicines i and r -The more similar loading vectors, the larger the covariance function value

r) =k(u i, u r ) Medicines Nonlinear relationship between medicines i

12 GP on tensors GP on a tensor: stochastic process in an infinite tensor space Tensor N (F 0; K u, K v, K z )=(2π) 3N 2 s=u,v,z K s N2 2 Evaluations of GP on any tensor of finite size is a tensor-valued Gaussian distribution exp{ 1 2 (F 1 K 1 u 2 K 1 v 3 K 1 z ) F 2 }

N2 2 Evaluations of GP on any tensor of finite size is a

13 Predict unknown tensor elements Kv Patients K v Biomarkers (i, j, k) Medicines K u w (r, s, t) Simple illustration: 1) Based on observed data, estimate loading vectors: [u i, z j, v k ] 2) Compute weights (similarities) between unknown and observed elements: w(ijk, rst) w([u i, z j, v k ], [u r, z s, v t ]) 3) Predict the unknown element: y i,j,k = r,s,t w(ijk, rst)y r,s,t

2) Compute weights (similarities) between unknown and observed elements: w(ijk, rst) w([u i,

14 Graphical model representation U =[u 1,...,u N ] U V Z Sparse loading vectors u i exp( λ u i ) Similarly, sample V and Z F Latent tensor F N(0; K u, K v, K z ) Y U Unknown data Y O Observed data Y O p(y ijk f ijk ) (i,j,k) O p(y ijk f ijk ) : Gaussian for continuous data Probit for binary data Possion for count data

Z F Latent tensor F N(0; K u, K v, K z ) Y U Unknown data Y O Observed data Y O

15 Benefits Handle binary and missing data Discover block/group structures Avoid overfitting: adaptive nonparametric model complexity Model prediction uncertainty Incorporate additional side information Yan, Xu & Qi, 2011; Xu, Yan & Qi 2011

nonparametric model complexity Model prediction uncertainty

16 Algorithm: Variational EM Marginal likelihood log p(y O U, V, Z) + log p(u, V, Z) Variational approximation Iterations

17 Algorithm: explore model structures Example: Trace{(I + K u K v K z ) 1 } Direct computation: Matrix inversion N 3 by N 3 O(N 9 ) Kronecker product operation:

18 Properties of Kronecker product Properties: Eigen-decomposition K = WΛW T If K = K u K v K z then W = W u W v W z Λ = Λ u Λ v Λ z W u : eigenvectors of K u diag{λ u } : eigenvalues of K u

19 Reduced computational complexity Example: Trace{(I + K u K v K z ) 1 } Direct computation Using the new theorem and trace properties O(N 9 ) N N N i=1 j=1 k=1 1 1+λ u i λv j λz k O(N 3 ) = M M M i=1 j=1 k=1 1 1+λ u i λv j λz k O(N 2 M) M<<N

trace properties O(N 9 ) N N N i=1 j=1 k=1 1 1+λ u i λv j

20 2D case: GP stochastic blockmodels - Undirected networks (friend relationships and protein-protein interactions) - Represented by symmetric adjacent matrices Yan, Xu & Qi, UAI 2011; Xu, Yan & Qi AAAI 2011

interactions) - Represented by symmetric adjacent

21 2D: Coauthor networks Membership produced by LEM Area Under Curve AUC values SMGB MMSB LEM Ours Number of latent groups Groups Groups Groups Nodes Membership produced by MMSB Nodes Membership produced by SMGB Nodes NIPS authors Ours Co-authorship dataset: co-authorship links from100 authors who have the largest number of co-authors from NIPS 1-17.

22 3D: Enron s Enron dataset: s from senior management of Enron before its bankruptcy in D tensor representation: Sender-Recipient-Subject Area Under Curve AUC values InfTucker tp InfTucker gp CP TD HOSVD NCP WCP PTD Ours Number of Factors Xu, Yan & Qi ICML 2012

23 4D: Digg Area Under Curve AUC values Ours Number of Factors Digg dataset: Social news from digg.com D tensor representation: user-news-keywords-category AUC values InfTucker tp InfTucker gp CP TD HOSVD NCP WCP PTD 3 5

24 Outline Bayesian nonparametric stochastic blockmodels Sparse Gaussian process models

25 Gaussian process Nonparametric Bayesian prior over functions Computational bottleneck: O(N 3 ) for regression

26 Sparse GP models Low-rank approximation by Nystrom approximation (Williams & Seeger 2001) Summarize data by a few pseudo inputs (Snelson & Ghahramani 2006) Summarize data by a few pseudo data clouds (Qi et al. 2010) A unifying view (Quiñonero-Candela & Rasmussen 2005)

27 Summarization by data clouds (Qi et al. UAI 2010) Exact posterior process: Data cloud approximation: M<<N Local manifold information q(f) GP (f 0,K) M j=1 p(u j f(x)φ(x)dx,λ 1 j ) When φ(x) =δ(x), this approximation reduces to FITC (i.e,. pseudo input approximation).

28 X 2 0 X X 1 Exact GP X 1 Pseudo input (i.e., FITC) X 2 0 X X 1 X 1 SASPA_sphere SASPA (Qi et al. UAI 2010)

29 Key observation Previous approaches: compress data into a sparse representation including pseudo data points or clouds PCA: the optimal compact representation among all orthogonal bases

30 EigenGP: sparse PCA + GP KL expansion of GP prior by Nystrom method: O(N 2 ) -> O(M 2 N) Select eigenfunctions by evidence maximization Qi, Dai & Zhu NIPS submission 2012

31 Eigenfunctions of Gaussian kernel Top four eigenfunctions Selected eigenfunctions

32 RMSE RMSE FULL GP FITC NYSTROM EIGEN GP EIGEN GP* Number of Basis/Rank Number of Basis/Rank Boston Housing (400/506 for training) RMSE of Nystrom: 312.5, 41.68, and Pumadyn-8nm (2000/8192 for training)

33 Classification results on digits 0.02 EIGEN GP* EIGEN GP Classification Errror Number of Basis/Rank EigenGP*: fix the eigenvalues EigenGP: sparsify eigenvalues

34 Classification results Classification Error Rate FITC EP FULL GP EP NYSTROM LAPLACE SOGP EIGEN GP Classification Error Rate FITC EP FULL GP EP NYSTROM LAPLACE SOGP EIGEN GP Classification Error Rate FITC EP FULL GP EP NYSTROM LAPLACE SOGP EIGEN GP Number of Basis/Rank Number of Basis/Rank Number of Basis/Rank Spambase 3 vs 8 5 vs 8

35 Semi-supervised classification Classification Error Rate SEB GR LAPSVM EIGEN GP Classification Error Rate SVM SEB GR LAPSVM EIGEN GP Classification Error Rate SVM SEB GR LAPSVM EIGEN GP Number of Labeled Points Ionospher e (351 points) Number of Labeled Points 20 Newsgroup (1976 points) Number of Labeled Points TDT2 (3672 points)

36 Why EigenGP is better than Nystrom Nystrom method: - Numerical matrix approximation and NOT a valid probabilistic model -- break down when the rank is low. - Always the top eigenvectors - Converges to full GP EigenGP: - Valid low-rank GP models -- robust when the rank is small (i.e., HIGHLY sparse models) - Can choose eigenvectors based on labeled information -- semisupervised learning - Can outperform full GP, esp. for classification (exploring clustering property)

37 Conclusions Latent GP models for graphs and tensors: 20% improvement in prediction accuracy (Xu et al., 2012) EigenGP: fast inference with potential of outperforming full GP on prediction accuracy (Qi et al., 2012)

Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes