Statistical machine learning, high dimension and big data

Size: px

Start display at page:

Download "Statistical machine learning, high dimension and big data"

Donald Briggs
8 years ago
Views:

1 Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars CMAP - Ecole Polytechnique

2 Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling, Graphical Gaussian Model

3 Divide and Conquer Principle for Matrix Completion

4 SVD-based matrix completion Unknown matrix M of size n 1 n 2. Prior: rank(m) n 1 n 2 We observe P Ω (M) = {M j,k : (j, k) Ω} R m Basic iteration of a proximal gradient algorithm writes X k+1 S λ (X k η k (P Ω (X k ) P Ω (M))) where S λ spectral soft-thresholding operator S λ (X ) = U diag[(σ 1 (X ) λ) +,..., (σ n1 n 2 (X ) λ) + ]V with X = UΣV SVD of X.

5 SVD-based matrix completion Bottleneck: a truncated SVD is necessary at each iteration Best-case complexity is O(n 1 n 2 k) [Lanczos algorithms] Such algorithms for matrix completion with theoretical guarantee rely on expensive truncated SVD computation This does not scale! Idea: Divide and Conquer principle Divide M into submatrices Solve the subproblems: matrix completion of each submatrix (done in parallel) Combine the reconstructed submatrices for entire reconstruction of M [Mackey Talwalkar and Jordan (2011)]

6 Divide and Conquer Matrix Completion 1 Randomly partition M into t column submatrices M = [ C 1 C 2 C T ] with C t R n 1 p, p = n 2 /T 2 Complete each submatrix Ĉ t using trace norm penalization on this subproblem in parallel [if fully done in parallel, this is T times faster than the single completion of M] leading to [Ĉ1 Ĉ 2 Ĉ T ] 3 Combine them: project this matrix onto the column spaces of each Ĉ t, and average. If Ĉ t = Û t ˆΣ t ˆV t SVD of Ĉ t, compute ˆM = 1 T T Û t Ût t=1 [Ĉ1 Ĉ 2 Ĉ T ] [Note that Û t Ût is the projection matrix onto the space spanned by the columns of Ĉ t ]

7 Divide and Conquer Matrix Completion Full matrix completion: complexity O(n 1 n 2 k) per iteration (truncated SVD on the full matrix) DC matrix completion: maximum O(n 1 p max t k t ) complexity per iteration (for truncated SVDs on the subcompletion problems done in parallel) ] O(n 1 kp) for the multiplication ÛtÛ t [Ĉ1 Ĉ 2 Ĉ T done in parallel, hence O(n 1 kpt ) = O(n 1 n 2 k) for the averaging (but done only once) 1 T T t=1 Warning: Û t Ût Ĉ j = Û t (Ût Ĉ j ) is O(n 1 kp) while Û t Ût Ĉ j = (Û t Ût )Ĉ j is O(n1 2p)

8 Divide and Conquer Matrix Completion Numerical results [Mackey et al. (2011)] And almost the same theoretical guarantees as matrix completion on the full matrix

9 Divide and Conquer Matrix Completion What is behind this? Getting a low-rank approximation using projection onto a random column subsample M a n 1 n 2 matrix and L a rank r approximation of M. Fix x > 0 and ε > 0 Construct a matrix C of size n 1 p that contains columns of M picked at random without replacement Compute C = U C Σ C V C Then SVD of C M U C U C M F (1 + ε) M L F with probability 1 e x whenever p crµ 0 (V L ) log(n 1 n 2 )x/ε 2 where µ 0 (V ) = n 2 r max 1 i n2 V i, 2 2 = n 2 r V 2, with L = U L Σ L VL SVD of L

10 Graphical modelling, Graphical Gaussian Model

11 Graphs

12 Graphs

13 Graphs Co-occure of words

14 Graphs Relation of artists in last.fm database

15 Graphs Evolution of co-voters in the US Senate [Show video]

16 Graphs Graph A graph G consists of a set of vertices V and a set of edges E We often note G = (V, E) E is a subset of V V containing ordered pairs of distinct vertices. An edge is directed from j to k if (j, k) E Undirected graphs, directed graphs

17 Graphical Models Graphical Model The set V corresponds to a collection of random variables Denote V = {1,..., p} with V = p X = (X 1,..., X p ) P The pair (G, P) is a graphical model

18 Graphs, Graphical Models Consider an undirected graph G and a graphical model (G, P) We say that P satisfies the pairwise Markov property with respect to G = (V, E) iif X j X k X V {j,k} for any (j, k) / E, j k, namely X j and X k are conditionaly independent given the all the other vertices A graphical model satisfying this property is called a conditional independence graph (CIG)

19 Gaussian Graphical Models A Gaussian Graphical Model is a CIG with the assumption X = (X 1,..., X d ) N(0, Σ) for a positive definite covariance matrix Σ. Mean is zero to simplify notations A well-known result (Lauritzen (1996)): (j, k) and (k, j) E iff X j X k X V {j,k} iff (Σ 1 ) j,k = 0 [exerc.] The edges can be read on the precision matrix K = Σ 1 : (j, k) V and (k, j) V iff K j,k 0

20 Gaussian Graphical Models The partial correlation ρ j,k V {j,k} between X j and X k conditional on X V {j,k} is given by K j,k ρ j,k V {j,k} = Kj,j K k,k The partial correlation coefficients are regression coefficients: we can write X j = β j,k X k + β l,j X l + ε j l V {j,k} where E[ε j ] = 0 and ε j X V {j}, with β j,k = K j,k K j,j and β k,j = K j,k K k,k [exerc.]

21 Sparse Gaussian Graphical Model Suppose that we observe X 1,..., X n i.i.d. N(0, Σ) Put X the n p observation matrix with lines X i = [ X i,1 X i,p ] Estimation of K = Σ 1 achieved by maximum likelihood estimation L(Σ; X) = n i=1 1 (2π) p/2 det Σ exp( 1 2 X i Σ 1 X i ) or L(K; X) = n det(k) i=1 (2π) p/2 exp( 1 2 X i KX i )

22 Gaussian Graphical Models Minus log-likelihood is l(k; X) = log det K + ˆΣ, K + c where c does not depend on K and where A, B = tr(a B) Prior assumption: each vertice isn t connected to all others: there is only few edges in the graph Use l 1 -penalization on K to obtain a sparse solution Graphical Lasso [Friedman et al (2007), Banerjee et al (2008)] { ˆK argmin log det K + ˆΣ, K + λ } K j,k K:K 0 1 j<k p

23 Sparse Gaussian Graphical Model

24 Sparse Gaussian Graphical Model How to solve ˆK argmin K 0 { log det K + ˆΣ, K + λ K 1 } It is a convex minimization problem: log det is convex log det differentiable, with log det(x ) = X 1 Recall that max X 1 X, Y = K 1 Dual problem is { } max log det(ˆσ + X ) + p X λ and primal and dual variable related by K = (ˆΣ + X ) 1 Duality gap is [Exerc.] K, ˆΣ p + λ K 1

25 Sparse Gaussian Graphical Model Rewrite dual problem as min X λ { } log det(ˆσ + X ) p min log det(x ) X ˆΣ λ This will be optimized recursively by updating over a single row and column of K at a time

26 Sparse Gaussian Graphical Model Let X j, k be the matrix with removed j-th line and k-th column and X j the j-th column with removed j-th entry Recall the Schur complement formula [ ] A B det = det(a) det(d CA 1 B) C D Namely [ ] K p, p k det p k p k p,p = det(k p, p ) det(k p,p k p K 1 p, p k p) If we are at iteration k, update the p-th row and column by k p (k) solution of min y (K (k 1) j, j ) 1 y y ˆΣ j λ

27 Sparse Gaussian Graphical Model The dual problem min y (K (k 1) j, j ) 1 y y ˆΣ j λ is a box-constrained quadratic program Its dual is min x K (k 1) x j, j x ˆΣ j, x + λ x 1 = min Ax b 2 x 2 + λ x 1 with A = (K (k 1) j, j )1/2 and b = 1 2 (K (k 1) j, j ) 1/2 ˆΣ j Several Lasso problem at each iteration

28 Sparse Gaussian Graphical Model Algorithm for graphical Lasso [Block coordinate descent] Initialize ˆK (0) = K (0) = ˆΣ + λi For k 0 repeat for j = 1,..., p solve ˆx argmin x (K (j 1) j, j )1/2 x 1 2 (K (j 1) j j ) 1/2 ˆΣj λ x 1 Obtain K (j) by replacing j-th row and column of K (j 1) by ˆx Put ˆK (k) = K (p) and K (0) (k) = ˆK If ˆK (k), ˆΣ p + λ ˆK (k) 1 ε stop and return ˆK (k)

29 Conclusion

30 What I didn t spoke about A plethora of other penalizations, optimization algorithms, settings for machine learning Lasso is not consistent for variable selection. Use Adaptive Lasso, namely a l 1 -penalization weighted by a previous solution d j=1 θ j θ j + ε where θ previous estimator [Zou et al (2006)]

31 What I didn t talk about Fused Lasso for finding change points: use a penalization based on d λ 1 θ 1 + λ tv θ j θ j 1 j=2 [decomposition of the proximal operator]

32 What I didn t talk about Support Vector Machine: non-linear classification using the Kernel Trick Classification trees, CART, Random Forest Multi-testing Feature screening Bayesian Networks Deep learning Multi-task learning, dictionary learning Non-negative matrix factorization Spectral Clustering Latent Dirichlet Allocation among many many other things...

33 This evening Don t forget!

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max