Statistical machine learning, high dimension and big data

Similar documents
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Factorization Theorems

Lecture 3: Linear methods for classification

3. Linear Programming and Polyhedral Combinatorics

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Course: Model, Learning, and Inference: Lecture 5

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Several Views of Support Vector Machines

Big Data Analytics: Optimization and Randomization

Discuss the size of the instance for the minimum spanning tree problem.

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Extracting correlation structure from large random matrices

Least Squares Estimation

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

1 Solving LPs: The Simplex Algorithm of George Dantzig

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Adaptive Online Gradient Descent

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Linear Threshold Units

Classification Problems

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Introduction to Online Learning Theory

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Traffic Driven Analysis of Cellular Data Networks

Gaussian Processes in Machine Learning

Tree based ensemble models regularization by convex optimization

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

D-optimal plans in observational studies

Distributed Machine Learning and Big Data

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

DATA ANALYSIS II. Matrix Algorithms

STA 4273H: Statistical Machine Learning

1 Introduction to Matrices

CONTROLLABILITY. Chapter Reachable Set and Controllability. Suppose we have a linear system described by the state equation

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

BayesX - Software for Bayesian Inference in Structured Additive Regression

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

An extension of the factoring likelihood approach for non-monotone missing data

HT2015: SC4 Statistical Data Mining and Machine Learning

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Statistical Machine Learning

SOLVING LINEAR SYSTEMS

Greedy Column Subset Selection for Large-scale Data Sets

An Introduction to Machine Learning

Introduction to General and Generalized Linear Models

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

The Characteristic Polynomial

Solution of Linear Systems

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

4.6 Linear Programming duality

The p-norm generalization of the LMS algorithm for adaptive filtering

Question 2 Naïve Bayes (16 points)

NOTES ON LINEAR TRANSFORMATIONS

Proximal mapping via network optimization

7 Gaussian Elimination and LU Factorization

Approximation Algorithms

Machine Learning Big Data using Map Reduce

Notes on Determinant

Notes on Symmetric Matrices

Social Media Mining. Network Measures

Strong Rules for Discarding Predictors in Lasso-type Problems

Bayes and Naïve Bayes. cs534-machine Learning

Variational approach to restore point-like and curve-like singularities in imaging

Factor analysis. Angela Montanari

Supervised Learning (Big Data Analytics)

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

LINEAR ALGEBRA. September 23, 2010

University of Lille I PC first year list of exercises n 7. Review

Environmental Remote Sensing GEOG 2021

Distributed Structured Prediction for Big Data

Matrix Differentiation

Similarity and Diagonalization. Similar Matrices

Warshall s Algorithm: Transitive Closure

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.

Lecture 8: Signal Detection and Noise Assumption

Solutions to Math 51 First Exam January 29, 2015

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

8. Linear least-squares

Data Mining: Algorithms and Applications Matrix Math Review

Tensor Factorization for Multi-Relational Learning

Bilinear Prediction Using Low-Rank Models

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Generating Valid 4 4 Correlation Matrices

CSCI567 Machine Learning (Fall 2014)

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Linear Programming. March 14, 2014

3. The Junction Tree Algorithms

Unit 18 Determinants

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Predict Influencers in the Social Network

Transcription:

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique

Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling, Graphical Gaussian Model

Divide and Conquer Principle for Matrix Completion

SVD-based matrix completion Unknown matrix M of size n 1 n 2. Prior: rank(m) n 1 n 2 We observe P Ω (M) = {M j,k : (j, k) Ω} R m Basic iteration of a proximal gradient algorithm writes X k+1 S λ (X k η k (P Ω (X k ) P Ω (M))) where S λ spectral soft-thresholding operator S λ (X ) = U diag[(σ 1 (X ) λ) +,..., (σ n1 n 2 (X ) λ) + ]V with X = UΣV SVD of X.

SVD-based matrix completion Bottleneck: a truncated SVD is necessary at each iteration Best-case complexity is O(n 1 n 2 k) [Lanczos algorithms] Such algorithms for matrix completion with theoretical guarantee rely on expensive truncated SVD computation This does not scale! Idea: Divide and Conquer principle Divide M into submatrices Solve the subproblems: matrix completion of each submatrix (done in parallel) Combine the reconstructed submatrices for entire reconstruction of M [Mackey Talwalkar and Jordan (2011)]

Divide and Conquer Matrix Completion 1 Randomly partition M into t column submatrices M = [ C 1 C 2 C T ] with C t R n 1 p, p = n 2 /T 2 Complete each submatrix Ĉ t using trace norm penalization on this subproblem in parallel [if fully done in parallel, this is T times faster than the single completion of M] leading to [Ĉ1 Ĉ 2 Ĉ T ] 3 Combine them: project this matrix onto the column spaces of each Ĉ t, and average. If Ĉ t = Û t ˆΣ t ˆV t SVD of Ĉ t, compute ˆM = 1 T T Û t Ût t=1 [Ĉ1 Ĉ 2 Ĉ T ] [Note that Û t Ût is the projection matrix onto the space spanned by the columns of Ĉ t ]

Divide and Conquer Matrix Completion Full matrix completion: complexity O(n 1 n 2 k) per iteration (truncated SVD on the full matrix) DC matrix completion: maximum O(n 1 p max t k t ) complexity per iteration (for truncated SVDs on the subcompletion problems done in parallel) ] O(n 1 kp) for the multiplication ÛtÛ t [Ĉ1 Ĉ 2 Ĉ T done in parallel, hence O(n 1 kpt ) = O(n 1 n 2 k) for the averaging (but done only once) 1 T T t=1 Warning: Û t Ût Ĉ j = Û t (Ût Ĉ j ) is O(n 1 kp) while Û t Ût Ĉ j = (Û t Ût )Ĉ j is O(n1 2p)

Divide and Conquer Matrix Completion Numerical results [Mackey et al. (2011)] And almost the same theoretical guarantees as matrix completion on the full matrix

Divide and Conquer Matrix Completion What is behind this? Getting a low-rank approximation using projection onto a random column subsample M a n 1 n 2 matrix and L a rank r approximation of M. Fix x > 0 and ε > 0 Construct a matrix C of size n 1 p that contains columns of M picked at random without replacement Compute C = U C Σ C V C Then SVD of C M U C U C M F (1 + ε) M L F with probability 1 e x whenever p crµ 0 (V L ) log(n 1 n 2 )x/ε 2 where µ 0 (V ) = n 2 r max 1 i n2 V i, 2 2 = n 2 r V 2, with L = U L Σ L VL SVD of L

Graphical modelling, Graphical Gaussian Model

Graphs

Graphs

Graphs Co-occure of words

Graphs Relation of artists in last.fm database

Graphs Evolution of co-voters in the US Senate [Show video]

Graphs Graph A graph G consists of a set of vertices V and a set of edges E We often note G = (V, E) E is a subset of V V containing ordered pairs of distinct vertices. An edge is directed from j to k if (j, k) E Undirected graphs, directed graphs

Graphical Models Graphical Model The set V corresponds to a collection of random variables Denote V = {1,..., p} with V = p X = (X 1,..., X p ) P The pair (G, P) is a graphical model

Graphs, Graphical Models Consider an undirected graph G and a graphical model (G, P) We say that P satisfies the pairwise Markov property with respect to G = (V, E) iif X j X k X V {j,k} for any (j, k) / E, j k, namely X j and X k are conditionaly independent given the all the other vertices A graphical model satisfying this property is called a conditional independence graph (CIG)

Gaussian Graphical Models A Gaussian Graphical Model is a CIG with the assumption X = (X 1,..., X d ) N(0, Σ) for a positive definite covariance matrix Σ. Mean is zero to simplify notations A well-known result (Lauritzen (1996)): (j, k) and (k, j) E iff X j X k X V {j,k} iff (Σ 1 ) j,k = 0 [exerc.] The edges can be read on the precision matrix K = Σ 1 : (j, k) V and (k, j) V iff K j,k 0

Gaussian Graphical Models The partial correlation ρ j,k V {j,k} between X j and X k conditional on X V {j,k} is given by K j,k ρ j,k V {j,k} = Kj,j K k,k The partial correlation coefficients are regression coefficients: we can write X j = β j,k X k + β l,j X l + ε j l V {j,k} where E[ε j ] = 0 and ε j X V {j}, with β j,k = K j,k K j,j and β k,j = K j,k K k,k [exerc.]

Sparse Gaussian Graphical Model Suppose that we observe X 1,..., X n i.i.d. N(0, Σ) Put X the n p observation matrix with lines X i = [ X i,1 X i,p ] Estimation of K = Σ 1 achieved by maximum likelihood estimation L(Σ; X) = n i=1 1 (2π) p/2 det Σ exp( 1 2 X i Σ 1 X i ) or L(K; X) = n det(k) i=1 (2π) p/2 exp( 1 2 X i KX i )

Gaussian Graphical Models Minus log-likelihood is l(k; X) = log det K + ˆΣ, K + c where c does not depend on K and where A, B = tr(a B) Prior assumption: each vertice isn t connected to all others: there is only few edges in the graph Use l 1 -penalization on K to obtain a sparse solution Graphical Lasso [Friedman et al (2007), Banerjee et al (2008)] { ˆK argmin log det K + ˆΣ, K + λ } K j,k K:K 0 1 j<k p

Sparse Gaussian Graphical Model

Sparse Gaussian Graphical Model How to solve ˆK argmin K 0 { log det K + ˆΣ, K + λ K 1 } It is a convex minimization problem: log det is convex log det differentiable, with log det(x ) = X 1 Recall that max X 1 X, Y = K 1 Dual problem is { } max log det(ˆσ + X ) + p X λ and primal and dual variable related by K = (ˆΣ + X ) 1 Duality gap is [Exerc.] K, ˆΣ p + λ K 1

Sparse Gaussian Graphical Model Rewrite dual problem as min X λ { } log det(ˆσ + X ) p min log det(x ) X ˆΣ λ This will be optimized recursively by updating over a single row and column of K at a time

Sparse Gaussian Graphical Model Let X j, k be the matrix with removed j-th line and k-th column and X j the j-th column with removed j-th entry Recall the Schur complement formula [ ] A B det = det(a) det(d CA 1 B) C D Namely [ ] K p, p k det p k p k p,p = det(k p, p ) det(k p,p k p K 1 p, p k p) If we are at iteration k, update the p-th row and column by k p (k) solution of min y (K (k 1) j, j ) 1 y y ˆΣ j λ

Sparse Gaussian Graphical Model The dual problem min y (K (k 1) j, j ) 1 y y ˆΣ j λ is a box-constrained quadratic program Its dual is min x K (k 1) x j, j x ˆΣ j, x + λ x 1 = min Ax b 2 x 2 + λ x 1 with A = (K (k 1) j, j )1/2 and b = 1 2 (K (k 1) j, j ) 1/2 ˆΣ j Several Lasso problem at each iteration

Sparse Gaussian Graphical Model Algorithm for graphical Lasso [Block coordinate descent] Initialize ˆK (0) = K (0) = ˆΣ + λi For k 0 repeat for j = 1,..., p solve ˆx argmin x (K (j 1) j, j )1/2 x 1 2 (K (j 1) j j ) 1/2 ˆΣj 2 2 + λ x 1 Obtain K (j) by replacing j-th row and column of K (j 1) by ˆx Put ˆK (k) = K (p) and K (0) (k) = ˆK If ˆK (k), ˆΣ p + λ ˆK (k) 1 ε stop and return ˆK (k)

Conclusion

What I didn t spoke about A plethora of other penalizations, optimization algorithms, settings for machine learning Lasso is not consistent for variable selection. Use Adaptive Lasso, namely a l 1 -penalization weighted by a previous solution d j=1 θ j θ j + ε where θ previous estimator [Zou et al (2006)]

What I didn t talk about Fused Lasso for finding change points: use a penalization based on d λ 1 θ 1 + λ tv θ j θ j 1 j=2 [decomposition of the proximal operator]

What I didn t talk about Support Vector Machine: non-linear classification using the Kernel Trick Classification trees, CART, Random Forest Multi-testing Feature screening Bayesian Networks Deep learning Multi-task learning, dictionary learning Non-negative matrix factorization Spectral Clustering Latent Dirichlet Allocation among many many other things...

This evening Don t forget!