Estimating the Inverse Covariance Matrix of Independent Multivariate Normally Distributed Random Variables

Size: px

Start display at page:

Download "Estimating the Inverse Covariance Matrix of Independent Multivariate Normally Distributed Random Variables"

Blake Heath
7 years ago
Views:

1 Estimating the Inverse Covariance Matrix of Independent Multivariate Normally Distributed Random Variables Dominique Brunet, Hanne Kekkonen, Vitor Nunes, Iryna Sivak Fields-MITACS Thematic Program on Inverse Problems and Imaging Group on Bayesian/Statistical Inverse Problems July 27, 2012

2 Menu Hanne Problem Formulation and Motivation Frequentist and Bayesian Origins Relation to Imaging Iryna The LASSO Methods Vitor Newton Methods and Quasi-Newton Methods The QUIC Trick Thoughts on Future Work Dominique Simulation and Results Conclusions

3 Bayes formula brings a priori information and measured data together Statistical Inversion We consider the linear measurement model m = Ax + ε where x R p and m R p are treated as random variables. Bayes formula Bayes formula gives us the posterior distribution π(x m): π(x m) π(m x)π(x) We recover x as a point estimate from π(x m).

4 Likelihood distribution describes the measurement If we assume that the noise ε is Gaussian ε k N p(µ, σ 2 ), 1 k p we can write the likelihood distribution which measures data misfit as ( 1 π(m x) = π ε( Ax m ) = σ (2π) exp 1 ) 2σ Ax 2 m 2 2.

5 Prior distribution contains all other information about unknown x The prior distribution shouldn t depend on the measurement. The prior distribution should assign a clearly higher probability to those values of x that we expect to see than to unexpected x. Designing a good prior distributions is one of the main difficulties in statistical inversion.

6 Gaussian prior distribution Gaussian model If we assume that also x is Gaussian we can write π(x) = (det(σ 1 )) 1/2 (2π) p/2 x N p(µ, Σ) ( exp 1 ) 2 (x µ)t Σ 1 (x µ)

7 Estimating inverse covariance matrix Σ 1 We consider the problem of finding a good estimator for inverse covariance matrix Σ 1 with a constraint that certain given pairs of variables are conditionally independent. Conditional independence constraints describe the sparsity pattern of the inverse covariance matrix Σ 1, zeros showing the conditional independence between variables.

8 Why inverse covariance matrix? Simple example We can think a line of objects which are connected by a spring. If we move the leftmost object y 1 also the rightmost object y 5 of the line will move so clearly y 1 and y 5 are not independent.

9 However y 1 and y 5 are conditionally independent given all the other variables. This is because given the information of how y 2 moves knowing y 5 gives us no further information about y 1. In fact y i is conditionally dependent only of its neighbours.

10 What does this have to do with pictures? Neighbourhood -Rule We are interested in exploring the assumption that every pixel is conditionally dependent only of its neighbours nz = 838 In this case the inverse covariance matrix has non zero elements only on nine diagonals.

11 Log-likelihood function Likelihood function We want to estimate parameters Σ 1 and µ of N p(µ, Σ), based on N independent samples x i. The likelihood function is N π(x i Σ, µ) = det N Σ 2 i=1 (2π) Np 2 { exp 1 2 N } (x j µ) T σ 1 (x j µ) j=1 Log-likelihood function Log-likelihood function of data is, up to a constant L(Σ 1, µ) = N log det Σ 1 N (x j µ) T Σ 1 (x j µ) j=1

12 How to obtain sparse Σ 1 maximum likelihood estimate Maximizing the log-likelihood with respect to Σ 1 leads to the maximum likelihood estimate Σ 1 = S 1 which isn t usually sparse (here S is the sample covariance obtained from the data). Also when p > N, S will be singular and so the maximum likelihood estimate cannot be computed. Penalised log-likelihood function To find a sparse Σ 1 we use penalised log-likelihood function argmin Σ 1 0 { log det Σ 1 + (SΣ 1 ) + P Σ 1 1 } where where S is the sample covariance obtained from the data and P is our sparsity constraint.

13 In a nutshell π(x m) π(m x)π(x) Expect that x N p(µ, Σ). Penalised log-likelihood function What is Σ 1? Back to original inverse problem Solve the original inverse problem 1 using Σ argmin x Ax m 2 (x µ) T Σ 1 (x µ) Σ 1 Estimating Use algorithms to find Σ 1

14 Graphical Lasso a 11 a 12 a 13 Σ 1 1 = a 12 a 22 a 23 a 13 a 23 a 33 b 11 0 b 13 Σ 1 2 = 0 b 22 b 23 b 13 b 23 b 33

15 Graphical Lasso a 11 a 12 a 13 Σ 1 1 = a 12 a 22 a 23 a 13 a 23 a 33 b 11 0 b 13 Σ 1 2 = 0 b 22 b 23 b 13 b 23 b 33

16 Graphical Lasso

17 Graphical Lasso

18 GLASSO Θ := Σ 1 ˆΘ = argmin Θ 0 log det(θ) + tr(sθ) + λ Θ 1 ˆΘ 1 S λγ( ˆΘ) = 0, where (Γ( ˆΘ)) ij = Γ( ˆΘ ij ) the subgradient of ˆΘ ij : Γ( ˆΘ ij ) = sign( ˆΘ ij ) if ˆΘ ij 0, Γ( ˆΘ ij ) ( 1, 1) if ˆΘ ij = 0

19 GLASSO Θ := Σ 1, W := Θ 1 ( ) ( ) ( ) Θ11 θ Θ = 12 S11 s, S = 12 W11 w, W = 12 θ 21 θ 22 s 21 s 22 w 21 w 22 Start with W = S + λi. We also added Thresholding. Cycle around rows/columns till convergence: solve QP: 1 ˆβ = arg min β R p 1 2 β W 11 β + β s 12 + λ β 1, W 11 0 using as warm starts the solution from the previous round for this column update ŵ 12 = W 11 ˆβ save ˆβ for this column in the matrix B For every row/column compute ˆθ 22 = (s 22 + λ ˆβ ŵ 12 ) 1, ˆθ 12 = ˆβ ˆθ 22. [J. Friedman, T. Hastie, R. Tibshirani, 2007]

20 GLASSO 1.7 Theorem. Solution to the graphical lasso problem to be diagonal with blocks C 1, C 2,...C K S ij λ for all i C k, j C l, k l. Θ 1 Θ 2 Θ =... ΘK where Θ k solves the graphical lasso problem applied to submatrix of S consisting of the entries whose indices are in C k. [D.M. Witten, J. Friedman, N. Simon, 2011]

21 Problem Formulation Cost Functional Definition Our goal is to minimize the following functional, f (v), under the set of definite positive matrices. min f (Σ 1 ) = log(det Σ 1 ) + tr(σ 1 S) + ρ Σ R(Σ 1 ) 2 2 (1) Minimizing the previous is equivalent to minimize : min f (Θ) = log(det Θ) + tr(θs) + ρ Θ 0 2 R(Θ) 2 2 (2)

22 Model Reduction- Closest Neighbor Neighborhood -Rule Reduce the model by assuming that each pixel is only related to itself and to those which it shares a boundary or a corner. As the following figure shows: Neighborhood -Rule This rule permits to reduced the problem dimension from n = (n xn y) 2 to m = 5n xn y 3n x 3n y + 2, where n x and n y are the number of pixels on the x-axis and y-axis respectively.

23 Model Reduction-Problem Reformulation Transformation Once g : R m R n,n is defined, the problem becomes an unconstrained optimization problem on R m. Therefore the minimizing problem (2) is written as: min J(v) = f (g(v)) = log(det g(v)) + tr(g(v)s) + ρ v R m 2 R(g(v)) 2 2 (3)

24 Numerical Approach Quasi-Newton s Method Expanding J(v) to a quadratic function we get : J(v + ) J(v) + J(v) + t H(v) (4) Therefore the that minimizes J(v + ) must be the solution of = H 1 (v) J(v). So given an initial guess v 0 one can solve the iterative sequence: v k = v k 1 α k H 1 k (v k 1 )J (v k 1 ) (5)

25 Gradient Computation: The gradient of J(v) is computed exactly by: Where f (Θ) is given by J (v) = f (g(v)) g (v) R m (6) ( f ) i,j = Θ 1 j,i + S j,i + +ρ[r (Θ) Θ σ i,j ] i,j (7) Hessian Computation: Since is computational expensive to compute the Hessian, one should use an approximation method such as BFGS.

26 Numerical Results, n=100 Error Analysis g(v)s I 2 Error Analysis g(v) Θ 2

27 QUIC Algorithm, Cho Jui Hsieh University of Texas: Cost Function min Θ J(Θ) = log(det Θ) + tr(θs) + ρ 2 Θ 1 (8) Split Into a Differentiable Function and a Non-Differentiable g(θ) = log(det Θ) + tr(θs) (9) h(θ) = ρ 2 Θ 1 (10)

28 QUIC Algorithm, Cho Jui Hsieh University of Texas: Second Order Approximation of g(.) G( ) = g(θ + ) = g(θ) + g(θ + ) + t H (11) New Cost functional min F( ) = G( ) + h(θ + ) (12) Choice of = µ[e j e t i + e i e t j ] (13) min µ F (µ) (14)

29 Overview / Future Work Overview Model Reduction Numerical Approach Numerical Results QUIC Algorithm Future Work Improve Memory Efficiency Better Choice of x 0 Different Rules of Reduction Incorporate the Model Reduction to QUIC

30 nz = 838 Student Version of MATLAB Hanne Iryna Vitor Dominique Conclusions Simulation Inverse Covariance Test Matrix Multivariate Normal Sampling Σ 1 = 0 x 1, x 2,..., x N Inverse Covariance Estimation ˆΣ 1 Sample Covariance S = xx T /N

31 Timing (s) Sample data: X R d N with N = 100. Covariance matrix S R d d. Algorithm d = 49 d = 100 d = 200 d = 400 S GLASSO 0.01 ± ± ± ± 0.22 QUIC 0.04 ± ± ± ± 2.46 quasi-newton

32 Error Analysis (N = 100) quasi!newton QUIC or GLASSO S inverse Frobenius norm d

33 Training Set extracted from

34 Results original noisy (! = 5%) GRMF denoising TV denoising

35 Remarks 1. Computed Inverse Covariance Matrix on subblocks and aggregated blocks; 2. Assumed Stationarity; 3. Assumed Gaussianity of the image pixels; 4. Performed very basic (poor) alignment of training set; 5. Had a small (29) (and unhealthy) training set; Nevertheless, it performs better than TV denoising!

36 Future Challenges 1. Computational Challenge: computing inverse covariance matrix on full images (e.g. 40, , 000) Could Iteratively Re-weighted Least Squares Minimization for Sparse Recovery help? 2. Parameter selection: warm start and generalized cross-validation; 3. Non-stationary model: use bigger and better aligned database; 4. Try Gaussianity model in transform domain or higher order models;

37 THANK YOU FOR YOUR TIME!!!

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,