The degrees of freedom of the Lasso in underdetermined linear regression models C. Dossal (1), M. Kachour (2), J. Fadili (2), G. Peyré (3), C. Chesneau (4) (1) IMB, Université Bordeaux 1 (2) GREYC, ENSICAEN (3) CNRS and Ceremade Université Paris-Dauphine (4) LMNO, Université de Caen SPARS 11, June 2011 Edinburgh
The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References
Motivations The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References
Motivations Statistical framework : The underdetermined (or high dimensional) linear regression model estimation : y = Ax 0 + ε = θ 0 + ε, (1) where y R n, A = (a j ) j {1,,p}, with a j R n and n < p, x 0 R p, and ε N (0, σ 2 I ) Applications : Genomics Econometrics Signal processing Tomography,...
Motivations Strict sparsity assumption : Weak sparsity assumption : x 0 has few non-zero components x 0 can be approximated to good quality by a vector with few non-zero components Let ˆx = f (y) be an estimator of x 0 and ˆµ = Aˆx Since y N (Ax 0, σ 2 ), according to Efron [2], the degrees of freedom df of ˆµ is dened by df (ˆµ) = n i=1 cov(ˆµ i, y i ) σ 2 (2)
Motivations Stein's lemma [3] : Let ˆx be an estimator of x 0. Suppose that ˆµ = Aˆx is almost dierentiable. Then, we have [ ] df (ˆµ) = E df ˆ (ˆµ), where df ˆ (ˆµ) = div (ˆµ) = n i=1 ˆµ i y i (3)
Motivations Stein's lemma [3] : Let ˆx be an estimator of x 0. Suppose that ˆµ = Aˆx is almost dierentiable. Then, we have [ ] df (ˆµ) = E df ˆ (ˆµ), where df ˆ (ˆµ) = div (ˆµ) = n i=1 ˆµ i y i (3) We dene MSE(ˆµ) = E ˆµ θ 0 2 2 = E A(ˆx x 0 ) 2 2 (4) SURE(ˆµ) = nσ 2 + ˆµ y 2 2 + 2σ 2 ˆ df (ˆµ) (5) So, according to Stein's lemma, we have MSE(ˆµ) = E [SURE(ˆµ)] (6)
Some characteristics of the Lasso The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References
Some characteristics of the Lasso The Lasso estimate [4] amounts to solving : P 1 (y, λ) : 1 min x R p 2 y Ax 2 2 + λ x 1 (7) Depending on λ, some coecients are set exactly equal to zero The estimator can be computed for very large p Let x = ( x j ) R p. The support of x and its cardinal are dened by I ( x) = supp( x) = {j : x j 0}, I ( x) = card(i ( x)) We note x I ( x) = ( x i ) i I ( x), A I ( x) = (a i ) i I ( x) and sign ( x I ( x) ) = ( sign ( xi) )) i I ( x) { 1, 1} I ( x)
Some characteristics of the Lasso x is a solution of P 1 (y, λ) if and only if 1 a k, y A x = λ, k I ( x) 2 a j, y A x λ, j I ( x) x is the unique minimizer of P 1 (y, λ) if 1 a k, y A x = λ, k I ( x) 2 a j, y A x < λ, j I ( x) 3 A I ( x) is full rank Here, we have ( ) 1 ) 1 x I ( x) = A t I ( x) A I ( x) A t I ( x) (A y λ t I ( x) A ) I ( x) sign ( xi ( x) (8) Furtheremore, there exists a nite sequence of λ's, called the transition points, so that between two transition point the support and the sign of the optimum (8) are constant with respect to λ.
The overdetermined case The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References
The overdetermined case Consider the overdetermined linear regression model : y = Ax 0 + ε, where y R n, x 0 R p, ε N (0, σ 2 I ), and (a j ) j {1,,p} are linearly independent, that is rank(a) = p Here, the Lasso problem P 1 (y, λ) has an unique solution x λ Let K λ R n, so that z K λ, λ is not a transition point
The overdetermined case Theorem (Zou and al. [5]) Let λ > 0 and y K λ. Let µ λ = A x λ (y) be the Lasso response vector. Then, we have [ ] df ( µ λ ) = E Ĩ, (9) where Ĩ is the support of x λ
Main results The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References
Main results Goal : Extend the previous result to the underdetermined case where the Lasso solution is not unique Remark : If x 1 and x 2 are solutions of P 1 (y, λ), then A x 1 = A x 2 Let ˆµ λ (y) be the unique image of the solutions of P 1 (y, λ)
Main results Notations Let I {1, 2,, p}, such that A I = (a i ) i I is full rank, A + I = (A t I A I ) 1 A t I the pseudo-inverse of A I, V I = span(a i ) i I, P VI = A I A + I, P V = I n n P VI, S { 1, 1} I I Let j {1, 2,, p} and x λ > 0. We dene H I,j,S = {u R n : P V (a j ), u = ±λ(1 a j, (A + I I ) t S )} (10)
Main results Notations Let I {1, 2,, p}, such that A I = (a i ) i I is full rank, A + I = (A t I A I ) 1 A t I the pseudo-inverse of A I, V I = span(a i ) i I, P VI = A I A + I, P V = I n n P VI, S { 1, 1} I I Let j {1, 2,, p} and x λ > 0. We dene H I,j,S = {u R n : P V (a j ), u = ±λ(1 a j, (A + I I ) t S )} (10) Ω = {(I, j, S) : a j V I } (11)
Main results Notations Let I {1, 2,, p}, such that A I = (a i ) i I is full rank, A + I = (A t I A I ) 1 A t I the pseudo-inverse of A I, V I = span(a i ) i I, P VI = A I A + I, P V = I n n P VI, S { 1, 1} I I Let j {1, 2,, p} and x λ > 0. We dene H I,j,S = {u R n : P V (a j ), u = ±λ(1 a j, (A + I I ) t S )} (10) and Ω = {(I, j, S) : a j V I } (11) G λ = R n \ H I,j,S (12) (I,j,S) Ω
Main results Notations Let I {1, 2,, p}, such that A I = (a i ) i I is full rank, A + I = (A t I A I ) 1 A t I the pseudo-inverse of A I, V I = span(a i ) i I, P VI = A I A + I, P V = I n n P VI, S { 1, 1} I I Let j {1, 2,, p} and x λ > 0. We dene H I,j,S = {u R n : P V (a j ), u = ±λ(1 a j, (A + I I ) t S )} (10) and Ω = {(I, j, S) : a j V I } (11) G λ = R n \ H I,j,S (12) (I,j,S) Ω Remark : K λ G λ
Main results Theorem : Let λ > 0, y G λ, and M y,λ the set of all the solutions of P 1 (y, λ). Let xλ be a solution of P 1(y, λ) such that A I the associated matrix to its support I is full rank. Then, I = min x λ M y,λ supp(x λ ). (13) Furthermore, there exists ε > 0 such that for all z B(y, ε), we have ˆµ λ (z) = ˆµ λ (y) + P VI (z y). (14)
Main results Corollary 1 : Let λ > 0 and y G λ. Then, we have div(ˆµ λ (y)) = I (15) Moreover, df = E( I ). (16) Corollary 2 : Let λ > 0 and y G λ. Suppose that P 1 (y, λ) has an unique solution ˆx λ, and let Î be its support. Then, we have df = E( Î ). (17)
Main results Condition (UC), Dossal [1] A matrix A satises condition (UC) if, for all subsets I {1,, p} with I n, such that (a i ) i I are linearly independent, for all indices j I and all vectors S { 1, 1} I, a j, (A + I ) t S 1. (18) Consequences Suppose that A satises condition (UC). Then, we have y G λ P 1 (y, λ) has an unique solution ˆx λ Therefore, df = E( Î ), where Î is the support of ˆx λ
References The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References
References Dossal, C (2007). A necessary and sucient condition for exact recovery by l1 minimization. Technical report, HAL-00164738:1. Efron, B. (1981). How biased is the apparent error rate of a prediction rule. J. Amer. Statist. Assoc. vol. 81 pp. 461-470. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135-1151. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58(1) 267-288. Zou, H., Hastie, T. and Tibshirani, R. (2007). On the "degrees of freedom" of the Lasso. Ann. Statist. Vol. 35, No. 5. 2173-2192.