The degrees of freedom of the Lasso in underdetermined linear regression models



Similar documents
Degrees of Freedom and Model Search

NOTES ON LINEAR TRANSFORMATIONS

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

160 CHAPTER 4. VECTOR SPACES

On the Degrees of Freedom of the Lasso

Solving Linear Systems, Continued and The Inverse of a Matrix

When is missing data recoverable?

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

Weakly Secure Network Coding

Orthogonal Projections and Orthonormal Bases

Factorization Theorems

8. Linear least-squares

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

About the inverse football pool problem for 9 games 1

Inner products on R n, and more

Least Squares Estimation

WHEN DOES A CROSS PRODUCT ON R n EXIST?

Linear Algebra Methods for Data Mining

Notes on Symmetric Matrices

Solving Systems of Linear Equations

Duality of linear conic problems

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Regularized Logistic Regression for Mind Reading with Parallel Validation

Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity

Methods for Finding Bases

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

SMALL SKEW FIELDS CÉDRIC MILLIET

IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction

Similarity and Diagonalization. Similar Matrices

α = u v. In other words, Orthogonal Projection

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Adaptive Online Gradient Descent

F. ABTAHI and M. ZARRIN. (Communicated by J. Goldstein)

6. Cholesky factorization

Randomized Robust Linear Regression for big data applications

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

RINGS WITH A POLYNOMIAL IDENTITY

2.3 Convex Constrained Optimization Problems

T ( a i x i ) = a i T (x i ).

COMMUTATIVE RINGS. Definition: A domain is a commutative ring R that satisfies the cancellation law for multiplication:

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Matrix Representations of Linear Transformations and Changes of Coordinates

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Abstract: We describe the beautiful LU factorization of a square matrix (or how to write Gaussian elimination in terms of matrix multiplication).

3. Linear Programming and Polyhedral Combinatorics

ON THE DEGREES OF FREEDOM OF THE LASSO

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

Lasso on Categorical Data

Section Inner Products and Norms

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Lecture 5 Least-squares

MATH1231 Algebra, 2015 Chapter 7: Linear maps

1 Norms and Vector Spaces

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Inner Product Spaces

Notes from February 11

1 Introduction to Matrices

MATHEMATICAL METHODS OF STATISTICS

9 Multiplication of Vectors: The Scalar or Dot Product

Orthogonal Projections

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

GI01/M055 Supervised Learning Proximal Methods

1 Sets and Set Notation.

On Mardia s Tests of Multinormality

Inner product. Definition of inner product

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

A =

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

Package bigdata. R topics documented: February 19, 2015

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

BLOCKWISE SPARSE REGRESSION

MATH 590: Meshfree Methods

Continued Fractions and the Euclidean Algorithm

Sales forecasting # 2

THE DIMENSION OF A VECTOR SPACE

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Ideal Class Group and Units

FINITE DIMENSIONAL ORDERED VECTOR SPACES WITH RIESZ INTERPOLATION AND EFFROS-SHEN S UNIMODULARITY CONJECTURE AARON TIKUISIS

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

What is Linear Programming?

SECRET sharing schemes were introduced by Blakley [5]

Linear Algebra. A vector space (over R) is an ordered quadruple. such that V is a set; 0 V ; and the following eight axioms hold:

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

4: SINGLE-PERIOD MARKET MODELS

TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS

Background: State Estimation

Chapter 7. Continuity

A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property

(Quasi-)Newton methods

Row Ideals and Fibers of Morphisms

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Transcription:

The degrees of freedom of the Lasso in underdetermined linear regression models C. Dossal (1), M. Kachour (2), J. Fadili (2), G. Peyré (3), C. Chesneau (4) (1) IMB, Université Bordeaux 1 (2) GREYC, ENSICAEN (3) CNRS and Ceremade Université Paris-Dauphine (4) LMNO, Université de Caen SPARS 11, June 2011 Edinburgh

The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References

Motivations The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References

Motivations Statistical framework : The underdetermined (or high dimensional) linear regression model estimation : y = Ax 0 + ε = θ 0 + ε, (1) where y R n, A = (a j ) j {1,,p}, with a j R n and n < p, x 0 R p, and ε N (0, σ 2 I ) Applications : Genomics Econometrics Signal processing Tomography,...

Motivations Strict sparsity assumption : Weak sparsity assumption : x 0 has few non-zero components x 0 can be approximated to good quality by a vector with few non-zero components Let ˆx = f (y) be an estimator of x 0 and ˆµ = Aˆx Since y N (Ax 0, σ 2 ), according to Efron [2], the degrees of freedom df of ˆµ is dened by df (ˆµ) = n i=1 cov(ˆµ i, y i ) σ 2 (2)

Motivations Stein's lemma [3] : Let ˆx be an estimator of x 0. Suppose that ˆµ = Aˆx is almost dierentiable. Then, we have [ ] df (ˆµ) = E df ˆ (ˆµ), where df ˆ (ˆµ) = div (ˆµ) = n i=1 ˆµ i y i (3)

Motivations Stein's lemma [3] : Let ˆx be an estimator of x 0. Suppose that ˆµ = Aˆx is almost dierentiable. Then, we have [ ] df (ˆµ) = E df ˆ (ˆµ), where df ˆ (ˆµ) = div (ˆµ) = n i=1 ˆµ i y i (3) We dene MSE(ˆµ) = E ˆµ θ 0 2 2 = E A(ˆx x 0 ) 2 2 (4) SURE(ˆµ) = nσ 2 + ˆµ y 2 2 + 2σ 2 ˆ df (ˆµ) (5) So, according to Stein's lemma, we have MSE(ˆµ) = E [SURE(ˆµ)] (6)

Some characteristics of the Lasso The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References

Some characteristics of the Lasso The Lasso estimate [4] amounts to solving : P 1 (y, λ) : 1 min x R p 2 y Ax 2 2 + λ x 1 (7) Depending on λ, some coecients are set exactly equal to zero The estimator can be computed for very large p Let x = ( x j ) R p. The support of x and its cardinal are dened by I ( x) = supp( x) = {j : x j 0}, I ( x) = card(i ( x)) We note x I ( x) = ( x i ) i I ( x), A I ( x) = (a i ) i I ( x) and sign ( x I ( x) ) = ( sign ( xi) )) i I ( x) { 1, 1} I ( x)

Some characteristics of the Lasso x is a solution of P 1 (y, λ) if and only if 1 a k, y A x = λ, k I ( x) 2 a j, y A x λ, j I ( x) x is the unique minimizer of P 1 (y, λ) if 1 a k, y A x = λ, k I ( x) 2 a j, y A x < λ, j I ( x) 3 A I ( x) is full rank Here, we have ( ) 1 ) 1 x I ( x) = A t I ( x) A I ( x) A t I ( x) (A y λ t I ( x) A ) I ( x) sign ( xi ( x) (8) Furtheremore, there exists a nite sequence of λ's, called the transition points, so that between two transition point the support and the sign of the optimum (8) are constant with respect to λ.

The overdetermined case The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References

The overdetermined case Consider the overdetermined linear regression model : y = Ax 0 + ε, where y R n, x 0 R p, ε N (0, σ 2 I ), and (a j ) j {1,,p} are linearly independent, that is rank(a) = p Here, the Lasso problem P 1 (y, λ) has an unique solution x λ Let K λ R n, so that z K λ, λ is not a transition point

The overdetermined case Theorem (Zou and al. [5]) Let λ > 0 and y K λ. Let µ λ = A x λ (y) be the Lasso response vector. Then, we have [ ] df ( µ λ ) = E Ĩ, (9) where Ĩ is the support of x λ

Main results The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References

Main results Goal : Extend the previous result to the underdetermined case where the Lasso solution is not unique Remark : If x 1 and x 2 are solutions of P 1 (y, λ), then A x 1 = A x 2 Let ˆµ λ (y) be the unique image of the solutions of P 1 (y, λ)

Main results Notations Let I {1, 2,, p}, such that A I = (a i ) i I is full rank, A + I = (A t I A I ) 1 A t I the pseudo-inverse of A I, V I = span(a i ) i I, P VI = A I A + I, P V = I n n P VI, S { 1, 1} I I Let j {1, 2,, p} and x λ > 0. We dene H I,j,S = {u R n : P V (a j ), u = ±λ(1 a j, (A + I I ) t S )} (10)

Main results Notations Let I {1, 2,, p}, such that A I = (a i ) i I is full rank, A + I = (A t I A I ) 1 A t I the pseudo-inverse of A I, V I = span(a i ) i I, P VI = A I A + I, P V = I n n P VI, S { 1, 1} I I Let j {1, 2,, p} and x λ > 0. We dene H I,j,S = {u R n : P V (a j ), u = ±λ(1 a j, (A + I I ) t S )} (10) Ω = {(I, j, S) : a j V I } (11)

Main results Notations Let I {1, 2,, p}, such that A I = (a i ) i I is full rank, A + I = (A t I A I ) 1 A t I the pseudo-inverse of A I, V I = span(a i ) i I, P VI = A I A + I, P V = I n n P VI, S { 1, 1} I I Let j {1, 2,, p} and x λ > 0. We dene H I,j,S = {u R n : P V (a j ), u = ±λ(1 a j, (A + I I ) t S )} (10) and Ω = {(I, j, S) : a j V I } (11) G λ = R n \ H I,j,S (12) (I,j,S) Ω

Main results Notations Let I {1, 2,, p}, such that A I = (a i ) i I is full rank, A + I = (A t I A I ) 1 A t I the pseudo-inverse of A I, V I = span(a i ) i I, P VI = A I A + I, P V = I n n P VI, S { 1, 1} I I Let j {1, 2,, p} and x λ > 0. We dene H I,j,S = {u R n : P V (a j ), u = ±λ(1 a j, (A + I I ) t S )} (10) and Ω = {(I, j, S) : a j V I } (11) G λ = R n \ H I,j,S (12) (I,j,S) Ω Remark : K λ G λ

Main results Theorem : Let λ > 0, y G λ, and M y,λ the set of all the solutions of P 1 (y, λ). Let xλ be a solution of P 1(y, λ) such that A I the associated matrix to its support I is full rank. Then, I = min x λ M y,λ supp(x λ ). (13) Furthermore, there exists ε > 0 such that for all z B(y, ε), we have ˆµ λ (z) = ˆµ λ (y) + P VI (z y). (14)

Main results Corollary 1 : Let λ > 0 and y G λ. Then, we have div(ˆµ λ (y)) = I (15) Moreover, df = E( I ). (16) Corollary 2 : Let λ > 0 and y G λ. Suppose that P 1 (y, λ) has an unique solution ˆx λ, and let Î be its support. Then, we have df = E( Î ). (17)

Main results Condition (UC), Dossal [1] A matrix A satises condition (UC) if, for all subsets I {1,, p} with I n, such that (a i ) i I are linearly independent, for all indices j I and all vectors S { 1, 1} I, a j, (A + I ) t S 1. (18) Consequences Suppose that A satises condition (UC). Then, we have y G λ P 1 (y, λ) has an unique solution ˆx λ Therefore, df = E( Î ), where Î is the support of ˆx λ

References The contents 1 Motivations 2 Some characteristics of the Lasso 3 The overdetermined case 4 Main results 5 References

References Dossal, C (2007). A necessary and sucient condition for exact recovery by l1 minimization. Technical report, HAL-00164738:1. Efron, B. (1981). How biased is the apparent error rate of a prediction rule. J. Amer. Statist. Assoc. vol. 81 pp. 461-470. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135-1151. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58(1) 267-288. Zou, H., Hastie, T. and Tibshirani, R. (2007). On the "degrees of freedom" of the Lasso. Ann. Statist. Vol. 35, No. 5. 2173-2192.