SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR. By Peter J. Bickel, Ya acov Ritov and Alexandre B. Tsybakov

Size: px
Start display at page:

Download "SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR. By Peter J. Bickel, Ya acov Ritov and Alexandre B. Tsybakov"

Transcription

1 Submitted to the Aals of Statistics SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR By Peter J. Bickel, Ya acov Ritov ad Alexadre B. Tsybakov We show that, uder a sparsity sceario, the Lasso estimator ad the Datzig selector exhibit similar behavior. For both methods we derive, i parallel, oracle iequalities for the predictio risk i the geeral oparametric regressio model, as well as bouds o the l p estimatio loss for p 2 i the liear model whe the umber of variables ca be much larger tha the sample size.. Itroductio. Durig the last few years a great deal of attetio has bee focused o the l pealized least squares (Lasso) estimator of parameters i high-dimesioal liear regressio whe the umber of variables ca be much larger tha the sample size [8, 9,, 7, 8, 2 22, 26, 27]. Quite recetly, Cades ad Tao [7] have proposed a ew estimate for such liear models, the Datzig selector, for which they establish optimal l 2 rate properties uder a sparsity sceario, i.e., whe the umber of o-zero compoets of the true vector of parameters is small. Lasso estimators have bee also studied i the oparametric regressio setup [2 5, 2, 3, 9]. I particular, Buea et al. [2 5] obtai sparsity oracle iequalities for the predictio loss i this cotext ad poit out the implicatios for miimax estimatio i classical o-parametric regressio settigs, as well as for the problem of aggregatio of estimators. A aalog of Lasso for desity estimatio with similar properties (SPADES) is proposed i [6]. Modified versios of Lasso estimators (o-quadratic terms ad/or pealties slightly differet from l ) for oparametric regressio with radom desig are suggested ad studied uder predictio loss i [4, 25]. Sparsity oracle iequalities for the Datzig selector with radom desig are obtaied i [5]. I liear fixed desig regressio, Meishause ad Yu [8] establish a boud o the l 2 loss for the coefficiets of Lasso which is quite differet from the boud o the same loss for the Datzig selector prove i [7]. The mai message of this paper is that uder a sparsity sceario, the Lasso Partially supported by NSF grat DMS-65236, ISF Grat, Frace-Berkeley Fud, the grat ANR-6-BLAN-94 ad the Europea Network of Excellece PASCAL. AMS 2 subject classificatios: Primary 6K35, 62G8; secodary 62C2, 62G5, 62G2 Keywords ad phrases: Liear models, Model selectio, Noparametric statistics

2 2 BICKEL ET AL. ad the Datzig selector exhibit similar behavior, both for liear regressio ad for oparametric regressio models, for l 2 predictio loss ad for l p loss i the coefficiets for p 2. All the results of the paper are oasymptotic. Let us specialize to the case of liear regressio with may covariates, y = Xβ + w where X is the M determiistic desig matrix, with M possibly much larger tha, ad w is a vector of i.i.d. stadard ormal radom variables. This is the situatio cosidered most recetly by Cades ad Tao [7] ad Meishause ad Yu [8]. Here sparsity specifies that the high-dimesioal vector β has coefficiets that are mostly. We develop geeral tools to study these two estimators i parallel. For the fixed desig Gaussia regressio model we recover, as particular cases, sparsity oracle iequalities for the Lasso, as i Buea et al. [4], ad l 2 bouds for the coefficiets of Datzig selector, as i Cades ad Tao [7]. This is obtaied as a cosequece of our more geeral results: I the oparametric regressio model, we prove sparsity oracle iequalities for the Datzig selector, that is, bouds o the predictio loss i terms of the best possible (oracle) approximatio uder the sparsity costrait. Similar sparsity oracle iequalities are proved for the Lasso i the oparametric regressio model, ad this is doe uder more geeral assumptios o the desig matrix tha i [4]. We prove that, for oparametric regressio, the Lasso ad the Datzig selector are approximately equivalet i terms of the predictio loss. We develop geometrical assumptios which are cosiderably weaker tha those of Cades ad Tao [7] for the Datzig selector ad Buea et al.[4] for the Lasso. I the cotext of liear regressio where the umber of variables is possibly much larger tha the sample size these assumptios imply the result of [7] for the l 2 loss ad geeralize it to l p loss, p 2, ad to predictio loss. Our bouds for the Lasso differ from those for Datzig selector oly i umerical costats. We begi, i the ext sectio, by defiig the Lasso ad Datzig procedures ad the otatio. I Sectio 3 we preset our key geometric assumptios. Some sufficiet coditios for these assumptios are give i Sectio 4, where they are also compared to those of [7] ad [8] as well as to oes appearig i [4] ad [5]. We ote a weakess of our assumptios, ad hece of those i the papers we cited, ad we discuss a way of slightly remedyig them. Sectios 5, 6 give some equivalece results ad sparsity oracle iequalities for the Lasso ad Datzig estimators i the geeral oparametric regressio

3 LASSO AND DANTZIG SELECTOR 3 model. Sectio 7 focuses o the liear regressio model ad icludes a fial discussio. Two importat techical lemmas are give i Appedix B as well as most of the proofs. 2. Defiitios ad otatio. Let (Z, Y ),..., (Z, Y ) be a sample of idepedet radom pairs with Y i = f(z i ) + W i, i =,...,, where f : Z R is a ukow regressio fuctio to be estimated, Z is a Borel subset of R d, the Z i s are fixed elemets i Z ad the regressio errors W i are Gaussia. Let F M = {f,..., f M } be a fiite dictioary of fuctios f j : Z R, j =,..., M. We assume throughout that M 2. Depedig o the statistical targets, the dictioary F M ca cotai qualitatively differet parts. For istace, it ca be a collectio of basis fuctios used to approximate f i the oparametric regressio model (e.g., wavelets, splies with fixed kots, step fuctios). Aother example is related to the aggregatio problem where the f j are estimators arisig from M differet methods. They ca also correspod to M differet values of the tuig parameter of the same method. Without much loss of geerality, these estimators f j are treated as fixed fuctios: the results are viewed as beig coditioed o the sample the f j are based o. The selectio of the dictioary ca be very importat to make the estimatio of f possible. We assume implicitly that f ca be well approximated by a member of the spa of F M. However this is ot eough. I this paper, we have i mid the situatio where M, ad f ca be estimated reasoably oly because it ca approximated by a liear combiatio of a small umber of members of F M, or i other words, it has a sparse approximatio i the spa of F M. But whe sparsity is a issue, equivalet bases ca have differet properties: a fuctio which has a sparse represetatio i oe basis may ot have it i aother oe, eve if both of them spa the same liear space. Cosider the matrix X = (f j (Z i )) i,j, i =,...,, j =,..., M ad the vectors y = (Y,..., Y ) T, f = (f(z ),..., f(z )) T, w = (W,..., W ) T. With this otatio, y = f + w. We will write x p for the l p orm of x R M, p. The otatio stads for the empirical orm: g = g 2 (Z i ) i=

4 4 BICKEL ET AL. for ay g : Z R. We suppose that f j, j =,..., M. Set f max = max j M f j, f mi = mi j M f j. For ay β = (β,..., β M ) R M, defie f β = M j= β j f j, or explicitly, f β (z) = M j= β j f j (z), ad f β = Xβ. The estimates we cosider are all of the form f β( ) where β is data determied. Sice we cosider maily sparse vectors β, it will be coveiet to defie the followig. Let M M(β) = I {βj } = J(β) j= deote the umber of o-zero coordiates of β, where I { } deotes the idicator fuctio, J(β) = {j {,..., M} : β j }, ad J deotes the cardiality of J. The value M(β) characterizes the sparsity of the vector β: the smaller M(β), the sparser β. For a vector δ R M ad a subset J {,..., M} we deote by δ J the vector i R M which has the same coordiates as δ o J ad zero coordiates o the complemet J c of J. Itroduce the residual sum of squares Ŝ(β) = {Y i f β (Z i )} 2, i= for all β R M. Defie the Lasso solutio β L = ( β,l,..., β M,L ) by β M L = arg mi β R Ŝ(β) + 2r (2.) f j β j M, where r > is some tuig costat, ad itroduce the correspodig Lasso estimator (2.2) f L (x) = f βl (x) = j= M β j,l f j (z). The criterio i (2.) is covex i β, so that stadard covex optimizatio procedures ca be used to compute β L. We refer to [9,, 6, 2, 2, 24] for detailed discussio of these optimizatio problems ad fast algorithms. A ecessary ad sufficiet coditio of the miimizer i (2.) is that belogs to the subdifferetial of the covex fuctio β y Xβ r D /2 β. This implies that the Lasso selector β L satisfies the costrait: D /2 X T (y X β (2.3) L ) r, j=

5 where D is the diagoal matrix LASSO AND DANTZIG SELECTOR 5 D = diag{ f 2,..., f M 2 }. The Datzig estimator of the regressio fuctio f is based o a particular solutio of (2.3), the Datzig selector, which is a solutio of the miimizatio problem: (2.4) { β D = arg mi β : The Datzig estimator is defied by D /2 X T (y Xβ) r }. (2.5) M f D (z) = (z) = β f βd j,d f j (z). j= where β D = ( β,d,..., β M,D ) is the Datzig selector. By the defiitio of Datzig selector, we have β D β L. The Datzig selector is computatioally feasible, sice it reduces to a liear programmig problem [7]. Fially for ay, M 2, we cosider the Gram matrix Ψ = ( ) XT X = f j (Z i )f j (Z i ), i= ad let φ max deote the maximal eigevalue of Ψ. j,j M 3. Restricted eigevalue assumptios. We ow itroduce the key assumptios o the Gram matrix that are eeded to guaratee ice statistical properties of Lasso ad Datzig selector. Uder the sparsity sceario we are typically iterested i the case where M >, ad eve M. The the matrix Ψ is degeerate, which ca be writte as mi δ R M :δ (δ T Ψ δ) /2 δ 2 mi δ R M :δ Xδ 2 δ 2 =. Clearly, ordiary least squares does ot work i this case, sice it requires positive defiiteess of Ψ, i.e. (3.) mi δ R M :δ Xδ 2 δ 2 >. It turs out that the Lasso ad Datzig selector require much weaker assumptios: the miimum i (3.) ca be replaced by the miimum over a

6 6 BICKEL ET AL. restricted set of vectors, ad the orm δ 2 i the deomiator of the coditio ca be replaced by the l 2 orm of oly a part of δ. Oe of the properties of both the Lasso ad the Datzig selector is that, for the liear regressio model, both δ = ˆβ L β ad δ = ˆβ D β satisfy, with high probability, (3.2) δ J c c δ J where J = J(β) is the set of o-zero coefficiets of β. For the liear regressio model, the vector of Datzig residuals δ satisfies (3.2) with probability if c =, cf. (B.9). A similar iequality holds for the vector of Lasso residuals δ = β L β, but this time with c = 3, ad with a probability which is ot exactly equal to, cf. Corollary B.2. Now, cosider for example, the case where the elemets of the Gram matrix Ψ are close to those of a positive defiite M M matrix Ψ. Deote by ε = maxi,j (Ψ Ψ) i,j the maximal differece betwee the elemets of the two matrices. The for ay δ satisfyig (3.2) we get (3.3) δ T Ψ δ δ 2 2 = δt Ψδ + δ T (Ψ Ψ)δ δ 2 2 δt Ψδ δ 2 2 δt Ψδ δ 2 2 δt Ψδ δ 2 2 ε δ 2 δ 2 2 ε ( ( + c ) δ J δ J 2 ε ( + c ) 2 J. ) 2 Thus, for δ satisfyig (3.2) which are the vectors that we have i mid, ad for ε J small eough, the LHS of (3.3) is bouded away from. It meas that we have a kid of restricted positive defiiteess which is valid oly for the vectors satisfyig (3.2). This suggests the followig coditios that will suffice for the mai argumet of the paper. We refer to these coditios as restricted eigevalue (RE) assumptios. Our first RE assumptio is: Assumptio RE(s, c ): For some iteger s such that s M, ad a positive umber c the followig coditio holds: κ(s, c ) = mi J {,...,M}, J s mi δ, δ J c c δ J Xδ 2 δj 2 >.

7 LASSO AND DANTZIG SELECTOR 7 The iteger s here plays the role of a upper boud o the sparsity M(β) of a vector of coefficiets β. Note that if Assumptio RE(s, c ) is satisfied with c, the mi{ Xδ 2 : M(δ) 2s, δ } >. I words, the square submatrices of size 2s of the Gram matrix are ecessarily positive defiite. Ideed, suppose that for some δ we have simultaeously M(δ) 2s ad Xδ =. Partitio J(δ) i two sets: J(δ) = J J, such that J i s, i =,. Without loss of geerality, suppose that δ J δ J. Sice, clearly, δ J = δ J c ad c, we have δ J c c δ J. Hece κ(s, c ) =, a cotradictio. To itroduce the secod assumptio we eed some more otatio. For itegers s, m such that s M/2 ad m s, s + m M, a vector δ R M ad a set of idices J {,..., M} with J s, deote by J m the subset of {,..., M} correspodig to the m largest i absolute value coordiates of δ outside of J ad defie J m = J J m. Assumptio RE(s, m, c ): κ(s, m, c ) = mi J {,...,M}, J s mi δ, δ J c c δ J Xδ 2 δjm 2 >. Note that that oly differece betwee the two assumptios is betwee the deomiators, ad κ(s, m, c ) κ(s, c ). As writte, for a fixed, the two assumptios are equivalet. However, asymptotically for large, Assumptio RE(s, c ) is less restrictive tha RE(s, m, c ), sice the ratio κ(s, m, c )/κ(s, c ) may ted to if s ad m deped o. For our bouds o the predictio loss ad o the l loss of the Lasso ad Datzig estimators we will oly eed Assumptio RE(s, c ). Assumptio RE(s, m, c ) will be required exclusively for the bouds o the l p loss with < p 2. Note also that Assumptios RE(s, c ) ad RE(s, m, c ) imply Assumptios RE(s, c ) ad RE(s, m, c ) respectively if s > s. 4. Discussio of the RE assumptios. There exist several simple sufficiet coditios for Assumptios RE(s, c ) ad RE(s, m, c ) to hold. Here we discuss some of them. For a real umber u M we itroduce the followig quatities that

8 8 BICKEL ET AL. we will call restricted eigevalues: φ mi (u) = φ max (u) = mi x R M : M(x) u max x R M : M(x) u x T Ψ x x 2, 2 x T Ψ x x 2. 2 Deote by X J the J submatrix of X obtaied by removig from X the colums that do ot correspod to the idices i J, ad for m, m M itroduce the followig quatities called restricted correlatios: θ m,m 2 = max{ ct X T J X J2 c 2 : J J 2 =, J i m i, c i R J i, c i 2, i =, 2} I Lemma 4. below we argue that a sufficiet coditio for RE(s, c ) ad RE(s, s, c ) to hold is give, for example, by the followig assumptio o the Gram matrix. Assumptio : Assume φ mi (2s) > c θ s,2s for some iteger s M/2 ad a costat c >. This coditio with c = appeared i [7], i coectio with the Datzig selector. Assumptio is more geeral: we ca have here a arbitrary costat c > which will allow us to cover ot oly the Datzig selector but also the Lasso estimators, ad to prove oracle iequalities for the predictio loss whe the model is oparametric. Our secod sufficiet coditio for RE(s, c ) ad RE(s, m, c ) does ot eed bouds o correlatios. Oly bouds o the miimal ad maximal eigevalues of small submatrices of the Gram matrix Ψ are ivolved. Assumptio 2: Assume mφ mi (s + m) > c 2 sφ max (m) for some itegers s, m such that s M/2, m s, ad s+m M, ad a costat c >. Assumptio 2 ca be viewed as a weakeig of the coditio o φ mi i [8]. Ideed, takig s + m = s log (we assume w.l.o.g. that s log is a iteger ad > 3) ad assumig that φ max ( ) is uiformly bouded by a costat we get that Assumptio 2 is equivalet to φ mi (s log ) > c/ log

9 LASSO AND DANTZIG SELECTOR 9 where c > is a costat. The correspodig slightly stroger assumptio i [8] is stated i asymptotic form (for s = s ): lim if φ mi (s log ) >. The followig two costats are useful whe Assumptios ad 2 are cosidered: ( κ (s, c ) = φ mi (2s) c ) θ s,2s φ mi (2s) ad κ 2 (s, m, c ) = ( s φmax (m) φ mi (s + m) c m φ mi (s + m) The ext lemma shows that if Assumptios or 2 are satisfied, the the quadratic form x T Ψ x is positive defiite o some restricted sets of vectors x. The costructio of the lemma is ispired by Cades ad Tao [7] ad covers, i particular, the correspodig result i [7]. Lemma 4.. Fix a iteger s M/2 ad a costat c >. (i) Let Assumptio be satisfied. The Assumptios RE(s, c ) ad RE(s, s, c ) hold with κ(s, c ) = κ(s, s, c ) = κ (s, c ). Moreover, for ay subset J of {,..., M} with cardiality J s, ad ay δ R M such that (4.) δ J c c δ J ). we have P m Xδ 2 κ (s, c ) δ Jm 2 where P m is the projector i R M o the liear spa of the colums of X Jm. (ii) Let Assumptio 2 be satisfied. The Assumptios RE(s, c ) ad RE(s, m, c ) hold with κ(s, c ) = κ(s, m, c ) = κ 2 (s, m, c ). Moreover, for ay subset J of {,..., M} with cardiality J s, ad ay δ R M such that (4.) holds we have P m Xδ 2 κ 2 (s, m, c ) δ Jm 2. The proof of the lemma is give i Appedix A. There exist other sufficiet coditios for Assumptios RE(s, c ) ad RE(s, m, c ) to hold. We metio here three of them implyig Assumptio RE(s, c ). The first oe is the followig []. Assumptio 3. For a iteger s such that s M we have φ mi (s) > 2c θ s, s

10 BICKEL ET AL. where c > is a costat. To argue that Assumptio 3 implies RE(s, c ) it suffices to remark that Xδ 2 2 δt J X T J X J δ J 2 δt J X T J X J c δ J c φ mi (s) δ J δt J X T J X J c δ J c ad, if (4.) holds, δ T J X T J X J c δ J c / δ J c max j J c δ T J X T J x (j) / θ s, δ J c δ J 2 c θ s, s δj 2 2. Aother type of assumptio related to mutual coherece [8] is discussed i the coectio to Lasso i [4, 5]. We state it here i a slightly differet form. Assumptio 4. For a iteger s such that s M we have φ mi (s) > 2c θ, s where c > is a costat. It is easy to see that Assumptio 4 implies RE(s, c ). Ideed, if (4.) holds, (4.2) Xδ 2 2 δt J X T J X J δ J 2θ, δ J c δ J φ mi (s) δ J 2 2 2c θ, δ J 2 (φ mi (s) 2c θ, s) δ J 2 2. If all the diagoal elemets of matrix X T X/ are equal to (ad thus θ, coicides with the mutual coherece [8]), a simple sufficiet coditio for Assumptio RE(s, c ) to hold is give by Assumptio 5. For a iteger s such that s M we have (4.3) θ, < ( + 2c )s. where c > is a costat. I fact, separatig the diagoal ad off-diagoal terms of the quadratic form we get δ T J X T J X J δ J / δ J 2 2 θ, δ J 2 δ J 2 2( θ, s).

11 LASSO AND DANTZIG SELECTOR Combiig this iequality with (4.2) we see that Assumptio RE(s, c ) is satisfied wheever (4.3) holds. Ufortuately, Assumptio RE(s, c ) has some weakess. Let, for example, f j, j =,..., 2 m, be the Haar wavelet basis o [, ] (M = 2 m ) ad cosider Z i = i/, i =,...,. If M, it is clear that φ mi () = sice there are fuctios f j o the highest resolutio level whose supports (of legth M ) cotai o poits Z i. So, oe of the Assumptios 4 holds. A less severe although similar situatio is whe we cosider step fuctios: f j ( ) = I { <j/m}. It is clear that φ mi (2) = O(/M), although sparse represetatio i this basis is very atural. Ituitively, the problem arises oly because we iclude very high resolutio compoets. Therefore, we may try to restrict the set J i RE(s, c ) to low resolutio compoets, which is quite reasoable because the true or iterestig vectors of parameters β are ofte characterized by such J. This idea is formalized i Sectio 6, cf. Corollary 6.2, see also a remark after Theorem 7.2 i Sectio Approximate equivalece. I this sectio we prove a type of approximate equivalece betwee the Lasso ad the Datzig selector. It is expressed as closeess of the predictio losses f D f 2 ad f L f 2 whe the umber of o-zero compoets of the Lasso or the Datzig selector is small as compared to the sample size. Theorem 5.. Let W i be idepedet N (, σ 2 ) radom variables with σ 2 >. Fix, M 2. Let Assumptio RE(s, ) be satisfied with s M. Cosider the Datzig estimator f D defied by (2.5) (2.4) with r = Aσ log M where A > 2 2, ad the Lasso estimator f L defied by (2.) (2.2) with the same r. If M( β L ) s, the with probability at least M A2 /8 we have (5.) f D f 2 f L f 2 6A 2 M( β L )σ 2 fmax 2 κ 2 log M. (s, ) Note that the RHS of (5.) is bouded by a product of three factors (ad a umerical costat which, ufortuately, equals at least 28). The first factor, M( β L )σ 2 / sσ 2 /, correspods to the error rate for predictio i regressio with s parameters. The two other factors, log M ad f 2 max/κ 2 (s, ), ca be regarded as a price to pay for the large umber of regressors. If the Gram matrix Ψ equals the idetity matrix (the white

12 2 BICKEL ET AL. oise model), the there is oly the log M factor. I the geeral case, there is aother factor, f 2 max/κ 2 (s, ) represetig the extet to which the Gram matrix is ill-posed for estimatio of sparse vectors. We also have the followig result that we state for simplicity uder the assumptio that f j =, j =,..., M. It gives a boud i the spirit of Theorem 5. but with M( β D ) rather tha M( β L ) o the right had side. Theorem 5.2. Let the assumptios of Theorem 5. hold, but with RE(s, 5) i place of RE(s, ), ad let f j =, j =,..., M. If M( β D ) s, the with probability at least M A2 /8 we have (5.2) f L f 2 f D f 2 + 8A 2 M( β D )σ 2 log M κ 2 (s, 5). Remark. The approximate equivalece is essetially that of the rates as Theorem 5. exhibits. A statemet free of M(β) holds for liear regressio, see discussio after Theorem 7.2 ad Theorem 7.3 below. 6. Oracle iequalities for predictio loss. Here we prove sparsity oracle iequalities for the predictio loss of Lasso ad Datzig estimators. These iequalities allow us to boud the differece betwee the predictio errors of the estimators ad the best sparse approximatio of the regressio fuctio (by a oracle that kows the truth, but is costraied by sparsity). The results of this sectio, together with those of Sectio 5, show that the distace betwee the predictio losses of Datzig ad Lasso estimators is of the same order as the distaces betwee them ad their oracle approximatios. A geeral discussio of sparsity oracle iequalities ca be foud i [23]. Such iequalities have bee recetly obtaied for the Lasso type estimators i a umber of settigs [2 6, 4, 25]. I particular, the regressio model with fixed desig that we study here is cosidered i [2 4]. The assumptios o the Gram matrix Ψ i [2 4] are more restrictive tha ours: i those papers either Ψ is positive defiite or a mutual coherece coditio similar to (4.3) is imposed. Theorem 6.. Let W i be idepedet N (, σ 2 ) radom variables with σ 2 >. Fix some ε > ad itegers, M 2, s M. Let Assumptio RE(s, 3 + 4/ε) be satisfied. Cosider the Lasso estimator f L defied by (2.) (2.2) with r = Aσ log M

13 LASSO AND DANTZIG SELECTOR 3 for some A > 2 2. The, with probability at least M A2 /8, we have (6.) f L f 2 ( + ε) if β R M : M(β) s { f β f 2 + C(ε)f 2 maxa 2 σ 2 κ 2 (s, 3 + 4/ε) } M(β) log M where C(ε) > is a costat depedig oly o ε. We ow state as a corollary a softer versio of Theorem 6. that ca be used to elimiate the pathologies metioed at the ed of Sectio 4. For this purpose we defie J s,γ,c = { J {,..., M} : J s ad where γ > is a costat, ad set mi δ J c c δ J Λ s,γ,c = {β : J(β) J s,γ,c }. Xδ } 2 γ δj 2 I similar way, we defie J s,γ,m,c ad Λ s,γ,m,c correspodig to Assumptio RE(s, m, c ). Corollary 6.2. Let W i, s ad the Lasso estimator f L be the same as i Theorem 6.. The, for all, ε >, ad γ >, with probability at least M A2 /8 we have f L f 2 ( + ε) if β Λ s,γ,ε { f β f 2 + C(ε)f 2 maxa 2 σ 2 γ 2 ( M(β) log M ) } where Λ s,γ,ε = {β Λ s,γ,3+4/ε : M(β) s}. To obtai this corollary it suffices to observe that the proof of Theorem 6. goes through if we drop Assumptio RE(s, 3 + 4/ε) but we assume istead that β Λ s,γ,3+4/ε ad we replace κ(s, 3 + 4/ε) by γ. We would like ow to get a sparsity oracle iequality similar to that of Theorem 6. for the Datzig estimator f D. We will eed a mild additioal assumptio o f. This is due to the fact that ot every β R M obeys to the Datzig costrait, ad thus we caot assure the key relatio (B.9) for all β R M. Oe possibility would be to prove iequality as (6.) where the ifimum o the right had side is take over β satisfyig ot oly M(β) s but also the Datzig costrait. However, this seems ot very ituitive sice

14 4 BICKEL ET AL. we caot guaratee that the correspodig f β gives a good approximatio of the ukow fuctio f. Therefore we choose aother approach (cf. [5]): we cosider f satisfyig the weak sparsity property relative to the dictioary f,..., f M. That is, we assume that there exist a iteger s ad costat C < such that the set { (6.2) Λ s = β R M : M(β) s, f β f 2 C fmaxr 2 2 } κ 2 (s, 3 + 4/ε) M(β) is o-empty. The secod iequality i (6.2) says that the bias term f β f 2 caot be much larger tha the variace term fmaxr 2 2 κ 2 M(β), cf. (6.). Weak sparsity is milder tha the sparsity property i the usual sese: the latter meas that f admits the exact represetatio f = f β for some β R M, with hopefully small M(β ) = s. Propositio 6.3. Let W i be idepedet N (, σ 2 ) radom variables with σ 2 >. Fix some ε > ad itegers, M 2. Let f obey the weak sparsity assumptio for some C < ad some s such that s max{c (ε), } M where C (ε) = 4 [( + ε)c + C(ε)] φ maxf 2 max κ 2 f 2 mi ad C(ε) is the costat i Theorem 6.. Suppose further that Assumptio RE(s max{c (ε), }, 3+4/ε) is satisfied. Cosider the Datzig estimator f D defied by (2.5) (2.4) with r = Aσ log M ad A > 2 2. The, with probability at least M A2 /8, we have f D f 2 (6.3) ( + ε) if β R M : M(β)=s f β f 2 + C 2 (ε) f 2 maxa 2 σ 2 κ 2 ( s log M ). Here C 2 (ε) = 6C (ε) + C(ε) ad κ = κ(max(c (ε), )s, 3 + 4/ε). Note that the sparsity oracle iequality (6.3) is slightly weaker tha the aalogous iequality (6.) for the Lasso: we have here if β R M : M(β)=s istead of if β R M : M(β) s i (6.).

15 LASSO AND DANTZIG SELECTOR 5 7. Special case: parametric estimatio i liear regressio. I this sectio we assume that the vector of observatios y = (Y,..., Y ) T is of the form (7.) y = Xβ + w where X is a M determiistic matrix, β R M ad w = (W,..., W ) T. We cosider dimesio M that ca be of order ad eve much larger. The β is, i geeral, ot uiquely defied. For M >, if (7.) is satisfied for β = β there exists a affie space U = {β : Xβ = Xβ } of vectors satisfyig (7.). The results of this sectio are valid for ay β such that (7.) holds. However, we will assume that Assumptio RE(s, c ) holds with c ad that M(β ) = s. The the set U {β : M(β ) = s} reduces to a sigle elemet (cf. Remark 2 at the ed of this sectio). I this sese, there is a uique sparse solutio of (7.). Our goal i this sectio, ulike that of the previous oes, is to estimate both Xβ for the purpose of predictio ad β itself for purpose of model selectio. We will see that meaigful results are obtaied whe the sparsity idex M(β ) is small. It will be assumed throughout this sectio that the diagoal elemets of the Gram matrix Ψ = X T X/ are all equal to (this is equivalet to the coditio f j =, j =,..., M, i the otatio of previous sectios). The the Lasso estimator of β i (7.) is defied by (7.2) { } β L = arg mi y Xβ r β. β R M The correspodece betwee the otatio here ad that of the previous sectios is the followig: f β 2 = X β 2 2/, f β f 2 = X (β β ) 2 2/, f L f 2 = X ( β L β ) 2 2/. The Datzig selector for liear model (7.) is defied by (7.3) where Λ = { β R M : β D = arg mi β β Λ XT (y Xβ) r } is the set of all β satisfyig the Datzig costrait. We first get bouds o the rate of covergece of Datzig selector.

16 6 BICKEL ET AL. Theorem 7.. Let W i be idepedet N (, σ 2 ) radom variables with σ 2 >, let all the diagoal elemets of the matrix X T X/ be equal to, ad M(β ) = s, where s M,, M 2. Let Assumptio RE(s, ) be satisfied. Cosider the Datzig selector β D defied by (7.3) with r = Aσ log M ad A > 2. The, with probability at least M A2 /2, we have (7.4) β D β 8A κ 2 (s, ) σ s log M, (7.5) X( β D β ) 2 2 6A2 κ 2 (s, ) σ2 s log M. I additio, if Assumptio RE(s, m, ) is satisfied, the with the same probability as above, simultaeously for all < p 2 we have { } p β s 2(p ) D β p p 2 p Aσ log M (7.6) 8 + s m κ 2. (s, m, ) Note that, sice s m, the factor i curly brackets i (7.6) is bouded by a costat idepedet of s ad m. Uder Assumptio i Sectio 4 with c = (which is less geeral tha RE(s, s, ), cf. Lemma 4.(i)) a boud of the form (7.6) for the case p = 2 is established by Cades ad Tao [7]. Bouds o the rate of covergece of the Lasso selector are quite similar to those obtaied i Theorem 7.. They are give by the followig result. Theorem 7.2. Let W i be idepedet N (, σ 2 ) radom variables with σ 2 >. Let all the diagoal elemets of the matrix X T X/ be equal to, ad M(β ) = s where s M,, M 2. Let Assumptio RE(s, 3) be satisfied. Cosider the Lasso estimator β L defied by (7.2) with r = Aσ log M ad A > 2 2. The, with probability at least M A2 /8, we have (7.7) β L β 6A κ 2 (s, 3) σ s log M, (7.8) (7.9) X( β L β ) 2 2 M( β L ) 64φ max κ 2 (s, 3) s. 6A2 κ 2 (s, 3) σ2 s log M,

17 LASSO AND DANTZIG SELECTOR 7 I additio, if Assumptio RE(s, m, 3) is satisfied, the with the same probability as above, simultaeously for all < p 2 we have (7.) { } β s 2(p ) L β p Aσ p s m κ 2 (s, m, 3) log M Iequalities of the form similar to (7.7) ad (7.8) ca be deduced from the results of [3] uder more restrictive coditios o the Gram matrix (the mutual coherece assumptio, cf. Assumptio 5 of Sectio 4). Assumptios RE(s, ) respectively RE(s, 3) ca be dropped i Theorem 7. ad 7.2 if we assume β Λ s,γ,c with c = or c = 3 as appropriate. The (7.4), (7.5) or respectively (7.7), (7.8) hold with κ = γ. This is aalogous to Corollary 6.2. Similarly (7.6) ad (7.) hold with κ = γ if β Λ s,γ,m,c with c = or c = 3 as appropriate. Observe that combiig Theorems 7. ad 7.2 we ca immediately get bouds for the differeces betwee Lasso ad Datzig selector β L β D p p ad X( β L β D ) 2 2. Such bouds have the same form as those of Theorems 7. ad 7.2, up to umerical costats. Aother way of estimatig these differeces follows directly from the proof of Theorem 7.. It suffices to observe that the oly property of β used i that proof is the fact that β satisfies the Datzig costrait, which is also true for the Lasso solutio β L. So, we ca replace β by β L ad s by M( β L ) everywhere i Theorem 7.. Geeralizig a bit more, we easily derive the followig fact. Theorem 7.3. The result of Theorem 7. remais valid if we replace there β D β p p by sup{ β D β p p : β Λ, M(β) = s} for p 2 ad X( β D β ) 2 2 by sup{ X( β D β) 2 2 : β Λ, M(β) = s} respectively. Here Λ is the set of all vectors satisfyig the Datzig costrait. Remarks.. Theorems 7. ad 7.2 oly give o-asymptotic upper bouds o the loss, with some probability ad uder some coditios. The probability depeds o M ad the coditios deped o ad M: recall that Assumptios RE(s, c ) ad RE(s, m, c ) are imposed o the M matrix X. To deduce asymptotic covergece (as ad/or as M ) from Theorems 7. ad 7.2 we would eed some very strog additioal properties, such as simultaeous validity of Assumptio RE(s, c ) or RE(s, m, c ) (with oe ad the same costat κ) for ifiitely may ad M. 2. Note that Assumptios RE(s, c ) or RE(s, m, c ) do ot imply idetifiability of β i the liear model (7.). However, the vector β appearig i the p.

18 8 BICKEL ET AL. statemets of Theorems 7. ad 7.2 is uiquely defied because we suppose there i additio that M(β ) = s ad c. Ideed, if there exists a β such that Xβ = Xβ, ad M(β ) = s the i view of assumptio RE(s, c ) with c we have ecessarily β = β (cf. discussio followig the defiitio of RE(s, c )). O the other had, Theorem 7.3 applies to certai values of β that do ot come from the model (7.) at all. 3. For the smallest value of A (which is A = 2 2) the costats i the boud of Theorem 7.2 for the Lasso are larger tha the correspodig umerical costats for the Datzig selector give i Theorem 7., agai for the smallest admissible value A = 2. O the cotrary, the Datzig selector has certai defects as compared to Lasso whe the model is oparametric, as discussed i Sectio 6. I particular, to obtai sparsity oracle iequalities for the Datzig selector we eed some restrictios o f, for example the weak sparsity property. O the other had, the sparsity oracle iequality (6.) for the Lasso is valid with o restrictio o f. 4. The proofs of Theorems 7. ad 7.2 differ maily i the value of the tuig costat: c = i Theorem 7. ad c = 3 i Theorem 7.2. Note that sice the Lasso solutio satisfies the Datzig costrait we could have obtaied a result similar to Theorem 7.2, though with less accurate umerical costats, by simply coductig the proof of Theorem 7. with c = 3. However, we act differetly: we deduce (B.3) directly from (B.), ad ot from (B.25). This is doe oly for the sake of improvig the costats: i fact, usig (B.25) with c = 3 would yield (B.3) with the doubled costat o the right had side. 5. For the Datzig selector i the liear regressio model ad uder Assumptios or 2 some further improvemet of costats i the l p bouds for the coefficiets ca be achieved by applyig the geeral versio of Lemma 4. with the projector P m iside. We do ot pursue this issue here. 6. All our results are stated with probabilities at least M A2 /2 or M A2 /8. These are reasoable (but ot the most accurate) lower bouds o the probabilities P(B) ad P(A) respectively: we have chose them just for readability. Ispectio of (B.4) shows that they ca be refied to 2MΦ(A log M) ad 2MΦ(A log M/2) respectively where Φ( ) is the stadard ormal c.d.f. APPENDIX A Proof of Lemma 4.. Cosider a partitio J c ito subsets of size m,

19 LASSO AND DANTZIG SELECTOR 9 with the last subset of size m: J c = K k= J k where K, J k = m for k =,..., K ad J K m, such that J k is the set of idices correspodig to m largest i absolute value coordiates of δ outside k j= J j (for k < K) ad J K is the remaiig subset. We have (A.) K 2 P m Xδ 2 P m Xδ Jm 2 P m Xδ Jk k=2 K 2 = Xδ Jm 2 P m Xδ Jk Xδ Jm 2 k=2 K k=2 P m Xδ Jk 2. We will prove first part (ii) of the lemma. Sice for k the vector δ Jk has oly m o-zero compoets we obtai (A.2) P m Xδ Jk 2 Xδ Jk 2 φ max (m) δ Jk 2. Next, as i [7], we observe that δ Jk+ 2 δ Jk / m, k =,..., K, ad therefore (A.3) K k=2 δ Jk 2 δ J c m c δ J m c s m δ J 2 c s m δ J m 2 where we used (4.). From (A.) (A.3) we fid Xδ 2 s Xδ Jm 2 c φ max (m) ( φ mi (s + m) c φ max (m) m δ J m 2 s m ) δ Jm 2 which proves part (ii) of the lemma. The proof of part (i) is aalogous. The oly differece is that we replace i the above argumet m by s ad istead of (A.2) we use the followig boud (cf. [7]): P m Xδ Jk 2 θ s,2s φmi (2s) δ J k 2.

20 2 BICKEL ET AL. APPENDIX B: TWO LEMMATA AND THE PROOFS OF THE RESULTS Lemma B.. Fix M 2 ad. Let W i be idepedet N (, σ 2 ) radom variables with σ 2 > ad let f L be the Lasso estimator defied by (2.2) with r = Aσ log M, for some A > 2 2. The, with probability at least M A2 /8 we have simultaeously for all β R M : f M L f 2 + r f j β j,l β j j= f β f 2 + 4r (B.) j J(β) f j β j,l β j f β f 2 + 4r M(β) f j 2 β j,l β j 2, ad (B.2) j J(β) XT (f X β L ) 3rf max /2. Furthermore, with the same probability (B.3) M( β L ) 4φ max f 2 mi ( f L f 2 /r 2) where φ max deotes the maximal eigevalue of the matrix X T X/. Proof of Lemma B.. The result (B.) is essetially Lemma from [5]. For completeess, we give its proof. Set r,j = r f j. By defiitio, Ŝ( β M L ) + 2 r,j β j,l Ŝ(β) + 2 M r,j β j j= j= for all β R M, which is equivalet to f M L f r,j β M j,l f β f r,j β j + 2 W i ( f L f β )(Z i ). j= j= i= Defie the radom variables V j = i= f j (Z i )W i, evet M A = {2 V j r,j }. j= j M, ad the

21 LASSO AND DANTZIG SELECTOR 2 Usig a elemetary boud o the tails of Gaussia disributio we fid that the probability of the complemetary evet A c satisfies (B.4) M P{A c } P{ V j > r,j /2} M P{ η r /(2σ)} j= ( ) ( ) M exp r2 8σ 2 = M exp A2 log M = M A2 /8 8 where η N (, ). O the evet A we have f M L f 2 f β f 2 + r,j β M M j,l β j + 2r,j β j 2r,j β j,l. j= j= j= Addig the term M j= r,j β j,l β j to both sides of this iequality yields, o A, f L f 2 + M j= r,j β j,l β j f β f M j= r,j ( β j,l β j + β j β ) j,l. Now, β j,l β j + β j β j,l = for j J(β), so that o A we get (B.). To prove (B.2) it suffices to ote that o A we have (B.5) D /2 X T W r/2. Now, y = f + w, ad (B.2) follows from (2.3), (B.5). We fially prove (B.3). The ecessary ad sufficiet coditio for β L to be the Lasso solutio ca be writte i the form (B.6) xt (j) (y X β L ) = r f j sig( β j,l ) if β j,l, xt (j) (y X β L ) r f j if β j,l = where x (j) deotes the jth colum of X, j =,..., M. Next, (B.5) yields that o A we have (B.7) xt (j) W r fj /2, j =,..., M. Combiig (B.6) ad (B.7) we get (B.8) xt (j) (f X β L ) r f j /2 if β j,l.

22 22 BICKEL ET AL. Therefore, 2 (f X β L ) T XX T (f X β L ) = ( Mj= 2 x T (j) (f X β ) 2 L ) ( x T 2 j: βj,l (j) (f X β ) 2 L ) = M( β L )r 2 f j 2 /4 f 2 mi M( β L )r 2 /4. Sice the matrices X T X/ ad XX T / have the same maximal eigevalues, 2 (f X β L ) T XX T (f X β L ) φ max f X β L 2 2 = φ max f f L 2 ad we deduce (B.3) from the last two displays. Corollary B.2. Let the assumptios of Lemma B. be satisfied ad f j =, j =,..., M. Cosider the liear regressio model y = Xβ + w. The, with probability at least M A2 /8, we have δ J c 3 δ J where J = J(β) is the set of o-zero coefficiets of β ad δ = β L β. Proof. Use the first iequality i (B.) ad the fact that f = f β for the liear regressio model. Lemma B.3. Let β R M satisfy the Datzig costrait D /2 X T (y Xβ) r ad set δ = β D β, J = J(β). The (B.9) δ J c δ J. Further, let the assumptios of Lemma B. be satisfied with A > 2. The with probability of at least M A2 /2 we have (B.) XT (f X β D ) 2rf max. Proof of Lemma B.3. Iequality(B.9) follows immediately from the defiitio of Datzig selector, cf. [7]. To prove (B.) cosider the evet { } B = D /2 X T M W r = { V j r,j }. j=

23 LASSO AND DANTZIG SELECTOR 23 Aalogously to (B.4), P{B c } M A2 /2. O the other had, y = f + w ad usig the defiitio of Datzig selector it is easy to see that (B.) is satisfied o B. Proof of Theorem 5.. Set δ = β L β D. We have f X β L 2 2 = f X β D δt X T (f X β D ) + Xδ 2 2. This ad (B.) yield (B.) f D f 2 f L f δ XT (f X β D ) Xδ 2 2 f L f 2 + 4f max r δ Xδ 2 2 where the last iequality holds with probability at least M A2 /2. Sice the Lasso solutio β L satisfies the Datzig costrait, we ca apply Lemma B.3 with β = β L, which yields (B.2) δ J c δ J with J = J( β L ). By Assumptio RE(s, ) we get (B.3) Xδ 2 κ δ J 2 where κ = κ(s, ). Usig (B.2) ad (B.3) we obtai (B.4) δ 2 δ J 2M /2 ( β L ) δ J 2 2M/2 ( β L ) κ Xδ 2. Fially, from (B.) ad (B.4) we get that, with probability at least M A2 /2, (B.5) f D f 2 f L f 2 + 8f maxrm /2 ( β L ) κ Xδ 2 Xδ 2 2 f L f 2 + 6f 2 maxr 2 M( β L ) κ 2, where the RHS follows (B.2), (B.), ad aother applicatio of (B.4). This proves oe side of the iequality.

24 24 BICKEL ET AL. To show the other side of the boud o the differece, we act as i (B.), up to the iversio of roles of β L ad β D, ad we use (B.2). This yields that, with probability at least M A2 /8, f L f 2 f D f δ XT (f X β L ) Xδ 2 2 (B.6) f D f 2 + 3f max r δ Xδ 2 2. This is aalogous to (B.). Parallelig ow the proof leadig to (B.5) we obtai (B.7) f L f 2 f D f 2 + 9f 2 maxr 2 M( β L ) κ 2. The theorem ow follows from (B.5) ad (B.7). Proof of Theorem 5.2. Set agai δ = β L β D. We apply (B.) with β = β D which yields that, with probability at least M A2 /8, (B.8) δ 4 δ J + f D f 2 /r where ow J = J( β D ). Cosider the two cases: (i) f D f 2 > 2r δ J ad (ii) f D f 2 2r δ J. I case (i) iequality (B.6) with f max = immediately implies f L f 2 f D f 2 ad the theorem follows. I case (ii) we get from (B.8) that δ 6 δ J ad thus δ J c 5 δ J. We ca therefore apply Assumptio RE(s, 5) which yields, similarly to (B.4), (B.9) δ 6M /2 ( β D ) δ J 2 6M/2 ( β D ) κ Xδ 2 where κ = κ(s, 5). Pluggig (B.9) ito (B.6) we fially get that, i case (ii), (B.2) f L f 2 f D f rm/2 ( β D ) κ Xδ 2 Xδ 2 2 f D f r2 M( β D ) κ 2.

25 LASSO AND DANTZIG SELECTOR 25 Proof of Theorem 6.. Fix a arbitrary β R M with M(β) s. Set δ = D /2 ( β L β), J = J(β). O the evet A, we get from the first lie i (B.) that f L f 2 + r δ f β f 2 + 4r f j β j,l β j (B.2) j J = f β f 2 + 4r δ J, ad from the secod lie i (B.) that (B.22) f L f 2 f β f 2 + 4r M(β) δ J 2. Cosider separately the cases where (B.23) 4r δ J ε f β f 2 ad (B.24) ε f β f 2 < 4r δ J. I case (B.23), the result of the theorem trivially follows from (B.2). So, we will oly cosider the case (B.24). All the subsequet iequalities are valid o the evet A A where A is defied by (B.24). O this evet we get from (B.2) that δ 4( + /ε) δ J which implies δ J c (3 + 4/ε) δ J. We ow use Assumptio RE(s, 3 + 4/ε). This yields κ 2 δ J 2 2 Xδ 2 2 = ( β K β) T D /2 X T XD /2 ( β L β) f max 2 ( β L β) T X T X( β L β) = fmax 2 f L f β 2 where κ = κ(s, 3 + 4/ε). Combiig this with (B.22) we fid f L f 2 f β f 2 + 4rf max κ M(β) f L f β f β f 2 + 4rf max κ M(β) ( f ) L f + f β f. This iequality is of the same form as (A.4) i [4]. A stadard decouplig argumet as i [4] usig iequality 2xy x 2 /b + by 2 with b >, x = rκ M(β), ad y beig either f L f or f β f yields that f L f 2 b + b f β f 2 + 8b2 f 2 max (b )κ 2 r2 M(β), b >.

26 26 BICKEL ET AL. Takig b = + 2/ε i the last display fiishes the proof of the theorem. Proof of Propositio 6.3. Due to the weak sparsity assumptio there exists β R M with M( β) s such that f β f 2 C f 2 maxr 2 κ 2 M( β) where κ = κ(s, 3 + 4/ε) is the same as i Theorem 6.. Usig this together with Theorem 6. ad (B.3) we obtai that, with probability at least M A2 /8, This ad Theorem 5. imply M( β L ) C (ε)m( β) C (ε)s. f D f 2 f L f 2 + 6C (ε)f 2 maxa 2 σ 2 κ 2 ( ) s log M where κ = κ(max(c (ε), )s, 3 + 4/ε). Applyig Theorem 6. oce agai we get the result. Proof of Theorem 7.. Set δ = β D β ad J = J(β ). Usig Lemma B.3 with β = β we get that o the evet B (i.e., with probability at least M A2 /2 ): (i) XT Xδ 2r, ad (ii) iequality (4.) holds with c =. Therefore, o B we have (B.25) Xδ 2 2 = δt X T Xδ X T Xδ δ ) 2r ( δ J + δ J c 2( + c )r δ J 2( + c )r s δ J 2 = 4r s δ J 2 sice c =. From Assumptio RE(s, ) we get that Xδ 2 2 κ 2 δ J 2 2 where κ = κ(s, ). This ad (B.25) yield that, o B, (B.26) Xδ 2 2 6r 2 s/κ 2, δ J 2 4r s/κ 2. The first iequality i (B.26) implies (7.5). Next, (7.4) is straightforward i view of the secod iequality i (B.26) of the followig relatios (with c = ): (B.27) δ = δ J + δ J c ( + c ) δ J ( + c ) s δ J 2

27 LASSO AND DANTZIG SELECTOR 27 that hold o B. It remais to prove (7.6). It is easy to see that the kth largest i absolute value elemet of δ J c satisfies δ J c (k) δ J c /k. Thus δ J c m 2 2 δ J c 2 k m+ ad sice (4.) holds o B (with c = ) we fid k 2 m δ J c 2 δ J c m 2 c δ J m c δ J 2 s m c δ Jm 2 s m. Therefore, o B, (B.28) δ 2 ( ) s + c δ Jm 2. m O the other had, it follows from (B.25) that Xδ 2 2 4r s δ Jm 2. Combiig this iequality with Assumptio RE(s, m, ) we obtai that, o B, δ Jm 2 4r s/κ 2. Recallig that c = ad applyig the last iequality together with (B.28) we get ( ) s 2 (B.29) δ c (r s/κ 2 ) 2. m It remais to ote that (7.6) is a direct cosequece of (7.4) ad (B.29). This follows from the fact that iequalities M j= a j b ad M j= a 2 j b 2 with a j imply M a p M j = M a 2 p j a 2p 2 j a j 2 p p M a 2 j j= j= j= j= b 2 p b p 2, < p 2. Proof of Theorem 7.2. Set δ = β L β ad J = J(β ). Usig (B.) where we put β = β, r,j r ad f β f = we get that, o the evet A, (B.3) Xδ 2 2 4r s δ J 2

28 28 BICKEL ET AL. ad (4.) holds with c = 3 o the same evet. Thus, by Assumptio RE(s, 3) ad the last iequality we obtai that, o A, (B.3) Xδ 2 2 6r 2 s/κ 2, δ J 2 4r s/κ 2 where κ = κ(s, 3). The first iequality here coicides with (7.8). Next, (7.9) follows immediately from (B.3) ad (7.8). To show (7.7) it suffices to ote that o the evet A the relatios (B.27) hold with c = 3, to apply the secod iequality i (B.3) ad to use (B.4). Fially, the proof of (7.) follows exactly the same lies as that of (7.6): the oly differece is that oe should set c = 3 i (B.28), (B.29), as well as i the display precedig (B.28). REFERENCES [] Bickel, P.J. (27). Discussio of The Datzig selector: statistical estimatio whe p is much larger tha, by Cades ad Tao. Aals of Statistics, 35, [2] Buea F., Tsybakov, A.B. ad Wegkamp M.H. (24). Aggregatio for regressio learig. Preprit LPMA, Uiversities Paris 6 Paris 7, 948, available at arxiv:math.st/424 ad at [3] Buea F., Tsybakov, A.B. ad Wegkamp M.H. (26). Aggregatio ad sparsity via l pealized least squares. Proceedigs of 9th Aual Coferece o Learig Theory (COLT 26), Lecture Notes i Artificial Itelligece v.45 (Lugosi, G. ad Simo, H.U.,eds.), Spriger-Verlag, Berli-Heidelberg, [4] Buea F., Tsybakov, A.B. ad Wegkamp M.H. (27). Aggregatio for Gaussia regressio. Aals of Statistics, 35, [5] Buea F., Tsybakov, A.B. ad Wegkamp M.H. (27). Sparsity oracle iequalities for the Lasso. Electroic Joural of Statistics [6] Buea F., Tsybakov, A.B. ad Wegkamp M.H. (27). Sparse desity estimatio with l pealties. Proceedigs of 2th Aual Coferece o Learig Theory (COLT 27), Lecture Notes i Artificial Itelligece v.4539 (N.H. Bshouty ad C.Getile, eds.), Spriger-Verlag, Berli-Heidelberg,

29 LASSO AND DANTZIG SELECTOR 29 [7] Cades, E. ad Tao, T. (27). The Datzig selector: statistical estimatio whe p is much larger tha. Aals of Statistics, 35, [8] Dooho, D.L., Elad, M. ad Temlyakov, V. (26). Stable Recovery of Sparse Overcomplete Represetatios i the Presece of Noise. IEEE Tras. o Iformatio Theory [9] Efro, B., Hastie, T., Johstoe, I. ad Tibshirai, R. (24). Least agle regressio. Aals of Statistics [] Friedma, J., Hastie, T., Höflig, H., ad Tibshirai, R. (27). Pathwise coordiate optimizatio. Aals of Applied Statistics [] Fu, W. ad Kight, K. (2). Asymptotics for Lasso-type estimators. Aals of Statistics [2] Greeshtei, E. ad Ritov, Y. (24). Persistecy i high dimesioal liear predictor-selectio ad the virtue of over-parametrizatio. Beroulli [3] Juditsky, A. ad Nemirovski, A. (2). Fuctioal aggregatio for oparametric estimatio. Aals of Statistics [4] Koltchiskii, V. (26). Sparsity i pealized empirical risk miimizatio. Aales de l IHP, to appear. [5] Koltchiskii, V. (27). Datzig selector ad sparsity oracle iequalities. Mauscript. [6] Meier, L., va de Geer, S. ad Bühlma, P. (28). The Group Lasso for logistic regressio. J. Royal Statistical Society, Series B [7] Meishause, N. ad Bühlma, P. (26). High-dimesioal graphs ad variable selectio with the Lasso. Aals of Statistics [8] Meishause, N. ad Yu, B. (26). Lasso type recovery of sparse represetatios for high dimesioal data. A. Statist., to appear. [9] Nemirovski, A. (2). Topics i No-parametric Statistics. Ecole d Eté de Probabilités de Sait-Flour XXVIII - 998, Lecture Notes i Mathematics, v. 738, Spriger: New York.

30 3 BICKEL ET AL. [2] Osbore, M.R., Presell, B. ad Turlach, B.A (2a). O the Lasso ad its dual. Joural of Computatioal ad Graphical Statistics [2] Osbore, M.R., Presell, B. ad Turlach, B.A (2b). A ew approach to variable selectio i least squares problems. IMA Joural of Numerical Aalysis [22] Tibshirai, R. (996). Regressio shrikage ad selectio via the Lasso. Joural of the Royal Statistical Society, Series B [23] Tsybakov, A.B. (26). Discussio of Regularizatio i Statistics, by P.Bickel ad B.Li. TEST [24] Turlach, B.A. (25). O algorithms for solvig least squares problems uder a L pealty or a L costrait. 24 Proceedigs of the America Statistical Associatio, Statistical Computig Sectio [CD- ROM], America Statistical Associatio, Alexadria, VA, pp [25] va de Geer, S.A. (26). High dimesioal geeralized liear models ad the Lasso. Research report No.33. Semiar für Statistik, ETH, Zürich, A. Statist., to appear. [26] Zhag, C.-H. ad Huag, J. (26). Model-selectio cosistecy of the Lasso i high-dimesioal regressio, A. Statist., to appear.. [27] Zhao, P. ad Yu, B. (26). O model selectio cosistecy of Lasso. Joural of Machie Learig Research Departmet of Statistics Uiversity of Califoria at Berkeley, CA USA bickel@stat.berkeley.edu Jerusalem, Israel yaacov.ritov@gmail.com Laboratoire de Probabilités et Modèles Aléatoires Uiversité Paris VI, Frace. tsybakov@ccr.jussieu.fr

Properties of MLE: consistency, asymptotic normality. Fisher information.

Properties of MLE: consistency, asymptotic normality. Fisher information. Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout

More information

Department of Computer Science, University of Otago

Department of Computer Science, University of Otago Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS-2006-09 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly

More information

I. Chi-squared Distributions

I. Chi-squared Distributions 1 M 358K Supplemet to Chapter 23: CHI-SQUARED DISTRIBUTIONS, T-DISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad t-distributios, we first eed to look at aother family of distributios, the chi-squared distributios.

More information

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample

More information

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008 I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces

More information

Asymptotic Growth of Functions

Asymptotic Growth of Functions CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll

More information

Chapter 7 Methods of Finding Estimators

Chapter 7 Methods of Finding Estimators Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of

More information

Convexity, Inequalities, and Norms

Convexity, Inequalities, and Norms Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for

More information

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE By Guillaume Lecué CNRS, LAMA, Mare-la-vallée, 77454 Frace ad By Shahar Medelso Departmet of Mathematics,

More information

Maximum Likelihood Estimators.

Maximum Likelihood Estimators. Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio

More information

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99 VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS Jia Huag 1, Joel L. Horowitz 2 ad Fegrog Wei 3 1 Uiversity of Iowa, 2 Northwester Uiversity ad 3 Uiversity of West Georgia Abstract We cosider a oparametric

More information

Irreducible polynomials with consecutive zero coefficients

Irreducible polynomials with consecutive zero coefficients Irreducible polyomials with cosecutive zero coefficiets Theodoulos Garefalakis Departmet of Mathematics, Uiversity of Crete, 71409 Heraklio, Greece Abstract Let q be a prime power. We cosider the problem

More information

Sequences and Series

Sequences and Series CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their

More information

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval

More information

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8 CME 30: NUMERICAL LINEAR ALGEBRA FALL 005/06 LECTURE 8 GENE H GOLUB 1 Positive Defiite Matrices A matrix A is positive defiite if x Ax > 0 for all ozero x A positive defiite matrix has real ad positive

More information

THE ABRACADABRA PROBLEM

THE ABRACADABRA PROBLEM THE ABRACADABRA PROBLEM FRANCESCO CARAVENNA Abstract. We preset a detailed solutio of Exercise E0.6 i [Wil9]: i a radom sequece of letters, draw idepedetly ad uiformly from the Eglish alphabet, the expected

More information

0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5

0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5 Sectio 13 Kolmogorov-Smirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.

More information

A probabilistic proof of a binomial identity

A probabilistic proof of a binomial identity A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two

More information

Lecture 4: Cheeger s Inequality

Lecture 4: Cheeger s Inequality Spectral Graph Theory ad Applicatios WS 0/0 Lecture 4: Cheeger s Iequality Lecturer: Thomas Sauerwald & He Su Statemet of Cheeger s Iequality I this lecture we assume for simplicity that G is a d-regular

More information

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem Lecture 4: Cauchy sequeces, Bolzao-Weierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits

More information

Chapter 5: Inner Product Spaces

Chapter 5: Inner Product Spaces Chapter 5: Ier Product Spaces Chapter 5: Ier Product Spaces SECION A Itroductio to Ier Product Spaces By the ed of this sectio you will be able to uderstad what is meat by a ier product space give examples

More information

Lecture 5: Span, linear independence, bases, and dimension

Lecture 5: Span, linear independence, bases, and dimension Lecture 5: Spa, liear idepedece, bases, ad dimesio Travis Schedler Thurs, Sep 23, 2010 (versio: 9/21 9:55 PM) 1 Motivatio Motivatio To uderstad what it meas that R has dimesio oe, R 2 dimesio 2, etc.;

More information

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009) 18.409 A Algorithmist s Toolkit October 27, 2009 Lecture 13 Lecturer: Joatha Keler Scribe: Joatha Pies (2009) 1 Outlie Last time, we proved the Bru-Mikowski iequality for boxes. Today we ll go over the

More information

Incremental calculation of weighted mean and variance

Incremental calculation of weighted mean and variance Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically

More information

CS103X: Discrete Structures Homework 4 Solutions

CS103X: Discrete Structures Homework 4 Solutions CS103X: Discrete Structures Homewor 4 Solutios Due February 22, 2008 Exercise 1 10 poits. Silico Valley questios: a How may possible six-figure salaries i whole dollar amouts are there that cotai at least

More information

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed Multi-Evet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria

More information

Class Meeting # 16: The Fourier Transform on R n

Class Meeting # 16: The Fourier Transform on R n MATH 18.152 COUSE NOTES - CLASS MEETING # 16 18.152 Itroductio to PDEs, Fall 2011 Professor: Jared Speck Class Meetig # 16: The Fourier Trasform o 1. Itroductio to the Fourier Trasform Earlier i the course,

More information

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. This documet was writte ad copyrighted by Paul Dawkis. Use of this documet ad its olie versio is govered by the Terms ad Coditios of Use located at http://tutorial.math.lamar.edu/terms.asp. The olie versio

More information

5 Boolean Decision Trees (February 11)

5 Boolean Decision Trees (February 11) 5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected

More information

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13 EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may

More information

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here). BEGINNING ALGEBRA Roots ad Radicals (revised summer, 00 Olso) Packet to Supplemet the Curret Textbook - Part Review of Square Roots & Irratioals (This portio ca be ay time before Part ad should mostly

More information

Hypothesis testing. Null and alternative hypotheses

Hypothesis testing. Null and alternative hypotheses Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate

More information

Modified Line Search Method for Global Optimization

Modified Line Search Method for Global Optimization Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o

More information

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method Chapter 6: Variace, the law of large umbers ad the Mote-Carlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value

More information

HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSING DATA: PROVABLE GUARANTEES WITH NONCONVEXITY

HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSING DATA: PROVABLE GUARANTEES WITH NONCONVEXITY The Aals of Statistics 2012, Vol. 40, No. 3, 1637 1664 DOI: 10.1214/12-AOS1018 Istitute of Mathematical Statistics, 2012 HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSING DATA: PROVABLE GUARANTEES WITH

More information

3 Basic Definitions of Probability Theory

3 Basic Definitions of Probability Theory 3 Basic Defiitios of Probability Theory 3defprob.tex: Feb 10, 2003 Classical probability Frequecy probability axiomatic probability Historical developemet: Classical Frequecy Axiomatic The Axiomatic defiitio

More information

High-dimensional support union recovery in multivariate regression

High-dimensional support union recovery in multivariate regression High-dimesioal support uio recovery i multivariate regressio Guillaume Oboziski Departmet of Statistics UC Berkeley gobo@stat.berkeley.edu Marti J. Waiwright Departmet of Statistics Dept. of Electrical

More information

Chapter 5 O A Cojecture Of Erdíos Proceedigs NCUR VIII è1994è, Vol II, pp 794í798 Jeærey F Gold Departmet of Mathematics, Departmet of Physics Uiversity of Utah Do H Tucker Departmet of Mathematics Uiversity

More information

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee

More information

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring No-life isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy

More information

MARTINGALES AND A BASIC APPLICATION

MARTINGALES AND A BASIC APPLICATION MARTINGALES AND A BASIC APPLICATION TURNER SMITH Abstract. This paper will develop the measure-theoretic approach to probability i order to preset the defiitio of martigales. From there we will apply this

More information

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length Joural o Satisfiability, Boolea Modelig ad Computatio 1 2005) 49-60 A Faster Clause-Shorteig Algorithm for SAT with No Restrictio o Clause Legth Evgey Datsi Alexader Wolpert Departmet of Computer Sciece

More information

Factors of sums of powers of binomial coefficients

Factors of sums of powers of binomial coefficients ACTA ARITHMETICA LXXXVI.1 (1998) Factors of sums of powers of biomial coefficiets by Neil J. Cali (Clemso, S.C.) Dedicated to the memory of Paul Erdős 1. Itroductio. It is well ow that if ( ) a f,a = the

More information

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE 6.44. The absolute value of the complex number z a bi is

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE 6.44. The absolute value of the complex number z a bi is 0_0605.qxd /5/05 0:45 AM Page 470 470 Chapter 6 Additioal Topics i Trigoometry 6.5 Trigoometric Form of a Complex Number What you should lear Plot complex umbers i the complex plae ad fid absolute values

More information

Theorems About Power Series

Theorems About Power Series Physics 6A Witer 20 Theorems About Power Series Cosider a power series, f(x) = a x, () where the a are real coefficiets ad x is a real variable. There exists a real o-egative umber R, called the radius

More information

1. MATHEMATICAL INDUCTION

1. MATHEMATICAL INDUCTION 1. MATHEMATICAL INDUCTION EXAMPLE 1: Prove that for ay iteger 1. Proof: 1 + 2 + 3 +... + ( + 1 2 (1.1 STEP 1: For 1 (1.1 is true, sice 1 1(1 + 1. 2 STEP 2: Suppose (1.1 is true for some k 1, that is 1

More information

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k. 18.409 A Algorithmist s Toolkit September 17, 009 Lecture 3 Lecturer: Joatha Keler Scribe: Adre Wibisoo 1 Outlie Today s lecture covers three mai parts: Courat-Fischer formula ad Rayleigh quotiets The

More information

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,

More information

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL. Auities Uder Radom Rates of Iterest II By Abraham Zas Techio I.I.T. Haifa ISRAEL ad Haifa Uiversity Haifa ISRAEL Departmet of Mathematics, Techio - Israel Istitute of Techology, 3000, Haifa, Israel I memory

More information

SUPPORT UNION RECOVERY IN HIGH-DIMENSIONAL MULTIVARIATE REGRESSION 1

SUPPORT UNION RECOVERY IN HIGH-DIMENSIONAL MULTIVARIATE REGRESSION 1 The Aals of Statistics 2011, Vol. 39, No. 1, 1 47 DOI: 10.1214/09-AOS776 Istitute of Mathematical Statistics, 2011 SUPPORT UNION RECOVERY IN HIGH-DIMENSIONAL MULTIVARIATE REGRESSION 1 BY GUILLAUME OBOZINSKI,

More information

Soving Recurrence Relations

Soving Recurrence Relations Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree

More information

1 Correlation and Regression Analysis

1 Correlation and Regression Analysis 1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio

More information

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5

More information

Permutations, the Parity Theorem, and Determinants

Permutations, the Parity Theorem, and Determinants 1 Permutatios, the Parity Theorem, ad Determiats Joh A. Guber Departmet of Electrical ad Computer Egieerig Uiversity of Wiscosi Madiso Cotets 1 What is a Permutatio 1 2 Cycles 2 2.1 Traspositios 4 3 Orbits

More information

1 Computing the Standard Deviation of Sample Means

1 Computing the Standard Deviation of Sample Means Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.

More information

Infinite Sequences and Series

Infinite Sequences and Series CHAPTER 4 Ifiite Sequeces ad Series 4.1. Sequeces A sequece is a ifiite ordered list of umbers, for example the sequece of odd positive itegers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29...

More information

THE HEIGHT OF q-binary SEARCH TREES

THE HEIGHT OF q-binary SEARCH TREES THE HEIGHT OF q-binary SEARCH TREES MICHAEL DRMOTA AND HELMUT PRODINGER Abstract. q biary search trees are obtaied from words, equipped with the geometric distributio istead of permutatios. The average

More information

Overview of some probability distributions.

Overview of some probability distributions. Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability

More information

Plug-in martingales for testing exchangeability on-line

Plug-in martingales for testing exchangeability on-line Plug-i martigales for testig exchageability o-lie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk

More information

Section 11.3: The Integral Test

Section 11.3: The Integral Test Sectio.3: The Itegral Test Most of the series we have looked at have either diverged or have coverged ad we have bee able to fid what they coverge to. I geeral however, the problem is much more difficult

More information

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series

Our aim is to show that under reasonable assumptions a given 2π-periodic function f can be represented as convergent series 8 Fourier Series Our aim is to show that uder reasoable assumptios a give -periodic fuctio f ca be represeted as coverget series f(x) = a + (a cos x + b si x). (8.) By defiitio, the covergece of the series

More information

1. C. The formula for the confidence interval for a population mean is: x t, which was

1. C. The formula for the confidence interval for a population mean is: x t, which was s 1. C. The formula for the cofidece iterval for a populatio mea is: x t, which was based o the sample Mea. So, x is guarateed to be i the iterval you form.. D. Use the rule : p-value

More information

Output Analysis (2, Chapters 10 &11 Law)

Output Analysis (2, Chapters 10 &11 Law) B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should

More information

S. Tanny MAT 344 Spring 1999. be the minimum number of moves required.

S. Tanny MAT 344 Spring 1999. be the minimum number of moves required. S. Tay MAT 344 Sprig 999 Recurrece Relatios Tower of Haoi Let T be the miimum umber of moves required. T 0 = 0, T = 7 Iitial Coditios * T = T + $ T is a sequece (f. o itegers). Solve for T? * is a recurrece,

More information

Notes on exponential generating functions and structures.

Notes on exponential generating functions and structures. Notes o expoetial geeratig fuctios ad structures. 1. The cocept of a structure. Cosider the followig coutig problems: (1) to fid for each the umber of partitios of a -elemet set, (2) to fid for each the

More information

How To Solve The Homewor Problem Beautifully

How To Solve The Homewor Problem Beautifully Egieerig 33 eautiful Homewor et 3 of 7 Kuszmar roblem.5.5 large departmet store sells sport shirts i three sizes small, medium, ad large, three patters plaid, prit, ad stripe, ad two sleeve legths log

More information

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations CS3A Hadout 3 Witer 00 February, 00 Solvig Recurrece Relatios Itroductio A wide variety of recurrece problems occur i models. Some of these recurrece relatios ca be solved usig iteratio or some other ad

More information

Lesson 17 Pearson s Correlation Coefficient

Lesson 17 Pearson s Correlation Coefficient Outlie Measures of Relatioships Pearso s Correlatio Coefficiet (r) -types of data -scatter plots -measure of directio -measure of stregth Computatio -covariatio of X ad Y -uique variatio i X ad Y -measurig

More information

LECTURE 13: Cross-validation

LECTURE 13: Cross-validation LECTURE 3: Cross-validatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Three-way data partitioi Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M

More information

FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. 1. Powers of a matrix

FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. 1. Powers of a matrix FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. Powers of a matrix We begi with a propositio which illustrates the usefuless of the diagoalizatio. Recall that a square matrix A is diogaalizable if

More information

CHAPTER 3 DIGITAL CODING OF SIGNALS

CHAPTER 3 DIGITAL CODING OF SIGNALS CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity

More information

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return EVALUATING ALTERNATIVE CAPITAL INVESTMENT PROGRAMS By Ke D. Duft, Extesio Ecoomist I the March 98 issue of this publicatio we reviewed the procedure by which a capital ivestmet project was assessed. The

More information

Confidence Intervals for One Mean

Confidence Intervals for One Mean Chapter 420 Cofidece Itervals for Oe Mea Itroductio This routie calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) at a stated cofidece level for a

More information

THIN SEQUENCES AND THE GRAM MATRIX PAMELA GORKIN, JOHN E. MCCARTHY, SANDRA POTT, AND BRETT D. WICK

THIN SEQUENCES AND THE GRAM MATRIX PAMELA GORKIN, JOHN E. MCCARTHY, SANDRA POTT, AND BRETT D. WICK THIN SEQUENCES AND THE GRAM MATRIX PAMELA GORKIN, JOHN E MCCARTHY, SANDRA POTT, AND BRETT D WICK Abstract We provide a ew proof of Volberg s Theorem characterizig thi iterpolatig sequeces as those for

More information

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100B Istructor: Nicolas Christou Three importat distributios: Distributios related to the ormal distributio Chi-square (χ ) distributio.

More information

A Recursive Formula for Moments of a Binomial Distribution

A Recursive Formula for Moments of a Binomial Distribution A Recursive Formula for Momets of a Biomial Distributio Árpád Béyi beyi@mathumassedu, Uiversity of Massachusetts, Amherst, MA 01003 ad Saverio M Maago smmaago@psavymil Naval Postgraduate School, Moterey,

More information

INFINITE SERIES KEITH CONRAD

INFINITE SERIES KEITH CONRAD INFINITE SERIES KEITH CONRAD. Itroductio The two basic cocepts of calculus, differetiatio ad itegratio, are defied i terms of limits (Newto quotiets ad Riema sums). I additio to these is a third fudametal

More information

Universal coding for classes of sources

Universal coding for classes of sources Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric

More information

Totally Corrective Boosting Algorithms that Maximize the Margin

Totally Corrective Boosting Algorithms that Maximize the Margin Mafred K. Warmuth mafred@cse.ucsc.edu Ju Liao liaoju@cse.ucsc.edu Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch Guar.Raetsch@tuebige.mpg.de Friedrich Miescher Laboratory of

More information

Perfect Packing Theorems and the Average-Case Behavior of Optimal and Online Bin Packing

Perfect Packing Theorems and the Average-Case Behavior of Optimal and Online Bin Packing SIAM REVIEW Vol. 44, No. 1, pp. 95 108 c 2002 Society for Idustrial ad Applied Mathematics Perfect Packig Theorems ad the Average-Case Behavior of Optimal ad Olie Bi Packig E. G. Coffma, Jr. C. Courcoubetis

More information

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction THE ARITHMETIC OF INTEGERS - multiplicatio, expoetiatio, divisio, additio, ad subtractio What to do ad what ot to do. THE INTEGERS Recall that a iteger is oe of the whole umbers, which may be either positive,

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis Ruig Time ( 3.) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork Solutios to Selected Problems I: Patter Classificatio by Duda, Hart, Stork Joh L. Weatherwax February 4, 008 Problem Solutios Chapter Bayesia Decisio Theory Problem radomized rules Part a: Let Rx be the

More information

5: Introduction to Estimation

5: Introduction to Estimation 5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample

More information

. P. 4.3 Basic feasible solutions and vertices of polyhedra. x 1. x 2

. P. 4.3 Basic feasible solutions and vertices of polyhedra. x 1. x 2 4. Basic feasible solutios ad vertices of polyhedra Due to the fudametal theorem of Liear Programmig, to solve ay LP it suffices to cosider the vertices (fiitely may) of the polyhedro P of the feasible

More information

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find 1.8 Approximatig Area uder a curve with rectagles 1.6 To fid the area uder a curve we approximate the area usig rectagles ad the use limits to fid 1.4 the area. Example 1 Suppose we wat to estimate 1.

More information

Convex Bodies of Minimal Volume, Surface Area and Mean Width with Respect to Thin Shells

Convex Bodies of Minimal Volume, Surface Area and Mean Width with Respect to Thin Shells Caad. J. Math. Vol. 60 (1), 2008 pp. 3 32 Covex Bodies of Miimal Volume, Surface Area ad Mea Width with Respect to Thi Shells Károly Böröczky, Károly J. Böröczky, Carste Schütt, ad Gergely Witsche Abstract.

More information

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

, a Wishart distribution with n -1 degrees of freedom and scale matrix. UMEÅ UNIVERSITET Matematisk-statistiska istitutioe Multivariat dataaalys D MSTD79 PA TENTAMEN 004-0-9 LÖSNINGSFÖRSLAG TILL TENTAMEN I MATEMATISK STATISTIK Multivariat dataaalys D, 5 poäg.. Assume that

More information

3. Greatest Common Divisor - Least Common Multiple

3. Greatest Common Divisor - Least Common Multiple 3 Greatest Commo Divisor - Least Commo Multiple Defiitio 31: The greatest commo divisor of two atural umbers a ad b is the largest atural umber c which divides both a ad b We deote the greatest commo gcd

More information

Parallel Coordinate Descent Methods for Big Data Optimization

Parallel Coordinate Descent Methods for Big Data Optimization Parallel Coordiate Descet Methods for Big Data Optimizatio Peter Richtárik Marti Taká School of Mathematics Uiversity of Ediburgh Uited Kigdom November 23, 202 updated o December 4, 202 Abstract I this

More information

CHAPTER 3 THE TIME VALUE OF MONEY

CHAPTER 3 THE TIME VALUE OF MONEY CHAPTER 3 THE TIME VALUE OF MONEY OVERVIEW A dollar i the had today is worth more tha a dollar to be received i the future because, if you had it ow, you could ivest that dollar ad ear iterest. Of all

More information

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles The followig eample will help us uderstad The Samplig Distributio of the Mea Review: The populatio is the etire collectio of all idividuals or objects of iterest The sample is the portio of the populatio

More information

Analysis Notes (only a draft, and the first one!)

Analysis Notes (only a draft, and the first one!) Aalysis Notes (oly a draft, ad the first oe!) Ali Nesi Mathematics Departmet Istabul Bilgi Uiversity Kuştepe Şişli Istabul Turkey aesi@bilgi.edu.tr Jue 22, 2004 2 Cotets 1 Prelimiaries 9 1.1 Biary Operatio...........................

More information

Estimating Probability Distributions by Observing Betting Practices

Estimating Probability Distributions by Observing Betting Practices 5th Iteratioal Symposium o Imprecise Probability: Theories ad Applicatios, Prague, Czech Republic, 007 Estimatig Probability Distributios by Observig Bettig Practices Dr C Lych Natioal Uiversity of Irelad,

More information

Probabilistic Engineering Mechanics. Do Rosenblatt and Nataf isoprobabilistic transformations really differ?

Probabilistic Engineering Mechanics. Do Rosenblatt and Nataf isoprobabilistic transformations really differ? Probabilistic Egieerig Mechaics 4 (009) 577 584 Cotets lists available at ScieceDirect Probabilistic Egieerig Mechaics joural homepage: wwwelseviercom/locate/probegmech Do Roseblatt ad Nataf isoprobabilistic

More information

Lesson 15 ANOVA (analysis of variance)

Lesson 15 ANOVA (analysis of variance) Outlie Variability -betwee group variability -withi group variability -total variability -F-ratio Computatio -sums of squares (betwee/withi/total -degrees of freedom (betwee/withi/total -mea square (betwee/withi

More information

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER? WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER? JÖRG JAHNEL 1. My Motivatio Some Sort of a Itroductio Last term I tought Topological Groups at the Göttige Georg August Uiversity. This

More information

SEQUENCES AND SERIES

SEQUENCES AND SERIES Chapter 9 SEQUENCES AND SERIES Natural umbers are the product of huma spirit. DEDEKIND 9.1 Itroductio I mathematics, the word, sequece is used i much the same way as it is i ordiary Eglish. Whe we say

More information

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas: Chapter 7 - Samplig Distributios 1 Itroductio What is statistics? It cosist of three major areas: Data Collectio: samplig plas ad experimetal desigs Descriptive Statistics: umerical ad graphical summaries

More information

ARTICLE IN PRESS. Statistics & Probability Letters ( ) A Kolmogorov-type test for monotonicity of regression. Cecile Durot

ARTICLE IN PRESS. Statistics & Probability Letters ( ) A Kolmogorov-type test for monotonicity of regression. Cecile Durot STAPRO 66 pp: - col.fig.: il ED: MG PROD. TYPE: COM PAGN: Usha.N -- SCAN: il Statistics & Probability Letters 2 2 2 2 Abstract A Kolmogorov-type test for mootoicity of regressio Cecile Durot Laboratoire

More information