High-dimensional support union recovery in multivariate regression

Transcription

1 High-dimesioal support uio recovery i multivariate regressio Guillaume Oboziski Departmet of Statistics UC Berkeley gobo@stat.berkeley.edu Marti J. Waiwright Departmet of Statistics Dept. of Electrical Egieerig ad Computer Sciece UC Berkeley waiwright@stat.berkeley.edu Michael I. Jorda Departmet of Statistics Departmet of Electrical Egieerig ad Computer Sciece UC Berkeley jorda@stat.berkeley.edu Abstract We study the behavior of block l /l regularizatio for multivariate regressio, where a K-dimesioal respose vector is regressed upo a fixed set of p covariates. The problem of support uio recovery is to recover the subset of covariates that are active i at least oe of the regressio problems. Studyig this problem uder high-dimesioal scalig (where the problem parameters as well as sample size ted to ifiity simultaeously), our mai result is to show that exact recovery is possible oce the order parameter give by θ l/l (, p, s) : = /[ψ(b ) log(p s)] exceeds a critical threshold. Here is the sample size, p is the ambiet dimesio of the regressio model, s is the size of the uio of supports, ad ψ(b ) is a sparsity-overlap fuctio that measures a combiatio of the sparsities ad overlaps of the K-regressio coefficiet vectors that costitute the model. This sparsity-overlap fuctio reveals that block l /l regularizatio for multivariate regressio ever harms performace relative to a aive l -approach, ad ca yield substatial improvemets i sample complexity (up to a factor of K) whe the regressio vectors are suitably orthogoal relative to the desig. We complemet our theoretical results with simulatios that demostrate the sharpess of the result, eve for relatively small problems. Itroductio A recet lie of research i machie learig has focused o regularizatio based o block-structured orms. Such structured orms are well motivated i various settigs, amog them kerel learig [3, 8], grouped variable selectio [], hierarchical model selectio [3], simultaeous sparse approximatio [0], ad simultaeous feature selectio i multi-task learig [7]. Block-orms that compose a l -orm with other orms yield solutios that ted to be sparse like the Lasso, but the structured orm also eforces blockwise sparsity, i the sese that parameters withi blocks are more likely to be zero (or o-zero) simultaeously. The focus of this paper is the model selectio cosistecy of block-structured regularizatio i the settig of multivariate regressio. Our goal is to perform model or variable selectio, by which we mea extractig the subset of relevat covariates that are active i at least oe regressio. We refer to this problem as the support uio problem. I lie with a large body of recet work i statistical machie learig (e.g., [, 9, 4, ]), our aalysis is high-dimesioal i ature, meaig that we allow the model dimesio p (as well as other structural parameters) to grow alog with the sample size. A great deal of work has focused o the case of ordiary l -regularizatio (Lasso) [,, 4], showig for istace that the Lasso ca recover the support of a sparse sigal eve whe p.

2 Some more recet work has studied cosistecy issues for block-regularizatio schemes, icludig classical aalysis (p fixed) of the group Lasso [], ad high-dimesioal aalysis of the predictive risk of block-regularized logistic regressio [5]. Although there have bee various empirical demostratios of the beefits of block regularizatio, the geeralizatios of the result of [] obtaied by [6, 4] fail to capture the improvemets observed i practice. I this paper, our goal is to uderstad the followig questio: uder what coditios does block regularizatio lead to a quatifiable improvemet i statistical efficiecy, relative to more aive regularizatio schemes? Here statistical efficiecy is assessed i terms of the sample complexity, meaig the miimal sample size required to recover the support uio; we wish to kow how this scales as a fuctio of problem parameters. Our mai cotributio is to provide a fuctio quatifyig the beefits of block regularizatio schemes for the problem of multivariate liear regressio, showig i particular that, uder suitable structural coditios o the data, the block-orm regularizatio we cosider ever harms performace relative to aive l -regularizatio ad ca lead to substatial gais i sample complexity. More specifically, we cosider the followig problem of multivariate liear regressio: a group of K scalar outputs are regressed o the same desig matrix X R p. Represetig the regressio coefficiets as a p K matrix B, the regressio model takes the form Y = XB + W, () where Y R K ad W R K are matrices of observatios ad zero-mea oise respectively ad B has colums β (),..., β (K) which are the parameter vectors of each uivariate regressio. We are iterested { i recoverig the uio } of the supports of idividual regressios, more specifically if S k = i {,..., p}, β (k) i 0 we would like to recover S = k S k. The Lasso is ofte preseted as a relaxatio of the so-called l 0 regularizatio, i.e., the cout of the umber of o-zero parameter coefficiets, a itractable o-covex fuctio. More geerally, block-orm regularizatios ca be thought of as the relaxatio of a o-covex regularizatio which couts the umber of covariates i for which at least oe of the uivariate regressio parameters β (k) i is o-zero. More specifically, let βi deote the ith row of B, ad defie, for q, B l0 /l q = {i {,..., p}, β i q > 0} ad B l /l q = p βi q All l 0 /l q orms defie the same fuctio, but differ coceptually i that they lead to differet l /l q relaxatios. I particular the l /l regularizatio is the same as the usual Lasso. The other coceptually most atural block-orms are l /l ad l /l. While l /l is of iterest, it seems ituitively to be relevat essetially to situatios where the support is exactly the same for all regressios, a assumptio that we are ot willig to make. I the curret paper, we focus o the l /l case ad cosider the estimator B obtaied by solvig the followig disguised secod-order coe program: { } mi B R p K Y XB F + λ B l /l, () where M F : = ( i,j m ij )/ deotes the Frobeius orm. We study the support uio problem uder high-dimesioal scalig, meaig that the umber of observatios, the ambiet dimesio p ad the size of the uio of supports s ca all ted to ifiity. The mai cotributio of this paper is to show that uder certai techical coditios o the desig ad oise matrices, the model selectio performace of block-regularized l /l regressio () is govered by the cotrol ψ(b,σ SS ) log(p s) parameter θ l /l (, p ; B ) : =, where is the sample size, p is the ambiet dimesio, s = S is the size of the uio of the supports, ad ψ( ) is a sparsity-overlap fuctio defied below. More precisely, the probability of correct support uio recovery coverges to oe for all sequeces (, p, s, B ) such that the cotrol parameter θ l /l (, p ; B ) exceeds a fixed critical threshold θ crit < +. Note that θ l /l is a measure of the sample complexity of the problem that is, the sample size required for exact recovery as a fuctio of the problem parameters. Whereas the ratio (/ log p) is stadard for high-dimesioal theory o l -regularizatio (essetially due to coverig umberigs of l balls), the fuctio ψ(b, Σ SS ) is a ovel ad iterestig quatity, which i=

3 measures both the sparsity of the matrix B, as well as the overlap betwee the differet regressio tasks (colums of B ). I Sectio, we itroduce the models ad assumptios, defie key characteristics of the problem ad state our mai result ad its cosequeces. Sectio 3 is devoted to the proof of this mai result, with most techical results deferred to the appedix. Sectio 4 illustrates with simulatios the sharpess of our aalysis ad how quickly the asymptotic regime arises.. Notatios For a (possibly radom) matrix M R p K, ad for parameters a b, we distiguish the l a /l b block orms from the (a, b)-operator orms, defied respectively as { p ( K ) a } M la /l b : = m ik b b a ad M a, b : = sup Mx a, (3) x b = i= k= although l /l p orms belog to both families (see Lemma B.0.). For brevity, we deote the spectral orm M, as M, ad the l -operator orm M, = max i j M ij as M. Mai result ad some cosequeces The aalysis of this paper applies to multivariate liear regressio problems of the form (), i which the oise matrix W R K is assumed to cosist of i.i.d. elemets W ij N(0, σ ). I additio, we assume that the measuremet or desig matrices X have rows draw i a i.i.d. maer from a zero-mea Gaussia N(0, Σ), where Σ 0 is a p p covariace matrix. Suppose that we partitio the full set of covariates ito the support set S ad its complemet S c, with S = s, S c = p s. Cosider the followig block decompositios of the regressio coefficiet matrix, the desig matrix ad its covariace matrix: B = [ ] B S BS, X = [X S X S c], ad Σ = c [ ] ΣSS Σ SS c. Σ S c S Σ S c S c We use β i to deote the ith row of B, ad assume that the sparsity of B is assessed as follows: (A0) Sparsity: The matrix B has row support S : = {i {,..., p} β i 0}, with s = S. I additio, we make the followig assumptios about the covariace Σ of the desig matrix: (A) Bouded eigespectrum: There exist a costat C mi > 0 (resp. C max < + ) such that all eigevalues of Σ SS (resp. Σ) are greater tha C mi (resp. smaller tha C max ). (A) Mutual icoherece: There exists γ (0, ] such that Σ S c S(Σ SS ) γ. (A3) Self icoherece: There exists a costat D max such that (Σ SS ) D max. Assumptio A is a stadard coditio required to prevet excess depedece amog elemets of the desig matrix associated with the support S. The mutual icoherece assumptio A is also well kow from previous work o model selectio with the Lasso [0, 4]. These assumptios are trivially satisfied by the stadard Gaussia esemble (Σ = I p ) with C mi = C max = D max = γ =. More geerally, it ca be show that various matrix classes satisfy these coditios [4, ].. Statemet of mai result With the goal of estimatig the uio of supports S, our mai result is a set of sufficiet coditios usig the followig procedure. Solve the block-regularized problem () with regularizatio parameter λ > 0, thereby obtaiig a solutio B = B(λ ). Use this solutio to compute a estimate of the support uio as Ŝ( B) { } : = i {,..., p} βi 0. This estimator is uambiguously defied if the solutio B is uique, ad as part of our aalysis, we show that the solutio B is ideed uique with high probability i the regime of iterest. We study the behavior of this estimator for a 3

4 sequece of liear regressios idexed by the triplet (, p, s), for which the data follows the geeral model preseted i the previous sectio with defiig parameters B () ad Σ() satisfyig A0- A3. As (, p, s) teds to ifiity, we give coditios o the triplet ad properties of B for which B is uique, ad such that P[Ŝ = S]. The cetral objects i our mai result are the sparsity-overlap fuctio, ad the sample complexity parameter, which we defie here. For ay vector β i 0, defie ζ(β i ) : = β i β i. We exted the fuctio ζ to ay matrix B S R s K with o-zero rows by defiig the matrix ζ(b S ) R s K with i th row [ζ(b S )] i = ζ(β i ). With this otatio, we defie the sparsity-overlap fuctio ψ(b) ad the sample complexity parameter θ l /l (, p ; B ) as ψ(b) : = ζ(bs ) T (Σ SS ) ζ(b S ) ad θ l /l (, p ; B ): = ψ(b ) log(p s). (4) Fially, we use b mi : = mi i S β i to deote the miimal l row-orm of the matrix B S. With this otatio, we have the followig result: Theorem. Cosider a radom desig matrix X draw with i.i.d. N(0, Σ) row vectors, a observatio matrix Y specified by model (), ad a regressio matrix B such that (b mi ) decays strictly more slowly tha f(p) max {s, log(p s)}, for ay fuctio f(p) +. ( Suppose that we solve f(p) ) the block-regularized program () with regularizatio parameter λ = Θ log(p)/. For ay sequece (, p, B ) such that the l /l cotrol parameter θ l/l (, p ; B ) exceeds the critical threshold θ crit (Σ) : = Cmax γ, the with probability greater tha exp( Θ(log p)), (a) the block-regularized program () has a uique solutio B, ad (b) its support set Ŝ( B) is equal to the true support uio S. Remarks: (i) For the stadard Gaussia esemble (Σ = I p ), the critical threshold is simply θ crit (Σ) =. (ii) A techical coditio that we require o the regularizatio parameter is which is satisfied by the choice give i the statemet.. Some cosequeces of Theorem λ log(p s) (5) It is iterestig to cosider some special cases of our mai result. The simplest special case is the uivariate regressio problem (K = ), i which case the fuctio ζ(β ) outputs a s-dimesioal sig vector with elemets z i = sig(β i ), so that ψ(β ) = z T (Σ SS ) z = Θ(s). Cosequetly, the order parameter of block l /l -regressio for uivariate regresio is give by Θ(/(s log(p s)), which matches the scalig established i previous work o the Lasso []. More geerally, give our assumptio (A) o Σ SS, the sparsity overlap ψ(b ) always lies i the s s iterval [ KC max, C mi ]. At the most pessimistic extreme, suppose that B : = β K T that is, B cosists of K copies of the same coefficiet vector β R p, with support of cardiality S = s. We the have [ζ(b )] ij = sig(βi )/ K, from which we see that ψ(b ) = z T (Σ SS ) z, with z agai the s-dimesioal sig vector with elemets zi = sig(βi ), so that there is o beefit i sample complexity relative to the aive strategy of solvig separate Lasso problems ad costructig the uio of idividually estimated supports. This might seem a pessimistic result, sice uder model (), we essetially have K observatios of the coefficiet vector β with the same desig matrix but K idepedet oise realizatios. However, the thresholds as well as the rates of covergece i high-dimesioal results such as Theorem are ot determied by the oise variace, but rather by the umber of iterferig variables (p s). At the most optimistic extreme, cosider the case where Σ SS = I s ad (for s > K) suppose that B is costructed such that the colums of the s K matrix ζ(b ) are all orthogoal ad of equal legth. Uder this coditio, we have 4

5 Corollary (Orthoormal tasks). If the colums of the matrix ζ(b ) are all orthogoal with equal legth ad Σ SS = I s s the the block-regularized problem () succeeds i uio support recovery oce the sample complexity parameter /( s K log(p s)) is larger tha. For the stadard Gaussia esemble, it is kow [] that the Lasso fails with probability oe for all sequeces such that < ( ν)s log(p s) for ay arbitrarily small ν > 0. Cosequetly, Corollary shows that uder suitable coditios o the regressio coefficiet matrix B, l /l ca provides a K-fold reductio i the umber of samples required for exact support recovery. As a third illustratio, cosider, for Σ SS = I s s, the case where the supports S k of idividual regressio problems are all disjoit. The sample complexity parameter for each of the idividual Lassos is /(s k log(p s k )) where S k = s k, so that the sample size required to recover the support uio from idividual Lassos scales as = Θ(max k [s k log(p s k )]). However, if the supports are all disjoit, the the colums of the matrix Z S = ζ(b S ) are orthogoal, ad Z S T Z S = diag(s,..., s K ) so that ψ(b ) = max k s k ad the sample complexity is the same. I other words, eve though there is o sharig of variables at all there is surprisigly o pealty from regularizig joitly with the l /l -orm. However, this is ot always true if Σ SS I s s ad i may situatios l /l -regularizatio ca have higher sample complexity tha separate Lassos. 3 Proof of Theorem I additio to previous otatios, the proofs use the shorthads: Σ SS = XT S X S, Σ S c S= XT S cx S ad Π S = X S ( Σ SS ) X T S deotes the orthogoal projectio oto the rage of X S. High-level proof outlie: At a high level, our proof is based o the otio of what we refer to as a primal-dual witess: we first formulate the problem () as a secod-order coe program (SOCP), with the same primal variable B as i () ad a dual variable Z whose rows coicide at optimality with the subgradiet of the l /l orm. We the costruct a primal matrix B alog with a dual matrix Ẑ such that, uder the coditios of Theorem, with probability covergig to : (a) The pair ( B, Ẑ) satisfies the Karush-Kuh-Tucker (KKT) coditios of the SOCP. (b) I spite of the fact that for geeral high-dimesioal problems (with p ), the SOCP eed ot have a uique solutio a priori, a strict feasibility coditio satisfied by the dual variables Ẑ guaratees that B is the uique optimal solutio of (). (c) The support uio Ŝ of B is idetical to the support uio S of B. At the core of our costructive procedure is the followig covex-aalytic result, which characterizes a optimal primal-dual pair for which the primal solutio B correctly recovers the support set S: Lemma. Suppose that there exists a primal-dual pair ( B, Ẑ) that satisfy the coditios: Ẑ S = ζ( B S ) (6a) Σ SS ( B S BS) XT S W = λ Ẑ S (6b) λ Ẑ S c : = l /l Σ S c S( B S BS) XT S l /l cw < λ (6c) B S c = 0. (6d) The ( B, Ẑ) is the uique optimal solutio to the block-regularized problem, with Ŝ( B) = S by costructio. Appedix A proves Lemma, with the strict feasibility of ẐSc give by (6c) to certify uiqueess. 3. Costructio of primal-dual witess Based o Lemma, we costruct the primal dual pair ( B, Ẑ) as follows. First, we set B S c = 0, to satisfy coditio (6d). Next, we obtai the pair ( B S, ẐS) by solvig a restricted versio of (): { [ ] } B S = arg mi B S R s K Y X BS + λ 0 S c B S l/l. (7) F 5

6 Sice s <, the empirical covariace (sub)matrix Σ SS = XT S X S is strictly positive defiite with probability oe, which implies that the restricted problem (7) is strictly covex ad therefore has a uique optimum B S. We the choose ẐS to be the solutio of equatio (6b). Sice ay such matrix ẐS is also a dual solutio to the SOCP (7), it must be a elemet of the subdifferetial B S l /l. It remais to show that this costructio satisfies coditios (6a) ad (6c). I order to satisfy coditio (6a), it suffices to show that β i 0, i S. From equatio (6b) ad sice Σ SS is ivertible, we may solve as follows ( B ) [ ] X S BS) T = S W ( ΣSS λ Ẑ S = : U S. (8) For ay row i S, we have β i βi U S l /l. Thus, it suffices to show that the followig evet occurs with high probability { E(U S ) : = U S l /l } b mi (9) to show that o row of B S is idetically zero. We establish this result later i this sectio. Turig to coditio (6c), by substitutig expressio (8) for the differece ( B S BS ) ito equatio (6c), we obtai a (p s) K radom matrix V S c, whose row j S c is give by V j : = X T j ( [Π S I ] W λ X S ( Σ SS ) Ẑ S ). (0) I order for coditio (6c) to hold, it is ecessary ad sufficiet that the probability of the evet } E(V S c) : = { V S c l /l < λ () coverges to oe as teds to ifiity. Correct iclusio of supportig covariates: We begi by aalyzig the probability of E(U S ). Lemma. Uder assumptio A3 ad coditios (5) of Theorem, with probability exp( Θ(log s)), we have ( ) ( ( )) U S l /l O (log s)/ + λ D max + O s /. This lemma is proved i i the Appedix. With the assumed scalig = Ω (s log(p s)), ad the assumed slow decrease of b mi, which we write explicitly as (b mi ) f(p) max{s,log(p s)} ε for some ε 0, we have U S l /l b mi O(ε ), () so that the coditios of Theorem esure that E(U S ) occurs with probability covergig to oe. Correct exclusio of o-support: Next we aalyze the evet E(V S c). For simplicity, i the followig argumets, we drop the idex S c ad write V for V S c. I order to show that V l /l < λ with probability covergig to oe, we make use of the decompositio 3 V λ l /l T i where T : = E [V X S ] λ l /l, i= T : = λ E [V X S, W ] E [V X S ] l /l ad T 3 : = λ V E [V X S, W ] l /l. Lemma 3. Uder assumptio A, T γ. Uder coditios (5) of Theorem, T = o p (). Therefore, to show that λ V l /l < with high probability, it suffices to show that T 3 < γ with high probability. Util ow, we have t appealed to the sample complexity parameter θ l /l (, p ; B ). I the ext sectio, we prove that θ l /l (, p ; B ) > θ crit (Σ) implies that T 3 < γ with high probability. 6

7 Lemma 4. Coditioally o W ad X S, we have ( Vj E [V j X S, W ] W, X S ) d = ( ΣS c S) jj ξt j M ξ j, where ξ j N( 0 K, I K ) ad where the K K matrix M = M (X S, W ) is give by M : = λ ẐT S ( Σ SS ) Ẑ S + W T (Π S I )W. (3) But the covariace matrix M is itself cocetrated. Ideed, Lemma 5. Uder the coditios (5) of Theorem, for ay δ > 0, the followig evet T (δ) has probability covergig to : T (δ) : = { M λ ψ(b ) } ( + δ). (4) For ay fixed δ > 0, we have P[T 3 γ] P[T 3 γ T (δ)] + P[T (δ) c ], but, from lemma 5, P[T (δ) c ] 0, so that it suffices to deal with the first term. Give that (Σ S c S) jj (Σ S c S c) jj C max for all j, o the evet T (δ), we have max (Σ j S c S c S) jj ξj T M ξ j C max M max ξ j j S c C max λ ψ(b ) [ ] P[T 3 γ T (δ)] P max ξ j j S c t (, B ) with t (, B ) : = γ max j S c ξ j C max ad ψ(b ) ( + δ). Fially usig the uio boud ad a large deviatio boud for χ variates we get the followig coditio which is equivalet to the coditio of Theorem : θ l/l (, p ; B ) > θ crit (Σ): [ ] Lemma 6. P max ξ j j S c t (, B ) 0 if t (, B ) > ( + ν) log(p s) for some ν > 0. 4 Simulatios I this sectio, we illustrate the sharpess of Theorem ad furthermore ascertai how quickly the predicted behavior is observed as, p, s grow i differet regimes, for two regressio tasks (i.e., K = ). I the followig simulatios, the matrix B of regressio coefficiets is desiged with etries βij i { /, / } to yield a desired value of ψ(b ). The desig matrix X is sampled from the stadard Gaussia esemble. Sice βij = / i this costructio, we have B S = ζ(b S ), ad b mi =. Moreover, sice Σ = I p, the sparsity-overlap ψ(b ) is simply ζ(b ) T ζ(b ). From our aalysis, the sample complexity parameter θ l /l is cotrolled by the iterferece of irrelevat covariates, ad ot by the variace of a oise compoet. We cosider liear sparsity with s = αp, for α = /8, for various ambiet model dimesios p {3, 56, 04}. For each value of p, we perform simulatios varyig the sample size to match correspodig values of the basic Lasso sample complexity parameter, give by θ Las : = /(s log(p s)), i the iterval [0.5,.5]. I each case, we solve the blockregularized problem () with sample size = θ Las s log(p s) usig the regularizatio parameter λ = log(p s) (log s)/. I all cases, the oise level is set at σ = 0.. For our costructio of matrices B, we choose both p ad the scaligs for the sparsity so that the obtaied values for s that are multiples of four, ad costruct the colums Z () ad Z () of the matrix B = ζ(b ) from copies of vectors of legth 4. Deotig by the usual matrix tesor product, we cosider: Idetical regressios: We set Z () = Z () = s, so that the sparsity-overlap is ψ(b ) = s. Orthogoal regressio: Here B is costructed with Z () Z (), so that ψ(b ) = s, the most favorable situatio. To achieve this, we set Z () = s ad Z () = s/ (, ) T. Itermediate agles: I this itermediate case, the colums Z () ad Z () are at a 60 agle, which leads to ψ(b ) = 3 4 s. We set Z() = s ad Z () = s/4 (,,, ) T. Figure shows plots of all three cases ad the referece Lasso case for the three differet values of the ambiet dimesio ad the two types of sparsity described above. Note how the curves all udergo a threshold pheomeo, with the locatio cosistet with the predictios of Theorem. 7

8 L p=3 s=p/8=4 p=56 s=p/8=3 p=04 s=p/8=8 P(support correct) Z =Z Ð (Z,Z )=60 o Z Z P(support correct) P(support correct) θ θ θ Figure. Plots of support recovery probability P[Ŝ = S] versus the basic l cotrol parameter θ Las=/[s log(p s)] for liear sparsity s=p/8, ad for icreasig values of p {3, 56, 04} from left to right. Each graph shows four curves correspodig to the case of idepedet l regularizatio (pluses), ad for l /l regularizatio, the cases of idetical regressio (crosses), itermediate agles (ablas), ad orthogoal regressios (squares). As plotted i dotted vertical lies, Theorem predicts that idetical case should succeed for θ Las > (same as ordiary Lasso), itermediate case for θ Las >0.75, ad orthogoal case for θ Las >0.50. The shift of these curves cofirms this predictio. 5 Discussio We studied support uio recovery uder high-dimesioal scalig with the l /l regularizatio, ad show that its sample complexity is determied by the fuctio ψ(b ). The latter itegrates the sparsity of each uivariate regressio with the overlap of all the supports ad the discrepacies betwee each of the vectors of parameter estimated. I favorable cases, for K regressios, the sample complexity for l /l is K times smaller tha that of the Lasso. Moreover, this gai is ot obtaied at the expese of a assumptio of shared support over the data. I fact, for stadard Gaussia desigs, the regularizatio seems adaptive i sese that it does t perform worse tha the Lasso for disjoit supports. This is ot ecessarily the case for more geeral desigs ad i some situatios, which eed to be characterized i future work, it could do worse tha the Lasso. Refereces [] F. Bach. Cosistecy of the group Lasso ad multiple kerel learig. Techical report, INRIA - Départemet d Iformatique, Ecole Normale Supérieure, 008. [] F. Bach, G. Lackriet, ad M. Jorda. Multiple kerel learig, coic duality, ad the SMO algorithm. I Proc. It. Cof. Machie Learig (ICML). Morga Kaufma, 004. [3] D. Dooho, M. Elad, ad V. M. Temlyakov. Stable recovery of sparse overcomplete represetatios i the presece of oise. IEEE Tras. Ifo Theory, 5():6 8, Jauary 006. [4] H. Liu ad J. Zhag. O the l l q regularized regressio. Techical Report arxiv:080.57v, Caregie Mello Uiversity, 008. [5] L. Meier, S. va de Geer, ad P. Bühlma. The group lasso for logistic regressio. Techical report, Mathematics Departmet, Swiss Federal Istitute of Techology Zürich, 007. [6] Y. Nardi ad A. Rialdo. O the asymptotic properties of the group lasso estimator for liear models. Electroic Joural of Statistics, : , 008. [7] G. Oboziski, B. Taskar, ad M. Jorda. Joit covariate selectio ad joit subspace selectio for multiple classificatio problems. Statistics ad Computig, 009. To appear. [8] M. Potil ad C.A. Michelli. Learig the kerel fuctio via regularizatio. Joural of Machie Learig Research, 6:099 5, 005. [9] P. Ravikumar, J. Lafferty, H. Liu, ad L. Wasserma. SpAM: sparse additive models. I Neural Ifo. Proc. Systems (NIPS), Vacouver, Caada, December 007. [0] J. A. Tropp. Just relax: Covex programmig methods for idetifyig sparse sigals i oise. IEEE Tras. Ifo Theory, 5(3):030 05, March 006. [] M. J. Waiwright. Sharp thresholds for high-dimesioal ad oisy recovery of sparsity usig usig l -costraied quadratic programs. Techical Report 709, Departmet of Statistics, UC Berkeley, 006. [] M. Yua ad Y. Li. Model selectio ad estimatio i regressio with grouped variables. Joural of the Royal Statistical Society B, (68):4967, 006. [3] P. Zhao, G. Rocha, ad B. Yu. Grouped ad hierarchical model selectio through composite absolute pealties. Techical report, Statistics Departmet, UC Berkeley, 007. [4] P. Zhao ad B. Yu. Model selectio with the lasso. J. of Machie Learig Research, pages ,