AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

Transcription

1 VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS Jia Huag 1, Joel L. Horowitz 2 ad Fegrog Wei 3 1 Uiversity of Iowa, 2 Northwester Uiversity ad 3 Uiversity of West Georgia Abstract We cosider a oparametric additive model of a coditioal mea fuctio i which the umber of variables ad additive compoets may be larger tha the sample size but the umber of o-zero additive compoets is small relative to the sample size. The statistical problem is to determie which additive compoets are o-zero. The additive compoets are approximated by trucated series expasios with B-splie bases. With this approximatio, the problem of compoet selectio becomes that of selectig the groups of coefficiets i the expasio. We apply the adaptive group Lasso to select ozero compoets, usig the group Lasso to obtai a iitial estimator ad reduce the dimesio of the problem. We give coditios uder which the group Lasso selects a model whose umber of compoets is comparable with the uderlyig model, ad the adaptive group Lasso selects the o-zero compoets correctly with probability approachig oe as the sample size icreases ad achieves the optimal rate of covergece. The results of Mote Carlo experimets show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the applicatio of the proposed method. KEY WORDS: adaptive group Lasso; compoet selectio; high-dimesioal data; oparametric regressio; selectio cosistecy. Short title. Noparametric compoet selectio AMS 2000 subject classificatio. Primary 62G08, 62G20; secodary 62G99 AOS R2 (December 7,

2 1 Itroductio Let (Y i, X i, i = 1,..., be radom vectors that are idepedetly ad idetically distributed as (Y, X, where Y is a respose variable ad X = (X 1,..., X p is a p-dimesioal covariate vector. Cosider the oparametric additive model p Y i = µ + f j (X ij + ε i, (1 j=1 where µ is a itercept term, X ij is the jth compoet of X i, the f j s are ukow fuctios, ad ε i is a uobserved radom variable with mea zero ad fiite variace σ 2. Suppose that some of the additive compoets f j are zero. The problem addressed i this paper is to distiguish the ozero compoets from the zero compoets ad estimate the ozero compoets. We allow the possibility that p is larger tha the sample size, which we represet by lettig p icrease as icreases. We propose a pealized method for variable selectio i (1 ad show that the proposed method ca correctly select the ozero compoets with high probability. There has bee much work o pealized methods for variable selectio ad estimatio with high-dimesioal data. Methods that have bee proposed iclude the bridge estimator (Frak ad Friedma 1993; Huag, Horowitz ad Ma 2008; least absolute shrikage ad selectio operator or Lasso (Tibshirai 1996, the smoothly clipped absolute deviatio (SCAD pealty (Fa ad Li 2001; Fa ad Peg 2004, ad the miimum cocave pealty (Zhag Much progress has bee made i uderstadig the statistical properties of these methods. I particular, may authors have studied the variable selectio, estimatio ad predictio properties of the Lasso i high-dimesioal settigs. See for example, Meishause ad Bühlma (2006; Zhao ad Yu (2006; Zou (2006; Buea, Tsybakov ad Wegkamp (2007; Meishause ad Yu (2008; Huag, Ma ad Zhag (2008; va de Geer (2008 ad Zhag ad Huag (2008, amog others. All these authors assume a liear or other parametric model. I may applicatios, however, there is little a priori justificatio for assumig that the effects of covariates take a liear form or belog to ay other kow, fiite-dimesioal parametric family. For example, i studies of ecoomic developmet, the effects of covariates o the growth of gross domestic product ca be oliear. Similarly, there is evidece of oliearity i the gee expressio data used i the empirical example i Sectio 5. There is a large body of literature o estimatio i oparametric additive models. For example, Stoe (1985, 1986 showed that additive splie estimators achieve the same optimal rate of covergece for a geeral fixed p as for p = 1. Horowitz ad Mamme (2004 ad Horowitz, 2

3 Klemelä, ad Mamme (2006 showed that if p is fixed ad mild regularity coditios hold, the oracle-efficiet estimates of the f j s ca be obtaied by a two-step procedure. Here, oracle efficiecy meas that the estimator of each f j has the same asymptotic distributio that it would have if all the other f j s were kow. However, these papers do ot discuss variable selectio i oparametric additive models. Zhag et al. (2004 ad Li ad Zhag (2006 have ivestigated the use of pealizatio methods i smoothig splie ANOVA with a fixed umber of covariates. Zhag et al. (2004 used a Lasso-type pealty but did ot ivestigate model-selectio cosistecy. Li ad Zhag (2006 proposed the compoet selectio ad smoothig operator (COSSO method for model selectio ad estimatio i multivariate oparametric regressio models. For fixed p, they showed that the COSSO estimator i the additive model coverges at the rate d/(2d+1, where d is the order of smoothess of the compoets. They also showed that, i the special case of a tesor product desig, the COSSO correctly selects the o-zero additive compoets with high probability. Zhag ad Li (2006 cosidered the COSSO for oparametric regressio i expoetial families. Meier, va de Geer, ad Bühlma (2008 treat variable selectio i a oparametric additive model i which the umbers of zero ad o-zero f j s may both be larger tha. They propose a pealized least-squares estimator for variable selectio ad estimatio. They give coditios uder which, with probability approachig 1, their procedure selects a set of f j s cotaiig all the additive compoets whose distace from zero i a certai metric exceeds a specified threshold. However, they do ot establish model-selectio cosistecy of their procedure. Eve asymptotically, the selected set may be larger tha the set of o-zero f j s. Moreover, they impose a compatibility coditio that relates the levels ad smoothess of the f j s. The compatibility coditio does ot have a straightforward, ituitive iterpretatio ad, as they poit out, caot be checked empirically. Ravikumar, Liu, Lafferty ad Wasserma (2009 proposed a pealized approach for variable selectio i oparametric additive models. I their approach, the pealty is imposed o the l 2 orm of the oparametric compoets, as well as the mea value of the compoets to esure idetifiability. I their theoretical results, they require that the eigevalues of a desig matrix be bouded away from zero ad ifiity, where the desig matrix is formed from the basis fuctios for the ozero compoets. It is ot clear whether this coditio holds i geeral, especially whe the umber of ozero compoets diverges with. Aother critical coditio required i the results of Ravikumar et al. (2009 is similar to the irrepresetable coditio of Zhao ad Yu (2007. It is ot clear for what type of basis fuctios this coditio is satisfied. We do ot require such a coditio i our results o selectio cosistecy of the adaptive group Lasso. 3

4 Several other recet papers have also cosidered variable selectio i oparametric models. For example, Wag, Che ad Li (2007 ad Wag ad Xia (2008 cosidered the use of group Lasso ad SCAD methods for model selectio ad estimatio i varyig coefficiet models with a fixed umber of coefficiets ad covariates. Bach (2007 applies what amouts to the group Lasso to a oparametric additive model with a fixed umber of covariates. He established model selectio cosistecy uder coditios that are cosiderably more complicated tha the oes we require for a possibly divergig umber of covariates. I this paper, we propose to use the adaptive group Lasso for variable selectio i (1 based o a splie approximatio to the oparametric compoets. With this approximatio, each oparametric compoet is represeted by a liear combiatio of splie basis fuctios. Cosequetly, the problem of compoet selectio becomes that of selectig the groups of coefficiets i the liear combiatios. It is atural to apply the group Lasso method, sice it is desirable to take ito the groupig structure i the approximatig model. To achieve model selectio cosistecy, we apply the group Lasso iteratively as follows. First, we use the group Lasso to obtai a iitial estimator ad reduce the dimesio of the problem. The we use the adaptive group Lasso to select the fial set of oparametric compoets. The adaptive group Lasso is a simple geeralizatio of the adaptive Lasso (Zou 2006 to the method of the group Lasso (Yua ad Li However, here we apply this approach to oparametric additive modelig. We assume that the umber of o-zero f j s is fixed. This eables us to achieve model selectio cosistecy uder simple assumptios that are easy to iterpret. We do ot have to impose compatibility or irrepresetable coditios, or do we eed to assume coditios o the eigevalues of certai matrices formed from the splie basis fuctios. We show that the group Lasso selects a model whose umber of compoets is bouded with probability approachig oe by a costat that is idepedet of the sample size. The, usig the group Lasso result as the iitial estimator, the adaptive group Lasso selects the correct model with probability approachig 1 ad achieves the optimal rate of covergece for oparametric estimatio of a additive model. The remaider of the paper is orgaized as follows. Sectio 2 describes the group Lasso ad the adaptive group Lasso for variable selectio i oparametric additive models. Sectio 3 presets the asymptotic properties of these methods i large p, small settigs. Sectio 4 presets the results of simulatio studies to evaluate the fiite-sample performace of these methods. Sectio 5 provides a illustrative applicatio, ad Sectio 6 icludes cocludig remarks. Proofs of the results stated i Sectio 3 are give i the Appedix. 4

5 2 Adaptive group Lasso i oparametric additive models We describe a two-step approach that uses the group Lasso for variable selectio based o a splie represetatio of each compoet i additive models. I the first step, we use the stadard group Lasso to achieve a iitial reductio of the dimesio i the model ad obtai a iitial estimator of the oparametric compoets. I the secod step, we use the adaptive group Lasso to achieve cosistet selectio. Suppose that each X j takes values i [a, b] where a < b are fiite umbers. To esure uique idetificatio of the f j s, we assume that Ef j (X j = 0, 1 j p. Let a = ξ 0 < ξ 1 < < ξ K < ξ K+1 = b be a partitio of [a, b] ito K subitervals I Kt = [ξ t, ξ t+1, t = 0,..., K 1 ad I KK = [ξ K, ξ K+1 ], where K K = v with 0 < v < 0.5 is a positive iteger such that max 1 k K+1 ξ k ξ k 1 = O( v. Let S be the space of polyomial splies of degree l 1 cosistig of fuctios s satisfyig: (i the restrictio of s to I Kt is a polyomial of degree l for 1 t K; (ii for l 2 ad 0 l l 2, s is l times cotiuously differetiable o [a, b]. This defiitio is phrased after Stoe (1985, which is a descriptive versio of Schumaker (1981, page 108, Defiitio 4.1. There exists a ormalized B-splie basis {φ k, 1 k m } for S, where m (Schumaker Thus for ay f j S, we ca write K + l m f j (x = β jk φ k (x, 1 j p. (2 k=1 Uder suitable smoothess assumptios, the f j s ca be well approximated by fuctios i S. Accordigly, the variable selectio method described i this paper is based o the represetatio (2. Let a 2 ( m j=1 a j 2 1/2 deote the l2 orm of ay vector a IR m. Let β j = (β j1,..., β jm ad β = (β 1,..., β p. Let w = (w 1,..., w p be a give vector of weights, where 0 w j, 1 j p. Cosider the pealized least squares criterio L (µ, β = i=1 [ Y i µ p m ] 2 p β jk φ k (X ij + λ w j β j 2, (3 j=1 k=1 j=1 where λ is a pealty parameter. We study the estimators that miimize L (µ, β subject to the costraits m β jk φ k (X ij = 0, 1 j p. (4 i=1 k=1 5

6 These ceterig costraits are sample aalogs of the idetifyig restrictio Ef j (X j = 0, 1 j p. We ca covert (3-(4 to a ucostraied optimizatio problem by ceterig the respose ad the basis fuctios. Let φ jk = 1 φ k (X ij, ψ jk (x = φ k (x φ jk. (5 i=1 For simplicity ad without causig cofusio, we simply write ψ k (x = ψ jk (x. Defie Z ij = ( ψ 1 (X ij,..., ψ m (X ij. So Z ij cosists of values of the (cetered basis fuctios at the ith observatio of the jth covariate. Let Z j = (Z 1j,..., Z j be the m desig matrix correspodig to the jth covariate. The total desig matrix is Z = (Z 1,..., Z p. Let Y = (Y 1 Y,..., Y Y. With this otatio, we ca write L (β ; λ = Y Zβ λ p w j β j 2. (6 Here we have dropped µ i the argumet of L. With the ceterig, µ = Y. The miimizig (3 subject to (4 is equivalet to miimizig (6 with respect to β, but the ceterig costraits are ot eeded for (6. We ow describe the two-step approach to compoet selectio i the oparametric additive model (1. Step 1. Compute the group Lasso estimator. Let j=1 L 1 (β, λ 1 = Y Zβ λ 1 p β j 2. This objective fuctio is the special case of (6 that is obtaied by settig w j = 1, 1 j p. The group Lasso estimator is β β (λ 1 = arg mi β L 1 (β ; λ 1. Step 2. Use the group Lasso estimator β to obtai the weights by settig w j = j=1 { βj 1 2, if β j 2 > 0,, if β j 2 = 0. 6

7 The adaptive group Lasso objective fuctio is L 2 (β ; λ 2 = Y Zβ λ 2 p w j β j 2. Here we defie 0 = 0. Thus the compoets ot selected by the group Lasso are ot icluded i Step 2. The adaptive group Lasso estimator is β β (λ 2 = arg mi β L 2 (β ; λ 2. Fially, the adaptive group Lasso estimators of µ ad f j are 3 Mai results µ = Y 1 Y i, fj (x = i=1 m k=1 j=1 β jk ψ k (x, 1 j p. This sectio presets our results o the asymptotic properties of the estimators defied i Steps 1 ad 2 of Sectio 2. Let k be a o-egative iteger, ad let α (0, 1] be such that d = k + α > 0.5. Let F be the class of fuctios f o [0, 1] whose kth derivative f (k exists ad satisfies a Lipschitz coditio of order α: f (k (s f (k (t C s t α for s, t [a, b]. I (1, without loss of geerality, suppose that the first q compoets are ozero, that is, f j (x 0, 1 j q, but f j (x 0, q + 1 j p. Let A 1 = {1,..., q} ad A 0 = {q + 1,..., p}. Defie f 2 = [ b a f 2 (xdx] 1/2 for ay fuctio f, wheever the itegral exists. We make the followig assumptios. (A1 The umber of ozero compoets q is fixed ad there is a costat c f > 0 such that mi 1 j q f j 2 c f. (A2 The radom variables ε 1,..., ε are idepedet ad idetically distributed with Eε i = 0 ad Var(ε i = σ 2. Furthermore, their tail probabilities satisfy P ( ε i > x K exp( Cx 2, i = 1,...,, for all x 0 ad for costats C ad K. (A3 Ef j (X j = 0 ad f j F, j = 1,..., q. (A4 The covariate vector X has a cotiuous desity ad there exist costats C 1 ad C 2 such that the desity fuctio g j of X j satisfies 0 < C 1 g j (x C 2 < o [a, b] for every 1 j p. We ote that (A1, (A3, ad (A4 are stadard coditios for oparametric additive models. They would be eeded to estimate the ozero additive compoets at the optimal l 2 rate of 7

8 covergece o [a, b], eve if q were fixed ad kow. Oly (A2 stregthes the assumptios eeded for oparametric estimatio of a oparametric additive model. While coditio (A1 is reasoable i most applicatios, it would be iterestig to relax this coditio ad ivestigate the case whe the umber of ozero compoets ca also icrease with the sample size. The oly techical reaso that we assume this coditio is related to Lemma 3 give i the appedix, which is cocered with the properties of the smallest ad largest eigevalues of the desig matrix formed from the splie basis fuctios. If this lemma ca be exteded to the case of a diverget umber of compoets, the (A1 ca be relaxed. However, it is clear that there eeds to be restrictio o the umber of ozero compoets to esure model idetificatio. 3.1 Estimatio cosistecy of the group Lasso I this sectio, we cosider the selectio ad estimatio properties of the group Lasso estimator. Defie Ã1 = {j : β j 2 0, 1 j p}. Let A deote the cardiality of ay set A {1,..., p}. Theorem 1 Suppose that (A1 to (A4 hold ad λ 1 C log(pm for a sufficietly large costat C. (i With probability covergig to 1, Ã1 M 1 A 1 = M 1 q for a fiite costat M 1 > 1. (ii If m 2 log(pm / 0 ad (λ 2 1m / 2 0 as, the all the ozero β j, 1 j q, are selected with probability covergig to oe. (iii p j=1 ( m β j β j = O log(pm p ( m ( 1 + O p + O + O m 2d 1 ( 4m 2 λ 2 1 Part (i of Theorem 1 says that, with probability approachig 1, the group Lasso selects a model whose dimesio is a costat multiple of the umber of o-zero additive compoets f j, regardless of the umber of additive compoets that are zero. Part (ii implies that every ozero coefficiet will be selected with high probability. Part (iii shows that the differece betwee the coefficiets i the splie represetatio of the oparametric fuctios i (1 ad their estimators coverges to zero i probability. The rate of covergece is determied by four terms: the stochastic error i estimatig the oparametric compoets (the first term ad the itercept µ (the secod term, the splie approximatio error (the third term ad the bias due to pealizatio (the fourth term. Let f j (x = m j=1 β jk ψ(x, 1 j p. The followig theorem is a cosequece of Theorem

9 Theorem 2 Suppose that (A1 to (A4 hold ad that λ 1 C log(pm for a sufficietly large costat C. The, (i Let Ãf = {j : f j 2 > 0, 1 j p}. There is a costat M 1 > 1 such that, with probability covergig to 1, Ãf M 1 q. (ii If (m log(pm / 0 ad (λ 2 1m / 2 0 as, the all the ozero additive compoets f j, 1 j q, are selected with probability covergig to oe. (iii f ( j f j 2 m log(pm ( 1 ( 1 2 = O p + O p + O + O m 2d where Ã2 = A 1 Ã1. ( 4m λ 2 1 2, j Ã2, Thus uder the coditios of Theorem 2, the group Lasso selects all the ozero additive compoets with high probability. Part (iii of the theorem gives the rate of covergece of the group Lasso estimator of the oparametric compoets. For ay two sequeces {a, b, = 1, 2,..., }, we write a b if there are costats 0 < c 1 < c 2 < such that c 1 a /b c 2 for all sufficietly large. We ow state a useful corollary of Theorem 2. Corollary 1 Suppose that (A1 to (A4 hold. If λ 1 log(pm ad m 1/(2d+1, the, (i If 2d/(2d+1 log(p 0 as, the with probability covergig to oe, all the ozero compoets f j, 1 j q, are selected ad the umber of selected compoets is o more tha M 1 q. (ii f j f j 2 2 = O p ( 2d/(2d+1 log(pm, j Ã2. For the λ 1 ad m give i Corollary 1, the umber of zero compoets ca be as large as exp(o( 2d/(2d+1. For example, if each f j has cotiuous secod derivative (d = 2, the it is exp(o( 4/5, which ca be much larger tha. 3.2 Selectio cosistecy of the adaptive group Lasso We ow cosider the properties of the adaptive group Lasso. We first state a geeral result cocerig the selectio cosistecy of the adaptive group Lasso, assumig a iitial cosistet estimator is available. We the apply to the case whe the group Lasso is used as the iitial estimator. We make the followig assumptios. 9

10 (B1 The iitial estimators β j are r -cosistet at zero: ad there exists a costat c b > 0 such that r max j A 0 β j 2 = O P (1, r, where b 1 = mi j A1 β j 2. P(mi j A 1 β j 2 c b b 1 1, (B2 Let q be the umber of ozero compoets ad s = p q be the umber of zero compoets. Suppose that (a (b 1/2 log 1/2 (s m λ 2 r + m + λ 2m 1/4 1/2 λ 2 r m (2d+1/2 = o(1, = o(1. We state coditio (B1 for a geeral iitial estimator, to highlight the poit that the availability of a r -cosistet estimator at zero is crucial for the adaptive group Lasso to be selectio cosistet. I other words, ay iitial estimator satisfyig (B1 will esure that the adaptive group Lasso (based o this iitial estimator is selectio cosistet, provided that certai regularity coditios are satisfied. We ote that it follows immediately from Theorem 1 that the group Lasso estimator satisfies (B1. We will come back to this poit below. For β ( β 1,..., β p ad β (β 1,..., β p, we say β = 0 β if sg 0 ( β j = sg 0 ( β j, 1 j p, where sg 0 ( x = 1 if x > 0 ad = 0 if x = 0. Theorem 3 Suppose that coditios (B1, (B2 ad (A1-(A4 hold. The (i P( β = 0 β 1. (ii q j=1 ( m β j β j = O p ( m ( 1 + O p + O + O m 2d 1 ( 4m 2 λ This theorem is cocered with the selectio ad estimatio properties of the adaptive group Lasso i terms of β. The followig theorem states the results i terms of the estimators of the oparametric compoets. 10

11 Theorem 4 Suppose that coditios (B1, (B2 ad (A1-(A4 hold. The (i P ( f j 2 > 0, j A 1 ad f j 2 = 0, j A 0 1. (ii q j=1 f ( j f j 2 m ( 1 ( 1 2 = O p + O p + O + O m 2d ( 4m λ Part (i of this theorem states that the adaptive group Lasso ca cosistetly distiguish ozero compoets from zero compoets. Part (ii gives a upper boud o the rate of covergece of the estimator. We ow apply the above results to our proposed procedure described i Sectio 2, i which we first obtai the the group Lasso estimator ad the use it as the iitial estimator i the adaptive group Lasso. By Theorem 1, if λ 1 log(pm ad m 1/(2d+1 for d 1, the the group Lasso estimator satisfies (B1 with r d/(2d+1 / log(pm. I this case, (B2 simplifies to λ 2 = o(1 ad 1/(4d+2 log 1/2 (pm = o(1. (7 (8d+3/(8d+4 λ 2 We summarize the above discussio i the followig corollary. Corollary 2 Let the group Lasso estimator β β (λ 1 with λ 1 log(pm ad m 1/(2d+1 be the iitial estimator i the adaptive group Lasso. Suppose that the coditios of Theorem 1 hold. If λ 2 O( 1/2 ad satisfies (7, the the adaptive group Lasso cosistetly selects the ozero compoets i (1, that is, part (i of Theorem 4 holds. I additio, q f ( j f j 2 2 = O p 2d/(2d+1. j=1 This corollary follows directly from Theorems 1 ad 4. The largest λ 2 allowed is λ 2 = O( 1/2. With this λ 2, the first equatio i (6 is satisfied. Substitute it ito the secod equatio i (6, we obtai p = exp(o( 2d/(2d+1, which is the largest p permitted ad ca be larger tha. Thus, uder the coditios of this corollary, our proposed adaptive group Lasso estimator usig the group Lasso as the iitial estimator is selectio cosistet ad achieves optimal rate of covergece eve whe p is larger tha. Followig model selectio, oracle-efficiet, asymptotically ormal estimators of the o-zero compoets ca be obtaied by usig existig methods. 11

12 4 Simulatio studies We use simulatio to evaluate the performace of the adaptive group Lasso with regard to variable selectio. The geeratig model is y i = f(x i + ε i p f j (x ij + ε i, i = 1,,. (8 j=1 Sice p ca be larger tha, we cosider two ways to select the pealty parameter, the BIC (Schwarz 1978 ad the EBIC (Che ad Che 2008, The BIC is defied as BIC(λ = log(rss λ + df λ log. Here RSS λ is the residual sum of squares for a give λ, ad the degrees of freedom df λ = ˆq λ m, where ˆq λ is the umber of ozero estimated compoets for the give λ. The EBIC is defied as EBIC(λ = log(rss λ + df λ log where 0 ν 1 is a costat. We use ν = ν df λ log p, We have also cosidered two other possible ways of defiig df: (a usig the trace of a liear smoother based o a quadratic approximatio; (b usig the umber of estimated ozero compoets. We have decided to use the defiitio give above based o the results from our simulatios. We ote that the df for the group Lasso of Yua ad Li (2006 requires a iitial (least squares estimator, which is ot available whe p >. Thus their df is ot applicable to our problem. I our simulatio example, we compare the adaptive group Lasso with the group Lasso ad ordiary Lasso. Here the ordiary Lasso estimator is defied as the value that miimizes Y Zβ λ p m β jk. j=1 k=1 This simple applicatio of the Lasso does ot take ito accout the groupig structure i the splie expasios of the compoets. The group Lasso ad the adaptive group Lasso estimates are computed usig the algorithm proposed by Yua ad Li (2006. The ordiary Lasso estimates are computed usig the Lars algorithms (Efro et al The group Lasso is used as the iitial estimate for the adaptive group Lasso. We also compare the results from the oparametric additive modelig with those from the 12

13 stadard liear regressio model with Lasso. We ote that this is ot a fair compariso because the geeratig model is highly oliear. Our purpose is to illustrate that it is ecessary to use oparametric models whe the uderlyig model deviates substatially from liear models i the cotext of variable selectio with high-dimesioal data ad that model misspecificatio ca lead to bad selectio results. Example 1. We geerate data from the model y i = f(x i + ε i p f j (x ij + ε i, i = 1,,, j=1 where f 1 (t = 5t, f 2 (t = 3(2t 1 2, f 3 (t = 4si(2πt/(2 si(2πt, f 4 (t = 6(0.1 si(2πt cos(2πt si(2πt cos(2πt si(2πt 3, ad f 5 (t = = f p (t = 0. Thus the umber of ozero fuctios is q = 4. This geeratig model is the same as Example 1 of Li ad Zhag (2006. However, here we use this model i high-dimesioal settigs. We cosider the cases where p = 1000 ad three differet sample sizes: = 50, 100 ad 200. We use the cubic B-splie with six evely distributed kots for all the fuctios f k. The umber of replicatios i all the simulatios is 400. The covariates are simulated as follows. First, we geerate w i1,, w ip, u i, u i, v i idepedetly from N(0, 1 trucated to the iterval [0, 1], i = 1,,. The we set x ik = (w ik + tu i /(1 + t for k = 1,, 4 ad x ik = (w ik + tv i /(1 + t for k = 5,, p, where the parameter t cotrols the amout of correlatio amog predictors. We have Corr(x ik, x ij = t 2 /(1 + t 2, 1 j 4, 1 k 4 ad Corr(x ik, x ij = t 2 /(1 + t 2, 4 j p, 4 k p, but the covariates of the ozero compoets ad zero compoets are idepedet. We cosider t = 0, 1 i our simulatio. The sigal to oise ratio is defied to be sd(f/sd(ɛ. The error term is chose to be ɛ i N(0, to give a sigal-to-oise ratio (SNR 3.11 : 1. This value is the same as the estimated SNR i the real data example below, which is the square root of the ratio of the sum of estimated compoets squared divided by the sum of residual squared. The results of 400 Mote Carlo replicatios are summarized i Table 1. The colums are the mea umber of variables selected (NV, model error (ER, the percetage of replicatios i which all the correct additive compoets are icluded i the selected model (IN, ad the percetage of replicatios i which precisely the correct compoets are selected (CS. The correspodig stadard errors are i paretheses. The model error is computed as the average of 1 i=1 [ ˆf(x i f(x i ] 2 over the 400 Mote Carlo replicatios, where f is the true coditioal mea fuctio. Table 1 shows that the adaptive group Lasso selects all the o-zero compoets (IN ad selects exactly the correct model (CS more frequetly tha the other methods do. For example, 13

14 with the BIC ad = 200, the percetage of correct selectios (CS by the adaptive group Lasso rages from 65.25% to 81%, which is much higher tha the rages % for the group Lasso ad % for the ordiary Lasso. The adaptive group Lasso ad group Lasso perform better tha the ordiary Lasso i all of the experimets, which illustrates the importace of takig accout of the group structure of the coefficiets of the splie expasio. Correlatio amog covariates icreases the difficulty of compoet selectio, so it is ot surprisig that all methods perform better with idepedet covariates tha with correlated oes. The percetage of correct selectios icreases as the sample size icreases. The liear model with Lasso ever selects the correct model. This illustrates the poor results that ca be produced by a liear model whe the true coditioal mea fuctio is oliear. Table 1 also shows that the model error (ME of the group Lasso is oly slightly larger tha that of the adaptive group Lasso. The models selected by the group Lasso est ad, therefore, have more estimated coefficiets tha the models selected by the adaptive group Lasso. Therefore, the group Lasso estimators of the coditioal mea fuctio have a larger variace ad larger ME. The differeces betwee the MEs of the two methods are small, however, because as ca be see from the NV colum, the models selected by the group Lasso i our experimets have oly slightly more estimated coefficiets tha the models selected by the adaptive group Lasso. Example 2. We ow compare the adaptive group Lasso with the COSSO (Li ad Zhag This compariso is suggested to us by the Associate Editor. Because the COSSO algorithm oly works for the case whe p is smaller tha, we use the same set-up as i Example 1 of Li ad Zhag (2006. I this example, the geeratig model is as i (8 with 4 ozero compoets. Let X j = (W j + tu/(1 + t, j = 1,, p, where W 1,, W p ad U are i.i.d. from N(0, 1, trucated to the iterval [0, 1]. Therefore corr(x j, X k = t 2 /(1 + t 2 for j k. The radom error term ɛ N(0, The SNR is 3:1. We cosider three differet sample sizes = 50, 100 or 200 ad three differet umber of predictors p = 10, 20 or 50. The COSSO estimator is computed usig the Matlab software which is publicly available at hzhag/cosso.html. The COSSO procedure uses either geeralized cross validatio or 5-fold cross validatio. Based the simulatio results of Li ad Zhag (2006 ad our ow simulatios, the COSSO with 5-fold cross validatio has better selectio performace. Thus we compare the adaptive group Lasso with BIC or EBIC with the COSSO with 5-fold cross validatio. The results are give i Table 2. For idepedet predictors, whe = 200 ad p = 10, 20 or 50, the adaptive group Lasso ad COSSO have similar performace i terms of selectio accuracy ad model error. However, for smaller ad larger p, the adaptive group Lasso does sigificatly better. For example, for = 100 ad p = 50, the percetage of correct selectio for the adaptive group Lasso is 81-83%, 14

15 but it is oly 11% for the COSSO. The model error of the adaptive group Lasso is similar to or smaller tha that of the COSSO. I several experimets, the model error of the COSSO is 2 to more tha 7 times larger tha that of the adaptive group Lasso. It is iterestig to ote that whe = 50 ad p = 20 or 50, the adaptive group Lasso still does a descet job i selectig the correct model, but the COSSO does poorly i these two cases. I particular, for = 50 ad p = 50, the COSSO did ot select the exact correct model i all the simulatio rus. For depedet predictors, the compariso is eve mode favorable to the adaptive group Lasso, which performs sigificatly better tha COSSO i terms of both model error ad selectio accuracy i all the cases. 5 Data example We use the data set reported i Scheetz et al. (2006 to illustrate the applicatio of the proposed method i high-dimesioal settigs. For this data set, 120 twelve-week-old male rats were selected for tissue harvestig from the eyes ad for microarray aalysis. The microarrays used to aalyze the RNA from the eyes of these aimals cotai over 31,042 differet probe sets (Affymetric GeeChip Rat Geome Array. The itesity values were ormalized usig the robust multi-chip averagig method (Irizzary et al method to obtai summary expressio values for each probe set. Gee expressio levels were aalyzed o a logarithmic scale. We are iterested i fidig the gees that are related to the gee TRIM32. This gee was recetly foud to cause Bardet-Biedl sydrome (Chiag et al. (2006, which is a geetically heterogeeous disease of multiple orga systems icludig the retia. Although over 30,000 probe sets are represeted o the Rat Geome Array, may of them are ot expressed i the eye tissue ad iitial screeig usig correlatio shows that most probe sets have very low correlatio with TRIM32. I additio, we are expectig oly a small umber of gees to be related to TRIM32. Therefore, we use 500 probe sets that are expressed i the eye ad have highest margial correlatio i the aalysis. Thus the sample size is = 120 (i.e., there are 120 arrays from 120 rats ad p = 500. It is expected that oly a few gees are related to TRIM32. Therefore this is a sparse, high-dimesioal regressio problem. We use the oparametric additive model to model the relatio betwee the expressio of TRIM32 ad those of the 500 gees. We estimate model (1 usig the ordiary Lasso, group Lasso, ad adaptive group Lasso for the oparametric additive model. To compare the results of the oparametric additive model with that of the liear regressio model, we also aalyzed the data usig the liear regressio model with Lasso. We scale the covariates so that their values are 15

16 betwee 0 ad 1 ad use cubic splies with six evely distributed kots to estimate the additive compoets. The pealty parameters i all the methods are chose usig the BIC or EBIC as i the simulatio study. Table 3 lists the probes selected by the group Lasso ad the adaptive group Lasso, idicated by the check sigs. Table 4 shows the umber of variables, the residual sums of squares obtaied with each estimatio method. For the ordiary Lasso with the splie expasio, a variable is cosidered to be selected if ay of the estimated coefficiets of the splie approximatio to its additive compoet are o-zero. Depedig o whether BIC or EBIC is used, the group Lasso selects variables, the adaptive group Lasso selects 15 variables ad the ordiary Lasso with the splie expasio selects variables, the liear model selects 8-14 variables. Table 4 shows that the adaptive group Lasso does better tha the other methods i terms of residual sum of squares (RSS. We have also examied the plots (ot show of the estimated additive compoets obtaied with the group Lasso ad the adaptive group Lasso, respectively. Most are highly oliear, cofirmig the eed for takig ito accout oliearity. I order to evaluate the performace of the methods, we use cross-validatio ad compare the predictio mea square errors (PEs. We radomly partitio the data ito 6 subsets, each set cosistig of 20 observatios. We the fit the model with 5 subsets as traiig set ad calculate the PE for the remaiig set which we cosider as test set. We repeat this process 6 times, cosiderig oe of the 6 subsets as test set every time. We compute the average of the umbers of probes selected ad the predictio errors of these 6 calculatios. The we replicate this process 400 times (this is suggested to us by the Associate Editor. Table 5 gives the average values over 400 replicatios. The adaptive group Lasso has smaller average predictio error tha the group Lasso, the ordiary Lasso ad the liear regressio with Lasso. The ordiary Lasso selects far more probe sets tha the other approaches, but this does ot lead to better predictio performace. Therefore, i this example, the adaptive group Lasso provides the ivestigator a more targeted list of probe sets, which ca serve as a startig poit for further study. It is of iterest to compare the selectio results from the adaptive group Lasso ad the liear regressio model with Lasso. The adaptive group Lasso ad the liear model with Lasso select differet sets of gees. Whe the pealty parameter is chose with the BIC, the adaptive group Lasso selects 5 gees that are ot selected by the liear model with Lasso. I additio, the liear model with Lasso selects 5 gees that are ot selected by the adaptive group Lasso. Whe the pealty parameter is selected with the EBIC, the adaptive group Lasso selects 10 gees that are ot selected by the liear model with Lasso. The estimated effects of may of the gees are oliear, ad the Mote Carlo results of Sectio 4 show that the performace of the liear model with Lasso ca be very poor i the presece of oliearity. Therefore, we iterpret the differeces betwee 16

17 the gee selectios of the adaptive group Lasso ad the liear model with Lasso as evidece that the selectios produced by the liear model are misleadig. 6 Cocludig remarks I this paper, we propose to use the adaptive group Lasso for variable selectio i oparametric additive models i sparse, high-dimesioal settigs. A key requiremet for the adaptive group Lasso to be selectio cosistet is that the iitial estimator is estimatio cosistet ad selects all the importat compoets with high probability. I low-dimesioal settigs, fidig a iitial cosistet estimator is relatively easy ad ca be achieved by may well established approaches such as the additive splie estimators. However, i high-dimesioal settigs, fidig a iitial cosistet estimator is difficult. Uder the coditios stated i Theorem 1, the group Lasso is show to be cosistet ad selects all the importat compoets. Thus the group Lasso ca be used as the iitial estimator i the adaptive Lasso to achieve selectio cosistecy. Followig model selectio, oracle-efficiet, asymptotically ormal estimators of the o-zero compoets ca be obtaied by usig existig methods. Our simulatio results idicate that our procedure works well for variable selectio i the models cosidered. Therefore, the adaptive group Lasso is a useful approach for variable selectio ad estimatio i sparse, high-dimesioal oparametric additive models. Our theoretical results are cocered with a fixed sequece of pealty parameters, which are ot applicable to the case where the pealty parameters are selected based o data drive procedures such as the BIC. This is a importat ad challegig problem that deserves further ivestigatio, but is beyod the scope of this paper. We have oly cosidered liear oparametric additive models. The adaptive group Lasso ca be applied to geeralized oparametric additive models, such as the geeralized logistic oparametric additive model ad other oparametric models with high-dimesioal data. However, more work is eeded to uderstad the properties of this approach i those more complicated models. Ackowledgemets. The authors wish to thak the Editor, Associate Editor ad two aoymous referees for their helpful commets. Jia Huag is supported i part by NIH grat CA ad NSF grat DMS Joel L. Horowitz was supported i part by NSF Grat grat SES

18 7 Appedix: Proofs We first prove the followig lemmas. Deote the cetered versios of S by S 0 j = { f j : f j (x = m k=1 b jk ψ k (x, (β j1,..., β jm IR m where ψ k s are the cetered splie bases defied i (5. }, 1 j p, Lemma 1 Suppose that f F ad Ef(X j = 0. The, uder (A3 ad (A4, there exists a f S 0 j satisfyig f f 2 = O p (m d + m 1/2 1/2. I particular, if we choose m = O( 1/(2d+1, the f f 2 = O p (m d = O p ( d/(2d+1. Proof of Lemma 1. By (A4, for f F, there is a f S such that f f 2 = O(m d. Let f = f 1 i=1 f (X ij. The f S 0 j ad f f f f + P f, where P is the empirical measure of iid radom variables X 1j,..., X j. Cosider P f = (P P f + P (f f. Here we use the liear fuctioal otatio, i.e., P f = fdp, where P is the probability measure of X 1j. For ay ε > 0, the bracketig umber N [] (ε, S 0 j, L 2 (P of S 0 j satisfies log N [] (ε, S 0 j, L 2 (P c 1 m log(1/ε for some costat c 1 > 0 (She ad Wog 1994, page 597. Thus by the maximal iequality, see e.g. Va der Vaart (1998, page 288, (P P f = O p ( 1/2 m 1/2. By (A4, P (f f C 2 f f 2 = O(m d for some costat C 2 > 0. The lemma follows from the triagle iequality. Lemma 2 Suppose that coditios (A2 ad (A4 hold. Let T jk = 1/2 m 1/2 ψ k (X ij ε i, 1 j p, 1 k m, i=1 ad T = max 1 j p,1 k m T jk. The E(T C 1 1/2 m 1/2 log(pm ( 2C2 m 1 log(pm + 4 log(2pm + C 2 m 1 1/2. 18

19 where C 1 ad C 2 are two positive costats. I particular, whe m log(pm / 0, E(T = O(1 log(pm. Proof of Lemma 2. Let s 2 jk = i=1 ψ2 k (X ij. Coditioal o X ij s, T jk s are subgaussia. Let s 2 = max 1 j p,1 k m s 2 jk. By (A2 ad the maximal iequality for subgaussia radom variables (Va der Vaart ad Weller 1996, Lemmas ad 2.2.2, E ( max T jk {Xij, 1 i, 1 j p} C 1 1/2 m 1/2 s log(pm. 1 j p,1 k m Therefore, E ( max T jk C 1 1/2 m 1/2 log(pm E(s, (9 1 j p,1 k m where C 1 > 0 is a costat. By (A4 ad the properties of B-splies, ψ k (X ij φ k (X ij + φ jk 2 ad E(ψ k (X ij 2 C 2 m 1, (10 for a costat C 2 > 0, for every 1 j p ad 1 k m. By (10, ad i=1 E[ψ 2 k(x ij Eψ 2 k(x ij ] 2 4C 2 m 1, (11 max 1 j p,1 k m By Lemma A.1 of Va de Geer (2008, (10 ad (11 imply i=1 i=1 ( E max {ψ 2 1 j p,1 k m k(x ij Eψk(X 2 ij } Therefore, by (12 ad the triagle iequality, Eψ 2 k(x ij C 2 m 1. (12 2C 2 m 1 log(pm + 4 log(2pm. Now sice Es (Es 2 1/2, we have Es 2 2C 2 m 1 log(pm + 4 log(2pm + C 2 m 1. Es ( 2C2 m 1 1/2. log(pm + 4 log(2pm + C 2 m 1 (13 19

20 The lemma follows from (9 ad (13. Deote β A = (β j, j A ad Z A = (Z j, j A. Here β A is a A m 1 vector ad Z A is a A m matrix. Let C A = Z A Z A/. Whe A = {1,..., p}, we simply write C = Z Z/. Let ρ mi (C A ad ρ max (C A be the miimum ad maximum eigevalues of C A, respectively. Lemma 3 Let m = O( γ where 0 < γ < 0.5. Suppose that A is bouded by a fixed costat idepedet of ad p. Let h h m 1. The, uder (A3 ad (A4, with probability covergig to oe, where c 1 ad c 2 are two positive costats. c 1 h ρ mi (C A ρ max (C A c 2 h, Proof of Lemma 3. Without loss of geerality, suppose A = {1,..., k}. The Z A = (Z 1,..., Z q. Let b = (b 1,..., b q, where b j R m. By Lemma 3 of Stoe (1985, Z 1 b Z q b q 2 c 3 ( Z 1 b Z q b q 2 for a certai costat c 3 > 0. By the triagle iequality, Z 1 b Z q b q 2 Z 1 b Z q b q 2. Sice Z A b = Z 1 b Z q b q, the above two iequalities imply that c 3 ( Z 1 b Z q b q 2 Z A b 2 Z 1 b Z q b q 2. Therefore, c 2 3( Z 1 b Z q b q 2 2 Z A b 2 2 2( Z 1 b Z q b q 2 2. (14 Let C j = 1 Z jz j. By Lemma 6.2 of Zhou, She ad Wolf (1998, c 4 h ρ mi (C j ρ max (C j c 5 h, j A. (15 Sice C A = 1 Z A Z A, it follows from (14 that c 2 3 ( b 1 C 1 b b qc q b q b C A b 2 ( b 1C 1 b b qc q b q. 20

21 Therefore, by (15, b 1C 1 b 1 b b qc q b q b 2 2 = b 1C 1 b 1 b b b b qc q b q b q 2 2 b q 2 2 b 2 2 ρ mi (C 1 b b 2 2 c 4 h. + + ρ mi (C q b q 2 2 b 2 2 Similarly, Thus we have The lemma follows. b 1C 1 b 1 b b qc q b q b 2 2 c 2 3c 4 h b C A b b b 2c 5h. c 5 h. Proof of Theorem 1. The proof of parts (i ad (ii essetially follows the proof of Theorem 2.1 of Wei ad Huag (2008. The oly chage that must be made here is that we eed to cosider the approximatio error of the regressio fuctios by splies. Specifically, let ξ = ε + δ, where δ = (δ 1,..., δ with δ i = q j=1 (f 0j(X ij f j (X ij. Sice f 0j f j 2 = O(m d = O( d/(2d+1 for m = 1/(2d+1, we have for some costat C 1 > 0. For ay iteger t, let δ 2 C 1 qm 2d = C 1 q 1/(4d+2, χ t = max A =t max U Ak 2 =1,1 k t ξ V A (s V A (s 2 ad χ t = max A =t max U Ak 2 =1,1 k t ε V A (s V A (s 2 where V A (S A = ξ (Z A (Z A Z A 1 SA (I P A Xβ for N(A = q 1 = m 0, S A = (S A 1,, S A m, S Ak = λ d Ak U Ak ad U Ak 2 = 1. For a sufficietly large costat C 2 > 0, defie Ω t0 = {(Z, ε : x t σc 2 ((t 1m log(pm, t t 0 }, ad Ω t 0 = {(Z, ε : x t σc 2 (t 1m log(pm, t t 0 }, where t

22 As i the proof of Theorem 2.1 of Wei ad Huag (2008, (Z, ε Ω q Ã1 M 1 q, for a costat M 1 > 1. By the triagle ad Cauchy-Schwarz iequalities, ξ V A (s V A (s 2 = ε V A (s + δ V A (s V A (s 2 ε V A (s V A 2 + δ. (16 I the proof of Theorem 2.1 of Wei ad Huag (2008, it is show that P(Ω p 1+c 0 ( 2p exp 1. (17 p 1+c 0 Sice δ V A (s V A (s 2 δ 2 = C 1 q 1 2(2d+1 ad m = O( 1/(2d+1, we have for all t 0 ad sufficietly large, δ 2 C 1 q 1 2(2d+1 σc2 (t 1m log(p. (18 It follows from (16, (17 ad (18 that P(Ω 0 1. This completes the proof of part (i of Theorem 1. Before provig part (ii, we first prove part (iii of Theorem 1. By the defiitio of β ( β 1,..., β p, p p Y Z β λ 1 β j 2 Y Zβ λ 1 β j 2. (19 j=1 j=1 Let A 2 = {j : β j 2 0 or β j 2 0} ad d 2 = A 2. By part (i, d 2 = O p (q. By (19 ad the defiitio of A 2, Y Z A2 βa λ 1 β j 2 Y Z A2 β A λ 1 β j 2. (20 j A 2 j A 2 Let η = Y Zβ. Write Y Z A2 βa2 = Y Zβ Z A2 ( β A2 β A2 = η Z A2 ( β A2 β A2. 22

23 We have Y Z A2 βa2 2 2 = Z A2 ( β A2 β A η Z A2 ( β A2 β A2 + η η. We ca rewrite (20 as Z A2 ( β A2 β A η Z A2 ( β A2 β A2 λ 1 β j 2 λ 1 β j 2. (21 j A 1 j A 1 Now β j 2 β j 2 A 1 β A1 β A1 2 A 1 β A2 β A2 2. (22 j A 1 j A 1 Let ν = Z A2 ( β A2 β A2. Combiig (20, (21 ad (22 to get ν 2 2 2η ν λ 1 A1 β A2 β A2 2. (23 Let η be the projectio of η to the spa of Z A2, that is, η = Z A2 (Z A 2 Z A2 1 Z A 2 η. By the Cauchy-Schwartz iequality, 2 η ν 2 η 2 ν 2 2 η ν 2 2. (24 From (23 ad (24, we have ν η λ 1 A1 β A2 β A2 2. Let c be the smallest eigevalue of Z A 2 Z A2 /. By Lemma 3 ad part (i, c p m 1. Sice ν 2 2 c β A2 β A2 2 2 ad 2ab a 2 + b 2, It follows that c β A2 β A η (2λ 1 A c 2 c β A2 β A β A2 β A η λ2 1 A 1. (25 c 2 c 2 Let f 0 (X i = p j=1 f 0j(X ij ad f 0A (X i = j A f 0j(X ij. Write η i = Y i µ f 0 (X i + (µ Y + f 0 (X i j A 2 Z ijβ j = ε i + (µ Y + f A2 (X i f A2 (X i. 23

24 Sice µ Y 2 = O p ( 1 ad f 0j f j = O(m d, we have η ε O p (1 + O(d 2 m 2d, (26 where ε is the projectio of ε = (ε 1,..., ε to the spa of Z A2. We have Now max Z Aε 2 2 = A: A d 2 ε 2 2 = (Z A 2 Z A2 1/2 Z A 2 ε c Z A 2 ε 2 2. max Z jε 2 2 d 2 m max Z A: A d 2 1 j p,1 k m jkε 2, j A where Z jk = (ψ k (X 1j,..., ψ k (X j. By Lemma 2, max Z 1 j p,1 k m jkε 2 = m 1 max (m / 1/2 Z 1 j p,1 k m jkε 2 = O p (1m 1 log(pm. It follows that, Combiig (25, (26, ad (27, we get ε 2 2 = O p (1 d 2 log(pm c. (27 ( β A2 β A2 2 d2 log(pm ( 1 ( d2 m 2d 2 O p + O c 2 p + O c c Sice d 2 = O p (q, c p m 1 ad c p m 1, we have ( m β A2 β A O log(pm p This completes the proof of part (iii. ( m ( 1 + O p + O + O m 2d 1 + 4λ2 1 A 1. 2 c 2 ( 4m 2 λ 2 1 We ow prove part (ii. Sice f j 2 c f > 0, 1 j q, f j f j 2 = O(m d ad f j 2 f j 2 f j f j 2, we have f j 2 0.5c f for sufficietly large. By a result of de Boor (2001, see also (12 of Stoe (1986, there are positive costats c 6 ad c 7 such that 2. c 6 m 1 β 2 2 f j 2 2 c 7 m 1 β j 2 2. It follows that, β j 2 2 c 1 7 m f j c 1 7 c 2 f m. Therefore, if β j 2 0 but β j 2 = 0, the β j β j c 1 7 c 2 fm. (28 24

25 However, sice (m log(pm / 0 ad (λ 2 1m / 2, (28 cotradicts part (iii. Proof of Theorem 2. By the defiitio of f j, 1 j p, parts (i ad (ii follow from parts (i ad (ii of Theorem 1 directly. Now cosider part (iii. By the properties of splie (de Boor (2001, c 6 m 1 β j β j 2 2 f j f j 2 2 c 7 m 1 β j β j 2 2. Thus By (A3, f ( j f j 2 m log(pm ( 1 ( 1 2 = O p + O p + O + O m 2d ( 4m λ (29 f j f j 2 2 = O(m 2d. (30 Part (iii follows from (29 ad (30. I the proofs below, for ay matrix H, deote its 2-orm by H, which is equal to its largest eigevalue. This orm satisfies the iequality Hx H x for a colum vector x whose dimesio is the same as the umber of the colums of H. Deote β A1 = (β j, j A 1, βa1 = ( β j, j A 1, ad Z A1 = (Z j, j A 1. Defie C A1 = 1 Z A 1 Z A1. Let ρ 1 ad ρ 2 be the smallest ad largest eigevalues of C A1, respectively. Proof of Theorem 3. By the KKT, a ecessary ad sufficiet coditio for β is ( 2Z βj j Y Z β = λ2 w j, β β j j 2 0, j 1, 2 Z j( Y Z β 2 λ 2 w j, β j = 0, j 1. (31 Let ν = (w j βj /(2 β j, j A 1. Defie β A1 = (Z A 1 Z A1 1 (Z A 1 Y λ 2 ν. (32 If β A1 = 0 β A1, the the equatio i (31 holds for β ( β A 1, 0. Thus, sice Z β = Z A1 βa1 for this β ad {Z j, j A 1 } are liearly idepedet, β A1 = 0 β A1 β = 0 β if Z j( Y ZA1 βa1 2 λ 2 w j /2, j A 1. 25

26 This is true if β = 0 β if β j 2 β j 2 < β j 2, j A 1, Z j(y Z A1 βa1 2 λ 2 w j /2, j A 1. Therefore, P( β 0 β P ( β j β j 2 β j 2, j A 1 +P ( Z j(y Z A1 βa1 2 > λ 2 w j /2, j A 1. have Let f 0j (X j = (f 0j (X 1j,..., f 0j (X j ad δ = j A 1 f 0j (X j Z A1 β A1. By Lemma 1, we Let H = I Z A1 (Z A 1 Z A1 1 Z A 1. By (32, 1 δ 2 = O p (qm 2d. (33 β A1 β A1 = 1 C 1 A 1 ( Z A1 (ε + δ λ 2 ν, (34 ad Y Z A1 βa1 = H ε + H δ + λ 2 Z A1 C 1 A 1 ν /. (35 Based o these two equatios, Lemma 5 below shows that P ( β j β j 2 β j 2, j A 1 0, ad Lemma 6 below shows that P ( Z j(y Z A1 βa1 2 > λ 2 w j /2, j A 1 0. These two equatios lead to part (i of the theorem. We ow prove part (ii of Theorem 3. As i (26, for η = Y Zβ ad η 1 = Z A1 (Z A 1 Z A1 1 Z A 1 η, we have η ε O p (1 + O(qm 2d, (36 26

27 where ε 1 is the projectio of ε = (ε 1,..., ε to the spa of Z A1. We have ε = (Z A 1 Z A1 1/2 Z A 1 ε ρ 1 Z A 1 ε 2 2 = O p (1 A 1 ρ 1. (37 Now similarly to the proof of (25, we ca show that Combiig (36, (37 ad (38, we get β A1 β A η λ2 2 A 1. (38 ρ 1 2 ρ 2 1 ( 8 ( 1 ( 1 ( 4λ β A1 β A = O p + O ρ 2 p + O + O 2. 1 ρ 1 m 2d 1 2 ρ 2 1 Sice ρ 1 p m 1, the result follows. The followig lemmas are eeded i the proof of Theorem 3. Lemma 4 For ν = (w j βj /(2 β j, j A 1, uder coditio (B1, Proof of Lemma 4. Write ν 2 = O p (h 2 = O p ( (b 2 1 c b 2 r 1 + qb 1 1. ν 2 = j A 1 w 2 j = j A 1 β j 2 = j A 1 β j 2 β j 2 β j 2 β j 2 + j A 1 β j 1. Uder (B2, j A 1 βj 2 β j 2 β j 2 β j 2 Mc 2 b b 4 1 β β, ad j A 1 β j 2 qb 2 1. The claim follows. Let ρ 3 be the maximum of the largest eigevalues of 1 Z jz j, j A 0, that is, ρ 3 = max j A0 1 Z jz j 2. By Lemma 3, b 1 O(m 1/2, ρ 1 p m 1, ρ 2 p m 1 ad ρ 3 p m 1. (39 Lemma 5 Uder coditios (B1, (B2, (A3 ad (A4, P ( β j β j 2 β j 2, j A 1 0. (40 27

28 Proof of Lemma 5. Let T j be a m qm matrix with the form T j = (0 m,..., 0 m, I m, 0 m,..., 0 m, where O m is a m m matrix of zeros ad I m is a m m idetity matrix, ad I m is at the jth block. By (34, β j β j = 1 T j C 1 A 1 (Z A 1 ε + Z A 1 δ λ 2 ν. By the triagle iequality, β j β j 2 1 T j C 1 A 1 Z A 1 ε T j C 1 A 1 Z A 1 δ λ 2 T j C 1 A 1 ν 2. (41 Let C be a geeric costat idepedet of. The first term o the right-had side By (33, the secod term max 1 T j C 1 A j A 1 Z A 1 ε 2 1 ρ 1 1 Z A 1 ε 2 1 = 1/2 ρ 1 1 1/2 Z A 1 ε 2 = O p (1 1/2 ρ 1 1 m 1/2 (qm 1/2 (42 By Lemma 4, the third term max 1 T j C 1 A j A 1 Z A 1 δ 2 C 1 A Z A 1 Z A1 1/2 2 1 δ 2 1 = O p (1ρ 1 1 ρ 1/2 2 q 1/2 m d. (43 max 1 λ 2 T j C 1 A j A 1 ν 2 λ 2 ρ 1 1 ν 2 = O p (1ρ λ 2 h. (44 1 Thus (40 follows from (39, (42, (43, (44 ad coditio (B2a. Lemma 6 Uder coditios (B1, (B2, (A3 ad (A4, P ( Z j(y Z A1 βa1 2 > λ 2 w j /2, j A 1 0. (45 Proof of Lemma 6. By (35, we have ( Z j Y ZA1 βa1 = Z j H ε + Z jh δ + λ 1 Z jz A1 C 1 A 1 ν. (46 28

29 Recall s = p q is the umber of zero compoets i the model. By Lemma 2, ( E max 1/2 Z j A 1 jh ε 2 O(1 { log(s m } 1/2. (47 Sice w j = β j 1 = O p (r for j A 1 ad by (47, for the first term o the right had side of (46 we have P ( Z jh ε 2 > λ 2 w j /6, j A 1 By (33, the secod term o the right had side of (46 P ( Z jh ε 2 > Cλ 2 r, j A 1 + o(1 ( = P max 1/2 Z j A 1 jh ε 2 > C 1/2 λ 2 r + o(1 O(1 1/2 {log(s m } 1/2 Cλ 2 r + o(1. (48 max Z j A 1 jh δ 2 1/2 max 1 Z j A 1 jz j 1/2 2 H 2 δ 2 = O(1ρ 1/2 3 q 1/2 m d. (49 By Lemma 4, the third term o the right had side of (46 max λ 2 1 Z j Z A1 C 1 A j ia 1 ν 2 λ 2 max 1/2 Z j 2 1/2 Z A1 C 1/2 A 1 j A 1 2 C 1/2 A 1 2 ν 2 1 = λ 2 ρ 1/2 3 ρ 1/2 1 O p (qb 1 1. (50 Therefore, (45 follows from (39, (48, (49, (50 ad coditio (B2b. Proof of Theorem 4. The proof is similar to that of Theorem 2 ad is omitted. Refereces [1] Bach, F. R. (2007. Cosistecy of the group Lasso ad multiple kerel learig. Joural of Machie Learig Research, 9, [2] Buea, F., Tsybakov, A. ad Wegkamp, M. (2007. Sparsity oracle iequalities for the lasso. Electroic Joural of Statististic, [3] Che, J. ad Che, Z. (2008. Exteded Bayesia iformatio criteria for model selectio with large model space. Biometrika, 95, [4] Che, J. ad Che, Z. (2009. Exteded BIC for small--large-p sparse GLM. Available from stachez/cheche.pdf. 29