On the degrees of freedom in shrinkage estimation

Transcription

1 On the degrees of freedom in shrinkage estimation Kengo Kato Graduate School of Economics, University of Tokyo, Hongo, Bunkyo-ku, Tokyo, , Japan kato October, 2007 Abstract We study the degrees of freedom in shrinkage estimation of the regression coefficients. Generalizing the idea of the Lasso, we consider the problem of estimating the coefficients by the projection of the ordinary least squares estimator onto a closed convex set. Then an unbiased estimator of the degrees of freedom is derived in terms of geometric quantities under a smoothness condition on the boundary of the closed convex set. The result presented in this paper is applicable to estimation with a wide class of constraints. As an application, we obtain a C p -type criterion and AIC for selecting the tuning parameter. Keywords: AIC, degrees of freedom, fused Lasso, group Lasso, Lasso, Mallows C p, second fundamental form, shrinkage estimation, Stein s lemma, tubal coordinates. Running title: Degrees of freedom in shrinkage estimation 1 Introduction In recent years, much attention has been paid for shrinkage methods in estimating coefficients of a linear model. Compared with the ordinary least squares OLS, shrinkage methods often improve the prediction accuracy. In addition, if the constraint region towards which the estimator is shrunk has edges or corners, some coefficients can be set to exactly zero. To be precise, suppose y = y 1,..., y n is the response vector and x j = x 1j,..., x nj, j = 1,..., p are p linearly independent predictors. Let X = [x 1 x p ] be the design matrix. We consider a linear model y = Xβ + ɛ, 1.1 where β = β 1,..., β p is the coefficient vector and ɛ N n 0, σ 2 I n. Without loss of generality, we assume that the predictors are centered so that the intercept is not included in the above linear model. 1

2 A canonical example of shrinkage methods is the Lasso Tibshirani [10]. Let 2 be the ordinary Euclidean norm: z 2 = z z 1 2 for z R n. The Lasso estimate is defined as the solution of the following problem: or equivalently min β y Xβ 2 2 subject to min y Xβ λ β β j t, 1.2 j=1 β j, 1.3 where t and λ are non-negative tuning parameters. The Lasso shrinks the coefficients towards zero as t decreases or λ increases. An important feature of the Lasso is that, depending on the tuning parameter, some coefficients are set exactly equal to zero. It should be noted that although 1.2 and 1.3 are equivalent as minimization problems, the solutions of these two problems are different as estimators since the correspondence between t and λ generally depends on the data. As explained in Efron [2], the degrees of freedom plays an important role in selecting the optimal tuning parameter. The degrees of freedom reflects the model complexity controlled by the shrinkage and corresponds to the penalty term of model selection criteria such as Mallows C p Mallows [4] and Akaike s information criterion AIC, Akaike [1]. Recently, Zou et al. [15] show that, with parametrization 1.3, the number of non-zero coefficients is an unbiased estimator of the degrees of the freedom of the Lasso. Their derivation, however, requires the local explicit form of the Lasso estimator and can not be applied to estimation with a more general restriction. The Lasso can be viewed as the projection of the OLS estimator onto the diamond shaped region. For u, v R p, we denote j=1 u, v = u V v, 1.4 where V = X X and let =,. Then the Lasso problem 1.2 is rewritten as min β ˆβ subject to β β j t, 1.5 j=1 where ˆβ is the OLS estimator of β. The natural generalization of the minimization problem 1.5 is min β ˆβ subject to β K, 1.6 β with a closed convex set K R p. The solution ˆβ K to the problem 1.6 is given by the projection of ˆβ onto K. Since K is closed and convex, ˆβK is uniquely defined. The problem of selecting the optimal tuning parameter is viewed as the problem of selecting the optimal constraint region K among a given collection of closed convex sets. The class of estimation methods considered here includes the Lasso, the fused Lasso Tibshirani et al. [11], and the group Lasso Yuan and Lin [12]. 2

3 Here we present illustrative examples of the constraint regions of the Lasso and the group Lasso. The left of the figures below corresponds to the Lasso constraint β 1 + β 2 + β 3 1. The right one corresponds to the group Lasso constraint β β β beta3 0.0 beta beta2 0 beta beta2 0 beta1 Fig. The constraint regions of the Lasso left and the group Lasso right. In this paper, we study the degrees of freedom of the fit ˆµ K = X ˆβ K. From Stein s lemma Stein [8], an unbiased estimator of the degrees of freedom is given by the divergence of ˆµ K with respect to y, which coincides with the divergence of ˆβ K with respect to ˆβ. However, in general, the estimator ˆβ K can not be expressed in an explicit form. Thus it is often impossible to directly calculate the divergence. To overcome this difficulty, we use the idea of the tubal coordinates Weyl [14]. From an approach similar to that of Kuriki and Takemura [3], we derive the divergence of the projection onto K in terms of geometric quantities under a regularity condition on the boundary K of K. Hence we obtain an unbiased estimator of the degrees of freedom of ˆµ K. As an application, a C p -type statistic and AIC for ˆµ K are also derived. The organization of this paper is as follows. In Section 2, we briefly review Stein s unbiased risk theory. In Section 3, we first prepare notations of geometry of a piecewise smooth boundary of a closed convex set and derive a divergence formula for the projection onto K from the differential geometric approach. An unbiased estimator of the degrees of freedom of ˆµ K is provided in Section 3.2. The result presented in this paper seems to be fairly general. In Section 4, we exemplify our method to obtain unbiased estimators of the degrees of freedom for the Lasso and its variants. Section 5 is devoted to some concluding remarks. 2 Unbiased estimation of the prediction risk In this section, according to Efron [2], we first introduce Stein s unbiased risk estimation theory. The precise definition of the degrees of freedom is given. Then we explain the strategy to derive an unbiased estimator of the degrees of freedom for the estimator defined by the solution to the minimization problem 1.6. Given a fit ˆµ = ˆµy = X ˆβ where ˆβ is an estimator of β, we focus on the accuracy of ˆµ to predict future data. Suppose y new is a new response vector generated from the same distribution as y. We shall consider to estimate the prediction risk E y new ˆµ 2 2/n. 3

4 Define µ = Xβ. Partitioning yi new ˆµ i 2 as y new i and substituting into 2.1, we obtain ˆµ i 2 = y new i µ i 2 + 2y new i µ i µ i ˆµ i + µ i ˆµ i µ i ˆµ i 2 = y i ˆµ i 2 y i µ i 2 + 2y i µ i ˆµ i µ i yi new ˆµ i 2 = y i ˆµ i 2 +2y i µ i ˆµ i µ i +yi new µ i 2 y i µ i 2 +2yi new µ i µ i ˆµ i. 2.2 Taking expectation of both sides of the equation 2.2, we obtain the decomposition where E y new ˆµ 2 2 = E y ˆµ dfˆµσ 2, dfˆµ = n covˆµ i, y i /σ i=1 is called the degrees of freedom of the fit ˆµ. When ˆµ is given by a linear function of y, i.e., ˆµ = Sy with some matrix S being independent of y, the degrees of freedom is dfˆµ = tr S, which is a known constant. However, in general it is necessary to estimate dfˆµ. We employ Stein s lemma to accomplish the task. Lemma 2.1 Stein s lemma. Suppose ˆµ i : R n R is absolutely continuous in i-th coordinate for i = 1,..., n. If E ˆµ i / y i < for each i, then where div ˆµ = n i=1 ˆµ i/ y i. n covˆµ i, y i /σ 2 = Ediv ˆµ, i=1 Therefore an unbiased estimator of the degrees of freedom is given by and we can define a C p -type criterion by dfˆµ = div ˆµ, 2.4 C p ˆµ = y ˆµ 2 2 n + 2 dfˆµ σ 2 n which is an unbiased estimator of the prediction risk. Let ˆβ K be the estimator defined as the solution to the problem 1.6 with a closed convex set K. We verify the absolute continuity of ˆµ K with ˆµ K = X ˆβ K. Lemma 2.2. For every i, ˆµ K,i is absolutely continuous in each coordinate and ˆµ K,i / y = ˆµ K,i / y 1,..., ˆµ K,i / y n is essentially bounded. 4

5 Proof. Since ˆβ K is the projection of ˆβ onto K, ˆβK is Lipschitz continuous in ˆβ see Theorem of Webster [13]. Therefore ˆµ K is shown to be Lipschitz continuous in y and so is each ˆµ K,i. The absolute continuity and the essential boundedness follow directly from the Lipschitz continuity. Note that if ˆβ K is differentiable in ˆβ, the divergence div ˆµ K is same as the divergence of ˆβ K with respect to ˆβ. This can be verified by the chain rule: div ˆµ K = tr X ˆβ K ˆβ ˆβ y = tr X ˆβ K ˆβ X X 1 X = tr ˆβ K ˆβ, where ˆβ K / ˆβ is the matrix whose i, k-th component is ˆβ K,i / ˆβ k and ˆβ / y is the matrix whose k, j-th component is ˆβ k / y j. Therefore we only need to calculate the divergence of ˆβ K with respect to ˆβ in order to derive an unbiased estimator of the degree of the freedom dfˆµ K. For the normal linear model 1.1, ˆβ is a complete sufficient statistic for β when σ 2 is known and ˆβ, y y is a complete sufficient statistic for β, σ 2 when σ 2 is unknown. In either case, dfˆµ K = tr ˆβ K / ˆβ is shown to be the unique uniformly minimum variance unbiased estimator of the degrees of freedom dfˆµ K since dfˆµ K is a function of ˆβ. Thus, in terms of estimating the degrees of freedom, the analytical estimator dfˆµ K is more efficient than cross-validation and related nonparametric methods. 3 Main results In this section, we first derive a divergence formula for the projection onto K under a smoothness condition on the boundary K. As noted in the previous section, it enables us to obtain an unbiased estimator of the degrees of freedom for the shrinkage estimator projected on K. The result presented here is an extension of that of Meyer and Woodroofe [5], which treats the case where K is a convex polyhedral cone. 3.1 Divergence formula Let K R p be a closed convex set. For x R p x K denotes the orthogonal projection of x onto K in terms of, : x x K = min x z. z K Recall that the inner product, is defined by 1.4. Since K is closed and convex, x K is uniquely defined. Our main aim is to evaluate the divergence of the projection onto K defined as fx = f 1 x,..., f p x = x K. Note that f is Lipschitz continuous see the proof of Lemma

6 Let K be boundary of K. For s K, the normal cone of K at s is defined by NK, s = {z s z K = s}. Depending on the dimension of the normal cone NK, s, we have a disjoint partition of the boundary K as K = D 1 D p, where Define D m = {s K dim NK, s = m}. E m = {x R p \ K x K D m }. Then we have a disjoint partition of R p \ K as R p \ K = E 1 E p. We put a condition on smoothness of K as in Kuriki and Takemura [3]. E m denotes the interior of E m. Assumption 3.1. D m is a p m-dimensional C 2 -manifold consisting of a finite number of relatively open connected components. Furthermore the Lebesgue measure of E m \E m is zero. Remark 3.1. In Kuriki and Takemura [3], they call K piecewise smooth if K meets Assumption 3.1. Let T s D m be the tangent space of D m at s and T s D m be the orthogonal complement of T s D m in terms of, : T s D m = {v R p v, z = 0, z T s D m }. Clearly, T s D m is the affine hull of NK, s. Following Milnor [6], the normal bundle of D m is defined as N m = {s, v s D m, v T s D m }. It is not difficult to show that N m is a p-dimensional C 1 -manifold imbedded in R 2p. Let us define ϕ : N m R p as ϕs, v = s + v. Notice that ϕ is a C 1 -mapping. Then we show the following basic fact. Lemma 3.1. For each fixed x E m, there exist an ɛ-ball B ɛ = {x R p x x 2 < ɛ} E m around x with sufficiently small ɛ > 0 and an open neighborhood W of x K, x x K in N m such that ϕ W : W B ɛ is a diffeomorphism and ϕ W 1 x = x K, x x K for x B ɛ. Especially, f is continuously differentiable on E m. Proof. See Appendix A.1. To calculate the divergence of f in an explicit form, we introduce the tubal coordinates on Em. Let θ = θ 1,..., θ p m be a C 2 -local coordinate system on D m and write s D m as sθ = sθ 1,..., θ p m. The tangent space T sθ D m at sθ is spanned by { b a θ = s } θ, a = 1,..., p m. θa 6

7 Let {n α θ, α = 1,..., m} be an orthonormal basis of Ts D m in terms of,. Since {b a θ} are C 1 -mappings in θ, we can choose {n α θ} so as to be of class C 1 as well. Hence we know that m θ, τ sθ, τ α n α θ, with τ = τ 1,..., τ m R m, gives a C 1 -local parametrization of N m. From Lemma 3.1, taking α=1 θ, τ ϕθ, τ = sθ + m τ α n α θ 3.1 as a C 1 -local parametrization of Em, we can express f in the local coordinates θ, τ as fθ, τ = sθ. Thus the Jacobian matrix of f with respect to x at x = ϕθ, τ is given by [ ] b1 θ b p m θ } 0 {{ 0} Jϕ 1 θ,τ, 3.2 m where Jϕ θ,τ is the Jacobian matrix of ϕ with respect to θ, τ. Especially, the divergence of f with respect to x at x = ϕθ, τ is given by the trace of the Jacobian matrix 3.2. To state our main result, we prepare some notations used in differential geometry: the first fundamental form and the second fundamental form. The first fundamental form of D m associated with the coordinate system θ = θ 1,..., θ p m is the symmetric matrix α=1 Gθ = g ab θ 1 a,b p m with g ab θ = b a θ, b b θ. The second fundamental form of D m in the normal direction n α θ is defined as with H α θ = h abα θ 1 a,b p m h abα θ = n α θ, 2 s θ a θ b θ. For x = ϕθ, τ, we define 2 s Hθ, τ = x x K, θ a θ θ b 1 a,b p m m = τ α H α θ, 3.3 α=1 which is a positive semi-definite matrix. See Appendix A.2. 7

8 Lemma 3.2. The divergence div fx = p j=1 f jx/ x j of f at x E m is given by div fx = p m a= κ a x, where κ a x = κ a θ, τ, a = 1,... p m are the eigenvalues satisfying the equation Hθ, τ κgθ = Proof. We need to evaluate the Jacobian matrix 3.2. In the following calculation, we abbreviate arguments like b a = b a θ. Since the elements of the Jacobian matrix Jϕ = [ ϕ/ θ 1 ϕ/ θ p m ϕ/ τ 1 ϕ/ τ m ] is given by we have ϕ θ a = b a + ϕ τ β = n β, m α=1 τ α n α θ a, Jϕ V [b 1 b p m n 1 n m ] [ gab + m α=1 = τ α nα, b θ a b m 1 a,b p m α=1 τ α nα, n θ a β ] 1 a p m,1 β m I m Differentiating both sides of n α, b b = 0 with respect to θ a, we obtain and hence 0 = θ n α, b a b = n α θ, b 2 s b + n a α, θ a θ, b n α θ, b b = n a α, Thus the right hand side of 3.5 is written as [ g ab m α=1 τ α n α, 2 s θ a θ b [ ] A11 A = 12, 0 I m 1 a,b p m 2 s θ a θ b. m α=1 τ α nα θ a, n β 1 a p m,1 β m 0 I m where p m p m matrix A 11 and p m m matrix A 12 are given by m A 11 = g ab τ α 2 s n α, θ a θ b α=1 = Gθ + Hθ, τ, m A 12 = τ α n α θ, n β a α=1 8 1 a,b p m 1 a p m,1 β m. ]

9 Therefore we obtain [ ] 1 Jϕ 1 A11 0 = [b 1 b p m n 1 n m ] V. A 12 The Jacobian matrix 3.2 is given by [ ] [ ] 1 [ ] A B B A 12 I m N V = [ B 0 ] [ ] [ ] A B A 12A 1 11 I m N V = BA 1 11 B V = BG + H 1 B V I m = BB V B + H 1 B V, 3.6 where G = Gθ, H = Hθ, τ, B = [b 1 b p m ], N = [n 1 n m ]. Let κ 1 θ, τ,..., κ p m θ, τ be the eigenvalues of Hθ, τ with respect to Gθ, i.e., solutions of the equation 3.4. Then, the divergence is written as Therefore the proof is completed. tr BG + H 1 B V = trg + H 1 G = p m a= κ a. Remark 3.2. The local coordinates θ, τ given in 3.1 is called the tubal coordinates, which is used in Weyl [14] to derive formulas for the volume of tubes. Remark 3.3. When K is a convex polyhedron, it holds that Bθ B constant matrix and Hθ, τ 0. In this case, the Jacobian matrix 3.6 reduces to the constant projection matrix. Remark 3.4. In Kuriki and Takemura [3], the average codimension dx is defined as dx = m + tri p m + HG 1 1 HG 1 = m + p m a=1 p m κ a 1 + κ a 1 = p 1 + κ a for x E m. Hence we have the relation div fx = p dx, a.e. 3.2 Degrees of freedom a=1 Using Lemma 3.2, we can derive an unbiased estimator of the degrees of freedom dfˆµ K. We assume that K is a closed convex set satisfying Assumption 3.1. For ˆβ E m, identifying x = ˆβ and x K = ˆβ K, let κ m,1 ˆβ,..., κ m,p m ˆβ be the eigenvalues satisfying 3.4. Formally we define E 0 = K and κ 0,a ˆβ 0, a = 1,..., p. Then, we obtain the following theorem. Note that ˆβ E m is equivalent to ˆβ / K and ˆβ K D m for m 1. 9

10 Theorem 3.1. Suppose K is a closed convex set satisfying Assumption 3.1. Then, dfˆµ K = p m m=0 a= κ m,a ˆβ I ˆβ E m 3.7 gives an unbiased estimator of the degrees of freedom dfˆµ K. Here, I is an indicator function. Hence, a C p -type criterion for ˆµ K is given by C p ˆµ K = y ˆµ K 2 2 n + 2 dfˆµ K σ 2, n which is an unbiased estimator of the prediction risk E[ y new ˆµ K 2 2]/n. Equivalently, we can define AIC for ˆµ K as AICˆµ K = y ˆµ K dfˆµ K. nσ 2 n When σ 2 is unknown, it is replaced by an unbiased estimate. In our setting 1.6, K plays a role of a tuning parameter. Practically, we choose the optimal K which minimizes C p ˆµ K or AICˆµ K among a given collection K of closed convex sets satisfying Assumption 3.1. For instance, K = {{β R p p j=1 β j t} t > 0} in the Lasso case. The usefulness of Theorem 3.1 is that it does not required to know the functional form of ˆβ K in calculation of 3.7. Once we know the numerical values of ˆβ and ˆβ K, we can calculate the value of 3.7 through the geometric quantities such as the first fundamental form and the second fundamental form. Especially, if K is a convex polyhedron, all κ m,a s turn out to be zero. Therefore, 3.7 is simply expressed as dfˆµ K = p mi ˆβ E m, 3.8 m=1 which coincides with the dimension of the face which contains ˆβ K as a relatively interior point when ˆβ / K. 4 Examples In this section, we provide unbiased estimators of the degrees of freedom for the Lasso, the fused Lasso, and the group Lasso. Our result is also applicable to order restricted inference. The degrees of freedom in order restricted inference is studied in Meyer and Woodroofe [5] in the case where K is a convex polyhedral cone. 10

11 4.1 Lasso For the Lasso, the constraint region is given by K = {β R p β j t}. j=1 We denote the Lasso estimator by ˆβt rather than ˆβ K. Since K is a convex polyhedron, an unbiased estimator dft of the degrees of freedom of ˆµt = X ˆβt is given by 3.8. In this case, if ˆβt D m, then the number of zeros in ˆβt is equal to m 1. Therefore we obtain the expression { #{j dft = ˆβt j 0} 1 if p j=1 ˆβ j > t, p if p j=1 ˆβ j t. A similar result is presented in Zou et al. [15], although their parametrization is not same as ours. 4.2 Fused Lasso The fused Lasso Tibshirani et al. [11] is the shrinkage method with the constraint region K = {β R p β j t 1, j=1 β j β j 1 t 2 }. j=2 We assume t 1 t 2. Let ˆβt be the fused Lasso estimator with t = t 1, t 2. Since K is a convex polyhedron, an unbiased estimator dft of the degrees of freedom of ˆµt = X ˆβt is given by 3.8. Define K 1 = {β R p β j t 1 } and j=1 K 2 = {β R p β j β j 1 t 2 }. j=2 Corresponding to the 2 p different possible signs for p components of β, K 1 is expressed as the solution set of 2 p linear inequalities: K 1 = {β R p a iβ t 1, i = 1,..., 2 p }. Similarly, K 2 is expressed as the solution set of 2 p 1 linear inequalities: K 2 = {β R p b iβ t 2, i = 1,..., 2 p 1 }. 11

12 For instance, if p = 3, a 1 = 1, 1, 1, a 2 = 1, 1, 1, a 3 = 1, 1, 1, a 4 = 1, 1, 1, a 5 = 1, 1, 1, a 6 = 1, 1, 1, a 7 = 1, 1, 1, a 8 = 1, 1, 1 and b 1 = 1, 0, 1, b 2 = 1, 2, 1, b 3 = 1, 2, 1, b 4 = 1, 0, 1. Each open face of the polytope K = K 1 K 2 is of the form {β R p a iβ = t 1, i I 1, b iβ = t 2, i I 2, a jβ < t 1, j {1,..., 2 p }\I 1, b jβ < t 2, j {1,..., 2 p 1 }\I 2 }, 4.1 where I 1 {1,..., 2 p } and I 2 {1,..., 2 p 1 }. Suppose a nonempty open face F of K is given by 4.1 where the matrix whose column vectors are a i, i I 1 and b i, i I 2 is of rank m. Then the dimension of F is p m. From these observations, we know that the unbiased estimator dft of dfˆµt is given by p m 1 t if ˆβt K1 K2 and ˆβ / K, p m 2 t if ˆβt K dft = 1 K 2 and ˆβ / K, p m 3 t if ˆβt K1 K 2 and ˆβ / K, p if ˆβ K, where m 1 t = #{j ˆβt j = 0} + 1, m 2 t = #{j 2 ˆβt j ˆβt j 1 = 0} + 1, m 3 t = #{j ˆβt j = 0} + #{j 2 ˆβt j ˆβt j 1 = 0, ˆβt j 1, ˆβt j 0} + 2. Remark 4.1. In Tibshirani et al. [11], with the penalization formulation, they propose p #{j ˆβ j = 0} #{j 2 ˆβ j ˆβ j 1 = 0, ˆβ j, ˆβ j 1 0} as an estimator of the degrees of freedom for the fused Lasso, where ˆβ is the fused Lasso estimator. They, however, do not present a mathematical proof for unbiasedness of this estimator. 4.3 Group Lasso The group Lasso is proposed in Yuan and Lin [12]. The constraint region of the group Lasso is J K = {β R p β[j]v j β [j] 1/2 t}, j=1 where β is partitioned as β = β [1],..., β [J] with β [j] being a p j 1 vector, and V j is a p j p j symmetric positive definite matrix. In the subsequent calculation, we assume that X is orthonormal, i.e., X X = I p and hence V = I p. For x R p, let x [1] = x 1,..., x q. We first treat the case K = {x R p x [1] 2 + x q x p t},

13 where x [1] 2 = q j=1 x2 j 1 2. We focus on the following surface area: M ={x R p x [1] 2 + x q x q+r = t, x [1] 2 > 0, x q+1 > 0,..., x q+r > 0, x q+r+1 = = x p = 0}. The set M is a q + r 1-dimensional smooth manifold. To introduce a local coordinate system on M, we transform x [1] into polar coordinates Takemura [9] as x [1] = θ q uθ 1,..., θ q 1, with cos θ 1 sin θ 1 cos θ 2 uθ 1,..., θ q 1 =., sin θ 1 sin θ 2 cos θ q 1 sin θ 1 sin θ 2 sin θ q 1 where 0 θ i π,i = 1,..., q 2, 0 θ q 1 < 2π, and 0 < θ q < t. Then the rest of the variables x q+1,..., x q+r must satisfy x q x q+r = t θ q. Let e i R p be the vector of which only i-th component is 1 and all other components are zero. Take b q+j = e q+1+j e q+1, j = 1,..., r 1. Then x M is expressed as x = xθ 1,..., θ q+r 1 = x [1] x q+1. x q+r 0 p q r uθ1,..., θ = θ q 1 q + t θ 0 q e q+1 + θ q+1 b q θ q+r 1 b q+r 1, p q where 0 i is i 1 zero vector, and θ q+1,..., θ q+r 1 satisfy θ q+j > 0, j = 1,..., r 1 and r 1 j=1 θ q+j < 1. The partial derivative of uθ 1,..., θ q 1 with respect to θ 1 is given by sin θ 1 u cos θ 1 sin θ 2 θ 1,..., θ q 1 = θ 1. cos θ 1 sin θ 2 sin θ q 2 cos θ q 1 cos θ 1 sin θ 2 sin θ q 2 sin θ q 1 vθ 1,..., θ q 1. 13

14 Define vθ i,..., θ q 1 for i 2 in the similar manner. Then, we have u 0 θ 1,..., θ q 1 = sin θ 1 sin θ i 1 i 1. θ i vθ i,..., θ q 1 Thus the tangent space T x M at x is spanned by the following q+r 1 linearly independent vectors: 0 x i 1 = θ q sin θ 1 sin θ i 1 vθ i,..., θ q 1, i = 1,..., q 1, θ i 0 p q x uθ1,..., θ = q 1 e θ q 0 q+1 + θ q+1 b q θ q+r 1 b q+r 1, p q x = t θ q b q+j, j = 1,..., r 1. θ q+j It is easy to see that the orthonormal system with n 1 = {n 1,..., n p q r+1 }, uθ 1 1,..., θ q 1 1 r r p q r and n 2 = e q+r+1,..., n p q r+1 = e p, gives a basis of Tx M. Here, 1 r = 1,..., 1 } {{ }. To r calculate the second fundamental forms, we evaluate the second partial derivatives of x, which are summarized as follows: 2 0 x i 1 = θ θi 2 q sin θ 1 sin θ i 1 uθ i,..., θ q 1, i = 1,..., q 1, 0 p q 2 0 x j 1 = θ q sin θ 1 cos θ i sin θ j 1 vθ j,..., θ q 1, 1 i < j q 1, θ i θ j 0 p q 2 0 x i 1 = sin θ 1 sin θ i 1 vθ i,..., θ q 1, i = 1,..., q 1, θ i θ q 0 p q 2 x θ q θ q+j = b q+j, j = 1,..., r 1, 2 x = 2 x θ i θ q+j θq 2 = 2 x θ 2 q+j = 0, i = 1,..., q 1, j = 1,..., r 1. Here, uθ i,..., θ q 1, i 2 are defined in the similar way as uθ 1,..., θ q 1. 14

15 Therefore the second fundamental forms are calculated as follows: 2 x H 1 = n 1, θ i θ j 1 i,j q+r 1 h = r h q 1 0, with h i = θ q sin 2 θ 1 sin 2 θ i 1, i = 1,..., q 1, and 2 x H k = n k, θ i θ j = 0, 1 i,j q+r 1 for k = 2,..., p q r + 1. Since the first fundamental form is given in the form θ q h G = θ q h q 1 0, G 22 the eigenvalues of H = p q r+1 k=1 τ k H k = τ 1 H 1 with respect to G are given by κ 1 = = κ q 1 = τ 1 θ q, κ q = = κ q+r 1 = 0, where τ 1 = τ 1 / r + 1. Returning to the original problem, let ˆβt be the resulting estimator with the constraint region 4.2. When ˆβt M, θ q and τ 1 correspond to θ q = ˆβt [1] 2 and τ 1 = ˆβ [1] ˆβt [1] 2. Since ˆβt [1] ˆβ [1] ˆβt [1] = θ q uθ 1,..., θ q 1 τ 1uθ 1,..., θ q 1 = θ q τ 1 = ˆβt [1] 2 ˆβ [1] ˆβt [1] 2, we have ˆβt [1] 2 + ˆβ [1] ˆβt [1] 2 = ˆβ [1] 2. Thus an unbiased estimator dft of dfˆµt, where ˆµt = X ˆβt, is given by 1 dft = r + q ˆβ [1] ˆβt [1] 2 / ˆβt [1] 2 = r + q 1 ˆβt [1] 2 ˆβ [1], 2 when ˆβt M and ˆβ / K. A similar calculation shows that entire dft is given by I dft = ˆβt [1] 2 > q 1 ˆβt [1] 2 ˆβ + p q [1] 2 j=1 I ˆβt q+j > 0 1 if ˆβ / K, p if ˆβ K. 15

16 Since ˆβ [1] 2 > 0 with probability 1, we also have dft as an unbiased estimator of dfˆµt, where I dft ˆβt [1] 2 > 0 + p q j=1 = I ˆβt q+j > 0 + q 1 ˆβt [1] 2 ˆβ 1 if ˆβ / K, [1] 2 p if ˆβ K. Next, for x R p, we write x = x [1],..., x [J] as a partition of x where x [j] is a p j 1 vector. Define x [j] = x [j] x [j] 1 2 for x R p. We consider the group Lasso estimation with the constraint region J K = {β R p β [j] t}. Let ˆβt be the resulting estimator. Assuming that X is orthonormal, an unbiased estimator of the degrees of freedom dfˆµt with ˆµt = X ˆβt is given by J j=1 dft = I ˆβ [j] > 0 + J j=1 p j 1 ˆβt [j] ˆβ 1 if ˆβ / K, [j] p if ˆβ K. The proof is similar as above and thus omitted. Remark 4.2. Even when X is not orthonormal, we can calculate the estimator 3.7 of the degrees of freedom numerically. j=1 5 Concluding remarks In this paper, we have derived an unbiased estimator of the degrees of freedom for the shrinkage estimator towards a closed convex set with piecewise smooth boundary. Setting the estimation problem to 1.6, we can treat selection criteria for the tuning parameter in recently proposed estimation methods such as the Lasso, the fused Lasso, and the group Lasso in unified sense. It seems to be necessary to study some optimal properties of C p or AIC in selecting the tuning parameter. For the traditional variable selection problem in linear model, there are lots of literature on properties of model selection criteria for example, Shao [7]. This topic remains in the future research. Acknowledgment The author would like to thank Professor Tatsuya Kubokawa for his encouragement and helpful suggestions. 16

17 A Appendix A.1 Proof of Lemma 3.1 Let x Em be an arbitrary fixed vector. From Remark A.1 below, x K, x x K is a regular point of ϕ. Thus the inverse function theorem implies that there exists an open neighborhood U N m of x K, x x K in N m such that ϕ U Nm : U N m ϕu N m is a diffeomorphism. Here U is an open set in R 2p containing x K, x x K. Let L > 0 be the Lipschitz constant of f. Let us define and B ɛ = {x R p x x 2 < ɛ} Q ɛ = {z R 2p z i x i < Lɛ, 1 i p, with ɛ > 0 small enough to have z p+j x j x K,j < 1 + Lɛ, 1 j p} B ɛ ϕu N m, Q ɛ U. Since f is Lipschitz continuous with Lipschitz constant L, it holds that for x B ɛ, and x K x K 2 < Lɛ, x x K x x K 2 x x 2 + x K x K 2 < 1 + Lɛ. Therefore we have x K, x x K Q ɛ N m U N m for x B ɛ. Define W = ϕ U Nm 1 B ɛ Q ɛ N m. Note that W is an open set in N m. Then it is seen that the diffeomorphism ϕ W : W B ɛ corresponds to the mapping x K, x x K x. A.2 Positive semi-definiteness of the matrix 3.3 We follow the notations used in Section 3.1. Let s 0 D m and v 0 NK, s 0 be arbitrary fixed vectors. We take a C 2 -local coordinate system θ 1,..., θ p m of D m around s 0 such that s 0 = s0,..., 0. Then we shall show the following fact: Lemma A.1. The matrix is positive semi-definite. 2 s v 0, θ a θ 0 b 1 a,b p m 17 A.1

18 Proof. Define Lθ = v 0, sθ s 0. in an appropriate neighborhood of 0. From Theorem of Webster [13], it follows that Lθ 0 for all θ in the neighborhood and L0 = 0. Hence θ = 0 is the minimizer of Lθ. Noting that 2 L θ a θ 0 = v 2 s 0, b θ a θ 0, b the second order necessary condition for the minimizer ensures that the matrix A.1 is indeed positive semi-definite. Remark A.1. From this lemma and Assertion 6.4 of Milnor [6] or our calculation in the proof of Lemma 3.2, it can be proved that x K, x x K with x E m is a regular point of ϕ. References [1] Akaike, H Information theory and an extension of the maximum liklihood principle. Second International Symposium on Information Theory [2] Efron, B The estimation of prediction error: covariance penalties and cross validation. J. Amer. Statist. Assoc. 99, [3] Kuriki, S. and Takemura, A Shrinkage estimation towards a closed convex set with a smooth boundary. J. Multivariate Anal. 75, [4] Mallows, C Some comments on C p. Technometrics 15, [5] Meyer, M. and Woodroofe, M On the degrees of freedom in shape-restricted regression. Ann. Statist. 28, [6] Milnor, J Morse Theory. Ann. Math. Stud. 51, Princeton Univ. Press, Princeton. [7] Shao, J An asymtotic theory for linear model selection with discussion. Statist. Sinica 7, [8] Stein, C Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9, [9] Takemura, A Foundation of Multivariate Statistical Inference in Japanese. Kyoritsu Shuppan, Tokyo. [10] Tibshirani, R Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, [11] Tibshirani, R., Saunders, M., Rosset, S., Zu, J. and Knight, K Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67,

19 [12] Yuan, M. and Lin, Y Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68, [13] Webster, R Convexity. Oxford Univ. Press, Oxford. [14] Weyl, H On the volume of tubes. Amer. J. Math. 61, [15] Zou, H., Hastie, T. and Tibshirani, R On the degrees of freedom of the Lasso. Ann. Statist. to appear. 19