Some Sharp Performance Bounds for Least Squares Regression with L 1 Regularization
|
|
|
- Avis Melton
- 10 years ago
- Views:
Transcription
1 Some Sharp Performance Bounds for Least Squares Regression with L 1 Regularization Tong Zhang Statistics Department Rutgers University, NJ [email protected] Abstract We derive sharp performance bounds for least squares regression with L 1 regularization, from parameter estimation accuracy and feature selection quality perspectives. The main result proved for L 1 regularization extends a similar result in 4] for the Dantzig selector. It gives an affirmative answer to an open question in 10]. Moreover, the result leads to an extended view of feature selection which allows less restrictive conditions than some recent work. Based on the theoretical insights, a novel two-stage L 1 -regularization procedure with selective penalization is analyzed. It is shown that if the target parameter vector can be decomposed as the sum of a sparse parameter vector with large coefficients and another less sparse vector with relatively small coefficients, then the two-stage procedure can lead to improved performance. 1 Introduction Consider a set of input vectors x 1,..., x n R d, with corresponding desired output variables y 1,..., y n. We use d instead of the more conventional p to denote data dimensionality, because the symbol p is used for another purpose. The task of supervised learning is to estimate the functional relationship y f(x) between the input x and the output variable y from the training examples {(x 1, y 1 ),..., (x n, y n )}. In this paper, we consider linear prediction model f(x) = β T x, and focus on least squares for simplicity. A commonly used estimation method is L 1 -regularized empirical risk minimization (aka, Lasso): ] 1 ˆβ = arg min (β T x i y i ) 2 + λ β 1, (1) β R d n where λ 0 is an appropriate regularization parameter. We are specifically interested in two related themes: parameter estimation accuracy and feature selection quality. A general convergence theorem is established in Section 4 which has two consequences. First, the theorem implies a parameter estimation accuracy result for standard Lasso that extends the main result of Dantzig selector in 4]. A detailed comparison is given in Section 6. This result provides an affirmative answer to an open question in 10] concerning whether a bound similar to that of 4] holds for Lasso. Second, we show in Section 7 that the general theorem in Partially supported by NSF grant DMS
2 Section 4 can be used to study the feature selection quality of Lasso. In this context, we consider an extended view of feature selection by selecting features with estimated coefficients larger than a certain non-zero threshold this method is different from 21] that only considered zero threshold. An interesting consequence of our method is the consistency of feature selection even when the irrepresentable condition of 21] (the condition is necessary in their approach) is violated. Moreover, the combination of our parameter estimation and feature selection results suggests that the standard Lasso might be sub-optimal when the target can be decomposed as a sparse parameter vector with large coefficients plus another less sparse vector with small coefficients. A two-stage selective penalization procedure is proposed in Section 8 to remedy the problem. We obtain a parameter estimation accuracy result for this procedure that can improve the corresponding result of the standard (one-stage) Lasso under appropriate conditions. For simplicity, most results (except for Theorem 4.1) in this paper are stated under the fixed design situation (that is, x i are fixed, while y i are random). However, with small modifications, they can also be applied to random design. 2 Related Work In the literature, there are typically three types of results for learning a sparse approximate target vector β = β 1,..., β d ] R d such that E(y x) β T x. 1. Feature selection accuracy: identify nonzero coefficients (e.g., 21, 20, 5, 19]), or more generally, identify features with target coefficients larger than a certain threshold (see Section 7). That is, we are interested in identifying the relevant feature set {j : β j > α} for some threshold α Parameter estimation accuracy: how accurate is the estimated parameter comparing to the approximate target β, measured in a certain norm (e.g., 4, 1, 20, 14, 2, 18, 9, 5]). That is, let ˆβ be the estimated parameter, we are interested in developing a bound for ˆβ β p for some p. Theorem 4.1 and Theorem 8.1 give such results. 3. Prediction accuracy: the prediction performance of the estimated parameter, both in fixed and random design settings (e.g., 15, 11, 3, 14, 2, 5]). For example, in fixed design, we are interested in a bound for 1 n ( ˆβ T x i Ey i ) 2 or the related quantity 1 n (( ˆβ β) T x i ) 2. In general, good feature selection implies good parameter estimation, and good parameter estimation implies good prediction accuracy. However, the reverse directions do not usually hold. In this paper, we focus on the first two aspects: feature selection and parameter estimation, as well as their inter-relationship. Due to the space limitation, the prediction accuracy of Lasso is not consider in this paper. However, it is a relatively straight-forward consequence of parameter estimation bounds with p = 1 and p = 2. As mentioned in the introduction, one motivation of this paper is to develop a parameter estimation bound for L 1 regularization directly comparable to that of the Dantzig selector in 4]. Compared to 4], where a parameter estimation error bound in 2-norm is proved for the Dantzig selector, Theorem 4.1 in Section 4 presents a more general bound for p-norm where p 1, ]. We are particularly interested in a bound in -norm because such a bound immediately induces a result on the feature selection accuracy of Lasso. This point of view is taken in Section 7, where feature selection is considered. Achieving good feature selection is important because it can be 2
3 used to improve the standard one-stage Lasso. In Section 8, we develop this observation, and show that a two-stage method with good feature selection achieves a bound better than that of one-stage Lasso. Experiments in Section 9 confirm this theoretical observation. Since the development of this paper heavily relies on parameter estimation accuracy of Lasso in p-norm, it is different and complements earlier work on Lasso given above. Among earlier work, prediction accuracy bounds for Lasso were derived in 2, 3, 5] under mutual incoherence conditions (introduced in 9]) that are generally regarded as stronger than the sparse eigenvalue conditions employed by 4]. This is because it is easier for a random matrix to satisfy sparse eigenvalue conditions than mutual incoherence conditions. A more detailed discussion on this point is given at the end of Section 4. Moreover, the relationship of different quantities are presented in Section 3. As we shall see from Section 3, mutual incoherence conditions are also stronger than conditions required for deriving p-norm estimation error bounds in this paper. Therefore under appropriate mutual incoherence conditions, our analysis leads to sharp p-norm parameter estimation bounds for all p 1, ] in Corollary 4.1. In comparison, sharp p-norm parameter estimation bounds cannot be directed derived from prediction error bounds studied in some earlier work. We shall also point out some 1-norm parameter estimation bounds were established in 2], but not for p > 1. At the same time this paper was written, related parameter estimation error bounds were also obtained in 1] under appropriate sparse eigenvalue conditions, both for the Dantzig selector and for Lasso. However, the results are only for p 0, 2] and only for truly sparse target (that is, E(y x) = β T x for some sparse β). In particular, their parameter estimation bound with p = 2 does not reproduce the main result of 4] while our bounds in this paper (which allows approximate sparse target) do. We shall point out that in addition to parameter estimation error bounds, a prediction error bound that allows approximately sparse target was also obtained in 1], in a form quite similar to the parameter estimation bounds of Lasso in this paper and that of Dantzig selector in 4]. However, that result does not immediately imply a bound on p-norm parameter estimation error. A similar bound was derived in 5] for Lasso, but not as elaborated as that in 4]. Another related work is 20], which contains a 2-norm parameter estimation error bound for Lasso, but has a cruder form than ours. In particular, their result is worse than the form given in 4] as well as the first claim of Theorem 4.1 in this paper. A similar 2-norm parameter estimation error bound but only for truly sparse target can be found in 18]. In 9], the authors derived 2-norm estimation bound for Lasso with approximate sparse targets under mutual incoherence conditions but without stochastic noise. We shall note that the noise in their paper is not random, and corresponds to approximation error in our notation, as discussed in Section 5. Their result is thus weaker than our result in Corollary 5.1. In addition to the above work, prediction error bounds were also obtained in 15, 14, 11] for general loss functions and random design. However, such results cannot be used to derive p-norm parameter estimation bound which we consider here. 3 Conditions on Design Matrix In order to obtain good bounds, it is necessary to impose conditions on the design matrix that generally specifies that small diagonal blocks of the design matrix are non-singular. For example, mutual incoherence conditions 9] or sparse eigenvalue conditions 20]. The sparse eigenvalue condition is also known as RIP (restricted isometry property) in the compressed sensing literature, which was first introduced in 7]. Mutual incoherence conditions are usually more restrictive than sparse eigenvalue conditions. 3
4 That is, a matrix that satisfies an appropriate mutual incoherence condition will also satisfy the necessary sparse eigenvalue condition in our analysis, but the reverse direction does not hold. For example, we will see from the discussion at the end of Section 4 that for random design matrices, more samples are needed in order to satisfy the mutual incoherence condition than to satisfy the sparse eigenvalue condition. In our analysis, the weaker sparse eigenvalue condition can be used to obtain sharp bounds for 2-norm parameter estimation error. Since we are interested in general p-norm parameter estimation error bounds, other conditions on the design matrix will be considered. They can be regarded as generalizations of the sparse eigenvalue condition or RIP. All these conditions are weaker than the mutual incoherence condition. We introduce the following definitions that specify properties of sub-matrices of a large matrix A. These quantities (when used with the design matrix  = 1 n x ix T i ) will appear in our result. Definition 3.1 The p-norm of a vector β = β 1,..., β d ] R d is defined as β p = ( d j=1 β j p ) 1/p. Given a positive semi-definite matrix A R d d. Given l, k 1 such that l + k d. Let I, J be disjoint subsets of {1,..., d} with k and l elements respectively. Let A I,I R k k be the restriction of A to indices I, A I,J R k l be the restriction of A to indices I on the left and J on the right. Define for p 1, ]: ρ (p) A,k = µ (p) A,k = sup A I,I v p, θ (p) v R k,i v A,k,l = p inf A I,I v p, γ (p) v R k,i v A,k,l = p sup A I,J u p, u R l,i,j u sup A 1 I,I A I,Ju p. u R l,i,j u Moreover, for all v = v 1,..., v k ] R k, define v p 1 = v 1 p 1 sgn(v 1 ),..., v k p 1 sgn(v k )], and ω (p) A,k = inf v R k,i max(0, v T A I,I v p 1 ) v p, π (p) A,k,l = p sup v R k,u R l,i,j (v p 1 ) T A I,J u v p max(0, v T A I,I v p 1 ) u. The ratio ρ (p) A,k /µ(p) A,k measures how close to the identity matrix are k k diagonal sub-matrices of A. The RIP concept in 7] can be regarded as ρ (2) A,k /µ(2) A,k and π (p) A,k,l and ρ (2) A,k in our notation. The quantities θ(p) A,k,l, γ(p) A,k,l, measure how close to zero are the k l off diagonal blocks of A. Note that µ(2) A,k = ω(2) A,k are the smallest and largest eigenvalues of k k diagonal blocks of A. It is easy to see that the following inequalities hold: µ (p) A,k ρ(p) A,k γ (p) A,k,l, and π(2) A,k,l. Moreover, we can also obtain bounds on θ(p) A,k,l, using eigenvalues of sub-matrices of A. The bounds essentially say that if diagonal sub-matrices of A of size k + l are well-conditioned, then the quantities θ (2) A,k,l, γ(2) A,k,l, and π(2) O( l). Proposition 3.1 The following inequalities hold: θ (2) A,k,l l1/2 (ρ (2) A,k µ(2) A,l+k )(ρ(2) ρ (2) A,l /µ(2) A,k+l 1, π (2) A,k,l l1/2 2 γ (p) A,k,l kmax(0,1/p 0.5) γ (2) A,k,l, min i A,l µ(2) A,l+k ), π(p) A,k,l θ(p) A,k,l /ω(p) A,k, γ(p) A,k,l θ(p) A,k,l /µ(p) A,k, A i,i sup A I,I diag(a I,I ) p ω (p) A,k µ(p) A,k, I θ(p) A,k,l kmax(0,1/p 0.5) θ (2) A,k,l, A,k,l are 4
5 where for a matrix B, diag(b) is the diagonal of B, and B p = sup u ( Bu p / u p ). The last inequality in the above proposition shows that µ (p) A,k > 0 and ω(p) A,k > 0 when A has a certain diagonal dominance (in p-norm) property for its k k blocks. Finally, we state a result which bounds all quantities defined here using the mutual incoherence concept of 9]. This result shows that mutual incoherence is a stronger notation than all quantities we employ in this paper. Although more complicated, by using these less restrictive quantities in Theorem 4.1, we obtain stronger results than using the mutual incoherence condition (Corollary 4.1). For simplicity, we consider diagonally normalized A such that A i,i = 1 for all i. Proposition 3.2 Given a matrix A R d d, and assume that A i,i = 1 for all i. Define the mutual coherence coefficient as M A = sup i j A i,j, then the following bounds hold ρ (p) A,k 1 + M Ak. µ (p) A,k ω(p) A,k 1 M Ak. θ (p) A,k,l M Ak 1/p l. π (p) A,k,l M Ak 1/p l/ max(0, 1 M A k). γ (p) A,k,l M Ak 1/p l/ max(0, 1 M A k). 4 A General Performance Bound for L 1 Regularization For simplicity, we assume sub-gaussian noise as follows. We use x i,j to indicate the j-th component of vector x i R d. Assumption 4.1 Assume that conditioned on {x i },...,n, {y i },...,n are independent (but not necessarily identically distributed) sub-gaussians: there exists a constant σ 0 such that i and t R, E yi e t(y i Ey i ) {x i },...,n e σ2 t 2 /2. Both Gaussian and bounded random variables are sub-gaussian using the above definition. For example, if a random variable ξ a, b], then E ξ e t(ξ Eξ) e (b a)2 t 2 /8. If a random variable is Gaussian: ξ N(0, σ 2 ), then E ξ e tξ e σ2 t 2 /2. For convenience, we also introduce the following definition. Definition 4.1 Let β = β 1,..., β d ] R d and α 0, we define the set of relevant features with threshold α as: supp α (β) = {j : β j > α}. Moreover, if β (1) β (d) are in descending order, then define r (p) k (β) = d j=k+1 β (j) p as the p-norm of the d k smallest components (in absolute value) of β. 1/p 5
6 Consider a target parameter vector β R d that is approximately sparse. Note that we do not assume that β is the true target; that is, βt x i may not equal to Ey i. We only assume that this holds approximately, and are interested in how well we can estimate β using (1). In particular, the approximation quality (or its closeness to the true model) of any β is measured by 1 n ( β T x i E y i )x i in our analysis. If βt x i = E y i, then the underlying model is y i = β T x i + ɛ i, where ɛ i are independent sub-gaussian noises. In the more general case, we only need to assume that 1 n ( β T x i E y i )x i is small for some approximate target β. The relationship of this quantity and least squares approximation error is discussed in Section 5. The following theorem holds both in fixed design and in random design. The only difference is that in the fixed design situation, we may let a = (sup j  j,j ) 1/2 and the condition (sup j  j,j ) 1/2 a automatically holds. In order to simplify the claims, our later results are all stated under the fixed design assumption. In the following theorem, the statement of with probability 1 δ: if X then Y can also be interpreted as with probability 1 δ: either X is false or Y is true. We note that in practice, the condition of the theorem can be combinatorially hard to check since computing the quantities in Definition 3.1 requires searching over sets of fixed cardinality. Theorem 4.1 Let Assumption 4.1 hold, and let  = 1 n x ix T i. Let ˆβ be the solution of (1). Consider any fixed target vector β R d, and a positive constant a > 0. Given δ (0, 1), then with probability larger than 1 δ, the following two claims hold for q = 1, p, and all k, l such that k l (d k)/2, t (0, 1), p 1, ]: If t 1 π (p),l k1 1/p /l, λ 4(2 t) t (sup j  j,j ) 1/2 a, then ˆβ β q 8k1/q 1/p tω (p) ρ (p) r(p) k ( 2 σa n ln(2d/δ) + 1 n ( β ) T x i E y i )x i, and ( β) ] + k 1/p λ + 32k1/q 1/p t + 4k 1/q 1/p r (p) k ( β) + 4r (1) k ( β)l 1/q 1. π (p),l r(1) k ( β)l 1 ( If t 1 γ (p),l k1 1/p /l, λ 4(2 t) 2 t σa n ln(2d/δ) + 1 n ( β ) T x i E y i )x i, and (sup j  j,j ) 1/2 a, then ˆβ β q 8k1/q 1/p t 4γ (p),l l 1 r (1) k ( β) ] + λ(k + l) 1/p /µ (p) + 4r (1) k ( β)l 1/q 1. Although for a fixed p, the above theorem only gives bounds for ˆβ β q with q = 1, p, this information is sufficient to obtain bounds for general q 1, ]. If q 1, p], we can use the following interpolation rule (which follows from the Hölder s inequality): ˆβ (p q)/(qp q) (pq p)/(pq q) q ˆβ 1 ˆβ p, and if q > p, we use ˆβ q ˆβ p. Although the estimate we obtain when q p is typically worse than the bound achieved at p = q (assuming that the condition of the theorem can be satisfied at p = q), it may still be useful because the condition for the theorem to apply may be easier to satisfy for certain p. 6
7 It is important to note that the first claim is more refined than the second claim, as it replaces the explicit l-dependent term O(l 1/p λ) by the term O(r (p) k ( β)) that does not explicitly depend on l. In order to optimize the bound, we can choose k = supp λ ( β), which implies that O(r (p) k ( β)) = O(l 1/p λ + r (p) k+l ( β)) = O(l 1/p λ + l 1/p 1 r (1) k ( β)). This quantity is dominated by the bound in the second claim. However, if O(r (p) k ( β)) is small, then the first claim is much better when l is large. The theorem as stated is not intuitive. In order to obtain a more intuitive bound from the first claim of the theorem, we consider a special case with mutual incoherence condition. The following corollary is a simple consequence of the first claim of Theorem 4.1 (with q = p) and Proposition 3.2. The result shows that mutual incoherence condition is a stronger assumption than the quantities appear in our analysis. Corollary 4.1 Let Assumption 4.1 hold, and let  = 1 n x ix T i, and assume that  j,j = 1 for all j. Define M = sup i j Âi,j. Let ˆβ be the solution of (1). Consider any fixed target vector β R d. Given δ (0, 1), then with probability larger than 1 δ, the following claim holds for all k l (d k)/2, t (0, 1), p 1, ]: if MÂ(k + l) (1 t)/(2 t) and ( λ 4(2 t) 2 t σ n ln(2d/δ) + 1 n ( β ) T x i E y i )x i, then ˆβ β p 8(2 t) t 1.5r (p) k The above result is of the following form ( β) ] + k 1/p λ + 4r (p) k ( β) + 4(8 7t) r (1) t k ( β)l 1/p 1. ˆβ β p = O(k 1/p λ + r (p) k ( β) + r (1) k ( β)l 1/p 1 ), (2) where we can let k = supp λ ( β) so that k is the number of components such that β j > λ. Although mutual incoherence is assumed for simplicity, a similar bound holds for any p, if we assume that  is p-diagonal dominant at block size k + l. Such an assumption is weaker than mutual-incoherence. To our knowledge, none of earlier work on Lasso obtained parameter estimation bounds in the form of (2). The first two terms in the bound are what we shall expect from the L 1 -regularization method (1), and thus unlikely to be significantly improved (except for the constants): the first term is the variance term, and the second term is needed because L 1 -regularization tends to shrink coefficients j / supp λ ( β) to zero. Although it is not clear whether the third term can be improved, we shall note that it becomes small if we can choose a large l. Note that if β is the true parameter: β T x i = Ey i, then we may take λ = 4(2 t)t 1 σ ln(d/δ)/n in (2). The bound in 4] has a similar form (but with p = 2) which we will compare in the next section. Note that in Theorem 4.1, one can always take λ sufficiently large so that the condition for λ is satisfied. Therefore in order to apply the theorem, one needs either the condition 0 < t 1 π (p),l k1 1/p /l or the condition 0 < t 1 γ (p),l k1 1/p /l. They require that small diagonal blocks of  are not nearly singular. As pointed out after Proposition 3.1, the condition for the first claim is typically harder to satisfy. For example, as discussed below, even when p 2, the requirement γ (p),l k1 1/p /l < 1 can always be satisfied when diagonal sub-blocks of  at certain size satisfy some eigenvalue conditions, while this is not true for the condition π (p),l k1 1/p /l < 1. In the case of p = 2, the condition 0 < 1 π (2),l k0.5 /l can always be satisfied if the small diagonal blocks of  have eigenvalues bounded from above and below (away from zero). We shall 7
8 refer to such a condition as sparse eigenvalue condition (also see 20, 1]). Indeed, Proposition 3.1 implies that π (2),l k0.5 /l 0.5(k/l) 0.5 ρ (2) A,l /µ(2) A,k+2l 1. Therefore this condition can be satisfied if we can find l such that ρ (2) A,l /µ(2) A,k+2l l/k. In particular, if ρ(2) A,l /µ(2) A,k+2l c for a constant c > 0 when l ck (which is what we will mean by sparse eigenvalue condition in later discussions), then one can simply take l = ck. For p > 2, a similar claim holds for the condition 0 < 1 γ (p),l k1 1/p /l. Proposition 3.1 implies that γ (p),l γ(2),l γ(2),l θ(2),l /µ(2) lρ (2) /µ(2). Sparse eigenvalue condition (bounded eigenvalue) at block size of order k 2 2/p implies that the condition 0 < 1 γ (p),l k1 1/p /l can be satisfied with an appropriate choice of l = O(k 2 2/p ). Under assumptions that are stronger than the sparse eigenvalue condition, one can obtain better and simpler results. For example, this is demonstrated in Corollary 4.1 under the mutual incoherence condition. Finally, in order to concretely compare the condition of Theorem 4.1 for different p, we consider the random design situation (with random vectors x i ), where each component x i,j is independently drawn from the standard Gaussian distribution N(0, 1) (i = 1,..., n and j = 1,..., d). This situation is investigated in the compressed sensing literature, such as 7]. In particular, it was shown that with large probability, the following RIP inequality holds for some constant c > 0 (s n d): ρ (2) 1 + µ(2) 1 c s ln d/n. Now for p 2, using Proposition 3.1, it is Â,s Â,s not hard to show that (we shall skip the detailed derivation here because it is not essential to the main point of this paper) for k l: θ (p) 4cl ln d/n and ω (p),l A,k+l 1 2c l 2 2/p ln d/n. Therefore, the condition γ (p),l k1 1/p /l in Theorem 4.1 holds with large probability as long as n 256c 2 l 2 2/p ln d. Therefore in order to apply the theorem with fixed k l and d, the larger p is, the larger the sample size n has to be. In comparison, the mutual incoherence condition of Corollary 4.1 is satisfied when n c l 2 ln d for some constant c > 0. 5 Noise and Approximation Error In Theorem 4.1 and Corollary 4.1, we do not assume that β is the true parameter that generates Ey i. The bounds depend on the quantity 1 n ( β T x i E y i )x i to measure how close β is different from the true parameter. This quantity may be regarded as an algebraic definition of noise, in that it behaves like stochastic noise. The following proposition shows that if the least squares error achieved by β (which is often called approximation error) is small, then the algebraic noise level (as defined above) is also small. However, the reverse is not true. For example, for identity design matrix and β T x i = E y i, the algebraic noise is 1 n ( β T x i E y i )x i = β β, but the least squares approximation error is β β 2 2. Therefore in general algebraic noise can be small even when the least squares approximation error is large. Proposition 5.1 Let  = 1 n x ix T i, and a = (sup j Âj,j) 1/2. Given k 0, there exists β (k) R d such that ( ) 1/2 1 ( n β (k)t a x i E y i )x i n k ( β T x i Ey i ) 2, 8
9 supp 0 ( β (k) β) k, and β (k) β 2 2(n 1 n ( β T x i Ey i ) 2 ) 1/2 / µ (2). Â,k This proposition can be combined with Theorem 4.1 or Corollary 4.1 to derive bounds in terms of the approximation error n 1 ( β T x i Ey i ) 2 instead of the algebraic noise 1 n ( β T x i E y i )x i. For example, as a simple consequence of Corollary 4.1, we have the following bound. A similar but less general result (with σ = 0 and supp 0 ( β) = k) was presented in 9]. Corollary 5.1 Let Assumption 4.1, and let  = 1 n x ix T i, and assume that  j,j = 1 for all j. Define M = sup i j Âi,j. Let ˆβ be the solution of (1). Consider any fixed target vector β R d. Given δ (0, 1), then with probability larger than 1 δ, the following claim holds for all 2k l (d 2k)/2, t (0, 1), p 1, ]: if MÂ(2k + l) (1 t)/(2 t) and λ 4(2 t) t ( σ 2 n ln(2d/δ) + ɛ/ k + 1 ˆβ β 2 8(2 t) t 1.5r (2) k ), then where ɛ = (n 1 ( β T x i Ey i ) 2 ) 1/2. ( β) ] + (2k) 1/2 λ + 4r (2) k ( β) + 4(8 7t) r (1) t k ( β)l 1/2 + 4ɛ, A similar result holds under the sparse eigenvalue condition. We should point out that in L 1 regularization, the behavior of stochastic noise (σ > 0) is similar to that of the algebraic noise introduced above, but very different from the least squares approximation error ɛ. In particular, the so-called bias of L 1 regularization shows up in the stochastic noise term but not in the least squares approximation error term. If we set σ = 0 but ɛ 0, our analysis of the two-stage procedure in Section 8 will not improve that of the standard Lasso given in Corollary 5.1, simply because the two-stage procedure does not improve the term involving the approximation error ɛ. However, the benefit of the two-stage procedure clearly shows up in the stochastic noise term. For this reason, it is important to distinguish the true stochastic noise and the approximation error ɛ, and to develop analysis that includes both stochastic noise and approximation error. 6 Dantzig Selector versus L 1 Regularization Recently, Candes and Tao proposed an estimator called the Dantzig selector in 4], and proved a very strong performance bound for this new method. However, it was observed 10, 8, 13, 12] that the performance of Lasso is comparable to that of the Dantzig selector. Consequently, the authors of 10] asked whether a performance bound similar to the Dantzig selector holds for Lasso as well. In this context, we observe that a simple but important consequence of the first claim of Theorem 4.1 leads to a bound for L 1 regularization that reproduces the main result for Dantzig selector in 4]. We restate the result below, which provides an affirmative answer to the above mentioned open question of 10]. Corollary 6.1 Let Assumption 4.1 hold, and let  = 1 n x ix T i, and a = (sup j Âj,j) 1/2. Consider the true target vector β such that Ey = β T x. Define  = 1 n x ix T i. Let ˆβ be the solution of (1). Given δ (0, 1), then with probability larger than 1 δ, the following claim holds for all 9
10 (d k)/2 l k. If t = 1 π (2),l k0.5 /l > 0, λ 4(2 t)t 1 σa 2 ln(2d/δ)/n, then ˆβ β 2 32ρ(2) tµ (2) + 4 r(1) k ( β) + l 8ρ(2) tµ (2) + 4 r (2) k ( β) + 8 tµ (2) kλ. Corollary 6.1 is directly comparable to the main result of 4] for Dantzig selector, which is given by the following estimator: ˆβ D = arg min β 1 subject to sup x i,j (x T β R d i β y i ) j b D. Their main result is stated below in Theorem 6.1. It uses a different quantity θ A,k,l, which is defined as A I,J β 2 θ A,k,l = sup, β R l,i,j β 2 using notations of Definition 3.1. It is easy to see that θ (2) A,k,l θ A,k,l l. Theorem 6.1 (4]) Assume that there exists a vector β R d with s non-zero components, such that y i = β T x i + ɛ i, where ɛ i N(0, σ 2 ) are iid Gaussian noises. Let  = 1 n x ix T i, and assume that Âj,j 1 for all j. Given t D > 0 and δ (0, 1). We set b D = nλ D σ, with ( λ D = 1 (ln δ + ln( ) 2 π ln d))/ ln d + t 1 D ln d. Let θâ,2s = max(ρ (2) 1, 1 µ(2) ), If θâ,2s + θâ,2s,s < 1 t Â,2s Â,2s D, then with probability exceeding 1 δ: ( ) ˆβ D β (k + 1)σ 2 2 C 2 ( θâ,2s, θâ,2s,s )λ 2 2 D + r (2) n k ( β) 2. The quantity C 2 (a, b) is defined as C 2 (a, b) = 2C 0(a,b) 1 a 2 1 a b ) + (1 + 1/ 2)(1 + a) 2 /(1 a b). 1 a b + 2b(1+a) + 1+a (1 a b) 2 1 a b, where C 0(a, b) = 2 2(1 + In order to see that Corollary 6.1 is comparable to Theorem 6.1, we shall compare their conditions and consequences. To this end, we can pick any l k, s] in Corollary 6.1. We shall first look at the conditions. The condition required in Theorem 6.1 is θâ,2s + θâ,2s,s < 1 t D, which implies that θâ,2s,s < µ (2) Â,2s t D. This condition is stronger than θ (2) A,k+l,l / k θ,l < µ (2) t D, which implies the condition t > 0 in Corollary 6.1: t = 1 π (2) k,l l 1 θ (2) k,l µ (2) l 1 (µ (2) t D)k µ (2) l = (1 k/l) + t D k µ (2) l > 0. Therefore Corollary 6.1 can be applied as long as Theorem 6.1 can be applied. Moreover, as long as t D > 0, t is never much smaller than t D, but can be significantly larger (e.g. t > 0.5 when 10
11 k l/2, even if t D is close to zero). It is also obvious that the condition t > 0 does not implies that t D > 0. Therefore the condition of Corollary 6.1 is strictly weaker. As discussed in Section 4, if ρ (2) A,l /µ(2) A,k+2l c for a constant c > 0 when l ck, then the condition t > 0 holds with l = ck. Next we shall look at the consequences of the two theorems when both t > 0 and t D > 0. Ignoring constants, the bound in Theorem 6.1, with λ D = O( ln(d/δ)), can be written as: ˆβ D β 2 = O( ln(d/δ))(r (2) k ( β) + σ k/n). However, the proof itself implies a stronger bound of the form ˆβ D β 2 = O(σ k ln(d/δ)/n + r (2) k ( β)). (3) In comparison, in Corollary 6.1, we can pick λ = O(σ ln(d/δ)/n), and then the bound can be written as (with l = s): ˆβ β 2 = O(σ k ln(d/δ)/n + r (2) k ( β) + r (1) k ( β)/ s). (4) Note that we don t have to assume that β only contains s non-zero components. The quantity r (1) k ( β)/ s is no more than r (2) k ( β) under the sparsity of β assumed in Theorem 6.1. It is thus clear that (4) has a more general form than that of (3). It was pointed out in 1] that Lasso and the Dantzig selector are quite similar, and the authors presented a simultaneous analysis of both. Since the explicit parameter estimation bounds in 1] are with the case k = supp 0 ( β), it is natural to ask whether our results (in particular, the first claim of Theorem 4.1) can also be applied to the Dantzig selector, so that a simultaneous analysis similar to that of 1] can be established. Unfortunately, the techniques used in this paper do not immediately give an affirmative answer. This is because a Lasso specific property is used in our proof of Lemma 10.4, and the property does not hold for the Dantzig selector. However, we conjecture that it may still be possible to prove similar results for the Dantzig selector through different techniques such as those employed in 4] and 6]. 7 Feature Selection through Coefficient Thresholding A fundamental result of L 1 -regularization is its feature selection consistency property, considered in 17], and more formally analyzed in 21]. It was shown that under a strong irrepresentable condition (introduced in 21]), together with the sparse eigenvalue condition, the set supp 0 ( ˆβ), with ˆβ estimated using Lasso, may be able to consistently identify features with coefficients larger than a threshold of order kλ (with λ = O(σ ln(d/δ)/n)). Here, k is the sparsity of the true target. That is, with probability 1 δ, all coefficients larger than a certain threshold of the order O(σ k ln(d/δ)/n) remains non-zero, while all zero coefficients remain zero. It was also shown that a slightly weaker irrepresentable condition is necessary for Lasso to possess this property. For Lasso, the k factor cannot be removed (under the sparse eigenvalue assumption plus the irrepresentable condition) unless additional conditions (such as the mutual incoherence assumption in Corollary 7.1) are imposed. Also see 5, 19] for related results without the k factor. It was acknowledged in 20, 18] that the irrepresentable condition can be quite strong, e.g., often more restricted than eigenvalue conditions required for Corollary 6.1. This is the motivation of the sparse eigenvalue condition introduced in 20], although such a condition does not necessarily yield 11
12 consistent feature selection under the scheme of 21], which employs the set supp 0 ( ˆβ) to identify features. However, limitations of the irrepresentable condition can be removed by considering supp α ( ˆβ) with α > 0. In this section, we consider a more extended view of feature selection, where a practitioner would like to find relevant features with coefficient magnitude larger than some threshold α that is not necessarily zero. Features with small coefficients are regarded as irrelevant features, which are not distinguished from zero for practical purposes. The threshold, α, can be pre-chosen based on the interests of the practitioner as well as our knowledge of the underlying problem. We are interested in the relationship of features estimated from the solution ˆβ of (1) and the true relevant features obtained from β. The following result is a simple consequence of Theorem 4.1 where we use ˆβ β p for some large p to approximate ˆβ β (which is needed for feature selection). A consequence of the result (see Corollary 7.2) is that using a non-zero threshold α (rather than zero-threshold of 21]), it is possible to achieve consistent feature selection even if the irrepresentable condition in 21] is violated. For clarity, we choose a simplified statement with sparse target β. However, it is easy to see from the proof that just as in Theorem 4.1, a similar but more complicated statement holds even when the target is not sparse. Theorem 7.1 Let Assumption 4.1 hold, and let  = 1 n x ix T i, and a = (sup j Âj,j) 1/2. Let β R d be the true target vector with E y = β T x, and assume that supp 0 ( β) = k. Let ˆβ be the solution of (1). Given δ (0, 1), then with probability larger than 1 δ, the following claim is true. For all ɛ (0, 1), if there exist (d k)/2 l k, t (0, 1), and p 1, ] so that: λ 4(2 t)t (σa ) 1 2 ln(2d/δ)/n ; either 8(ɛαω (p) ) 1 k 1/p λ t 1 π (p),l k1 1/p /l, or 8(ɛαµ (p) ) 1 (k + l) 1/p λ t 1 γ (p),l k1 1/p /l; then supp (1+ɛ)α ( β) supp α ( ˆβ) supp (1 ɛ)α ( β). If either t 1 π (p),l k1 1/p /l or t 1 γ (p),l k1 1/p /l, then the result can be applied as long as α is sufficiently large. As we have pointed out after Theorem 4.1, if sparse eigenvalue condition holds at block size of order k 2 2/p for some p 2, then one can take l = O(k 2 2/p ) so that the condition t 1 γ (p),l k1 1/p /l is satisfied. This implies that we may take α = O(k 2/p 2/p2 λ) = O(σk 2/p 2/p2 (p) ln(d/δ)/n), assuming that µ is bounded from below (which holds when  is p-norm diagonal dominant at size k + l, according to Proposition 3.1). That is, sparse eigenvalue condition at a certain block size of order k 2 2/p, together with the boundedness of µ (p), imply that one can distinguish coefficients of magnitude larger than a threshold of order σk 2/p 2/p2 ln(d/δ)/n from zero. In particular, if p =, we can distinguish nonzero coefficients of order σ ln(d/δ)/n from zero. For simplicity, we state such a result under the mutual incoherence assumption. Corollary 7.1 Let Assumption 4.1 hold, and let  = 1 n x ix T i, and assume that  j,j = 1 for all j. Define M = sup i j Âi,j. Let ˆβ be the solution of (1). Let β R d be the true target vector with E y = β T x, and k = supp 0 ( β). Assume that kmâ 0.25 and 3k d. 12
13 Given δ (0, 1), with probability larger than 1 δ: if α/32 λ 12σ 2 ln(2d/δ)/n, then supp (1+ɛ)α ( β) supp α ( ˆβ) supp (1 ɛ)α ( β), where ɛ = 32λ/α. One can also obtain a formal result on the asymptotic consistency of feature selection. An example is given below. In the description, we allow the problem to vary with sample size n, and study the asymptotic behavior when n. Therefore except for the input vectors x i, all other quantities such as d, β etc will be denoted with subscript n. The input vectors x i R dn also vary with n; however, we drop the subscript n to simplify the notation. The statement of our result is in the same style as a corresponding asymptotic feature selection consistency theorem of 21] for the zero-thresholding scheme supp 0 ( ˆβ), which requires the stronger irrepresentable condition in addition to the sparse eigenvalue condition. In contrast, our result employs non-zero thresholding supp αn ( ˆβ), with an appropriately chosen sequence of decreasing α n ; the result only requires the sparse eigenvalue condition (and for clarity, we only consider p = 2 instead of general p discussed above) without the need for irrepresentable condition. Corollary 7.2 Consider regression problems indexed by the sample size n, and let the corresponding true target vector be β n = β n,1,..., β n,dn ] R dn, where Ey = β T n x. Let Assumption 4.1 hold, with σ independent of n. Assume that there exists a > 0 that is independent of n, such that 1 n x2 i,j a2 for all j. Denote by ˆβ n the solution ˆβ of (1) with λ = 12σa 2(ln(2d n ) + n s )/n, where s (0, 1). Pick s (0, 1 s ), and set α n = n s/2. Then as n, P (supp αn ( ˆβ n ) supp 0 ( β n )) = O(exp( n s )) if the following conditions hold: 1. βn only has k n = o(n 1 s s ) non-zero coefficients. 2. k n ln(d n ) = o(n 1 s ). 3. 1/ min j supp0 ( β n) β n,j = o(n s/2 ). 4. Let Ân = 1 n x ix T i R dn dn. There exists a positive integer q n such that (1+2q n )k n d n, 1/µ (2) = O(1), and ρ (2) (1 + q  n,(1+2q n)k n  n )µ (2). n,(1+2q n)k n  n,(1+2q n)k n The conditions of the corollary are all standard. Similar conditions have also appeared in 21]. The first condition simply requires β n to be sufficiently sparse; if k n is in the same order of n, then one cannot obtain meaningful consistency results. The second condition requires that d n is not too large, and in particular it should be sub-exponential in n; otherwise, our analysis do not lead to consistency. The third condition requires that β n,j to be sufficiently large when j supp 0 ( β n ). In particular, the condition implies that each feature component β n,j needs to be larger than the 2-norm noise level σ k n ln(d n )/n. If some component β n,j is too small, then we cannot distinguish it from the noise. Note that since the 2-norm parameter estimation bound is used here, we have a k n factor in the noise level. Under stronger conditions such as mutual incoherence, this k n factor can be removed (as shown in Corollary 7.1). Finally, the fourth condition is the sparse eigenvalue assumption; it can also be replaced by some other conditions (such as mutual incoherence). In comparison, 21] employed zero-threshold scheme with α n = 0; therefore in addition to our assumptions, irrepresentable condition is also required. 13
14 8 Two-stage L 1 Regularization with Selective Penalization We shall refer to the feature components corresponding to the large coefficients as relevant features, and the feature components smaller than an appropriately defined cut-off threshold α as irrelevant features. Theorem 7.1 implies that Lasso can be used to approximately identify the set of relevant features supp α ( β). This property can be used to improve the standard Lasso. In this context, we observe that as an estimation method, L 1 regularization has two important properties 1. Shrink estimated coefficients corresponding to irrelevant features toward zero. 2. Shrink estimated coefficients corresponding to relevant features toward zero. While the first effect is desirable, the second effect is not. In fact, we should avoid shrinking the coefficients corresponding to the relevant features if we can identify these features. In this case, the standard L 1 regularization may have sub-optimal performance. In order to improve L 1, we observe that under appropriate conditions such as those of Theorem 7.1, estimated coefficients corresponding to relevant features tend to be larger than estimated coefficients corresponding to irrelevant features. Therefore after the first stage of L 1 regularization, we can identify the relevant features by picking the components corresponding to the largest coefficients. Those coefficients are over-shrinked in the first stage. This problem can be fixed by applying a second stage of L 1 regularization where we do not penalize the features selected in the first stage. The procedure is described in Figure 1. Its overall effect is to unshrink coefficients of relevant features identified in the first stage. In practice, instead of tuning α, we may also let supp α ( ˆβ) contain exactly q elements, and simply tune the integer valued q. The parameters can then be tuned by cross validation in sequential order: first, find λ to optimize stage 1 prediction accuracy; second, find q to optimize stage 2 prediction accuracy. If cross-validation works well, then this tuning method ensures that the two-stage selective penalization procedure is never much worse than the one-stage procedure in practice because they are equivalent with q = 0. However, under right conditions, we can prove a much better bound for this two stage procedure, as shown in Theorem 8.1. Tuning parameters: λ, α Input: training data (x 1, y 1 ),..., (x n, y n ) Output: parameter vector ˆβ Stage 1: Compute ˆβ using (1). Stage 2: Solve the following selective penalization problem: ˆβ = arg min 1 (β T x i y i ) 2 + λ n β R d j / supp α ( ˆβ) β j. Figure 1: Two-stage L 1 Regularization with Selective Penalization A related method, called relaxed Lasso, was proposed recently by Meinshausen 16], which is similar to a two-stage Dantzig selector in 4] (also see 12] for a more detailed study). Their idea differs from our proposal in that in the second stage, the parameter coefficients β j are forced to be 14
15 zero when j / supp 0 ( ˆβ). It was pointed out in 16] that if supp 0 ( ˆβ) can exactly identify all non-zero components of the target vector, then in the second stage, the relaxed Lasso can asymptotically remove the bias in the first stage Lasso. However, it is not clear what theoretical result can be stated when Lasso cannot exactly identify all relevant features. In the general case, it is not easy to ensure that relaxed Lasso does not degrade the performance when some relevant coefficients become zero in the first stage. On the contrary, the two-stage selective penalization procedure in Figure 1 does not require that all relevant features are identified. Consequently, we are able to prove a result for Figure 1 with no counterpart for relaxed Lasso. For clarity, the result is stated under similar conditions as those of the Dantzig selector in Theorem 6.1 with sparse targets and p = 2 only. Both restrictions can be easily removed with a more complicated version of Theorem 7.1 to deal with non-sparse targets (which can be easily obtained from Theorem 4.1) as well as the general form of Lemma 10.4 which allows p 2. Theorem 8.1 Let Assumption 4.1 hold, and let  = 1 n x ix T i, and a = (sup j Âj,j) 1/2. Consider any target vector β R d such that Ey = β T x, and β contains only s non-zeros. Let k = supp λ ( β). Consider the two-stage selective penalization procedure in Figure 1. Given δ (0, 0.5), with probability larger than 1 2δ, for all (d s)/2 l s, and t (0, 1). Assume that t 1 π (2),l k0.5 /l. 0.5α λ 4(2 t)t 1 σa 2 ln(2d/δ)/n. either 16(αω (p) Â,s+l ) 1 s 1/p λ t 1 π (p) Â,s+l,l s1 1/p /l, or 16(αµ (p) Â,s+l ) 1 (s + l) 1/p λ t 1 γ (p) Â,s+l,l s1 1/p /l. Then ˆβ β 2 8 tµ (2) 5ρ (2) r(2) k ( β) + k qλ + aσ( ln(1/δ)) ] q/n + 8r (2) k ( β), where q = supp 1.5α ( β). Again, we include a simplification of Theorem 8.1 under the mutual incoherence condition. Corollary 8.1 Let Assumption 4.1 hold, and let  = 1 n x ix T i, and assume that Âj,j = 1 for all j. Define M = sup i j Âi,j. Consider any target vector β such that Ey = β T x, and assume that β contains only s non-zeros where s d/3 and assume that MÂs 1/6. Let k = supp λ ( β). Consider the two-stage selective penalization procedure in Figure 1. Given δ (0, 0.5), with probability larger than 1 2δ: if α/48 λ 12σ 2 ln(2d/δ)/n, then where q = supp 1.5α ( β). ˆβ β 2 24 k qλ + 24σ ( 1 + ) 20q n ln(1/δ) + 168δ (2) k ( β), 15
16 Theorem 8.1 can significantly improve the corresponding one-stage result (see Corollary 6.1 and Theorem 6.1) when r (2) k ( β) kλ and k q k. The latter condition is true when supp 1.5α ( β) supp λ ( β). In such case, we can identify most features in supp λ ( β). These conditions are satisfied when most non-zero coefficients in supp λ ( β) are relatively large in magnitude and the other coefficients are small (in 2-norm). That is, the two-stage procedure is superior when the target β can be decomposed as a sparse vector with large coefficients plus another (less sparse) vector with small coefficients. In the extreme case when r (2) k (β) = 0 and q = k, we obtain ˆβ β 2 = O( k ln(1/δ)/n) instead of ˆβ β 2 = O( k ln(d/δ)/n) for the one-stage Lasso. The difference can be significant when d is large. Finally, we shall point out that the two-stage selective penalization procedure may be regarded as a two-step approximation to solving the least squares problem with a non-convex regularization: ˆβ = arg min 1 d (β T x i y i ) 2 + λ min(α, β j ). n β R d j=1 However, for high dimensional problems, it not clear whether one can effectively find a good solution using such a non-convex regularization condition. When d is sufficiently large, one can often find a vector β such that β j > α and it perfectly fits (thus overfits) the data. This β is clearly a local minimum for this non-convex regularization condition since the regularization has no effect locally for such a vector β. Therefore the two-stage L 1 approximation procedure in Figure 1 not only preserves desirable properties of convex programming, but also prevents such a local minimum to contaminate the final solution. 9 Experiments Although our investigation is mainly theoretical, it is useful to verify whether the two stage procedure can improve the standard Lasso in practice. In the following, we show with a synthetic data and a real data that the two-stage procedure can be helpful. Although more comprehensive experiments are still required, these simple experiments show that the two-stage method is useful at least on datasets with right properties this is consistent with our theory. Note that instead of tuning the α parameter in Figure 1, in the following experiments, we tune the parameter q = supp α ( ˆβ), which is more convenient. The standard Lasso corresponds to q = Simulation Data In this experiment, we generate an n d random matrix with its column j corresponding to x 1,j,..., x n,j ], and each element of the matrix is an independent standard Gaussian N(0, 1). We then normalize its columns so that x2 i,j = n. A truly sparse target β, is generated with k nonzero elements that are uniformly distributed from 10, 10]. The observation y i = β T x i + ɛ i, where each ɛ i N(0, σ 2 ). In this experiment, we take n = 25, d = 100, k = 5, σ = 1, and repeat the experiment 100 times. The average training error and parameter estimation error in 2-norm are reported in Figure 2. We compare the performance of the two-stage method with different q versus the regularization parameter λ. Clearly, the training error becomes smaller when q increases. The smallest estimation error for this example is achieved at q = 3. This shows that the two-stage 16
17 procedure with appropriately chosen q performs better than the standard Lasso (which corresponds to q = 0). training error q=0 q=1 q=3 q=5 parameter estimation error q=0 q=1 q=3 q= lambda lambda Figure 2: Performance of the algorithms on simulation data. Left: average training squared error versus λ; Right: parameter estimation error versus λ. 9.2 Real Data We use real data to illustrate the effectiveness of two-stage L 1 regularization. For simplicity, we only report the performance on a single data, Boston Housing. This is the housing data for 506 census tracts of Boston from the 1970 census, available from the UCI Machine Learning Database Repository: Each census tract is a data-point, with 13 features (we add a constant offset on e as the 14th feature), and the desired output is the housing price. In the experiment, we randomly partition the data into 20 training plus 456 test points. We perform the experiments 100 times, and report training and test squared error versus the regularization parameter λ for different q. The results are plotted in Figure 3. In this case, q = 1 achieves the best performance. Note that this dataset contains only a small number (d = 14) features, which is not the case we are interested (most of other UCI data similarly contain only small number of features). In order to illustrate the advantage of the two-stage method more clearly, we also consider a modified Boston Housing data, where we append 20 random features (similar to the simulation experiments) to the original Boston Housing data, and rerun the experiments. The results are shown in Figure 4. As we can expect, the effect of using q > 0 becomes far more apparent. This again verifies that the two-stage method can be superior to the standard Lasso (q = 0) on some data. 17
18 training error q=0 q=1 q=2 q=3 test error q=0 q=1 q=2 q= lambda lambda Figure 3: Performance of the algorithms on the original Boston Housing data. training squared error versus λ; Right: test squared error versus λ. Left: average 10 Proofs In the proof, we use the following convention: let I be a subset of {1,..., d}, and a vector β R d, then β I denotes either the restriction of β to indices I which lies in R I, or its embedding into the original space R d, with components not in I set to zero Proof of Proposition 3.1 Given any v R k and u R l, without loss of generality, we may assume that v 2 = 1 and u 2 = 1 in the following derivation. Take indices I and J as in the definition. We let I = I J. Given any α R, let u T = v T, αu T ] R l+k. By definition, we have µ (2) A,l+k u 2 2 u T AI,I u = v T A I,I v + 2αv T A I,J u + α 2 u T A J,J u. Let b = v T A I,J u, c 1 = v T A I,I v, and c 2 = u T A J,J u. The above inequality can be written as: µ (2) A,l+k (1 + α2 ) c 1 + 2αb + α 2 c 2. By optimizing over α, we obtain: (c 1 µ (2) A,l+k )(c 2 µ (2) A,l+k ) b2. Therefore b 2 (ρ (2) A,l µ (2) A,l+k )(ρ(2) A,k µ(2) A,l+k ), which implies that (with v 2 = u 2 = 1) v T A I,J u vt A I,J u v 2 u v 2 u 2 / l l b l (ρ (2) A,l µ(2) A,l+k )(ρ(2) A,k µ(2) A,l+k ). Since v and u are arbitrary, this implies the first inequality. 18
19 training error q=0 q=1 q=2 q=3 test error q=0 q=1 q=2 q= lambda lambda Figure 4: Performance of the algorithms on the modified Boston Housing data. training squared error versus λ; Right: test squared error versus λ. Left: average The second inequality follows from v p k max(0,1/p 0.5) v 2 for all v R k, so that A I,J u p u k max(0,1/p 0.5) A I,Ju 2 u. From (c 1 µ (2) A,l+k )(c 2 µ (2) A,l+k ) b2, we also obtain 4b 2 /c 2 1 4c 1 1 (1 µ(2) A,l+k /c 1)(c 2 µ (2) A,l+k ) (c 2 µ (2) A,l+k )/µ(2) A,l+k ρ(2) A,l /µ(2) A,l+k 1. Note that in the above derivation, we have used 4µ (2) v 2 = u 2 = 1: A,l+k c 1 1 (1 µ(2) A,l+k c 1 1 v T A I,J u v 2 v T vt A I,J u A I,I v u v T A I,I v/ l = b l 0.5l 1/2 ρ (2) c A,l /µ(2) A,l+k 1. 1 Because v and u are arbitrary, we obtain the third inequality. The fourth inequality follows from max(0, v T A I,I v p 1 ) ω (p) A,k v p p, so that (v p 1 ) T A I,J u v p max(0, v T A I,I v p 1 1 (v p 1 ) T A I,J u ) u ω (p) v p 1 1 A,k p u ω (p) A,k ) 1. Therefore with A I,J u p u. In the above derivation, the second inequality follows from (v/ v p ) p 1 p/(p 1) = 1 and the Hölder s inequality. The fifth inequality follows from v p k max(0,1/p 0.5) v 2 for all v R k, so that A 1 I,I A I,Ju p u A 1 I,I A I,Ju 2 u k max(0,1/p 0.5). 19
20 The sixth inequality follows from A 1 I,I v p v p /µ (p) A,k for all v Rk, so that The last inequality is due to A 1 I,I A I,Ju p 1 A I,J u p u µ (p). u A,k A I,I v p v p = (v/ v p) p 1 p/(p 1) A I,I v p v p (vp 1 ) T A I,I v v p p = (vp 1 ) T diag(a I,I )v v p p + (vp 1 ) T (A I,I diag(a I,I ))v v p p min A i,i (v/ v p ) p 1 (A I,I diag(a I,I ))v p p/(p 1) i v p min i A i,i A I,I diag(a I,I ) p. In the above derivation, Hölder s inequality is used to obtain the first two inequalities. The first equality and the last inequality use the fact that (v/ v p ) p 1 p/(p 1) = Proof of Proposition 3.2 Let B R d d be the off-diagonal part of A; that is, A B is the identity matrix. sup i,j B i,j M A. Given any v R k, we have We have B I,I v p M A k 1/p v 1 M A k v p. This implies that B I,I p M A k. Therefore we have This proves the first claim. Moreover, A I,I v p v p (1 + Mk). (1 M A k) 1 B I,I p = min i A i,i A I,I diag(a I,I ) p. We thus obtain the second claim from Proposition 3.1. Now given u R l, since I J =, we have A I,J u p = B I,J u p M A k 1/p u 1 M A k 1/p l u. This implies the third claim. The last two claims follow from Proposition Some Auxiliary Results Lemma 10.1 Consider k, l > 0, and p 1, ]. Given any β, v R d. Let β = β F + β G, where supp 0 (β F ) supp 0 (β G ) = and supp 0 (β F ) = k. Let J be the indices of the l largest components of β G (in absolute values), and I = supp 0 (β F ) J. If supp 0 (v) I, then v T Aβ v T I A I,I β I A I,I v I p/(p 1) γ (p) A,k+l,l β G 1 l 1, max(0, ((v/ v p ) p 1 ) T Aβ) ω (p) A,k+l ( v I p π (p) A,k+l,l l 1 β G 1 ) ρ (p) A,k+l β I v I p. 20
21 Proof In order to prove the first inequality, we may assume without loss of generality that β = β 1,..., β d ], where supp 0 (β F ) = {1, 2,..., k}, and when j > k, β j is arranged in descending order of β j : β k+1 β k+2 β d. Let J 0 = {1,..., k}, and J s = {k + (s 1)l + 1,..., k + sl} (s = 1, 2,...), except the largest index in the last block stops at d. Note that in this definition we require that J 1 = J and I = J 0 J 1. We have β Js β Js 1 1 l 1 when s > 1, which implies that s>1 β J s β G 1 l 1. This gives v T Aβ =v T I A I,I β I + s>1 v T I A I,Js β Js v T I A I,I β I A I,I v I p/(p 1) s>1 A 1 I,I A I,J s β Js p vi T A I,I β I γ (p) A, I,l A I,Iv I p/(p 1) β Js s>1 v T I A I,I β I γ (p) A, I,l A I,Iv I p/(p 1) β G 1 l 1. The first inequality in the above derivation is due to the Hölder s inequality. This proves the first inequality of the lemma. The proof of the second inequality is similar, but with a slightly different estimate. We can assume that the right hand side is positive (the inequality is trivial otherwise). It implies that (v p 1 I ) T A I,I v I > 0 since otherwise ω (p) A, I = 0: (v p 1 ) T Aβ =(v (p 1) I ) T A I,I (β I v I ) + (v (p 1) I ) T A I,I v I + (v p 1 I ) T A I,Js β Js s>1 (v (p 1) I ρ (p) ρ (p) ) T A I,I (β I v I ) + (v (p 1) I ) T A I,I v I A, I v(p 1) I A, I v(p 1) I 1 π (p) A, I,l p/(p 1) β I v I p + (v (p 1) I ) T A I,I v I p/(p 1) β I v I p + ω (p) A, I v I p p ] β Js / v I p s>1 ] 1 π (p) A, I,l l 1 β G 1 / v I p ] 1 π (p) A, I,l l 1 β G 1 / v I p. The second inequality in the above derivation is due to the Hölder s inequality. The last inequality assumes that the right hand side is non-negative. Observe that v (p 1) I p/(p 1) = v I p 1 p, we obtain the second inequality of the lemma. Lemma 10.2 Consider the decomposition of any target vector β = β F + β G such that {1, 2,..., d} = F G and F G =. Consider the solution ˆβ to the following more general problem instead of (1): ˆβ = arg min 1 β R d n (β T x i y i ) 2 + λ β j, (5) j / ˆF where ˆF F. Let ˆβ = ˆβ β, Â = 1 n x ix T i, and ˆɛ = 1 n ( β T x i y i )x i. If we pick a sufficiently large λ in (5) such that λ > 2 ˆɛ, then ˆβ G 1 2 ˆɛ + λ λ 2 ˆɛ ( ˆβ F 1 + β G 1 ). 21
22 Proof We define the derivative of β 1 as sgn(β), where for β = β 1,..., β d ] R d, sgn(β) = sgn(β 1 ),..., sgn(β d )] R d is defined as sgn(β j ) = 1 when β j > 0, sgn(β j ) = 1 when β j < 0, and sgn(β j ) 1, 1] when β j = 0. We start with the following first order condition: 2 n ( ˆβ T x i y i )x i + λg( ˆβ) = 0, where g( ˆβ) = g( ˆβ 1 ),..., g( ˆβ d )], with g( ˆβ j ) = 0 when j ˆF and g( ˆβ j ) = sgn( ˆβ j ) otherwise. This implies that 2Â ˆβ + λg( ˆβ) = 2 ( n β T x i y i )x i. Therefore, for all v R d, we have 2v T Â ˆβ 2v T ˆɛ λv T g( ˆβ). (6) Now, let v = ˆβ in (6), and use the fact that ˆβ T G g( ˆβ G ) = ˆβ T G sgn( ˆβ G ) = ˆβ G 1 as well as g( ˆβ) 1, we obtain 0 2 ˆβ T Â ˆβ 2 ˆβ T ˆɛ λ ˆβ T g( ˆβ) 2 ˆβ 1 ˆɛ λ ˆβ F T g( ˆβ) λ ˆβT G g( ˆβ) + λ β T Gg( ˆβ) 2( ˆβ F 1 + ˆβ G 1 + β G 1 ) ˆɛ + λ ˆβ F 1 λ ˆβ G 1 + λ β G 1 =(2 ˆɛ λ) ˆβ G 1 + (2 ˆɛ + λ)( ˆβ F 1 + β G 1 ). By rearranging the above inequality, we obtain the desired bound. Lemma 10.3 Let the conditions of Lemma 10.2 hold. Let J be the indices of the largest l coefficients (in absolute value) of ˆβ G, and I = F J. If λ 4(2 t)t 1 ˆɛ for some t (0, 1), then p 1, ], ˆβ 1 4k 1 1/p ˆβ I p + 4 β G 1, ˆβ p (1 + 3(k/l) 1 1/p ) ˆβ I p + 4 β G 1 l 1/p 1. Proof The condition on λ implies that (λ + 2 ˆɛ )/(λ 2 ˆɛ ) (4 t)/(4 3t) 3. We have from Lemma 10.2 ˆβ G 1 β G 1 + ˆβ G 1 3 ˆβ F β G 1. Therefore ˆβ ˆβ I ˆβ G 1 /l (3 ˆβ F β G 1 )/l, which implies that ˆβ ˆβ I p ( ˆβ G 1 ˆβ ˆβ I p 1 ) 1/p (3 ˆβ F β G 1 )l 1/p 1. Now, the first inequality in the proof also implies that ˆβ 1 4 ˆβ F β G 1. By combining the previous two inequalities with ˆβ F 1 k 1 1/p ˆβ I p, we obtain the desired bounds. 22
23 Lemma 10.4 Let the conditions of Lemma 10.2 hold. Let J be the indices of the largest l coefficients (in absolute value) of ˆβ G, and I = F J. Assume that  I,I is invertible. If t = 1 π (p),l k1 1/p l 1 > 0, and λ 4(2 t)t 1 ˆɛ, then ˆβ I p 2 tω (p) ρ (p) β ] 8π (p) β G p + (k F 0 ) 1/p,l G 1 /l λ + ˆɛ F0 p + + t β G p, where F 0 is any subset of ˆF. Proof The condition of λ implies that (λ + 2 ˆɛ )/(λ 2 ˆɛ ) (4 t)/(4 3t). Therefore if we let β = β I = ˆβ I + β J, then max(0, (( β I / β I p ) p 1 ) T  ˆβ) + ρ (p) β J p ω (p) ( β I p π (p),l l 1 ˆβ G 1 ) ω (p) β I p π (p),l l 1 ((4 t)(4 3t) 1 ( ˆβ F 1 + β G 1 ) + β G 1 )] ω (p) ( β I p (1 t)(4 t)(4 3t) 1 β I p 4π (p),l l 1 β G 1 ) 0.5tω (p) β I p 4ω (p) π(p),l l 1 β G 1. In the above derivation, the first inequality is due to Lemma 10.1, and the second inequality is due to Lemma The third inequality uses ˆβ F 1 = β F 1 k 1 1/p β I p and (4 t)/(4 3t) 3. The last inequality follows from 1 (1 t)(4 t)(4 3t) 1 0.5t. β p 1 I If ( ) T  ˆβ 0, then the above inequality, together with ˆβ I p β I p + β J p β I p + β G p, already implies the lemma. Therefore in the following, we can assume that (( β I / β I p ) p 1 ) T  ˆβ 0.5tω (p) β I p 4ω (p) π(p),l l 1 β G 1 ρ (p) β J p. Moreover, we obtain from (6) with v = ( ( p 1 β I ) T  ˆβ ( p 1 ˆβ J ) T ˆɛ λ( ( ˆɛ λ/2) p 1 λ ˆβ F F ( (k F 0 ) 1/p λ p 1 β I ) T ˆɛ λ( p 1 ˆβ J ) T g( ˆβ)/2 + ( β p 1 I the following: p 1 β I ) T g( ˆβ)/2 ˆβ p 1 p 1 ˆβ J 1 + ( ˆɛ + λ/2) p 1 ˆβ F 0 ) T ˆɛ ˆβ p 1 F F 0 p/(p 1) + ((k F 0 ) 1/p λ + ˆɛ F0 p ) β I p 1 p. F F 0 ) T ˆɛ λ( ˆβ p 1 ˆβ p 1 F F ( p 1 ˆβ F 0 p/(p 1) ˆɛ F0 p F F 0 ) T g( ˆβ)/2 + ( p 1 ˆβ F 0 ) T ˆɛ p 1 ˆβ F 0 ) T ˆɛ In the above derivation, the second inequality uses g( ˆβ F0 ) = 0; the third inequality uses that fact that g( ˆβ) p 1 1 and ( ˆβ J ) T g( ˆβ) p 1 = ˆβ J 1 ; the fourth inequality uses ˆɛ 0.5λ; and the last inequality uses the fact that β p 1 p/(p 1) = β p p 1. Now by combining the above two estimates, together with ˆβ I p β I p + β J p β I p + β G p, we obtain the desired bound. 23
24 Lemma 10.5 Let the conditions of Lemma 10.2 hold. Let J be the indices of the largest l coefficients (in absolute value) of ˆβ G, and I = F J. Assume that ÂI,I is invertible. Let p 1, ]. If t = 1 γ (p),l k1 1/p l 1 > 0, and λ 4(2 t)t 1 ˆɛ, then ˆβ I p 2 t 4γ (p),l l 1 β ] G 1 + λ(k + l) 1/p /µ (p). Proof Consider v R d such that supp 0 (v) I. We have from Lemma 10.1 and (6): v T Â ˆβ I ÂI,Iv I p/(p 1) γ (p),l ˆβ G 1 l 1 v T Â ˆβ v T (ˆɛ + 0.5λg( ˆβ)). Take v such that ÂI,Iv I p/(p 1) = 1 and v T Â ˆβ I = ˆβ I p. We obtain ˆβ I p γ (p),l ( ˆβ G 1 + β G 1 )l 1 Â 1 I,I (ˆɛ I + 0.5λg( ˆβ I )) p. By using Lemma 10.2, (λ+2 ˆɛ )/(λ 2 ˆɛ ) (4 t)/(4 3t) 3, and 1 (1 t)(4 t)/(4 3t) 0.5t, we obtain ˆβ I p γ (p),l l 1 ˆβ G 1 ˆβ I p γ (p),l l 1 (4 t)(4 3t) 1 ( ˆβ F 1 + β G 1 ) ˆβ I p γ (p),l l 1 (4 t)(4 3t) 1 (k 1 1/p ˆβ F p + β G 1 ) ˆβ I p (1 t)(4 t)(4 3t) 1 ˆβ I p 3γ (p),l l 1 β G 1 0.5t ˆβ I p 3γ (p),l l 1 β G 1. Combine the previous two inequalities, we obtain 0.5t ˆβ I p 4γ (p),l β G 1 l 1 + Â 1 I,I (ˆɛ I + 0.5λg( ˆβ I )) p 4γ (p),l β G 1 l 1 + (k + l) 1/p ˆɛ I + 0.5λg( ˆβ I ) /µ (p). Since ˆɛ I + 0.5λg( ˆβ I ) λ, we obtain the desired bound. Proposition 10.1 Consider n independent random variables ξ 1,..., ξ n such that Ee t(ξ i Eξ i ) e σ2 i t2 /2 for all t and i, then ɛ > 0: ( ) P ξ i Eξ i nɛ 2e n2 ɛ 2 /(2 P n σ2 i ). Proof Let s n = (ξ i Eξ i ); then by assumption, E(e tsn + e tsn ) 2e P i σ2 i t2 /2, which implies that P ( s n nɛ)e tnɛ 2e P i σ2 i t2 /2. Now let t = nɛ/ i σ2 i, we obtain the desired bound. 24
25 Proposition 10.2 Consider n independent random variables ξ 1,..., ξ n such that Eξ i = 0 and Ee tξ i e σ2 t 2 /2 for all t and i. Let z 1,..., z n R d be n fixed vectors, and let a n = ( z i 2 2 )1/2. Then, ɛ > 0: ( ) P ξ i z i a n (σ + ɛ) e 2 ɛ2 /(20σ 2). Proof For each i, let ξ i be an identically distributed and independent copy of ξ i, and h( ) is any real-valued function such that h(ξ i ) h(ξ i ) ξ i + ξ i. Then E ξi e t(h(ξ i) E ξ h(ξ i )) t k i =1 + k! E ξ i (h(ξ i ) E ξ i h(ξ i)) k k=2 t k 1 + k! E ξ i ( ξ i + E ξ i ξ i ) k (2t) k 1 + E ξi ξ i k k! k=2 k=2 ] 1 =1 + (2k)! E ξ i 2tξ i 2k 1 + (2k + 1)! E ξ i 2tξ i 2k+1 k= (2k)! E 2tξ i 2k ] (2k)! E 2tξ i 2k 1 + (2k + 2)! E 2tξ i 2k+2 k= k=1 1 (2k)! E 2tξ i 2k = (Ee 2tξ i + Ee 2tξ i 2) (2e 2t2 σ 2 2) e 5t2 σ 2. The second inequality is due to Jensen s inequality. In the third inequality, we have used a 2k+1 /(2k+ 1)! 0.5 a 2k /(2k)!+ a 2k+2 /(2k+2)!. The last inequality can be obtained by comparing the Taylor expansion of the function e x on both sides. Now let s j = E ξj+1,...,ξ n ξ iz i 2. If we regard h(ξ j ) = s j / z j 2 as a function of ξ j (with variables ξ 1,..., ξ j 1 fixed), then s j s j 1 = (h(ξ j ) E ξ j h(ξ j )) z j 2 and h(ξ j ) h(ξ j ) ξ j + ξ j. Therefore from the above inequality, we have E ξj e t(s j s j 1 ) e 5 z j 2 2 t2 σ 2, and E ξ1,...,ξ j e ts j = E ξ1,...,ξ j 1 e ts j 1 E ξj e t(s j s j 1 ) e 5 z j 2 2 σ2 t 2 E ξ1,...,ξ j 1 e ts j 1. By induction, we obtain E ξ1,...,ξ n e tsn e 5σ2 t 2 a 2 n e ts 0, which implies that P (s n s 0 +a n ɛ)e t(s 0+a nɛ) e 5a2 nσ 2 t 2 e ts 0. Let t = ɛ/(10a n σ 2 ), we have P (s n s 0 + a n ɛ) e ɛ2 /(20σ 2). Note that Eξi 2 2 = lim t 0 t 2 (E ξ i e tξ i 2(e σ2t2/2 1) 1) lim t 0 t 2 = σ 2. Therefore, s 0 = E ξ iz i 2 ( Eξ2 i z i 2 2 )1/2 a n σ. This leads to the desired bound Proof of Theorem 4.1 Let F be the indices corresponding to the largest k coefficients of β in absolute value. We only need to estimate ˆɛ, and then apply Lemma 10.4, Lemma 10.5, and Lemma 10.3, with F 0 =. 25
26 By Proposition 10.1, we have ] P sup x 2 i,j na 2 1 and sup (y i Ey i )x i,j ɛ j j n ] 1 d sup P (y i Ey i )x i,j ɛ sup x 2 i,j j n na2 j 2d sup e n2 ɛ 2 /(2σ 2 P n x2 i,j ) (subject to sup x 2 i,j na2 ) j j 2de nɛ2 /(2σ 2 a 2). Therefore with probability larger than 1 δ, if sup n j x2 i,j na2, then 1 2 ln(2d/δ) sup (y i Ey i )x i,j σa. j n n The latter implies that 1 ˆɛ = sup ( j n β 2 ln(2d/δ) T 1 x i y i )x i,j σa + n n ( β T x i E y i )x i. With this bound, the condition of Lemma 10.3 is satisfied. Using k/l 1, we obtain for q = 1, p: ˆβ q 4k 1/q 1/p ˆβ I p + 4 β G 1 l 1/q 1. This estimate together with Lemma 10.4 (let F 0 = ) leads to the first claim of the theorem; with Lemma 10.5, it gives the second claim Proof of Corollary 4.1 The condition MÂ(k + l) (1 t)/(2 t) is equivalent to t 1 MÂ(k + l)/(1 MÂ(k + l)). It implies t 1 MÂ(k + l) 1/p k 1 1/p /(1 MÂ(k + l)). Now using Proposition 3.2, it implies that the condition t 1 π (p),l k1 1/p /l in the first claim of Theorem 4.1 is satisfied. Therefore we have (with q = p) ˆβ p 8 tω (p) ρ (p) r(p) k ( β) ] + k 1/p λ ] + 4r (p) k ( β) 32 + t π(p),l l 1 + 4l 1/p 1 Now, using Proposition 3.2 again, the inequality can be simplified to ˆβ 8 p (1 + MÂ(k + l))r (p) t(1 MÂ(k + l)) k ( β) ] + k 1/p λ + 32M Â (k + l)1/p t(1 MÂ(k + l)) r(1) k ( β) + 4r (p) k ( β) + 4r (1) k ( β)l 1/p 1. r (1) k ( β). Now by using the condition MÂ(k + l)) (1 t)/(2 t) 0.5 to eliminate MÂ, and simplify the result, we obtain the desired bound. 26
27 10.6 Proof of Proposition 5.1 We construct a sequence β (k) with a greedy algorithm as follows. Let β (0) = β, and for k = 1, 2,..., we perform the following steps: j (k) = arg max j (β(k 1)T x i E y i )x i,j. α (k) = (β(k 1)T x i E y i )x i,j (k)/(na 2 ). β (k) = β (k 1) + α (k) e j (k), where e j R d is the vector of zeros, except for the j-th component being one. The following derivation holds for the above procedure. (β (k)t x i E y i ) 2 = = (β (k 1)T x i E y i ) 2 + α (k)2 x 2 + 2α (k) i,j (k) (β (k 1)T x i E y i )x i,j (k) (β (k 1)T x i E y i ) 2 2 (β (k 1)T x i E y i )x i,j (k) /(na 2 ) (β (k 1)T x i E y i ) 2 2 (β (k 1)T x i E y i )x i /(na 2 ). In the above derivation, first equality is simple algebra; the inequality uses the definition of a and α (k) ; and the last equality uses the definition of j (k). Since (β(k+1)t x i E y i ) 2 0, we obtain k 2 (β (k )T x i E y i )x i na 2 (β (0)T x i E y i ) 2. k =0 Therefore there exists k k such that the displayed equation of the proposition holds with β (k) = β (k ). Moreover, µ (2) Â,k ( β (k) β) 2 Â1/2 ( β (k) β) 2 Â1/2 ( β (k) Ey) 2 + Â1/2 ( β Ey) 2 2 Â1/2 ( β Ey) 2. This finishes the proof Proof of Corollary 5.1 The proof is just a straight-forward application of Corollary 4.1 in which we replace β by β (k) of Proposition 5.1, and then replace k by 2k. This leads to the bound: ˆβ β 2 ˆβ β (k) 2 + β β (k) 2 8(2 t) 1.5r (2) t 2k ( β ] (k) ) + (2k) 1/2 λ + 4r (2) 2k ( β (k) ) + Note that from Proposition 3.2, we have 1/ 4(8 7t) r (1) t 2k ( β (k) )l 1/2 + 2ɛ/ µ (2). Â,k µ (2) Â,k (1 km  ) 1/2 < 2. Moreover, since supp 0 ( β (k) β) k, we have r (p) 2k ( β (k) ) r (p) k ( β). This leads to the desired bound. 27
28 10.8 Proof of Corollary 6.1 Note that Proposition 3.1 implies that π (2),l θ(2),l /ω(2) ρ(2) (2) l/ω. Also note that ω (2) A,k+l = µ(2) A,k+l. The first statement of Theorem 4.1 (with q = p = 2) implies the desired bound Proof of Theorem 7.1 Under the conditions of this theorem, we obtain from Theorem 4.1 that with probability 1 δ, the following two claims hold (with q = p): ( If t 1 π (p),l k1 1/p /l, λ 4(2 t)t 1 σa ) 2 ln(2d/δ)/n k 1/p λ. ( If t 1 γ (p),l k1 1/p /l, λ 4(2 t)t 1 l) 1/p /µ (p). That is, if there exist l k, t (0, 1), and p 1, ] so that: λ 4(2 t)t (σa ) 1 2 ln(2d/δ)/n ; σa ) 2 ln(2d/δ)/n, then ˆβ p 8 tω (p), then ˆβ p 8 t λ(k + either t 1 π (p),l k1 1/p /l, α 8ɛ 1 (tω (p) ) 1 k 1/p λ; or t 1 γ (p),l k1 1/p /l, α 8ɛ 1 (tµ (p) ) 1 (k + l) 1/p λ; then ˆβ ˆβ p ɛα. Note that the condition ˆβ ɛα implies that supp (1+ɛ)α ( β) supp α ( ˆβ) supp (1 ɛ)α ( β). This proves the desired result Proof of Corollary 7.1 We take p =, l = k, and t = 0.5 in Theorem 7.1. Proposition 3.2 implies that ω ( ) 1 MÂ(k + l) 0.5 and π ( ) M,l  l/(1 M  (k + l)) 0.5 under the conditions of the corollary. Therefore the condition 8(ɛαω ( ) ) 1 λ t 1 π ( ) k/l in Theorem 7.1 holds.,l Proof of Corollary 7.2 We want to apply Theorem 7.1 with ɛ = 0.5,  = Ân, δ = δ n = exp( n s ), t = 0.5, α = α n = n s/2, d = d n, k = k n, and l = q n k n. Then λ = λ n = 4(2 t)t (σa ) 1 2 ln(2d n /δ n )/n. The condition ρ (2) (1+q  n )µ (2) implies that l/k ρ (2) n,(1+2q n)k n  n,(1+2q n)k n Â,k+2l /µ(2) 1 4(π(2) Â,k+2l,l )2 l 1, where the second inequality is due to Proposition 3.1. Therefore the condition t 1 π (2),l k0.5 /l is satisfied. Now, from the conditions k n = o(n 1 s s ) and k n ln(d n ) = o(n 1 s ), we have lim n ( k n λ n /α n ) = 0. Since (µ (2) ) 1 = O(1), the condition α 16(tµ (2) ) 1 kλ is also satisfied when n is sufficiently large. 28
29 Therefore by Theorem 7.1, when n is sufficiently large, supp 1.5αn ( β n ) supp αn ( ˆβ n ) supp 0.5αn ( β n ) with probability at least 1 δ n. Since 1/ min j supp0 ( β n) β n,j = o(n s/2 ), we know when n is sufficiently large, min j supp0 ( β n) β n,j > 2α n. Thus supp 1.5αn ( β n ) = supp 0.5αn ( β n ) = supp 0 ( β n ). This means that supp 0 ( β n ) = supp αn ( ˆβ n ) with probability at least 1 δ n when n is sufficiently large Proof of Theorem 8.1 Let F = supp λ ( β). We would like to apply Lemma 10.2, Lemma 10.4, and Lemma 10.3, with ˆF = supp α ( ˆβ) and F 0 = supp 1.5α ( β). First Theorem 7.1 implies that with probability larger than 1 δ, supp 1.5α ( β) supp α ( ˆβ) supp 0.5α ( β) F. Moreover, 2 ln(2d/δ) ˆɛ aσ. (7) n This can be seen from the proof of Theorem 4.1 (which directly implies Theorem 7.1). Let z i = z i,1,..., z i,d ] R d so that z i,j = 1 n x i,j if j F 0 and z i,j = 0 otherwise. We thus have ˆɛ F0 = (y i Ey i )z i. Since each y i Ey i is an independent sub-gaussian random variable, and n z i 2 2 = j F 0 (x i,j/n) 2 qa 2 /n, we obtain from Proposition 10.2 that with probability larger than 1 δ: ˆɛ F0 2 aσ( ln(1/δ)) q/n. (8) Therefore with probability exceeding 1 2δ, both (7) and (8) hold. Therefore Lemma 10.4 implies that ˆβ I 2 2 tµ (2) 4θ (2),l l 1 β G 1 + ρ (2) β G 2 + ] k qλ + ˆɛ F0 2 + β G 2 2 tµ (2) 5ρ (2) β G 2 + k qλ + aσ( ln(1/δ)) ] q/n + β G 2. In the first inequality, we have used π (2),l θ(2),l /µ(2) (Proposition 3.1). In the second inequality, we have used β G 1 s β G 2 l β G 2, and θ (2),l ρ(2) l (Proposition 3.1). By combining the above estimate with Lemma 10.3, we obtain the desired bound Proof of Corollary 8.1 We take p =, l = s, and t = 0.5 in Theorem 8.1. Proposition 3.2 implies that π (2),l MÂ(2s) 1/2 s/(1 MÂ2s) 2s/4, ω (p) 1 M Â,s+l Â2s 2/3, and π(p) M Â,s+l,l  s/(1 M Â2s) 1/4. Now it is clear that t 1 π (2),l k0.5 /l holds. Moreover, the condition 16(αω (p) Â,s+l ) 1 s 1/p λ t 1 π (p) Â,s+l,l s1 1/p /l is also valid. Therefore Theorem 8.1 can be applied with µ (2) 1 MÂ(k + l) 2/3 and ρ (2) 1 + M  (k + l) 4/3. 29
30 11 Conclusion This paper considers the performance of least squares regression with L 1 regularization from parameter estimation accuracy and feature selection quality perspectives. To this end, a general theorem is established in Section 4. An important consequence of this theorem is a performance bound for Lasso similar to that of 4] for the Dantzig selector. The detailed comparison is given in Section 6. Our result gives an affirmative answer to an open question in 10] concerning whether a bound similar to that of 4] holds for Lasso. Another important consequence of Theorem 4.1 is the feature selection quality of Lasso using a non-zero thresholding feature selection method, which extends the zero thresholding method considered in 21]. Our method can remove some limitations of 21], as discussed in Section 7. Moreover, we pointed out that the standard (one-stage) Lasso may be sub-optimal under certain conditions. However, the problem can be remedied by combining the parameter estimation and feature selection perspectives of Lasso. In Section 8, a two-stage L 1 -regularization procedure with selective penalization was analyzed. In practice, if one is able to appropriately tune the thresholding parameter using cross-validation, then the procedure should not be much worse than the standard one-stage Lasso. Theoretically, it is shown that if the target vector can be decomposed as the sum of a sparse parameter vector with large coefficients and another (less sparse) vector with small coefficients, then the two-stage L 1 regularization procedure can lead to improved performance when d is large. Finally we shall point out some limitations of our analysis. First, procedures considered in this work are not adaptive. For example, in the one-stage method, the regularization parameter λ has to satisfy certain conditions that depend on t and the noise level σ. In feature selection and the two-stage method, the threshold parameter α also needs to satisfy certain conditions. Although in practice, such parameters can be tuned using cross-validation, it still remains an interesting problem to come out with a theoretical procedure for setting such parameters, which leads to a so-called adaptive estimation method. Moreover, although bounds in this paper can be apply in random design situations with small modifications, the results are incomplete for random design because the conditions on the design matrix (which is now random) needs to be shown to concentrate at a certain rate. Although a number of such results exist in the random matrix literature, a more general treatment with better integration is still needed in future work. Acknowledgments The author would like to thank Trevor Hastie, Terence Tao, and an anonymous reviewer for helpful remarks that clarified differences between L 1 -regularization and the Dantzig selector. References 1] Peter Bickel, Yaacov Ritov, and Alexandre Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Technical Report, ] Florentina Bunea, Alexandre Tsybakov, and Marten H. Wegkamp. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1: , ] Florentina Bunea, Alexandre B. Tsybakov, and Marten H. Wegkamp. Aggregation for Gaussian regression. Annals of Statistics, 35: ,
31 4] Emmanuel Candes and Terence Tao. The Dantzig selector: statistical estimation when p is much larger than n. Annals of Statistics, to appear. 5] Emmanuel J. Candes and Yaniv Plan. Near-ideal model selection by l 1 minimization. submitted, ] Emmanuel J. Candes, Justin Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59: , ] Emmanuel J. Candes and Terence Tao. Decoding by linear programming. IEEE Trans. on Information Theory, 51: , ] Emmanuel J. Candes and Terence Tao. Rejoinder: the Dantzig selector: statistical estimation when p is much larger than n. Annals of Statistics, 35, ] David L. Donoho, Michael Elad, and Vladimir N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Info. Theory, 52(1):6 18, ] Bradley Efron, Trevor Hastie, and Robert Tibshirani. Discussion of the Dantzig selector. Annals of Statistics, ] Sara A. Van De Geer. High dimensional generalized linear models and the Lasso. Technical Report 133, ETH, ] Gareth M. James and Peter Radchenko. Generalized Dantzig selector with shrinkage tuning. Biometrika, to appear. 13] Gareth M. James, Peter Radchenko, and Jinchi Lv. DASSO: Connections between the Dantzig selector and lasso. Journal of the Royal Statistical Society, Series B, to appear. 14] Vladimir Koltchinskii. The Dantzig selector and sparsity oracle inequalities. Manuscript, ] Vladimir Koltchinskii. Sparsity in penalized empirical risk minimization. Annales de l Institut Henri Poincaré, to appear. 16] Nicolai Meinshausen. Relaxed Lasso. Computational Statistics and Data Analysis, 52: , ] Nicolai Meinshausen and Peter Buhlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34: , ] Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for highdimensional data. Technical report, UC Berkeley, ] Martin Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity. Technical report, Department of Statistics, UC. Berkeley, ] Cun-Hui Zhang and Jian Huang. Model-selection consistency of the Lasso in high-dimensional linear regression. Technical report, Rutgers University, ] Peng Zhao and Bin Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7: ,
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +
Factorization Theorems
Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs
CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
1 Solving LPs: The Simplex Algorithm of George Dantzig
Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.
Factor analysis. Angela Montanari
Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Notes on Symmetric Matrices
CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.
Inner Product Spaces and Orthogonality
Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,
We shall turn our attention to solving linear systems of equations. Ax = b
59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system
Degrees of Freedom and Model Search
Degrees of Freedom and Model Search Ryan J. Tibshirani Abstract Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quantitative description of the amount of fitting performed
MOP 2007 Black Group Integer Polynomials Yufei Zhao. Integer Polynomials. June 29, 2007 Yufei Zhao [email protected]
Integer Polynomials June 9, 007 Yufei Zhao [email protected] We will use Z[x] to denote the ring of polynomials with integer coefficients. We begin by summarizing some of the common approaches used in dealing
Labeling outerplanar graphs with maximum degree three
Labeling outerplanar graphs with maximum degree three Xiangwen Li 1 and Sanming Zhou 2 1 Department of Mathematics Huazhong Normal University, Wuhan 430079, China 2 Department of Mathematics and Statistics
1 Short Introduction to Time Series
ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
On an anti-ramsey type result
On an anti-ramsey type result Noga Alon, Hanno Lefmann and Vojtĕch Rödl Abstract We consider anti-ramsey type results. For a given coloring of the k-element subsets of an n-element set X, where two k-element
A note on companion matrices
Linear Algebra and its Applications 372 (2003) 325 33 www.elsevier.com/locate/laa A note on companion matrices Miroslav Fiedler Academy of Sciences of the Czech Republic Institute of Computer Science Pod
By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal
Stochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = 36 + 41i.
Math 5A HW4 Solutions September 5, 202 University of California, Los Angeles Problem 4..3b Calculate the determinant, 5 2i 6 + 4i 3 + i 7i Solution: The textbook s instructions give us, (5 2i)7i (6 + 4i)(
Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 [email protected].
Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 [email protected] This paper contains a collection of 31 theorems, lemmas,
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH
31 Kragujevac J. Math. 25 (2003) 31 49. SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH Kinkar Ch. Das Department of Mathematics, Indian Institute of Technology, Kharagpur 721302, W.B.,
I. GROUPS: BASIC DEFINITIONS AND EXAMPLES
I GROUPS: BASIC DEFINITIONS AND EXAMPLES Definition 1: An operation on a set G is a function : G G G Definition 2: A group is a set G which is equipped with an operation and a special element e G, called
Mathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
Direct Methods for Solving Linear Systems. Matrix Factorization
Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property
A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property Venkat Chandar March 1, 2008 Abstract In this note, we prove that matrices whose entries are all 0 or 1 cannot achieve
On the Interaction and Competition among Internet Service Providers
On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
How To Know If A Domain Is Unique In An Octempo (Euclidean) Or Not (Ecl)
Subsets of Euclidean domains possessing a unique division algorithm Andrew D. Lewis 2009/03/16 Abstract Subsets of a Euclidean domain are characterised with the following objectives: (1) ensuring uniqueness
U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra
U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
BANACH AND HILBERT SPACE REVIEW
BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but
Metric Spaces. Chapter 7. 7.1. Metrics
Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some
1 Norms and Vector Spaces
008.10.07.01 1 Norms and Vector Spaces Suppose we have a complex vector space V. A norm is a function f : V R which satisfies (i) f(x) 0 for all x V (ii) f(x + y) f(x) + f(y) for all x,y V (iii) f(λx)
CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation
Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in
Numerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number
NOTES ON LINEAR TRANSFORMATIONS
NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu
Lecture 13: Martingales
Lecture 13: Martingales 1. Definition of a Martingale 1.1 Filtrations 1.2 Definition of a martingale and its basic properties 1.3 Sums of independent random variables and related models 1.4 Products of
How to assess the risk of a large portfolio? How to estimate a large covariance matrix?
Chapter 3 Sparse Portfolio Allocation This chapter touches some practical aspects of portfolio allocation and risk assessment from a large pool of financial assets (e.g. stocks) How to assess the risk
2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?
WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT? introduction Many students seem to have trouble with the notion of a mathematical proof. People that come to a course like Math 216, who certainly
Classification of Cartan matrices
Chapter 7 Classification of Cartan matrices In this chapter we describe a classification of generalised Cartan matrices This classification can be compared as the rough classification of varieties in terms
Random graphs with a given degree sequence
Sourav Chatterjee (NYU) Persi Diaconis (Stanford) Allan Sly (Microsoft) Let G be an undirected simple graph on n vertices. Let d 1,..., d n be the degrees of the vertices of G arranged in descending order.
Introduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
Iterative Methods for Solving Linear Systems
Chapter 5 Iterative Methods for Solving Linear Systems 5.1 Convergence of Sequences of Vectors and Matrices In Chapter 2 we have discussed some of the main methods for solving systems of linear equations.
Solving polynomial least squares problems via semidefinite programming relaxations
Solving polynomial least squares problems via semidefinite programming relaxations Sunyoung Kim and Masakazu Kojima August 2007, revised in November, 2007 Abstract. A polynomial optimization problem whose
Decision-making with the AHP: Why is the principal eigenvector necessary
European Journal of Operational Research 145 (2003) 85 91 Decision Aiding Decision-making with the AHP: Why is the principal eigenvector necessary Thomas L. Saaty * University of Pittsburgh, Pittsburgh,
THREE DIMENSIONAL GEOMETRY
Chapter 8 THREE DIMENSIONAL GEOMETRY 8.1 Introduction In this chapter we present a vector algebra approach to three dimensional geometry. The aim is to present standard properties of lines and planes,
1 The Brownian bridge construction
The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In
3. Linear Programming and Polyhedral Combinatorics
Massachusetts Institute of Technology Handout 6 18.433: Combinatorial Optimization February 20th, 2009 Michel X. Goemans 3. Linear Programming and Polyhedral Combinatorics Summary of what was seen in the
The Characteristic Polynomial
Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem
Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs
Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs Stavros Athanassopoulos, Ioannis Caragiannis, and Christos Kaklamanis Research Academic Computer Technology Institute
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
THE DYING FIBONACCI TREE. 1. Introduction. Consider a tree with two types of nodes, say A and B, and the following properties:
THE DYING FIBONACCI TREE BERNHARD GITTENBERGER 1. Introduction Consider a tree with two types of nodes, say A and B, and the following properties: 1. Let the root be of type A.. Each node of type A produces
Analysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
26. Determinants I. 1. Prehistory
26. Determinants I 26.1 Prehistory 26.2 Definitions 26.3 Uniqueness and other properties 26.4 Existence Both as a careful review of a more pedestrian viewpoint, and as a transition to a coordinate-independent
MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
An example of a computable
An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
α = u v. In other words, Orthogonal Projection
Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University
A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses Michael R. Powers[ ] Temple University and Tsinghua University Thomas Y. Powers Yale University [June 2009] Abstract We propose a
n k=1 k=0 1/k! = e. Example 6.4. The series 1/k 2 converges in R. Indeed, if s n = n then k=1 1/k, then s 2n s n = 1 n + 1 +...
6 Series We call a normed space (X, ) a Banach space provided that every Cauchy sequence (x n ) in X converges. For example, R with the norm = is an example of Banach space. Now let (x n ) be a sequence
Chapter 7. Lyapunov Exponents. 7.1 Maps
Chapter 7 Lyapunov Exponents Lyapunov exponents tell us the rate of divergence of nearby trajectories a key component of chaotic dynamics. For one dimensional maps the exponent is simply the average
1 Portfolio mean and variance
Copyright c 2005 by Karl Sigman Portfolio mean and variance Here we study the performance of a one-period investment X 0 > 0 (dollars) shared among several different assets. Our criterion for measuring
1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.
Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
Least-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction
IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL R. DRNOVŠEK, T. KOŠIR Dedicated to Prof. Heydar Radjavi on the occasion of his seventieth birthday. Abstract. Let S be an irreducible
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every
Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x
Lecture 4. LaSalle s Invariance Principle We begin with a motivating eample. Eample 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum Dynamics of a pendulum with friction can be written
Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery
: Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical
Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 2, FEBRUARY 2002 359 Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel Lizhong Zheng, Student
TD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Abstract
11 Ideals. 11.1 Revisiting Z
11 Ideals The presentation here is somewhat different than the text. In particular, the sections do not match up. We have seen issues with the failure of unique factorization already, e.g., Z[ 5] = O Q(
How To Prove The Dirichlet Unit Theorem
Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if
1 VECTOR SPACES AND SUBSPACES
1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such
Sum of squares of degrees in a graph
Sum of squares of degrees in a graph Bernardo M. Ábrego 1 Silvia Fernández-Merchant 1 Michael G. Neubauer William Watkins Department of Mathematics California State University, Northridge 18111 Nordhoff
Orthogonal Bases and the QR Algorithm
Orthogonal Bases and the QR Algorithm Orthogonal Bases by Peter J Olver University of Minnesota Throughout, we work in the Euclidean vector space V = R n, the space of column vectors with n real entries
Linear Programming I
Linear Programming I November 30, 2003 1 Introduction In the VCR/guns/nuclear bombs/napkins/star wars/professors/butter/mice problem, the benevolent dictator, Bigus Piguinus, of south Antarctica penguins
Date: April 12, 2001. Contents
2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........
The Trip Scheduling Problem
The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems
7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
Globally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University [email protected] Aditya G. Parameswaran University of Illinois (UIUC) [email protected] Jennifer Widom Stanford
ALMOST COMMON PRIORS 1. INTRODUCTION
ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type
