# LEAST ANGLE REGRESSION. Stanford University

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 The Annals of Statistics 2004, Vol. 32, No. 2, Institute of Mathematical Statistics, 2004 LEAST ANGLE REGRESSION BY BRADLEY EFRON, 1 TREVOR HASTIE, 2 IAIN JOHNSTONE 3 AND ROBERT TIBSHIRANI 4 Stanford University The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a C p estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates. 1. Introduction. Automatic model-building algorithms are familiar, and sometimes notorious, in the linear model literature: Forward Selection, Backward Elimination, All Subsets regression and various combinations are used to automatically produce good linear models for predicting a response y on the basis of some measured covariates x 1,x 2,...,x m. Goodness is often defined in terms of prediction accuracy, but parsimony is another important criterion: simpler models are preferred for the sake of scientific insight into the x y relationship. Two promising recent model-building algorithms, the Lasso and Forward Stagewise lin- Received March 2002; revised January Supported in part by NSF Grant DMS and NIH Grant 8R01-EB Supported in part by NSF Grant DMS and NIH Grant R01-EB Supported in part by NSF Grant DMS and NIH Grant R01-EB Supported in part by NSF Grant DMS and NIH Grant 2R01-CA AMS 2000 subject classification. 62J07. Key words and phrases. Lasso, boosting, linear regression, coefficient paths, variable selection. 407

2 408 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI ear regression, will be discussed here, and motivated in terms of a computationally simpler method called Least Angle Regression. Least Angle Regression (LARS) relates to the classic model-selection method known as Forward Selection, or forward stepwise regression, described in Weisberg [(1980), Section 8.5]: given a collection of possible predictors, we select the one having largest absolute correlation with the response y, sayx j1, and perform simple linear regression of y on x j1. This leaves a residual vector orthogonal to x j1, now considered to be the response. We project the other predictors orthogonally to x j1 and repeat the selection process. After k steps this results in a set of predictors x j1,x j2,...,x jk that are then used in the usual way to construct a k-parameter linear model. Forward Selection is an aggressive fitting technique that can be overly greedy, perhaps eliminating at the second step useful predictors that happen to be correlated with x j1. Forward Stagewise, as described below, is a much more cautious version of Forward Selection, which may take thousands of tiny steps as it moves toward a final model. It turns out, and this was the original motivation for the LARS algorithm, that a simple formula allows Forward Stagewise to be implemented using fairly large steps, though not as large as a classic Forward Selection, greatly reducing the computational burden. The geometry of the algorithm, described in Section 2, suggests the name Least Angle Regression. It then happens that this same geometry applies to another, seemingly quite different, selection method called the Lasso [Tibshirani (1996)]. The LARS Lasso Stagewise connection is conceptually as well as computationally useful. The Lasso is described next, in terms of the main example used in this paper. Table 1 shows a small part of the data for our main example. Ten baseline variables, age, sex, body mass index, average blood pressure and six blood serum measurements, were obtained for each of n = 442 diabetes TABLE 1 Diabetes study: 442 diabetes patients were measured on 10 baseline variables; a prediction model was desired for the response variable, a measure of disease progression one year after baseline AGE SEX BMI BP Serum measurements Response Patient x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 y

3 LEAST ANGLE REGRESSION 409 patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. The statisticians were asked to construct a model that predicted response y from covariates x 1,x 2,...,x 10. Two hopes were evident here, that the model would produce accurate baseline predictions of response for future patients and that the form of the model would suggest which covariates were important factors in disease progression. The Lasso is a constrained version of ordinary least squares (OLS). Let x 1, x 2,...,x m be n-vectors representing the covariates, m = 10 and n = 442 in the diabetes study, and let y be the vector of responses for the n cases. By location and scale transformations we can always assume that the covariates have been standardized to have mean 0 and unit length, and that the response has mean 0, n n n (1.1) y i = 0, x ij = 0, xij 2 = 1 forj = 1, 2,...,m. i=1 i=1 This is assumed to be the case in the theory which follows, except that numerical results are expressed in the original units of the diabetes example. A candidate vector of regression coefficients β = ( β 1, β 2,..., β m ) gives prediction vector µ, (1.2) m µ = x β j j = X β [X n m = (x 1, x 2,...,x m )] j=1 with total squared error (1.3) n S( β) = y µ 2 = (y i µ i ) 2. i=1 Let T( β) be the absolute norm of β, (1.4) m T( β) = β j. j=1 The Lasso chooses β by minimizing S( β) subject to a bound t on T( β), (1.5) Lasso: minimize S( β) subject to T( β) t. Quadratic programming techniques can be used to solve (1.5) though we will present an easier method here, closely related to the homotopy method of Osborne, Presnell and Turlach (2000a). The left panel of Figure 1 shows all Lasso solutions β(t) for the diabetes study, as t increases from 0, where β = 0, to t = , where β equals the OLS regression vector, the constraint in (1.5) no longer binding. We see that the Lasso tends to shrink the OLS coefficients toward 0, more so for small values of t. Shrinkage often improves prediction accuracy, trading off decreased variance for increased bias as discussed in Hastie, Tibshirani and Friedman (2001). i=1

4 410 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI FIG. 1. Estimates of regression coefficients β j, j = 1, 2,...,10, for the diabetes study. (Left panel) Lasso estimates, as a function of t = j β j. The covariates enter the regression equation sequentially as t increases, in order j = 3, 9, 4, 7,...,1. (Right panel) The same plot for Forward Stagewise Linear Regression. The two plots are nearly identical, but differ slightly for large t as shown in the track of covariate 8. The Lasso also has a parsimony property: for any given constraint value t, only a subset of the covariates have nonzero values of β j.att = 1000, for example, only variables 3, 9, 4 and 7 enter the Lasso regression model (1.2). If this model provides adequate predictions, a crucial question considered in Section 4, the statisticians could report these four variables as the important ones. Forward Stagewise Linear Regression, henceforth called Stagewise, is an iterative technique that begins with µ = 0 and builds up the regression function in successive small steps. If µ is the current Stagewise estimate, let c( µ) be the vector of current correlations (1.6) ĉ = c( µ) = X (y µ), so that ĉ j is proportional to the correlation between covariate x j and the current residual vector. The next step of the Stagewise algorithm is taken in the direction of the greatest current correlation, (1.7) ĵ = arg max ĉ j and µ µ + ε sign(ĉĵ) xĵ, with ε some small constant. Small is important here: the big choice ε = ĉĵ leads to the classic Forward Selection technique, which can be overly greedy, impulsively eliminating covariates which are correlated with xĵ.thestagewise procedure is related to boosting and also to Friedman s MART algorithm

5 LEAST ANGLE REGRESSION 411 [Friedman (2001)]; see Section 8, as well as Hastie, Tibshirani and Friedman [(2001), Chapter 10 and Algorithm 10.4]. The right panel of Figure 1 shows the coefficient plot for Stagewise applied to the diabetes data. The estimates were built up in 6000 Stagewise steps [making ε in (1.7) small enough to conceal the Etch-a-Sketch staircase seen in Figure 2, Section 2]. The striking fact is the similarity between the Lasso and Stagewise estimates. Although their definitions look completely different, the results are nearly, but not exactly, identical. The main point of this paper is that both Lasso and Stagewise are variants of a basic procedure called Least Angle Regression, abbreviated LARS (the S suggesting Lasso and Stagewise ). Section 2 describes the LARS algorithm while Section 3 discusses modifications that turn LARS into Lasso or Stagewise, reducing the computational burden by at least an order of magnitude for either one. Sections 5 and 6 verify the connections stated in Section 3. Least Angle Regression is interesting in its own right, its simple structure lending itself to inferential analysis. Section 4 analyzes the degrees of freedom of a LARS regression estimate. This leads to a C p type statistic that suggests which estimate we should prefer among a collection of possibilities like those in Figure 1. A particularly simple C p approximation, requiring no additional computation beyond that for the β vectors, is available for LARS. Section 7 briefly discusses computational questions. An efficient S program for all three methods, LARS, Lasso and Stagewise, is available. Section 8 elaborates on the connections with boosting. 2. The LARS algorithm. Least Angle Regression is a stylized version of the Stagewise procedure that uses a simple mathematical formula to accelerate the computations. Only m steps are required for the full set of solutions, where m is the number of covariates: m = 10 in the diabetes example compared to the 6000 steps used in the right panel of Figure 1. This section describes the LARS algorithm. Modifications of LARS that produce Lasso and Stagewise solutions are discussed in Section 3, and verified in Sections 5 and 6. Section 4 uses the simple structure of LARS to help analyze its estimation properties. The LARS procedure works roughly as follows. As with classic Forward Selection, we start with all coefficients equal to zero, and find the predictor most correlated with the response, say x j1. We take the largest step possible in the direction of this predictor until some other predictor, say x j2, has as much correlation with the current residual. At this point LARS parts company with Forward Selection. Instead of continuing along x j1, LARS proceeds in a direction equiangular between the two predictors until a third variable x j3 earns its way into the most correlated set. LARS then proceeds equiangularly between x j1,x j2 and x j3, that is, along the least angle direction, until a fourth variable enters, and so on.

6 412 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI The remainder of this section describes the algebra necessary to execute the equiangular strategy. As usual the algebraic details look more complicated than the simple underlying geometry, but they lead to the highly efficient computational algorithm described in Section 7. LARS builds up estimates µ = X β, (1.2), in successive steps, each step adding one covariate to the model, so that after k steps just k of the β j s are nonzero. Figure 2 illustrates the algorithm in the situation with m = 2 covariates, X = (x 1, x 2 ). In this case the current correlations (1.6) depend only on the projection ȳ 2 of y into the linear space L(X) spanned by x 1 and x 2, (2.1) c( µ) = X (y µ) = X (ȳ 2 µ). The algorithm begins at µ 0 = 0 [remembering that the response has had its mean subtracted off, as in (1.1)]. Figure 2 has ȳ 2 µ 0 making a smaller angle with x 1 than x 2,thatis,c 1 ( µ 0 )>c 2 ( µ 0 ). LARS then augments µ 0 in the direction of x 1,to (2.2) µ 1 = µ 0 + γ 1 x 1. Stagewise would choose γ 1 equal to some small value ε, and then repeat the process many times. Classic Forward Selection would take γ 1 large enough to make µ 1 equal ȳ 1, the projection of y into L(x 1 ). LARS uses an intermediate value of γ 1, the value that makes ȳ 2 µ, equally correlated with x 1 and x 2 ;thatis, ȳ 2 µ 1 bisects the angle between x 1 and x 2,soc 1 ( µ 1 ) = c 2 ( µ 1 ). FIG. 2. The LARS algorithm in the case of m = 2 covariates; ȳ 2 is the projection of y into L(x 1, x 2 ). Beginning at µ 0 = 0, the residual vector ȳ 2 µ 0 has greater correlation with x 1 than x 2 ; the next LARS estimate is µ 1 = µ 0 + γ 1 x 1, where γ 1 is chosen such that ȳ 2 µ 1 bisects the angle between x 1 and x 2 ; then µ 2 = µ 1 + γ 2 u 2, where u 2 is the unit bisector; µ 2 = ȳ 2 in the case m = 2, but not for the case m>2; see Figure 4. The staircase indicates a typical Stagewise path. Here LARS gives the Stagewise track as ε 0, but a modification is necessary to guarantee agreement in higher dimensions; see Section 3.2.

7 (2.3) LEAST ANGLE REGRESSION 413 Let u 2 be the unit vector lying along the bisector. The next LARS estimate is µ 2 = µ 1 + γ 2 u 2, with γ 2 chosen to make µ 2 = ȳ 2 in the case m = 2. With m>2 covariates, γ 2 would be smaller, leading to another change of direction, as illustrated in Figure 4. The staircase in Figure 2 indicates a typical Stagewise path. LARS is motivated by the fact that it is easy to calculate the step sizes γ 1, γ 2,... theoretically, short-circuiting the small Stagewise steps. Subsequent LARS steps, beyond two covariates, are taken along equiangular vectors, generalizing the bisector u 2 in Figure 2. We assume that the covariate vectors x 1, x 2,...,x m are linearly independent. ForA a subset of the indices {1, 2,...,m}, define the matrix (2.4) where the signs s j equal ±1. Let X A = ( s j x j ) j A, (2.5) G A = X A X A and A A = (1 A G 1 A 1 A) 1/2, 1 A being a vector of 1 s of length equaling A, the size of A.The (2.6) equiangular vector u A = X A w A where w A = A A G 1 A 1 A, is the unit vector making equal angles, less than 90, with the columns of X A, (2.7) X A u A = A A 1 A and u A 2 = 1. We can now fully describe the LARS algorithm. As with the Stagewise procedure we begin at µ 0 = 0 and build up µ by steps, larger steps in the LARS case. Suppose that µ A is the current LARS estimate and that (2.8) ĉ = X (y µ A ) is the vector of current correlations (1.6). The active set A is the set of indices corresponding to covariates with the greatest absolute current correlations, (2.9) Letting (2.10) Ĉ = max{ ĉ j } and A ={j : ĉ j =Ĉ}. j s j = sign{ĉ j } for j A, we compute X A,A A and u A as in (2.4) (2.6), and also the inner product vector (2.11) a X u A. Then the next step of the LARS algorithm updates µ A,sayto (2.12) µ A+ = µ A + γ u A,

8 414 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI where (2.13) { + Ĉ ĉj γ = min, j A c A A a j Ĉ + ĉ j A A + a j min + indicates that the minimum is taken over only positive components within each choice of j in (2.13). Formulas (2.12) and (2.13) have the following interpretation: define (2.14) µ(γ ) = µ A + γ u A, for γ>0, so that the current correlation c j (γ ) = x ( ) (2.15) j y µ(γ ) = ĉj γa j. For j A, (2.7) (2.9) yield (2.16) c j (γ ) =Ĉ γa A, showing that all of the maximal absolute current correlations decline equally. For j A c, equating (2.15) with (2.16) shows that c j (γ ) equals the maximal value at γ = (Ĉ ĉ j )/(A A a j ). Likewise c j (γ ), the current correlation for the reversed covariate x j, achieves maximality at (Ĉ + ĉ j)/(a A + a j ). Therefore γ in (2.13) is the smallest positive value of γ such that some new index ĵ joins the active set; ĵ is the minimizing index in (2.13), and the new active set A + is A {ĵ}; the new maximum absolute correlation is Ĉ+ = Ĉ γa A. Figure 3 concerns the LARS analysis of the diabetes data. The complete algorithm required only m = 10 steps of procedure (2.8) (2.13), with the variables } ; FIG. 3. LARS analysis of the diabetes study: (left) estimates of regression coefficients β j, j = 1, 2,...,10; plotted versus β j ; plot is slightly different than either Lasso or Stagewise, Figure 1; (right) absolute current correlations as function of LARS step; variables enter active set (2.9) in order 3, 9, 4, 7,...,1; heavy curve shows maximum current correlation Ĉk declining with k.

9 LEAST ANGLE REGRESSION 415 joining the active set A in the same order as for the Lasso: 3, 9, 4, 7,...,1. Tracks of the regression coefficients β j are nearly but not exactly the same as either the Lasso or Stagewise tracks of Figure 1. The right panel shows the absolute current correlations (2.17) ĉ kj = x j (y µ k 1) for variables j = 1, 2,...,10, as a function of the LARS step k. The maximum correlation (2.18) Ĉ k = max{ ĉ kj } = Ĉk 1 γ k 1 A k 1 declines with k, as it must. At each step a new variable j joins the active set, henceforth having ĉ kj =Ĉk. The sign s j of each x j in (2.4) stays constant as the active set increases. Section 4 makes use of the relationship between Least Angle Regression and Ordinary Least Squares illustrated in Figure 4. Suppose LARS has just completed step k 1, giving µ k 1, and is embarking upon step k. The active set A k, (2.9), will have k members, giving X k, G k,a k and u k as in (2.4) (2.6) (here replacing subscript A with k ). Let ȳ k indicate the projection of y into L(X k ), which, since µ k 1 L(X k 1 ),is (2.19) ȳ k = µ k 1 + X k G 1 k X k (y µ k 1) = µ k 1 + Ĉk u k, A k the last equality following from (2.6) and the fact that the signed current correlations in A k all equal Ĉk, (2.20) X k (y µ k 1) = Ĉk1 A. Since u k is a unit vector, (2.19) says that ȳ k µ k 1 has length (2.21) γ k Ĉk. A k Comparison with (2.12) shows that the LARS estimate µ k lies on the line FIG. 4. At each stage the LARS estimate µ k approaches, but does not reach, the corresponding OLS estimate ȳ k.

10 416 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI from µ k 1 to ȳ k, (2.22) µ k µ k 1 = γ k γ k (ȳ k µ k 1 ). It is easy to see that γ k, (2.12), is always less than γ k,sothat µ k lies closer than ȳ k to µ k 1. Figure 4 shows the successive LARS estimates µ k always approaching but never reaching the OLS estimates ȳ k. The exception is at the last stage: since A m contains all covariates, (2.13) is not defined. By convention the algorithm takes γ m = γ m = Ĉm/A m,making µ m = ȳ m and β m equal the OLS estimate for the full set of m covariates. The LARS algorithm is computationally thrifty. Organizing the calculations correctly, the computational cost for the entire m steps is of the same order as that required for the usual Least Squares solution for the full set of m covariates. Section 7 describes an efficient LARS program available from the authors. With the modifications described in the next section, this program also provides economical Lasso and Stagewise solutions. 3. Modified versions of Least Angle Regression. Figures 1 and 3 show Lasso, Stagewise and LARS yielding remarkably similar estimates for the diabetes data. The similarity is no coincidence. This section describes simple modifications of the LARS algorithm that produce Lasso or Stagewise estimates. Besides improved computational efficiency, these relationships elucidate the methods rationale: all three algorithms can be viewed as moderately greedy forward stepwise procedures whose forward progress is determined by compromise among the currently most correlated covariates. LARS moves along the most obvious compromise direction, the equiangular vector (2.6), while Lasso and Stagewise put some restrictions on the equiangular strategy The LARS Lasso relationship. The full set of Lasso solutions, as shown for the diabetes study in Figure 1, can be generated by a minor modification of the LARS algorithm (2.8) (2.13). Our main result is described here and verified in Section 5. It closely parallels the homotopy method in the papers by Osborne, Presnell and Turlach (2000a, b), though the LARS approach is somewhat more direct. Let β be a Lasso solution (1.5), with µ = X β. Then it is easy to show that the sign of any nonzero coordinate β j must agree with the sign s j of the current correlation ĉ j = x j (y µ), (3.1) sign( β j ) = sign( ĉ j ) = s j ; see Lemma 8 of Section 5. The LARS algorithm does not enforce restriction (3.1), but it can easily be modified to do so.

11 LEAST ANGLE REGRESSION 417 Suppose we have just completed a LARS step, giving a new active set A as in (2.9), and that the corresponding LARS estimate µ A corresponds to a Lasso solution µ = X β.let (3.2) w A = A A G 1 A 1 A, a vector of length the size of A, and (somewhat abusing subscript notation) define d to be the m-vector equaling s j w Aj for j A and zero elsewhere. Moving in the positive γ direction along the LARS line (2.14), we see that (3.3) µ(γ ) = Xβ(γ ), where β j (γ ) = β j + γ d j for j A. Therefore β j (γ ) will change sign at (3.4) γ j = β j / d j, the first such change occurring at (3.5) γ = min {γ j }, γ j >0 say for covariate x j ; γ equals infinity by definition if there is no γ j > 0. If γ is less than γ, (2.13), then β j (γ ) cannot be a Lasso solution for γ> γ since the sign restriction (3.1) must be violated: β j (γ ) has changed sign while c j (γ ) has not. [The continuous function c j (γ ) cannot change sign within a single LARS step since c j (γ ) =Ĉ γa A > 0, (2.16).] LASSO MODIFICATION. If γ < γ, stop the ongoing LARS step at γ = γ and remove j from the calculation of the next equiangular direction. That is, (3.6) rather than (2.12). µ A+ = µ A + γ u A and A + = A { j} THEOREM 1. Under the Lasso modification, and assuming the one at a time condition discussed below, the LARS algorithm yields all Lasso solutions. The active sets A grow monotonically larger as the original LARS algorithm progresses, but the Lasso modification allows A to decrease. One at a time means that the increases and decreases never involve more than a single index j. This is the usual case for quantitative data and can always be realized by adding a little jitter to the y values. Section 5 discusses tied situations. The Lasso diagram in Figure 1 was actually calculated using the modified LARS algorithm. Modification (3.6) came into play only once, at the arrowed point in the left panel. There A contained all 10 indices while A + = A {7}. Variable7 was restored to the active set one LARS step later, the next and last step then taking β all the way to the full OLS solution. The brief absence of variable 7 had an effect on the tracks of the others, noticeably β 8. The price of using Lasso instead of unmodified LARS comes in the form of added steps, 12 instead of 10 in this example. For the more complicated quadratic model of Section 4, the comparison was 103 Lasso steps versus 64 for LARS.

12 418 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI 3.2. The LARS Stagewise relationship. The staircase in Figure 2 indicates how the Stagewise algorithm might proceed forward from µ 1, a point of equal current correlations ĉ 1 = ĉ 2, (2.8). The first small step has (randomly) selected index j = 1, taking us to µ 1 + εx 1. Now variable 2 is more correlated, (3.7) x 2 (y µ 1 εx 1 )>x 1 (y µ 1 εx 1 ), forcing j = 2 to be the next Stagewise choice and so on. We will consider an idealized Stagewise procedure in which the step size ε goes to zero. This collapses the staircase along the direction of the bisector u 2 in Figure 2, making the Stagewise and LARS estimates agree. They always agree for m = 2 covariates, but another modification is necessary for LARS to produce Stagewise estimates in general. Section 6 verifies the main result described next. Suppose that the Stagewise procedure has taken N steps of infinitesimal size ε from some previous estimate µ, with (3.8) N j #{steps with selected index j}, j = 1, 2,...,m. It is easy to show, as in Lemma 11 of Section 6, that N j = 0forj not in the active set A defined by the current correlations x j (y µ), (2.9). Letting (3.9) P (N 1,N 2,...,N m )/N, with P A indicating the coordinates of P for j A, the new estimate is (3.10) µ = µ + NεX A P A [(2.4)]. (Notice that the Stagewise steps are taken along the directions s j x j.) The LARS algorithm (2.14) progresses along (3.11) µ A + γx A w A, where w A = A A G 1 A 1 A [(2.6) (3.2)]. Comparing (3.10) with (3.11) shows that LARS cannot agree with Stagewise if w A has negative components, since P A is nonnegative. To put it another way, the direction of Stagewise progress X A P A must lie in the convex cone generated by the columns of X A, { C A = v = } (3.12) s j x j P j,p j 0. j A If u A C A then there is no contradiction between (3.12) and (3.13). If not it seems natural to replace u A with its projection into C A, that is, the nearest point in the convex cone. STAGEWISE MODIFICATION. Proceed as in (2.8) (2.13), except with u A replaced by u ˆB, the unit vector lying along the projection of u A into C A.(See Figure 9 in Section 6.)

13 LEAST ANGLE REGRESSION 419 THEOREM 2. Under the Stagewise modification, the LARS algorithm yields all Stagewise solutions. The vector u ˆB in the Stagewise modification is the equiangular vector (2.6) for the subset B A corresponding to the face of C A into which the projection falls. Stagewise is a LARS type algorithm that allows the active set to decrease by one or more indices. This happened at the arrowed point in the right panel of Figure 1: there the set A ={3, 9, 4, 7, 2, 10, 5, 8} was decreased to B = A {3, 7}. It took a total of 13 modified LARS steps to reach the full OLS solution β m = (X X) 1 X y. The three methods, LARS, Lasso and Stagewise, always reach OLS eventually, but LARS does so in only m steps while Lasso and, especially, Stagewise can take longer. For the m = 64 quadratic model of Section 4, Stagewise took 255 steps. According to Theorem 2 the difference between successive Stagewise modified LARS estimates is (3.13) µ A+ µ A = γ u ˆB = γx ˆB w ˆB, as in (3.13). Since u ˆB exists in the convex cone C A, w ˆB must have nonnegative components. This says that the difference of successive coefficient estimates for coordinate j B satisfies (3.14) sign( β +j β j ) = s j, where s j = sign{x j (y µ)}. We can now make a useful comparison of the three methods: 1. Stagewise successive differences of β j agree in sign with the current correlation ĉ j = x j (y µ); 2. Lasso β j agrees in sign with ĉ j ; 3. LARS no sign restrictions (but see Lemma 4 of Section 5). From this point of view, Lasso is intermediate between the LARS and Stagewise methods. The successive difference property (3.14) makes the Stagewise β j estimates move monotonically away from 0. Reversals are possible only if ĉ j changes sign while β j is resting between two periods of change. This happened to variable 7 in Figure 1 between the 8th and 10th Stagewise-modified LARS steps Simulation study. A small simulation study was carried out comparing the LARS, Lasso and Stagewise algorithms. The X matrix for the simulation was based on the diabetes example of Table 1, but now using a Quadratic Model having m = 64 predictors, including interactions and squares of the 10 original covariates: (3.15) Quadratic Model 10 main effects, 45 interactions, 9 squares,

14 420 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI the last being the squares of each x j except the dichotomous variable x 2. The true mean vector µ for the simulation was µ = Xβ, whereβ was obtained by running LARS for 10 steps on the original (X, y) diabetes data (agreeing in this case with the 10-step Lassoor Stagewise analysis). Subtracting µ from a centered version of the original y vector of Table 1 gave a vector ε = y µ of n = 442 residuals. The true R 2 for this model, µ 2 /( µ 2 + ε 2 ), equaled simulated response vectors y were generated from the model (3.16) y = µ + ε, with ε = (ε1,ε 2,...,ε n ) a random sample, with replacement, from the components of ε. The LARS algorithm with K = 40 steps was run for each simulated data set (X, y ), yielding a sequence of estimates µ (k), k = 1, 2,...,40, and likewise using the Lasso and Stagewise algorithms. Figure 5 compares the LARS, Lasso and Stagewise estimates. For a given estimate µ define the proportion explained pe( µ) to be (3.17) pe( µ) = 1 µ µ 2 / µ 2, so pe(0) = 0 and pe(µ) = 1. The solid curve graphs the average of pe( µ (k) ) over the 100 simulations, versus step number k for LARS, k = 1, 2,...,40. The corresponding curves are graphed for Lasso and Stagewise, except that the horizontal axis is now the average number of nonzero β j terms composing µ(k). For example, µ (40) averaged nonzero terms with Stagewise, compared to for Lasso and 40 for LARS. Figure 5 s most striking message is that the three algorithms performed almost identically, and rather well. The average proportion explained rises quickly, FIG. 5. Simulation study comparing LARS, Lasso and Stagewise algorithms; 100 replications of model (3.15) (3.16). Solid curve shows average proportion explained, (3.17), for LARS estimates as function of number of steps k = 1, 2,...,40; Lasso and Stagewise give nearly identical results; small dots indicate plus or minus one standard deviation over the 100 simulations. Classic Forward Selection (heavy dashed curve) rises and falls more abruptly.

15 LEAST ANGLE REGRESSION 421 reaching a maximum of at k = 10, and then declines slowly as k grows to 40. The light dots display the small standard deviation of pe( µ (k) ) over the 100 simulations, roughly ±0.02. Stopping at any point between k = 5 and 25 typically gave a µ (k) with true predictive R 2 about 0.40, compared to the ideal value for µ. The dashed curve in Figure 5 tracks the average proportion explained by classic Forward Selection. It rises very quickly, to a maximum of after k = 3steps, and then falls back more abruptly than the LARS Lasso Stagewise curves. This behavior agrees with the characterization of Forward Selection as a dangerously greedy algorithm Other LARS modifications. Here are a few more examples of LARS type model-building algorithms. POSITIVE LASSO. (3.18) Constraint (1.5) can be strengthened to minimize S( β) subject to T( β) t and all β j 0. This would be appropriate if the statisticians or scientists believed that the variables x j must enter the prediction equation in their defined directions. Situation (3.18) is a more difficult quadratic programming problem than (1.5), but it can be solved by a further modification of the Lasso-modified LARS algorithm: change ĉ j to ĉ j at both places in (2.9), set s j = 1 instead of (2.10) and change (2.13) to { } + Ĉ ĉj (3.19) γ = min. j A c A A a j The positive Lasso usually does not converge to the full OLS solution β m,evenfor very large choices of t. The changes above amount to considering the x j as generating half-lines rather than full one-dimensional spaces. A positive Stagewise version can be developed in the same way, and has the property that the β j tracks are always monotone. LARS OLS hybrid. After k steps the LARS algorithm has identified a set A k of covariates, for example, A 4 ={3, 9, 4, 7} in the diabetes study. Instead of β k we might prefer β k, the OLS coefficients based on the linear model with covariates in A k using LARS to find the model but not to estimate the coefficients. Besides looking more familiar, this will always increase the usual empirical R 2 measure of fit (though not necessarily the true fitting accuracy), (3.20) R 2 ( β k ) R 2 ( β k ) = where ρ k = γ k / γ k as in (2.22). 1 ρ2 k ρ k (2 ρ k ) [R2 ( β k ) R 2 ( β k 1 )],

16 422 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI The increases in R 2 were small in the diabetes example, on the order of 0.01 for k 4 compared with R 2 = 0.50, which is expected from (3.20) since we would usually continue LARS until R 2 ( β k ) R 2 ( β k 1 ) was small. For the same reason β k and β k are likely to lie near each other as they did in the diabetes example. Main effects first. It is straightforward to restrict the order in which variables are allowed to enter the LARS algorithm. For example, having obtained A 4 = {3, 9, 4, 7} for the diabetes study, we might then wish to check for interactions. To do this we begin LARS again, replacing y with y µ 4 and x with the n 6matrix whose columns represent the interactions x 3:9, x 3:4,...,x 4:7. Backward Lasso. The Lasso modified LARS algorithm can be run backward, starting from the full OLS solution β m. Assuming that all the coordinates of β m are nonzero, their signs must agree with the signs s j that the current correlations had during the final LARS step. This allows us to calculate the last equiangular direction u A, (2.4) (2.6). Moving backward from µ m = X β m along the line µ(γ ) = µ m γ u A, we eliminate from the active set the index of the first β j that becomes zero. Continuing backward, we keep track of all coefficients β j and current correlations ĉ j, following essentially the same rules for changing A as in Section 3.1. As in (2.3), (3.5) the calculation of γ and γ is easy. The crucial property of the Lasso that makes backward navigation possible is (3.1), which permits calculation of the correct equiangular direction u A at each step. In this sense Lasso can be just as well thought of as a backward-moving algorithm. This is not the case for LARS or Stagewise, both of which are inherently forward-moving algorithms. 4. Degrees of freedom and C p estimates. Figures 1 and 3 show all possible Lasso, Stagewise or LARS estimates of the vector β for the diabetes data. The scientists want just a single β of course, so we need some rule for selecting among the possibilities. This section concerns a C p -type selection criterion, especially as it applies to the choice of LARS estimate. Let µ = g(y) represent a formula for estimating µ from the data vector y. Here, as usual in regression situations, we are considering the covariate vectors x 1, x 2,...,x m fixed at their observed values. We assume that given the x s, y is generated according to an homoskedastic model (4.1) y (µ,σ 2 I), meaning that the components y i are uncorrelated, with mean µ i and variance σ 2. Taking expectations in the identity (4.2) ( µ i µ i ) 2 = (y i µ i ) 2 (y i µ i ) 2 + 2( µ i µ i )(y i µ i ),

17 and summing over i, yields { µ µ 2 } { y µ 2 (4.3) E = E σ 2 LEAST ANGLE REGRESSION 423 σ 2 } n + 2 n i=1 cov( µ i,y i ) σ 2. The last term of (4.3) leads to a convenient definition of the degrees of freedom for an estimator µ = g(y), n (4.4) df µ,σ 2 = cov( µ i,y i )/σ 2, i=1 and a C p -type risk estimation formula, y µ 2 (4.5) C p ( µ) = σ 2 n + 2df µ,σ 2. If σ 2 and df µ,σ 2 are known, C p ( µ) is an unbiased estimator of the true risk E{ µ µ 2 /σ 2 }. For linear estimators µ = My, model (4.1) makes df µ,σ 2 = trace(m), equaling the usual definition of degrees of freedom for OLS, and coinciding with the proposal of Mallows (1973). Section 6 of Efron and Tibshirani (1997) and Section 7 of Efron (1986) discuss formulas (4.4) and (4.5) and their role in C p, Akaike information criterion (AIC) and Stein s unbiased risk estimated (SURE) estimation theory, a more recent reference being Ye (1998). Practical use of C p formula (4.5) requires preliminary estimates of µ,σ 2 and df µ,σ 2. In the numerical results below, the usual OLS estimates µ and σ 2 from the full OLS model were used to calculate bootstrap estimates of df µ,σ 2; bootstrap samples y and replications µ were then generated according to (4.6) y N( µ, σ 2 ) and µ = g(y ). Independently repeating (4.6) say B times gives straightforward estimates for the covariances in (4.4), Bb=1 µ i ĉov i = (b)[ y i (b) y i ( )] Bb=1, where y y (b) (4.7) ( ) =, B 1 B and then n (4.8) df = ĉov i / σ 2. i=1 Normality is not crucial in (4.6). Nearly the same results were obtained using y = µ + e, where the components of e were resampled from e = y µ. The left panel of Figure 6 shows df k for the diabetes data LARS estimates µ k,k = 1, 2,...,m= 10. It portrays a startlingly simple situation that we will call the simple approximation, (4.9) df ( µ k ) = k.

18 424 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI FIG. 6. Degrees of freedom for LARS estimates µ k : (left) diabetes study, Table 1, k = 1, 2,...,m = 10; (right) quadratic model (3.15) for the diabetes data, m = 64. Solid line is simple approximation df k = k. Dashed lines are approximate 95% confidence intervals for the bootstrap estimates. Each panel based on B = 500 bootstrap replications. The right panel also applies to the diabetes data, but this time with the quadratic model (3.15), having m = 64 predictors. We see that the simple approximation (4.9) is again accurate within the limits of the bootstrap computation (4.8), where B = 500 replications were divided into 10 groups of 50 each in order to calculate Student-t confidence intervals. If (4.9) can be believed, and we will offer some evidence in its behalf, we can estimate the risk of a k-step LARS estimator µ k by (4.10) C p ( µ k ) = y µ k 2 / σ 2 n + 2k. Theformula,whichisthesameastheC p estimate of risk for an OLS estimator based on a subset of k preselected predictor vectors, has the great advantage of not requiring any further calculations beyond those for the original LARS estimates. The formula applies only to LARS, and not to Lasso or Stagewise. Figure 7 displays C p ( µ k ) as a function of k for the two situations of Figure 6. Minimum C p was achieved at steps k = 7andk = 16, respectively. Both of the minimum C p models looked sensible, their first several selections of important covariates agreeing with an earlier model based on a detailed inspection of the data assisted by medical expertise. The simple approximation becomes a theorem in two cases. THEOREM 3. If the covariate vectors x 1, x 2,...,x m are mutually orthogonal, then the k-step LARS estimate ˆµ k has df ( ˆµ k ) = k. To state the second more general setting we introduce the following condition.

19 LEAST ANGLE REGRESSION 425 FIG. 7. C p estimates of risk (4.10) for the two situations of Figure 6: (left) m = 10 model has smallest C p at k = 7; (right) m = 64 model has smallest C p at k = 16. POSITIVE CONE CONDITION. For all possible subsets X A of the full design matrix X, (4.11) G 1 A 1 A > 0, where the inequality is taken element-wise. The positive cone condition holds if X is orthogonal. It is strictly more general than orthogonality, but counterexamples (such as the diabetes data) show that not all design matrices X satisfy it. It is also easy to show that LARS, Lasso and Stagewise all coincide under the positive cone condition, so the degrees-of-freedom formula applies to them too in this case. THEOREM 4. Under the positive cone condition, df ( ˆµ k ) = k. The proof, which appears later in this section, is an application of Stein s unbiased risk estimate (SURE) [Stein (1981)]. Suppose that g : R n R n is almost differentiable (see Remark A.1 in the Appendix) and set g = n i=1 g i / x i. If y N n (µ,σ 2 I), then Stein s formula states that n (4.12) cov(g i,y i )/σ 2 = E[ g(y)]. i=1 The left-hand side is df (g) for the general estimator g(y). Focusing specifically on LARS, it will turn out that ˆµ k (y) = k in all situations with probability 1, but that the continuity assumptions underlying (4.12) and SURE can fail in certain nonorthogonal cases where the positive cone condition does not hold.

20 426 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI A range of simulations suggested that the simple approximation is quite accurate even when the x j s are highly correlated and that it requires concerted effort at pathology to make df ( µ k ) much different than k. Stein s formula assumes normality, y N(µ,σ 2 I). A cruder delta method rationale for the simple approximation requires only homoskedasticity, (4.1). The geometry of Figure 4 implies (4.13) µ k = ȳ k cot k ȳ k+1 ȳ k, where cot k is the cotangent of the angle between u k and u k+1, u (4.14) k cot k = u k+1 [1 (u k u k+1) 2 ] 1/2. Let v k be the unit vector orthogonal to L(X b ), the linear space spanned by the first k covariates selected by LARS, and pointing into L(X k+1 ) along the direction of ȳ k+1 ȳ k.fory near y we can reexpress (4.13) as a locally linear transformation, (4.15) µ k = µ k + M k (y y) with M k = P k cot k u k v k, P k being the usual projection matrix from R n into L(X k ); (4.15) holds within a neighborhood of y such that the LARS choices L(X k ) and v k remain the same. The matrix M k has trace(m k ) = k. Since the trace equals the degrees of freedom for linear estimators, the simple approximation (4.9) is seen to be a delta method approximation to the bootstrap estimates (4.6) and (4.7). It is clear that (4.9) df ( µ k ) = k cannot hold for the Lasso, since the degrees of freedom is m for the full model but the total number of steps taken can exceed m. However, we have found empirically that an intuitively plausible result holds: the degrees of freedom is well approximated by the number of nonzero predictors in the model. Specifically, starting at step 0, let l(k) be the index of the last model in the Lasso sequence containing k predictors. Then df ( µ l(k) ) = k. We do not yet have any mathematical support for this claim Orthogonal designs. In the orthogonal case, we assume that x j = e j for j = 1,...,m. The LARS algorithm then has a particularly simple form, reducing to soft thresholding at the order statistics of the data. To be specific, define the soft thresholding operation on a scalar y 1 at threshold t by y 1 t, if y 1 >t, η(y 1 ; t)= 0, if y 1 t, y 1 + t, if y 1 < t. The order statistics of the absolute values of the data are denoted by (4.16) y (1) y (2) y (n) y (n+1) := 0. We note that y m+1,...,y n do not enter into the estimation procedure, and so we may as well assume that m = n.

### Degrees of Freedom and Model Search

Degrees of Freedom and Model Search Ryan J. Tibshirani Abstract Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quantitative description of the amount of fitting performed

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### On the degrees of freedom in shrinkage estimation

On the degrees of freedom in shrinkage estimation Kengo Kato Graduate School of Economics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan kato ken@hkg.odn.ne.jp October, 2007 Abstract

### Stochastic Inventory Control

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

### Section 1.1. Introduction to R n

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

### By W.E. Diewert. July, Linear programming problems are important for a number of reasons:

APPLIED ECONOMICS By W.E. Diewert. July, 3. Chapter : Linear Programming. Introduction The theory of linear programming provides a good introduction to the study of constrained maximization (and minimization)

### The Trip Scheduling Problem

The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

### December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING 1. Introduction The Black-Scholes theory, which is the main subject of this course and its sequel, is based on the Efficient Market Hypothesis, that arbitrages

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### We call this set an n-dimensional parallelogram (with one vertex 0). We also refer to the vectors x 1,..., x n as the edges of P.

Volumes of parallelograms 1 Chapter 8 Volumes of parallelograms In the present short chapter we are going to discuss the elementary geometrical objects which we call parallelograms. These are going to

### I. Solving Linear Programs by the Simplex Method

Optimization Methods Draft of August 26, 2005 I. Solving Linear Programs by the Simplex Method Robert Fourer Department of Industrial Engineering and Management Sciences Northwestern University Evanston,

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### ISOMETRIES OF R n KEITH CONRAD

ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x

### 3. INNER PRODUCT SPACES

. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.

### CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,

### 5 =5. Since 5 > 0 Since 4 7 < 0 Since 0 0

a p p e n d i x e ABSOLUTE VALUE ABSOLUTE VALUE E.1 definition. The absolute value or magnitude of a real number a is denoted by a and is defined by { a if a 0 a = a if a

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Linear Programming. March 14, 2014

Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

### Nonlinear Iterative Partial Least Squares Method

Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

### Section Notes 3. The Simplex Algorithm. Applied Math 121. Week of February 14, 2011

Section Notes 3 The Simplex Algorithm Applied Math 121 Week of February 14, 2011 Goals for the week understand how to get from an LP to a simplex tableau. be familiar with reduced costs, optimal solutions,

### Chapter 21: The Discounted Utility Model

Chapter 21: The Discounted Utility Model 21.1: Introduction This is an important chapter in that it introduces, and explores the implications of, an empirically relevant utility function representing intertemporal

### State of Stress at Point

State of Stress at Point Einstein Notation The basic idea of Einstein notation is that a covector and a vector can form a scalar: This is typically written as an explicit sum: According to this convention,

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### Common Core Unit Summary Grades 6 to 8

Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity- 8G1-8G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Practice Problems Solutions

Practice Problems Solutions The problems below have been carefully selected to illustrate common situations and the techniques and tricks to deal with these. Try to master them all; it is well worth it!

### New insights on the mean-variance portfolio selection from de Finetti s suggestions. Flavio Pressacco and Paolo Serafini, Università di Udine

New insights on the mean-variance portfolio selection from de Finetti s suggestions Flavio Pressacco and Paolo Serafini, Università di Udine Abstract: In this paper we offer an alternative approach to

### GRADES 7, 8, AND 9 BIG IDEAS

Table 1: Strand A: BIG IDEAS: MATH: NUMBER Introduce perfect squares, square roots, and all applications Introduce rational numbers (positive and negative) Introduce the meaning of negative exponents for

### Matrix Algebra LECTURE 1. Simultaneous Equations Consider a system of m linear equations in n unknowns: y 1 = a 11 x 1 + a 12 x 2 + +a 1n x n,

LECTURE 1 Matrix Algebra Simultaneous Equations Consider a system of m linear equations in n unknowns: y 1 a 11 x 1 + a 12 x 2 + +a 1n x n, (1) y 2 a 21 x 1 + a 22 x 2 + +a 2n x n, y m a m1 x 1 +a m2 x

### 2.5 Gaussian Elimination

page 150 150 CHAPTER 2 Matrices and Systems of Linear Equations 37 10 the linear algebra package of Maple, the three elementary 20 23 1 row operations are 12 1 swaprow(a,i,j): permute rows i and j 3 3

### Chapter 2 Portfolio Management and the Capital Asset Pricing Model

Chapter 2 Portfolio Management and the Capital Asset Pricing Model In this chapter, we explore the issue of risk management in a portfolio of assets. The main issue is how to balance a portfolio, that

### 5. Multiple regression

5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

### No: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

No: 10 04 Bilkent University Monotonic Extension Farhad Husseinov Discussion Papers Department of Economics The Discussion Papers of the Department of Economics are intended to make the initial results

### Tangent and normal lines to conics

4.B. Tangent and normal lines to conics Apollonius work on conics includes a study of tangent and normal lines to these curves. The purpose of this document is to relate his approaches to the modern viewpoints

### THE DIMENSION OF A VECTOR SPACE

THE DIMENSION OF A VECTOR SPACE KEITH CONRAD This handout is a supplementary discussion leading up to the definition of dimension and some of its basic properties. Let V be a vector space over a field

### Chapter 4 One Dimensional Kinematics

Chapter 4 One Dimensional Kinematics 41 Introduction 1 4 Position, Time Interval, Displacement 41 Position 4 Time Interval 43 Displacement 43 Velocity 3 431 Average Velocity 3 433 Instantaneous Velocity

### Linear Models and Conjoint Analysis with Nonlinear Spline Transformations

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations Warren F. Kuhfeld Mark Garratt Abstract Many common data analysis models are based on the general linear univariate model, including

### t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

### Solving Simultaneous Equations and Matrices

Solving Simultaneous Equations and Matrices The following represents a systematic investigation for the steps used to solve two simultaneous linear equations in two unknowns. The motivation for considering

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### Overview of Math Standards

Algebra 2 Welcome to math curriculum design maps for Manhattan- Ogden USD 383, striving to produce learners who are: Effective Communicators who clearly express ideas and effectively communicate with diverse

### Date: April 12, 2001. Contents

2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

### KEANSBURG SCHOOL DISTRICT KEANSBURG HIGH SCHOOL Mathematics Department. HSPA 10 Curriculum. September 2007

KEANSBURG HIGH SCHOOL Mathematics Department HSPA 10 Curriculum September 2007 Written by: Karen Egan Mathematics Supervisor: Ann Gagliardi 7 days Sample and Display Data (Chapter 1 pp. 4-47) Surveys and

### ARE211, Fall2012. Contents. 2. Linear Algebra Preliminary: Level Sets, upper and lower contour sets and Gradient vectors 1

ARE11, Fall1 LINALGEBRA1: THU, SEP 13, 1 PRINTED: SEPTEMBER 19, 1 (LEC# 7) Contents. Linear Algebra 1.1. Preliminary: Level Sets, upper and lower contour sets and Gradient vectors 1.. Vectors as arrows.

### Lecture 1 Newton s method.

Lecture 1 Newton s method. 1 Square roots 2 Newton s method. The guts of the method. A vector version. Implementation. The existence theorem. Basins of attraction. The Babylonian algorithm for finding

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### 24. The Branch and Bound Method

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

### Some Notes on Taylor Polynomials and Taylor Series

Some Notes on Taylor Polynomials and Taylor Series Mark MacLean October 3, 27 UBC s courses MATH /8 and MATH introduce students to the ideas of Taylor polynomials and Taylor series in a fairly limited

### Solution to Homework 2

Solution to Homework 2 Olena Bormashenko September 23, 2011 Section 1.4: 1(a)(b)(i)(k), 4, 5, 14; Section 1.5: 1(a)(b)(c)(d)(e)(n), 2(a)(c), 13, 16, 17, 18, 27 Section 1.4 1. Compute the following, if

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### CHAPTER 17. Linear Programming: Simplex Method

CHAPTER 17 Linear Programming: Simplex Method CONTENTS 17.1 AN ALGEBRAIC OVERVIEW OF THE SIMPLEX METHOD Algebraic Properties of the Simplex Method Determining a Basic Solution Basic Feasible Solution 17.2

### http://www.jstor.org This content downloaded on Tue, 19 Feb 2013 17:28:43 PM All use subject to JSTOR Terms and Conditions

A Significance Test for Time Series Analysis Author(s): W. Allen Wallis and Geoffrey H. Moore Reviewed work(s): Source: Journal of the American Statistical Association, Vol. 36, No. 215 (Sep., 1941), pp.

### 1 Portfolio mean and variance

Copyright c 2005 by Karl Sigman Portfolio mean and variance Here we study the performance of a one-period investment X 0 > 0 (dollars) shared among several different assets. Our criterion for measuring

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

### Smoothing and Non-Parametric Regression

Smoothing and Non-Parametric Regression Germán Rodríguez grodri@princeton.edu Spring, 2001 Objective: to estimate the effects of covariates X on a response y nonparametrically, letting the data suggest

### 1 VECTOR SPACES AND SUBSPACES

1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### x if x 0, x if x < 0.

Chapter 3 Sequences In this chapter, we discuss sequences. We say what it means for a sequence to converge, and define the limit of a convergent sequence. We begin with some preliminary results about the

### The Ideal Class Group

Chapter 5 The Ideal Class Group We will use Minkowski theory, which belongs to the general area of geometry of numbers, to gain insight into the ideal class group of a number field. We have already mentioned

### LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,

### SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA This handout presents the second derivative test for a local extrema of a Lagrange multiplier problem. The Section 1 presents a geometric motivation for the

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### Linear Algebra: Vectors

A Linear Algebra: Vectors A Appendix A: LINEAR ALGEBRA: VECTORS TABLE OF CONTENTS Page A Motivation A 3 A2 Vectors A 3 A2 Notational Conventions A 4 A22 Visualization A 5 A23 Special Vectors A 5 A3 Vector

### MATHEMATICAL INDUCTION AND DIFFERENCE EQUATIONS

1 CHAPTER 6. MATHEMATICAL INDUCTION AND DIFFERENCE EQUATIONS 1 INSTITIÚID TEICNEOLAÍOCHTA CHEATHARLACH INSTITUTE OF TECHNOLOGY CARLOW MATHEMATICAL INDUCTION AND DIFFERENCE EQUATIONS 1 Introduction Recurrence

### On the Interaction and Competition among Internet Service Providers

On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers

### discuss how to describe points, lines and planes in 3 space.

Chapter 2 3 Space: lines and planes In this chapter we discuss how to describe points, lines and planes in 3 space. introduce the language of vectors. discuss various matters concerning the relative position

### Lecture L3 - Vectors, Matrices and Coordinate Transformations

S. Widnall 16.07 Dynamics Fall 2009 Lecture notes based on J. Peraire Version 2.0 Lecture L3 - Vectors, Matrices and Coordinate Transformations By using vectors and defining appropriate operations between

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### THREE DIMENSIONAL GEOMETRY

Chapter 8 THREE DIMENSIONAL GEOMETRY 8.1 Introduction In this chapter we present a vector algebra approach to three dimensional geometry. The aim is to present standard properties of lines and planes,

### Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important

### Pythagorean Theorem. Overview. Grade 8 Mathematics, Quarter 3, Unit 3.1. Number of instructional days: 15 (1 day = minutes) Essential questions

Grade 8 Mathematics, Quarter 3, Unit 3.1 Pythagorean Theorem Overview Number of instructional days: 15 (1 day = 45 60 minutes) Content to be learned Prove the Pythagorean Theorem. Given three side lengths,

### Using simulation to calculate the NPV of a project

Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial

### IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction

IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL R. DRNOVŠEK, T. KOŠIR Dedicated to Prof. Heydar Radjavi on the occasion of his seventieth birthday. Abstract. Let S be an irreducible

### (Refer Slide Time: 00:00:56 min)

Numerical Methods and Computation Prof. S.R.K. Iyengar Department of Mathematics Indian Institute of Technology, Delhi Lecture No # 3 Solution of Nonlinear Algebraic Equations (Continued) (Refer Slide