For more information about how to cite these materials visit

Transcription

1 Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. 1 / 1

2 Decomposing Variance Kerby Shedden Department of Statistics, University of Michigan October 5, / 1

3 Law of total variation For any regression model involving a response Y and a covariate vector X, we have var(y ) = var X E(Y X ) + E X var(y X ). Note that this only makes sense if we treat X as being random. We often wish to distinguish these two situations: The population is homoscedastic: var(y X ) does not depend on X, so we can simply write var(y X ) = σ 2, and we get var(y ) = var X E(Y X ) + σ 2. The population is heteroscedastic: var(y X ) is a function σ 2 (X ) with expected value σ 2 = E X σ 2 (X ), and again we get var(y ) = var X E(Y X ) + σ 2. If we write Y = f (X ) + ɛ with E(ɛ X ) = 0, then E(Y X ) = f (X ), and var X E(Y X ) summarizes the variation of f (X ) over the marginal distribution of X. 3 / 1

4 Law of total variation 4 3 E(Y X) X Orange curves: conditional distributions of Y given X Purple curve: marginal distribution of Y Black dots: conditional means of Y given X 4 / 1

5 Pearson correlation The population Pearson correlation coefficient of two jointly distributed scalar-valued random variables X and Y is ρ XY cov(x, Y ) σ X σ Y. Given data Y = (Y 1,..., Y n ) and X = (X 1,..., X n ), the Pearson correlation coefficient is estimated by ˆρ XY = ĉov(x, Y ) ˆσ X ˆσ Y = i (X i X )(Y i Ȳ ) i (X i X ) 2 i (Y i Ȳ ) 2 = (X X ) (Y Ȳ ) X X Y Ȳ. When we write Y Ȳ here, this means Y Ȳ 1, where 1 is a vector of 1 s, and Ȳ is a scalar. 5 / 1

6 Pearson correlation By the Cauchy-Schwartz inequality, 1 ρ XY 1 1 ˆρ XY 1. The sample correlation coefficient is slightly biased, but the bias is so small that it is usually ignored. 6 / 1

7 Pearson correlation and simple linear regression slopes For the simple linear regression model Y = α + βx + ɛ, if we view X as a random variable that is uncorrelated with ɛ, then and the correlation is cov(x, Y ) = βσ 2 X β ρ XY cor(x, Y ) =. β2 + σ 2 /σx 2 The sample correlation coefficient is related to the least squares slope estimate: ˆβ = ĉov(x, Y ) ˆσ 2 X = ˆρ XY ˆσ Y ˆσ X. 7 / 1

8 Orthogonality between fitted values and residuals Recall that the fitted values are and the residuals are Ŷ = X ˆβ = PY R = Y Ŷ = (I P)Y. Since P(I P) = 0 it follows that Ŷ R = 0. since R = 0, it is equivalent to state that the sample correlation between R and Ŷ is zero, i.e. ĉor(r, Ŷ ) = 0. 8 / 1

9 Coefficient of determination A descriptive summary of the explanatory power of X for Y is given by the coefficient of determination, also known as the proportion of explained variance, or multiple R 2. This is the quantity R 2 1 Y Ŷ 2 Ŷ Ȳ 2 var(ŷ ) = = Y Ȳ 2 Y Ȳ 2 var(y ). The equivalence between the two expressions follows from the identity Y Ȳ 2 = Y Ŷ + Ŷ Ȳ 2 = Y Ŷ 2 + Ŷ Ȳ 2 + 2(Y Ŷ ) (Ŷ Ȳ ) = Y Ŷ 2 + Ŷ Ȳ 2, It should be clear that R 2 = 0 iff Ŷ = Ȳ and R 2 = 1 iff Ŷ = Y. 9 / 1

10 Coefficient of determination The coefficient of determination is equal to To see this, note that ĉor(ŷ, Y )2. ĉor(ŷ, Y ) = (Ŷ Ȳ ) (Y Ȳ ) Ŷ Ȳ Y Ȳ = (Ŷ Ȳ ) (Y Ŷ + Ŷ Ȳ ) Ŷ Ȳ Y Ȳ = (Ŷ Ȳ ) (Y Ŷ ) + (Ŷ Ȳ ) (Ŷ Ȳ ) Ŷ Ȳ Y Ȳ = Ŷ Ȳ Y Ȳ. 10 / 1

11 Coefficient of determination in simple linear regression In general, R 2 = ĉor(y, Ŷ )2 = ĉov(y, Ŷ )2 var(y ) var(ŷ ). In the case of simple linear regression, ĉov(y, Ŷ ) = ĉov(y, ˆα + ˆβX ) = ˆβ ĉov(y, X ), and var(ŷ ) = var(ˆα + ˆβX ) = ˆβ 2 var(x ) Thus for simple linear regression, R 2 = ĉor(y, X ) 2 = ĉor(y, Ŷ )2. 11 / 1

12 Relationship to the F statistic The F-statistic for the null hypothesis is β 1 =... = β p = 0 Ŷ Ȳ 2 Y Ŷ n p 1 = R2 2 p 1 R 2 n p 1, p which is an increasing function of R / 1

13 Adjusted R 2 The sample R 2 is an estimate of the population R 2 : 1 var(y X ) var(y ). Since it is a ratio, the plug-in estimate R 2 is biased, although the bias is not large unless the sample size is small or the number of covariates is large. The adjusted R 2 is an approximately unbiased estimate of the population R 2 : 1 (1 R 2 n 1 ) n p 1. The adjusted R 2 is always less than the unadjusted R 2. The adjusted R 2 is always less than or equal to one, but can be negative. 13 / 1

14 The unique variation in one covariate How much information about Y is present in a covariate X k? This question is not straightforward when the covariates are non-orthogonal, since several covariates may contain overlapping information about Y. Let Xk be the residual of X k after regressing it against all other covariates (including the intercept). If P k is the projection onto span({x j, j k}), then X k = (I P k )X k. We could use var(x k )/ var(x k) to assess how much of the variation in X k is unique in that it is not also captured by other predictors. But this measure doesn t involve Y, so it can t tell us whether the unique variation in X k is useful in the regression analysis. 14 / 1

15 The unique regression information in one covariate To learn how X k contributes uniquely to the regression, we can consider how introducing X k to a working regression model affects the R 2. Let Ŷ k = P k Y be the fitted values in the model omitting covariate k. Let R 2 denote the multiple R 2 for the full model, and let R k 2 be the multiple R 2 for the regression omitting covariate X k. The value of R 2 R 2 k is a way to quantify how much unique information about Y in X k is not captured by the other covariates. This is called the semi-partial R / 1

16 Identity involving norms of fitted values and residuals Before we continue, we will need a simple identity that is often useful. In general, if A and B are orthogonal, then A + B 2 = A 2 + B 2. If A and B A are orthogonal, then B 2 = B A + A 2 = B A 2 + A 2. Thus we have B 2 A 2 = B A 2. Applying this fact to regression, we know that the fitted values and residuals are orthogonal. Thus for the regression omitting variable k, Ŷ k and Y Ŷ k are orthogonal, so so Y Ŷ k 2 = Y 2 Ŷ k 2. By the same argument, Y Ŷ 2 = Y 2 Ŷ / 1

17 Improvement in R 2 due to one covariate Now we can obtain a simple, direct expression for the semi-partial R 2. Since X k and is orthogonal to the other covariates, Ŷ = Ŷ k + Y, X k X k, X k X k, Ŷ 2 = Ŷ k 2 + Y, X k 2 / X k / 1

18 Improvement in R 2 due to one covariate Thus we have R 2 = 1 Y Ŷ 2 Y Ȳ 2 = 1 Y 2 Ŷ 2 Y Ȳ 2 = 1 Y 2 Ŷ k 2 Y, X k 2 / X k 2 Y Ȳ 2 = 1 Y Ŷ k 2 Y Ȳ + Y, X k 2 / Xk 2 2 Y Ȳ 2 = R 2 k + Y, X k 2 / X k 2 Y Ȳ / 1

19 Semi-partial R 2 Thus the semi-partial R 2 is R 2 R 2 k = Y, X k 2 / X k 2 Y Ȳ 2 = Y, X k / X k 2 Y Ȳ 2 where Ŷk is the fitted value for regressing Y on X k. Since X k / X k is centered and has length 1, it follows that R 2 R 2 k = ĉor(y, X k ) 2 = ĉor(y, Ŷ k ) 2. Thus the semi-partial R 2 for covariate k has two equivalent interpretations: It is the improvement in R 2 resulting from including covariate k in a working regression model that already contains the other covariates. It is the R 2 for a simple linear regression of Y on X k = (I P k)x k. 19 / 1

20 Partial R 2 The partial R 2 is R 2 R 2 k 1 R 2 k = Y, X k 2 / X k 2 Y Ŷ k 2. The partial R 2 for covariate k is the fraction of the maximum possible improvement in R 2 that is contributed by covariate k. Let Ŷ k be the fitted values for regressing Y on all covariates except X k. Since Ŷ k X k = 0, Y, X k 2 Y Ŷ k 2 X k 2 = Y Ŷ k, X k 2 Y Ŷ k 2 X k 2 The expression on the left is the usual R 2 that would be obtained when regressing Y Ŷ k on X k. Thus the partial R2 is the same as the usual R 2 for (I P k )Y regressed on (I P k )X k. 20 / 1

21 Decomposition of projection matrices Suppose P R n n is a rank-d projection matrix, and U is a n d orthogonal matrix whose columns span col(p). If we partition U by columns U = U 1 U 2 U d, then P = UU, so we can write P = d U j U j. Note that this representation is not unique, since there are different orthogonal bases for col(p). j=1 Each summand U j U j R n n is a rank-1 projection matrix onto U j. 21 / 1

22 Decomposition of R 2 Question: In a multiple regression model, how much of the variance in Y is explained by a particular covariate? Orthogonal case: If the design matrix X is orthogonal (X X = I ), the projection P onto col(x ) can be decomposed as P = p p P j = 11 n + X j X j, j=0 where X j is the j th column of the design matrix (assuming here that the first column of X is an intercept). j=1 22 / 1

23 Decomposition of R 2 (orthogonal case) The n n rank-1 matrix P j = X j X j is the projection onto span(x j ) (and P 0 is the projection onto the span of the vector of 1 s). Furthermore, by orthogonality, P j P k = 0 unless j = k. Since by orthogonality p Ŷ Ȳ = P j Y, j=1 p Ŷ Ȳ 2 = P j Y 2. Here we are using the fact that if U 1,..., U m are orthogonal, then U U m 2 = U U m 2. j=1 23 / 1

24 Decomposition of R 2 (orthogonal case) The R 2 for simple linear regression of Y on X j is R 2 j Ŷ Ȳ 2 / Y Ȳ 2 = P j Y 2 / Y Ȳ 2, so we see that for orthogonal design matrices, R 2 = p Rj 2. That is, the overall coefficient of determination is the sum of univariate coefficients of determination for all the explanatory variables. j=1 24 / 1

25 Decomposition of R 2 Non-orthogonal case: If X is not orthogonal, the overall R 2 will not be the sum of single covariate R 2 s. If we let R 2 j be as above (the R 2 values for regressing Y on each X j ), then there are two different situations: j R2 j > R 2, and j R2 j < R / 1

26 Decomposition of R 2 Case 1: R 2 j > R 2 It s not surprising that j R2 j suppose that can be bigger than R 2. For example, Y = X 1 + ɛ is the data generating model, and X 2 is highly correlated with X 1 (but is not part of the data generating model). For the regression of Y on both X 1 and X 2, the multiple R 2 will be 1 σ 2 /var(y ) (since E(Y X 1, X 2 ) = E(Y X 1 ) = X 1 ). The R 2 values for Y regressed on either X 1 or X 2 separately will also be approximately 1 σ 2 /var(y ). Thus R R2 2 2R2. 26 / 1

27 Decomposition of R 2 Case 2: j R2 j < R 2 This is more surprising, and is sometimes called enhancement. As an example, suppose the data generating model is Y = Z + ɛ, but we don t observe Z (for simplicity assume EZ = 0). Instead, we observe a value X 1 that satisfies X 1 = Z + X 2, where X 2 has mean 0 and is independent of Z and ɛ. Since X 2 is independent of Z and ɛ, it is also independent of Y, thus R2 2 0 for large n. 27 / 1

28 Decomposition of R 2 (enhancement example) The multiple R 2 of Y on X 1 and X 2 is approximately σ 2 Z /(σ2 Z + σ2 ) for large n, since the fitted values will converge to Ŷ = X 1 X 2 = Z. To calculate R 2 1, first note that for the regression of Y on X 1, and ˆβ = ĉov(y, X 1) var(x 1 ) σ2 Z σz 2 + σ2 X 2 ˆα / 1

29 Decomposition of R 2 (enhancement example) Therefore for large n, n 1 Y Ŷ 2 n 1 Z + ɛ σz 2 X 1 /(σz 2 + σx 2 2 ) 2 = n 1 σx 2 2 Z/(σZ 2 + σx 2 2 ) + ɛ σz 2 X 2 /(σz 2 + σx 2 2 ) 2 = σx 4 2 σz 2 /(σz 2 + σx 2 2 ) 2 + σ 2 + σz 4 σx 2 2 /(σz 2 + σx 2 2 ) 2 = σx 2 2 σz 2 /(σz 2 + σx 2 2 ) + σ 2. Therefore R 2 1 = 1 n 1 Y Ŷ 2 n 1 Y Ȳ 2 1 σ2 X 2 σ 2 Z /(σ2 Z + σ2 X 2 ) + σ 2 σ 2 Z + σ2. = σ 2 Z (σ 2 Z + σ2 )(1 + σ 2 X 2 /σ 2 Z ) 29 / 1

30 Decomposition of R 2 (enhancement example) Thus R1 2 /R 2 1/(1 + σx 2 2 /σz 2 ), which is strictly less than one if σx 2 2 > 0. Since R2 2 = 0, it follows that R2 > R1 2 + R2 2. The reason for this is that while X 2 contains no directly useful information about Y (hence R2 2 = 0), it can remove the measurement error in X 1, making X 1 a better predictor of Z. 30 / 1

31 Decomposition of R 2 (enhancement example) We can also calculate the limiting partial R 2 for adding X 2 to a model that already contains X 1 : σ 2 X 2 σ 2 X 2 + σ 2 (1 + σ 2 X 2 /σ 2 Z ). 31 / 1

32 Partial R 2 example 2 Suppose the design matrix satisfies X X /n = and the data generating model is r 0 r 1 with var ɛ = σ 2. Y = X 1 + X 2 + ɛ 32 / 1

33 Partial R 2 example 2 We will calculate the partial R 2 for X 1, using the fact that the partial R 2 is the regular R 2 for regressing on (I P 1 )Y (I P 1 )X 1 where P 1 is the projection onto span ({1, X 2 }). Since this is a simple linear regression, the partial R 2 can be expressed ĉor((i P 1 )Y, (I P 1 )X 1 ) / 1

34 Partial R 2 example 2 The numerator of the partial R 2 is the square of ĉov((i P 1 )Y, (I P 1 )X 1 ) = Y (I P 1 )X 1 /n = (X 1 + X 2 + ɛ) (X 1 rx 2 )/n 1 r 2. The denominator contains two factors. The first is (I P 1 )X 1 2 /n = X 1(I P 1 )X 1 /n = X 1(X 1 rx 2 )/n 1 r / 1

35 Partial R 2 example 2 The other factor in the denominator is Y (I P 1 )Y /n: Y (I P 1 )Y /n = (X 1 + X 2 ) (I P 1 )(X 1 + X 2 )/n + ɛ (I P 1 )ɛ/n + 2ɛ (I P 1 )(X 1 + X 2 )/n (X 1 + X 2 ) (X 1 rx 2 )/n + σ 2 1 r 2 + σ 2. Thus we get that the partial R 2 is approximately equal to 1 r 2 1 r 2 + σ 2. If r = 1 then the result is zero (X 1 has no unique explanatory power), and if r = 0, the result is 1/(1 + σ 2 ), indicating that after controlling for X 2, around 1/(1 + σ 2 ) fraction of the remaining variance is explained by X 1 (the rest is due to ɛ). 35 / 1

36 Summary Each of the three R 2 values can be expressed either in terms of variance ratios, or as a squared correlation coefficient: Multiple R 2 Semi-partial R 2 Partial R 2 VR Ŷ Ȳ 2 / Y Ȳ 2 R 2 R k 2 (R 2 R k)/(1 2 R k) 2 Correlation ccor(ŷ, Y ) 2 ccor(y, Xk ) 2 ccor((i P k )Y, Xk ) 2 36 / 1