Correlation and Regression

Transcription

1 Correlation and Regression Fathers and daughters heights Fathers heights mean = 67.7 SD = height (inches) Daughters heights mean = 63.8 SD = height (inches) Reference: Pearson and Lee (1906) Biometrika 2: pairs

2 Fathers and daughters heights 70 corr = 0.52 Daughter s height (inches) Father s height (inches) Reference: Pearson and Lee (1906) Biometrika 2: pairs Covariance and correlation Let X and Y be random variables with µ X = E(X), µ Y = E(Y), σ X = SD(X), σ Y = SD(Y) For example, sample a father/daughter pair and let X = the father s height and Y = the daughter s height. Covariance Correlation cov(x,y) = E{(X µ X ) (Y µ Y )} cor(x, Y) = cov(x, Y) σ X σ Y cov(x,y) can be any real number 1 cor(x, Y) 1

3 Examples corr = 0 corr = 0.1 corr =! Y 0 Y 20 Y 20!1! !3!2! corr = 0.3 corr = 0.5 corr =! Y Y Y corr = 0.7 corr = 0.9 corr =!0.9 Y Y Y Estimated correlation Consider n pairs of data: (x 1,y 1 ), (x 2,y 2 ), (x 3,y 3 ),..., (x n,y n ) We consider these as independent draws from some bivariate distribution. We estimate the correlation in the underlying distribution by: r = i (x i x)(y i ȳ) i (x i x) 2 i (y i ȳ) 2 This is sometimes called the correlation coefficient.

4 Correlation measures linear association All three plots have correlation 0.7! Correlation versus regression Covariance / correlation: Quantifies how two random variables X and Y co-vary. There is typically no particular order between the two random variables (e. g., fathers versus daughters height). Regression Assesses the relationship between predictor X and response Y: we model E[Y X]. The values for the predictor are often deliberately chosen, and are therefore not random quantities. We typically assume that we observe the values for the predictor(s) without error.

5 Example Measurements of degradation of heme with different concentrations of hydrogen peroxide (H 2 O 2 ), for different types of heme. A A and B A B OD OD H2O2 concentration H2O2 concentration Linear regression 140 Y = X Y = X Y 80 Y = X Y = 0 + 5X X

6 Linear regression 3 2! 1 Y 1 1! 0 0! X The regression model Let X be the predictor and Y be the response. Assume we have n observations (x 1, y 1 ),..., (x n, y n ) from X and Y. The simple linear regression model is y i =β 0 + β 1 x i + ɛ i, ɛ i iid N(0,σ 2 ). This implies: E[Y X] = β 0 + β 1 X. Interpretation: For two subjects that differ by one unit in X, we expect the responses to differ by β 1. How do we estimate β 0, β 1, σ 2?

7 Fitted values and residuals We can write ɛ i =y i β 0 β 1 x i For a pair of estimates ( ˆβ 0, ˆβ 1 ) for the pair of parameters (β 0, β 1 ) we define the fitted values as ŷ i = ˆβ 0 + ˆβ 1 x i The residuals are ˆɛ i =y i ŷ i =y i ˆβ 0 ˆβ 1 x i Residuals Y Y Y^ "^ X

8 Residual sum of squares For every pair of values for β 0 and β 1 we get a different value for the residual sum of squares. RSS(β 0, β 1 )= i (y i β 0 β 1 x i ) 2 We can look at RSS as a function of β 0 and β 1. We try to minimize this function, i. e. we try to find ( ˆβ 0, ˆβ 1 )=min β0,β 1 RSS(β 0, β 1 ) Hardly surprising, this method is called least squares estimation. Residual sum of squares RSS b0 b1

9 Notation Assume we have n observations: (x 1, y 1 ),..., (x n, y n ). i x = x i n i ȳ = y i n = i (x i x) 2 = i x 2 i n( x) 2 SYY = i (y i ȳ) 2 = i y 2 i n(ȳ) 2 SXY = i (x i x)(y i ȳ)= i x i y i n xȳ RSS = i (y i ŷ i ) 2 = i ˆɛ 2 i Parameter estimates The function RSS(β 0, β 1 )= i (y i β 0 β 1 x i ) 2 is minimized by ˆβ 1 = SXY ˆβ 0 = ȳ ˆβ 1 x

10 Useful to know Using the parameter estimates, our best guess for any y given x is y= ˆβ 0 + ˆβ 1 x Hence ˆβ 0 + ˆβ 1 x = ȳ ˆβ 1 x + ˆβ 1 x = ȳ That means every regression line goes through the point ( x, ȳ). Variance estimates As variance estimate we use ˆσ 2 = RSS n 2 This quantity is called the residual mean square. It has the following property: In particular, this implies (n 2) ˆσ2 σ 2 χ2 n 2 E(ˆσ 2 )=σ 2

11 Example H 2 O 2 concentration We get x=21.25, ȳ=0.27, = , SXY= 16.48, RSS= Therefore ˆβ 1 = = , ˆβ0 = 0.27 ( ) = 0.353, ˆσ = = Example 0.35 Y = 0.353! X 0.30 OD H2O2 concentration

12 Comparing models We want to test whether β 1 = 0: H 0 : y i = β 0 + ɛ i versus H a : y i = β 0 + β 1 x i + ɛ i Fit under H a Fit under H o y x Example 0.35 Y = 0.353! X Y = OD H2O2 concentration

13 Sum of squares Under H a : RSS = i (y i ŷ i ) 2 = SYY (SXY)2 = SYY ˆβ 2 1 Under H 0 : (y i ˆβ 0 ) 2 = i i (y i ȳ) 2 = SYY Hence SS reg = SYY RSS = (SXY)2 ANOVA Source df SS MS F regression on X 1 SS reg MS reg = SS reg 1 MS reg MSE residuals for full model n 2 RSS MSE = RSS n 2 total n 1 SYY

14 Example Source df SS MS F regression on X residuals for full model total Parameter estimates One can show that E( ˆβ 0 )=β 0 E( ˆβ 1 )=β 1 ( Var( ˆβ 0 )=σ 2 1 n + x ) 2 Var( ˆβ 1 )= σ2 Cov( ˆβ 0, ˆβ 1 )= σ 2 x Cor( ˆβ 0, ˆβ 1 )= x x 2 + /n Note: We re thinking of the x s as fixed.

15 Parameter estimates One can even show that the distribution of ˆβ 0 and ˆβ 1 is a bivariate normal distribution! ( ) ˆβ0 ˆβ 1 N(β, Σ) where β = ( β0 β 1 ) and Σ = σ 2 1 n + x2 x x 1 Simulation: coefficients!0.0034! slope!0.0038!0.0040!0.0042! y!intercept

16 Possible outcomes OD H 2 O 2 Confidence intervals We know that ˆβ 0 N ( (β 0, σ 2 1 n + x )) 2 ( ˆβ 1 N β 1, σ 2 ) We can use those distributions for hypothesis testing and to construct confidence intervals!

17 Statistical inference We want to test: H 0 : β 1 = β 1 versus H a : β 1 β 1 (generally, β1 is 0.) We use t= ˆβ 1 β 1 se( ˆβ 1 ) t n 2 where se( ˆβ 1 )= ˆσ 2 Also, [ ˆβ1 t (1 α 2 ),n 2 se( ˆβ 1 ), ˆβ 1 + t (1 α 2 ),n 2 se( ˆβ ] 1 ) is a (1 α) 100% confidence interval for β 1. Results The calculations in the test H 0 : β 0 = β0 versus H a : β 0 β0 are analogous, except that we have to use ( se( ˆβ 1 0 )= ˆσ2 n + x ) 2 For the example we get the 95% confidence intervals (0.342, 0.364) for the intercept ( , ) for the slope Testing whether the intercept (slope) is equal to zero, we obtain 70.7 ( 22.0) as test statistic. This corresponds to a p-value of ( ).

18 Now how about that Testing for the slope being equal to zero, we use t= ˆβ 1 se( ˆβ 1 ) For the squared test statistic we get t 2 = ( ˆβ1 se( ˆβ 1 ) ) 2 = ˆβ 2 1 ˆσ 2 / = ˆβ 2 1 ˆσ 2 = (SYY RSS)/1 RSS/n 2 = MS reg MSE = F The squared t statistic is the same as the F statistic from the ANOVA! Joint confidence region A 95% joint confidence region for the two parameters is the set of all values (β 0, β 1 ) that fulfill ( ) T ( β0 n i x )( ) i β 1 i x β0 i i x2 i β 1 2ˆσ 2 F (0.95),2,n-2 where β 0 = β 0 ˆβ 0 and β 1 = β 1 ˆβ 1.

19 Joint confidence region!^1!^0 Notation Assume we have n observations: (x 1, y 1 ),..., (x n, y n ). We previously defined = i SYY = i (x i x) 2 = i (y i ȳ) 2 = i x 2 i n( x) 2 y 2 i n(ȳ) 2 SXY = i (x i x)(y i ȳ) = i x i y i n xȳ We also define r XY = SXY SYY (called the sample correlation)

20 Coefficient of determination We previously wrote Define SS reg = SYY RSS = (SXY)2 R 2 = SS reg SYY =1 RSS SYY R 2 is often called the coefficient of determination. Notice that R 2 = SS reg SYY = (SXY) 2 SYY =r2 XY The Anscombe Data!^0 =3.0!^1 =0.5 #^2=13.75 R 2 =0.667!^0 =3.0!^1 =0.5 #^2=13.75 R 2 = !^0 =3.0!^1 =0.5 #^2=13.75 R 2 =0.667!^0 =3.0!^1 =0.5 #^2=13.75 R 2 =

21 Fathers and daughters heights 70 corr = 0.52 Daughter s height (inches) Father s height (inches) Linear regression 70 Daughter s height (inches) Father s height (inches)

22 Linear regression 70 Daughter s height (inches) Father s height (inches) Regression line 70 Daughter s height (inches) Father s height (inches) Slope = r SD(Y) / SD(X)

23 SD line 70 Daughter s height (inches) Father s height (inches) Slope = SD(Y) / SD(X) SD line vs regression line 70 Daughter s height (inches) Father s height (inches) Both lines go through the point ( X, Ȳ).

24 Predicting father s ht from daughter s ht 70 Daughter s height (inches) Father s height (inches) Predicting father s ht from daughter s ht 70 Daughter s height (inches) Father s height (inches)

25 Predicting father s ht from daughter s ht 70 Daughter s height (inches) Father s height (inches) There are two regression lines! 70 Daughter s height (inches) Father s height (inches)

26 The equations Regression of y on x (for predicting y from x) Slope = r SD(y) SD(x) Goes through the point ( x, ȳ) ŷ ȳ = r SD(y) SD(x) (x x) ŷ = ˆβ 0 + ˆβ 1 x where ˆβ 1 = r SD(y) SD(x) and ˆβ 0 = ȳ ˆβ 1 x Regression of x on y (for predicting x from y) Slope = r SD(x) SD(y) Goes through the point (ȳ, x) ˆx x = r SD(x) SD(y) (y ȳ) ˆx = ˆβ 0 + ˆβ 1 y where ˆβ 1 = r SD(x) SD(y) and ˆβ 0 = x ˆβ 1 ȳ Estimating the mean response 0.35 Y = 0.353! X 0.30 OD H2O2 concentration We can use the regression results to predict the expected response for a new concentration of hydrogen peroxide. But what is its variability?

27 Variability of the mean response Let ŷ be the predicted mean for some x, i. e. ŷ= ˆβ 0 + ˆβ 1 x Then E(ŷ) = β 0 + β 1 x ( ) var(ŷ) = σ 2 1 n + (x x) 2 where y = β 0 + β 1 x is the true mean response. Why? E(ŷ) =E( ˆβ 0 + ˆβ 1 x) =E( ˆβ 0 )+xe( ˆβ 1 ) = β 0 + x β 1 var(ŷ) = var( ˆβ 0 + ˆβ 1 x) = var( ˆβ 0 )+var( ˆβ 1 x)+2 cov( ˆβ 0, ˆβ 1 x) = var( ˆβ 0 )+x 2 var( ˆβ 1 )+2xcov( ˆβ 0, ˆβ 1 ) ( = σ 2 1 n + x ) 2 ( ) x + σ 2 2 2x x σ 2 [ 1 = σ 2 n + (x x) 2 ]

28 Confidence intervals Hence ŷ ± t (1 α 2 ),n 2 ˆσ 1 n + (x x) 2 is a (1 α) 100% confidence interval for the mean response given x. Confidence limits 95% confidence limits for the mean response OD H2O2 concentration

29 Prediction Now assume that we want to calculate an interval for the predicted response y for a value of x. There are two sources of uncertainty: (a) the mean response (b) the natural variation σ 2 The variance of ŷ is ( var(ŷ )=σ 2 + σ 2 1 n + (x x) 2 ) =σ 2 ( n + (x x) 2 ) Prediction intervals Hence ŷ ± t (1 α 2 ),n 2 ˆσ n + (x x) 2 is a (1 α) 100% prediction interval for the predicted response given x. When n is very large, we get roughly ŷ ± t (1 α 2 ),n 2 ˆσ

30 Prediction intervals % confidence limits for the mean response 95% confidence limits for the prediction 0.30 OD H2O2 concentration Span and height Height (inches) Span (inches)

31 With just 100 individuals Height (inches) Span (inches) Regression for calibration That prediction interval is for the case that the x s are known without error while y=β 0 + β 1 x + ɛ where ɛ= error Another common situation: We have a number of pairs (x,y) to get a calibration line/curve. x s basically without error; y s have measurement error. We obtain a new value, y, and want to estimate the corresponding x : y =β 0 + β 1 x + ɛ

32 Example Y X Another example Y X

33 Regression for calibration Data: (x i,y i ) for i = 1,...,n with y i =β 0 + β 1 x i + ɛ i, ɛ i iid Normal(0, σ) y j for j = 1,...,m with y j =β 0 + β 1 x + ɛ j, ɛ j iid Normal(0, σ) for some x Goal: Estimate x and give a 95% confidence interval. The estimate: Obtain ˆβ 0 and ˆβ 1 by regressing the y i on the x i. Let ˆx =(ȳ ˆβ 0 )/ ˆβ 1 where ȳ = j y j /m 95% CI for ˆx Let T denote the 97.5th percentile of the t distr n with n 2 d.f. Let g = T / [ ˆβ 1 / (ˆσ/ )] = (T ˆσ) / ( ˆβ 1 ) If g 1, we would fail to reject H 0 : β 1 =0! In this case, the 95% CI for ˆx is (, ). If g < 1, our 95% CI is the following: ˆx ± (ˆx x) g 2 +(Tˆσ / ˆβ 1 ) (ˆx x) 2 / +(1 g 2 )( 1 m + 1 n ) 1 g 2 For very large n, this reduces to approximately ˆx ±(T ˆσ) / ( ˆβ 1 m)

34 Example Y X Another example Y X

35 Infinite m Y X Infinite n Y X

36 Multiple linear regression A and B 0.35 A B OD H2O2 concentration Multiple linear regression general parallel concurrent coincident

37 Multiple linear regression A and B 0.35 A B OD H2O2 concentration More than one predictor # Y X 1 X The model with two parallel lines can be described as Y=β 0 + β 1 X 1 + β 2 X 2 + ɛ In other words (or, equations): { β0 + β 1 X 1 + ɛ if X 2 =0 Y= (β 0 + β 2 )+β 1 X 1 + ɛ if X 2 =1

38 Multiple linear regression A multiple linear regression model has the form Y=β 0 + β 1 X β k X k + ɛ, ɛ N(0, σ 2 ) The predictors (the X s) can be categorical or numerical. Often, all predictors are numerical or all are categorical. And actually, categorical variables are converted into a group of numerical ones. Interpretation Let X 1 be the age of a subject (in years). E[Y] = β 0 + β 1 X 1 Comparing two subjects who differ by one year in age, we expect the responses to differ by β 1. Comparing two subjects who differ by five years in age, we expect the responses to differ by 5β 1.

39 Interpretation Let X 1 be the age of a subject (in years), and let X 2 be an indicator for the treatment arm (0/1). E[Y] = β 0 + β 1 X 1 + β 2 X 2 Comparing two subjects from the same treatment arm who differ by one year in age, we expect the responses to differ by β 1. Comparing two subjects of the same age from the two different treatment arms (X 2 =1 versus X 2 =0), we expect the responses to differ by β 2. Interpretation Let X 1 be the age of a subject (in years), and let X 2 be an indicator for the treatment arm (0/1). E[Y] = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 E[Y] = β 0 + β 1 X 1 (if X 2 =0) E[Y] = β 0 + β 1 X 1 + β 2 + β 3 X 1 = β 0 + β 2 + (β 1 + β 3 ) X 1 (if X 2 =1) Comparing two subjects who differ by one year in age, we expect the responses to differ by β 1 if they are in the control arm (X 2 =0), and expect the responses to differ by β 1 + β 3 if they are in the treatment arm (X 2 =1).

40 Estimation We have the model y i = β 0 + β 1 x i1 + + β k x ik + ɛ i, ɛ i iid Normal(0, σ 2 ) We estimate the β s by the values for which RSS = i (y i ŷ i ) 2 is minimized where ŷ i = ˆβ 0 + ˆβ 1 x i1 + + ˆβ k x ik (aka least squares ). RSS We estimate σ by ˆσ = n (k + 1) FYI Calculation of the ˆβ s (and their SEs and correlations) is not that complicated, but without matrix algebra, the formulas are nasty. Here is what you need to know: The SEs of the ˆβ s involve σ and the x s. The ˆβ s are normally distributed. Obtain confidence intervals for the β s using ˆβ ± t ŜE( ˆβ) where t is a quantile of t dist n with n (k+1) d.f. Test H 0 : β =0using ˆβ /ŜE( ˆβ) Compare this to a t distribution with n (k+1) d.f.

41 The example: a full model x 1 = [H 2 O 2 ]. x 2 = 0 or 1, indicating type of heme. y = the OD measurement. The model: i.e., y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + ɛ β 0 + β 1 X 1 + ɛ if X 2 =0 y = (β 0 + β 2 ) + (β 1 + β 3 )X 1 + ɛ if X 2 =1 β 2 =0 Same intercepts. β 3 =0 Same slopes. β 2 = β 3 =0 Same lines. Results Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 x e-15 x x1:x Residual standard error: on 20 degrees of freedom Multiple R-Squared: 0.98,Adjusted R-squared: F-statistic: on 3 and 20 DF, p-value: < 2.2e-16

42 Testing many parameters We have the model y i = β 0 + β 1 x i1 + + β k x ik + ɛ i, ɛ i iid Normal(0, σ 2 ) We seek to test H 0 : β r+1 = = β k = 0. In other words, do we really have just: y i = β 0 + β 1 x i1 + + β r x ir + ɛ i, ɛ i iid Normal(0, σ 2 )? What to do Fit the full model (with all k x s). 2. Calculate the residual sum of squares, RSS full. 3. Fit the reduced model (with only r x s). 4. Calculate the residual sum of squares, RSS red. 5. Calculate F = (RSS red RSS full )/(df red df full ) RSS full /df full. where df red =n r 1 and df full =n k 1). 6. Under H 0,F F(df red df full, df full ).

43 In particular... Assume the model y i = β 0 + β 1 x i1 + + β k x ik + ɛ i, ɛ i iid Normal(0, σ 2 ) We seek to test H 0 : β 1 = = β k =0 (i.e., none of the x s are related to y). Full model: All the x s Reduced model: y = β 0 + ɛ RSS red = i (y i ȳ) 2 F=[( i (y i ȳ) 2 i (y i ŷ i ) 2 )/k] / [ i (y i ŷ i ) 2 /(n k 1)] Compare this to a F(k, n k 1) dist n. The example To test β 2 = β 3 =0 Analysis of Variance Table Model 1: y x1 Model 2: y x1 + x2 + x1:x2 Res.Df RSS Df Sum of Sq F Pr(>F) e-05