BINF 702 Chapter 11 Regression and Correlation Methods. Chapter 11 Regression and Correlation Methods (SPRING 2014) 1

Transcription

1 BINF 702 Chapter 11 Regression and Correlation Methods (SPRING 2014) 1

2 Section 11.1 Introduction Example 11.1 Obstetrics Obstetricians sometimes order tests for estriol levels from 24-hour urine specimens taken from pregnant women who are near term, since the level of estriol has been found to be related to the birthweight of the infant. The test can provide indirect evidence of an abnormally small fetus. The relationship between estriol level and birthweight can be quantified by fitting a regression line that relates the two variables. Example 11.2 Hypertension Much discussion has taken place in the literature concerning the familial aggregation of blood pressure. In general, children whose parents have high blood pressure tend to have higher blood pressure than their peers. One way of expressing this relationship is to compute a correlation coefficient relating the blood pressure of parents and children over a large collection of families. (SPRING 2014) 2

3 Section 11.2 General Concepts Let us return to our consideration of the relationship between estriol level and birthweight data. Let x = estriol level and y = birthweight. We might posit a relationship such as Eq E(y x) = a + bx Our regression line is defined as Def y = a + bx, a is the y-intercept and b is the slope. It is expected of course that our regression line does not fit exactly. There will be some associated error to the fit. Eq y = a + bx + e where e ~ N(0,s 2 ) where x is the independent variable and y is the dependent variable. (SPRING 2014) 3

4 Section 11.2 General Concepts A linear regression fit for our birthweight data (SPRING 2014) 4

5 Section 11.2 General Concepts Some nuances of the fit We can vary noise. b may vary. (SPRING 2014) 5

6 Section 11.3 Fitting Regression Lines The Method of Least Squares Def The least-squares line, or estimated regression line, is the line y = a +bx minimizing the sum of squares distances of the sample points from the line given by S n d i1 2 i We choose this criteria because the math is tractable. Eq Estimation of the Least- Squares Line The coefficients of the least-squares line y = ax + b are given by L b and a y bx L i xy i1 i1 (SPRING xx 2014) n n y b x n i 6

7 Section 11.3 Fitting Regression Lines The Method of Least Squares Section 11.3 Fitting Regression Lines The Method of Least Squares L xx n i1 x 2 i n n i1 x i 2 DEF The predicted, or average, value of y for a given value of x, as estimated from the fitted regression line, is denoted by ŷ a bx L n x y xy i i i1 n n x i i1 i1 n y i (SPRING 2014) 7

8 Section 11.3 Fitting Regression Lines The Method of Least Squares Regression in R lm {stats} R Documentation Fitting Linear Models Description lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these). Usage lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset,...) (SPRING 2014) 8

9 Section 11.3 Fitting Regression Lines The Method of Least Squares Regression in R (The Arguments) Formula a symbolic description of the model to be fit. The details of model specification are given below. Data an optional data frame containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called. Subset an optional vector specifying a subset of observations to be used in the fitting process. Weights an optional vector of weights to be used in the fitting process. If specified, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used. na.action a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The factory-fresh default is na.omit. Another possible value is NULL, no action. (SPRING 2014) 9

10 Section 11.3 Fitting Regression Lines The Method of Least Squares Regression in R (The Arguments) Method the method to be used; for fitting, currently only method = "qr" is supported; method = "model.frame" returns the model frame (the same as with model = TRUE, see below). model, x, y, qr logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned. singular.ok logical. If FALSE (the default in S but not in R) a singular fit is an error. Contrasts an optional list. See the contrasts.arg of model.matrix.default. offsetthis can be used to specify an a priori known component to be included in the linear predictor during fitting. An offset term can be included in the formula instead or as well, and if both are specified their sum is used... additional arguments to be passed to the low level regression fitting functions (see below). (SPRING 2014) 10

11 Section 11.3 Fitting Regression Lines The Method of Least Squares Regression in R (Some of the Details) Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second. If response is a matrix a linear model is fitted to each column of the matrix. See model.matrix for some further details. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula. A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae. lm calls the lower level functions lm.fit, etc, see below, for the actual numerical computations. For programming only, you may consider doing likewise. All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula. (SPRING 2014) 11

12 Section 11.3 Fitting Regression Lines The Method of Least Squares Regression in R (Some of the Details) lm returns an object of class "lm" or for multiple responses of class c("mlm", "lm"). The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm. An object of class "lm" is a list containing at least the following components: Coefficients a named vector of coefficients Residuals the residuals, that is response minus fitted values. fitted.values the fitted mean values. rank the numeric rank of the fitted linear model. weights (only for weighted fits) the specified weights. df.residualthe residual degrees of freedom. call the matched call. terms the terms object used. contrasts (only where relevant) the contrasts used. xlevels (only where relevant) a record of the levels of the factors used in fitting. y if requested, the Chapter response 11 Regression used. and Correlation Methods x if requested, the model matrix (SPRING used. 2014) 12 model if requested (the default), the model frame used.

13 Section 11.3 Fitting Regression Lines The Method of Least Squares Example 11.8 Obstetrics Birthweight as a function of estriol in R. es = c(7,9,9,12,14,16,16,14,16,16,17,19,21,24,15,1 6,17,25,27,15,15,15,16,19,18,17,18,20,22,25,2 4) bw = c(25,25,25,27,27,27,24,30,30,31,30,31,30,28,3 2,32,32,32,34,34,34,35,35,34,35,36,37,38,40,3 9,43) library(stats) bw.lm = lm(bw ~ es) bw.lm$coefficients (Intercept) es plot(es,bw) lines(es, * es ) (SPRING 2014) 13

14 Section 11.4 Inferences About Parameters from Regression Lines EQ 11.5 Decomposition of the Total Sum of Squares into Regression and Residual Components Check out Figure 11.6 n 2 n 2 n 2 y y yˆ y y yˆ i i i i i1 i1 i1 Total Sum of Squares = Regression Sum of Squares + Residual Sum of Squares A good-fitting regression line will have regression components large in absolute value relative to the residual components whereas Chapter the 11 Regression opposite is and Correlation Methods true for poor fitting lines. (SPRING 2014) 14

15 F Test for Simple Linear Regression We will use the ratio of the regression sum of squares to the residual sum of squares as a regression test. A large ratio will indicate a good fit where we are testing H 0 : b = 0 versus H 1 :b!= 0 where b is the slope of the regression line. Some helpful notation Regression mean square (Reg MS) is (Reg SS)/k, the number of predictors in the model. Residual mean square, Res MS is (Res SS)/(n k 1). Df = (n k -1), the degrees of freedom of the residual sum of squares, Res df. In the literature Res MS = s 2 y,x Reg SS = bl xy = b 2 L xx = L 2 xy/l xx Res SS = Total SS Reg SS = L yy L 2 xy/l xx (SPRING 2014) 15

16 F Test for Simple Linear Regression Eq F Test for Simple Linear Regression To test H 0 : b = 0 versus H 1 : b!= 0, use the following procedure: 1) Compute the test statistic F = Reg MS/Res MS = (L 2 xy/l xx )/[L yy L 2 xy/l xx )(n-2)] that follows an F 1,n-2 distribution under H 0. 2) For a two-sided test with significance level a, if F > F 1,n-2,1-a then reject H 0 ; if F <= F 1,n-2,1-a then accept H 0. 3) The exact p-value is given by P(F 1,n-2 > F) (SPRING 2014) 16

17 F Test for Simple Linear Regression Def R 2 is defined as (Reg SS)/(Total SS) Interpretation of R 2 R 2 can be though of as the proportion of the variance of y that can be explained by the variable x R 2 = 1 all of the data points fall on the regression line R 2 = 0 x gives no information about the variance of y (SPRING 2014) 17

18 F Test for Simple Linear Regression The obstetrics data revisited in R > summary(bw.lm) Call: lm(formula = bw ~ es) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** es *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 29 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 29 DF, p-value: (SPRING 2014) 18

19 F Test for Simple Linear Regression Using aov in R to perform the regression fit on the obstetrics data > summary(aov(bw ~ es)) Df Sum Sq Mean Sq F value Pr(>F) es *** Residuals Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (SPRING 2014) 19

20 t Test for Simple Linear Regression EQ 11.8 t Test for Simple Linear Regression To test the hypothesis H 0 : b = 0 versus H 1 : b!= 0, use the following procedure: 1) Compute the test statistic t = b/(s 2 yx/l xx ) 1/2 2) For a two-sided test with significance level a, if t > t n-2,1-a/2 or t <= t n-2,a/2 = -t n-2,1-a/2 Then reject H 0 ; if t n-2,1-a/2 <= t <= t n-2,1-a/2 Then accept H 0 3) The p-value is given by p = 2 x (area to the left of t under a t n-2 distribution) if t < 0 p = 2 x (area to the right of t under a t n-2 distribution) if t >= 0 (SPRING 2014) 20

21 F Test for Simple Linear Regression The R output of the obstetrics data revisited > summary(bw.lm) Call: lm(formula = bw ~ es) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** es *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 29 degrees of freedom Multiple R-Squared: Chapter , Regression and Adjusted Correlation Methods R-squared: F-statistic: on 1 and (SPRING ) DF, p-value:

22 11.5 Interval Estimation for Linear Regression Interval Estimates for Regression Parameters Under certain assumptions how well can we quantify the uncertainty in our estimates of the slope and y-intercept Interval Estimation for Predictions Made from Regression Line Under certain assumptions how well can we quantify the uncertainty in our estimates of the predicted values (SPRING 2014) 22

23 11.5 Interval Estimation for Linear Regression Interval Estimates for Regression Parameters Eq Standard Errors of Estimated Parameters in Simple Linear Regression se() b s L 2 yx xx se( a) s 2 yx 1 n x L 2 xx (SPRING 2014) 23

24 11.5 Interval Estimation for Linear Regression Interval Estimates for Regression Parameters Eq Two-Sided 100% x (1 a) Confidence Intervals for the Parameters of a Regression Line: If b and a are, respectively, the estimated slope and intercept of a regression line as given on the previous slide, i. e. se(b) and se(a) are the estimated standards errors, the the two-sided 100% x (1-a) confidence intervals for b and a are given by b t se() b n2,1 a / 2 a t se( a) n2,1 a / 2 (SPRING 2014) 24

25 Interval Estimates for Regression Parameters Confidence intervals on regression parameters in R > summary(bw.lm) Call: lm(formula = bw ~ es) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** es *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 29 degrees of freedom Multiple R-Squared: , (SPRING 2014) Adjusted R-squared: F-statistic: on 1 and 29 DF, p-value:

26 Interval Estimation for Predictions Made from Regression Lines A pedagogical example Forced expiratory volume (FEV) is a standard measure of pulmonary function. To identify people with abnormal pulmonary function, standards of FEV for normal people must be established. One problem here is that FEV is related to both age and height. Let us focus on boys who are ages and postulate a regression model for the form FEV = a + b(height) + e. Data were collected on FEV and height for 655 boys in this age group residing in Tecumseh, Michigan. The mean FEV in liters is presented for each of twelve 4-cmheight groups in the Table below. Find the best fitting regression line and test for statistical significance. What proportion of the variance of FEV can be explained by height? (SPRING 2014) 26

27 Interval Estimation for Predictions Made from Regression Lines Our FEV pedagogical example continued. Mean Mean Height FEV Height FEV (cm) (L) (cm) (L) (SPRING 2014) 27

28 Interval Estimation for Predictions Made from Regression Lines (SPRING 2014) 28

29 Interval Estimation for Predictions Made from Regression Lines EX Pulmonary Function suppose we wish to use the FEVheight regression line computed previously to develop normal ranges for 10- to 15-year-old boys of particular heights. In particular, consider John H., whose is 12 years old and 160 cm tall and whose FEV is 2.5 L. Can his FEV be considered abnormal for his age and height? (SPRING 2014) 29

30 Interval Estimation for Predictions Made from Regression Lines Eq Predictions Made from Regression Lines for Individual Observations Suppose we wish to make predictions from a regression line for an individual observations with independent variable x that was not used in constructing the regress line. The distribution of observed y values for the subset of individuals with independent variable x is normal with mean = and standard deviation given by ŷ a bx Furthermore, 100% x (1-a) of the observed values will fall within the interval yˆ t se ( yˆ) n2,1 a / 2 1 This interval is sometimes called a 100% x (1-a) prediction interval for y. 2 1 se1 yˆ syx 1 n x x 2 L xx (SPRING 2014) 30

31 Interval Estimation for Predictions Made from Regression Lines Predicted Confidence Intervals in R > new = list(ht=160) > predict(fev.lm,new,interval='prediction') fit lwr upr [1,] We note that John s observed value of 2.5 does not fall within the predicted interval. John merits follow up. (SPRING 2014) 31

32 Interval Estimation for Predictions Made from Regression Lines Suppose we wish to asses the mean FEV value for a large number of boys with the same x value? Eq Standard Error and Confidence Interval for Predictions Made from Regression Lines for the Average Value of y for a Given x The best estimate of the average value of y for a given x is ŷ a bx Its standard error is given by 2 ˆ se y s 2 yx 1 n x x 2 L xx Furthermore, a two-sided 100% x (1-a) confidence interval for he average value of y is n2,1 a / 2 2 ˆ yˆ t se y (SPRING 2014) 32

33 Interval Estimation for Predictions Made from Regression Lines Predicted Confidence Intervals in R for the average value of y > predict(fev.lm,new,interval='confidence') fit lwr upr [1,] This is sometimes denoted within the statistics community as the confidence interval for the regression function. (SPRING 2014) 33

34 Interval Estimation for Predictions Made from Regression Lines Example (SPRING 2014) 34

35 11.6 Assessing the Goodness of Fit of Regression Lines Eq Assumptions Made in Linear-Regression Models 1) For any given value of x, the corresponding value of y has an average value of a + bx, which is a linear function of x. 2) For any given value of x, the corresponding value of y is normally distributed about a + bx with the same variance s 2 for any x. 3) For any two data points (x 1, y 1 ), (x 2, y 2 ), the error terms e 1, e 2, are independent of each other. (SPRING 2014) 35

36 11.6 Assessing the Goodness of Fit of Regression Lines The simplest type of diagnostic plot. There may be more variability for larger values of es. Which assumption is this violating? (SPRING 2014) 36

37 11.6 Assessing the Goodness of Fit of Regression Lines Eq Standard Deviation of Residuals About Fitted Regression Line Let (x i, y i ) be a sample point used in estimating the regression line, y = a +bx. If y = a + bx is the estimated regression line, and The Studentized residual corresponding to the point (x i,y i ) is given by eˆ i sd eˆ i eˆi = residual for the point (x i, y i ) about the estimated regression line, then eˆ y ( a bx ) i i i and 1 sd( eˆ) ˆ2 i s 1 n x x 2 L xx (SPRING 2014) 37

38 11.6 Assessing the Goodness of Fit of Regression Lines (Regression Diagnostic Plots in R - I) (SPRING 2014) 38

39 11.6 Assessing the Goodness of Fit of Regression Lines (Regression Diagnostic Plots in R - II) (SPRING 2014) 39

40 11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R) Assessing uniformity of variance and linearity of residual structure. (SPRING 2014) 40

41 11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R) Assessing normality of residual structure with QQ plots. (SPRING 2014) 41

42 11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R) A few EDA type plots for assessment of normality. (SPRING 2014) 42

43 11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R) QQ plots for various types of distributions. (SPRING 2014) 43

44 11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R) Cook's Distance for the i-th observation is based on the differences between the predicted responses from the model constructed from all of the data and the predicted responses from the model constructed by setting the i-th observation aside. For each observation, the sum of squared residuals is divided by (p+1) times the Residual Mean Square from the full model. Some analysts suggest investigating observations for which Cook's distance is greater than 1. Others suggest looking at a dot plot to find extreme values. Cooks Distance Plots. (SPRING 2014) 44

45 11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R) A pedagogical example. Age is age at first word (x-values) and gesell (y-values) is the Gesell adaptive score. age = c(15,26,10,9,15,20,18,11,8,20,7,9,1 0,11,11,10,12,42,17,11,10) gesell = c(95,71,83,91,102,87,93,100,104,94, 113,96,83,84,102,100,105,57,121,86, 100) > plot(gesell ~ age) > identify(gesell ~ age) [1] (SPRING 2014) 45

46 11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R) Gesell example continued (SPRING 2014) 46

47 11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R) (SPRING 2014) 47

48 11.17 The Correlation Coefficient The sample correlation coefficient offers an alternative way to measure a linear association between variables. One can use it rather than the regression coefficient. The sample, Pearson, correlation coefficient is given by r = L xy /sqrt(l xx *L yy ) Properties of r r > 0 positively correlated r < 0 negatively correlated r = 0 uncorrelated (SPRING 2014) 48

49 11.17 The Correlation Coefficient Relationship between sample correlation coefficient r and the population correlation coefficient r r Lxy /( n 1) sxy L L ss xx yy x n 1 n1 y (SPRING 2014) 49

50 11.17 The Correlation Coefficient There is actually a simple relationship between the sample correlation coefficient and the regression coefficient b rs s x y So these two quantities really are just rescaled versions of one another (SPRING 2014) 50

51 11.17 The Correlation Coefficient The sample Pearson correlation coefficient, r, in R Example > es = c(7,9,9,12,14,16,16,14,16,16,17,19,21,24,15,16,17,25,27,15, 15,15,16,19,18,17,18,20,22,25,24) > bw = c(25,25,25,27,27,27,24,30,30,31,30,31,30,28,32,32,32,32,34, 34,34,35,35,34,35,36,37,38,40,39,43) > cor(es,bw,method='pearson') [1] (SPRING 2014) 51

52 11.8 Statistical Inference for Correlation Coefficients : One-Sample t-test for a Correlation Coefficient Eq One-sample t Test for a Correlation Coefficient To test the hypothesis H 0 : r = 0 versus H 1 : r!= 0, use the following procedure: 1) Compute the sample correlation coefficient r. 2) Compute the test statsitic t = r(n 2) 1/2 /(1 r 2 ) 1/2 Which under H 0 follows a t distribution with n 2 df. 3) For a two-sided level a test, if accept t > t n-2,1-a/2 or t < -t n-2,1-a/2 then reject H 0. If t n-2,1-a/2 <= t <=t n-2,1-a/2 4) The p-value is given by p = 2 * (area to the left of t under a t n-2 distribution) if t < 0 P = 2 * (area to the right of t under a t n-2 distribution) if t >= 0 5) We assume an underlying normal distribution for each of the random variables used to compute r. (SPRING 2014) 52

53 11.8 Statistical Inference for Correlation Coefficients : One-Sample t-test for a Correlation Coefficient Problem pg. 505 in R > logmort = c(-2.35, -2.20, -2.12,-1.95,-1.85,-1.80,-1.70,-1.58) > logcig = c(-0.26,-0.03,0.30,0.37,0.40,0.50,0.55,0.55) > cor(logmort,logcig) [1] > cor.test(logmort,logcig) Pearson's product-moment correlation data: logmort and logcig t = , df = 6, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor (SPRING 2014) 53

54 11.8 Statistical Inference for Correlation Coefficients : One-Sample z-test for a Correlation Coefficient Eq One-Sample z Test for a Correlation Coefficient To test the hypothesis H 0 : r = r 0 versus H 1 :r!=r 0, use the following procedure: 1) Compute the sample correlation coefficient r and the z transformation of r. 2) Compute the test statistic l = (z z 0 )*sqrt(n-3) 3) If l > z 1-a/2 or l < -z 1-a/2 reject H 0. If z 1-a/2 <= l <= z 1-a/2 accept H 0. 4) The exact p-value is given by P = 2 * F(l) if l <= 0 P = 2 * [1 F(l)] if l > 0 5) Assume an underlying normal distribution for each of the random variables used to compute r and z. (SPRING 2014) 54

55 11.8 Statistical Inference for Correlation Coefficients : One-Sample z-test for a Correlation Coefficient 1 1 r 1 1 r 1 0 z ln N ln, under H0 2 1 r 2 1 r0 n3 z 0 (SPRING 2014) 55

56 11.8 Statistical Inference for Correlation Coefficients : One-Sample z-test for a Correlation Coefficient There is no implementation of this in R but this method is used to compute confidence intervals when the number of observation is larger than 6 when one calls cor.test (SPRING 2014) 56

57 11.9 Multiple Regression Consider Ex on pg. 466 of the text. Eq y = a + b 1 x 1 + b 2 x 2 + e where y is the systolic blood pressure, x 1 is birth weight and x 2 is the age in days where e ~ N(0, s 2 ). We choose the method of least square to minimize the sum of [y (a + b 1 x 1 + b 2 x 2 )] 2 In general if we have k independent variables x 1,, x k then a linearregression model relating y to x 1,, x k is of the form EQ k bj j, e ~ N(0, s 2 ) j1 (SPRING 2014) y a x e 57

58 11.9 Multiple Regression Def k j j j1 y a b x e Partial Regression Coefficients (SPRING 2014) 58

59 11.9 Multiple Regression Def The standardized regression coefficient b s is given by b * (s x /s y ) (SPRING 2014) 59

60 Hypothesis Testing Eq F Test for Testing the Hypothesis H 0 : b 1 = b 2 = b k = 0 versus H 1 :At least One of the b j!= 0 in Multiple Regression 1) Fit the regression parameters using the method of least squares, and compute Reg SS and Res SS Re s SS y yˆ i1 Re g SS Total SS Re s SS Total SS y y i1 yˆ a b x n n i j ij j1 k i i i 2 2 x jth independent variable for ith subject, j 1,, k; i 1,, n ik (SPRING 2014) 60

61 Hypothesis Testing Eq F Test for Testing the Hypothesis H 0 : b 1 = b 2 = b k = 0 versus H 1 :At least One of the b j!= 0 in Multiple Regression 2) Compute Reg MS = RegSS/k, RegMS = ResSS/(n-k-1) 3) Compute the test statistic F=Reg MS/Res MS which follows an F k,n-k-1 distribution under H 0. 4) For a level a test, F > F k, n-k,1-a then reject H 0 : If F <= F k,n-k,1-a then accept H 0 5) The exact p-value is given by the area to the right of F under an F k,n-k-1 distribution = P(F k,n-k-1 > F) (SPRING 2014) 61

62 Hypothesis Testing Eq t Test for Testing the Hypothesis H 0 :b l = 0, All Other bj!= 0 versus H 1 :b l!= 0, All other b j!= 0 in Multiple Linear Regression 1) Compute t = b l /se(b l ) 2) If t < t n-k-1,a/2 or t > t n-k-1,1-a/2 then reject H 0 If t n-k-1,a/2 <= t <= t n-k-1,1-a/2 then accept H 0 3) The exact p-value is given by 2 * P(t n-k-1 > t) if t >= 0 2 * P(t n-k-1 <=t) if t < 0 (SPRING 2014) 62

63 11.9 Multiple Regression (EX in R) > bwmv = c(135,120,100,105,130,125,125,105,120,90,120,95,120,150,160,125) > agemv = c(3,4,3,2,4,5,2,3,5,4,2,3,3,4,3,3) > bpmv = c(89, 90, 83, 77, 92, 98, 82, 85, 96, 95, 80, 79, 86, 97, 92,88) > bpmv.lm = lm(bpmv ~ bwmv + agemv) > summary(bpmv.lm) Call: lm(formula = bpmv ~ bwmv + agemv) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** bwmv ** agemv e-07 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 13 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 2 and 13 DF, p-value: 9.844e-07 (SPRING 2014) 63

64 Regression Diagnostics (SPRING 2014) 64

65 11.9 Multiple Regression (EX in R) (SPRING 2014) 65

66 11.9 Multiple Regression (EX in R) (SPRING 2014) 66

67 Chapter 11 Homework ; , (SPRING 2014) 67