CHAPTER 14 MORE ABOUT REGRESSION

Transcription

1 CHAPTER 14 MORE ABOUT REGRESSION We learned n Chapter 5 that often a straght lne descrbes the pattern of a relatonshp between two quanttatve varables. For nstance, n Example 5.1 we explored the relatonshp between the hand-spans (cm) and heghts (nches) of 167 college students, and found that the pattern of the relatonshp n ths sample could be descrbed by the equaton Average hand-span = Heght An equaton lke the one relatng hand-span to heght s called a regresson equaton, and the term smple regresson s sometmes used to descrbe the analyss of a straght-lne relatonshp (lnear relatonshp) between a response varable (y-varable) and an explanatory varable (xvarable). In Chapter 5, we only used regresson methods to descrbe a sample and dd not make statstcal nferences about the larger populaton. Now, we consder how to make nferences about a relatonshp n the populaton represented by the sample. Some questons nvolvng the populaton that we mght ask when analyzng a relatonshp are: 1. Does the observed relatonshp also occur n the populaton? For example, s the observed relatonshp between hand-span and heght strong enough to conclude that the relatonshp also holds n the populaton? 2. For a lnear relatonshp, what s the slope of the regresson lne n the populaton? For example, n the larger populaton, what s the slope of the regresson lne that connects hand-spans to heghts? 3. What s the mean value of the response varable (y) for ndvduals wth a specfc value of the explanatory varable (x)? For example, what s the mean hand-span n a populaton of people 65 nches tall? 4. What nterval of values predcts the value of the response varable (y) for an ndvdual wth a specfc value of the explanatory varable (x)? For example, what nterval predcts the hand-span of an ndvdual 65 nches tall? 14.1 Sample and Populaton Regresson Models A regresson model descrbes the relatonshp between a quanttatve response varable (the y-varable) and one or more explanatory varables (x-varables). The y-varable s sometmes called the dependent varable, and because regresson models may be used to make predctons, the x-varables may be called the predctor varables. The labels response varable and explanatory varable may be used for the varables on the y-axs and x-axs, respectvely, even f there s not an obvous way to assgn these labels n the usual sense Any regresson model has two mportant components. The most obvo us component s the equaton that descrbes how the mean value of the y-varable s connected to specfc values of the x-varable. The equaton stated before for the connecton between hand-span and heght, Average hand-span = Heght, s an example. In ths Chapter, we focus on lnear relatonshps so a straght-lne equaton wll be used, but t s mportant to note that some relatonshps are curvlnear. The second component of a regresson model descrbes how ndvduals vary from the regresson lne. Fgure 14.1, whch s dentcal to Fgure 5.6, dsplays the raw data for the sample of n=167 hand-spans and heghts along wth the regresson lne that estmates how the mean hand-span s connected to specfc heghts. Notce that most ndvduals vary from the lne. When 631

2 we examne sample data, we wll fnd t useful to estmate the general sze of the devatons from the lne. When we consder a model for the relatonshp wthn the populaton represented by a sample, we wll state assumptons about the dstrbuton of devatons from the lne. If the sample represents a larger populaton, we need to dstngush between the regresson lne for the sample and the regresson lne for the populaton. The observed data can be used to determne the regresson lne for the sample, but the regresson lne for the populaton can only be magned. Because we do not observe the whole populaton, we wll not know numercal values for the ntercept and slope of the regresson lne n the populaton. As n nearly every statstcal problem, the statstcs from a sample are used to estmate the unknown populaton parameters, whch n ths case are the slope and ntercept of the regresson lne. Fgure 14.1 Regresson Lne Lnkng Hand-Span and Heght for a Sample of College Students The Regresson Lne for the Sample In Chapter 5, we ntroduced ths notaton for the regresson lne that descrbes sample data: yˆ = b0 + b1 x. In any gven stuaton, the sample s used to determne values for b 0 and b 1. ŷ s spoken as y-hat and t s also referred to ether as predcted y or estmated y. b 0 s the ntercept of the straght lne. The ntercept s the value of ŷ when x = 0. b 1 s the slope of the straght lne. The slope tells us how much of an ncrease (or decrease) there s for ŷ when the x-varable ncreases by one unt. The sgn of the slope tells us whether ŷ ncreases or decreases when x ncreases. If the slope s 0, there s no lnear relatonshp between x and y because ŷ s the same for all values of x. The equaton descrbng the relatonshp between hand-span and heght for the sample of college students can be wrtten as ŷ = x. In ths equaton: ŷ estmates the average hand-span for any specfc heght x. If heght=70 nches, for nstance, ŷ = (70)= 21.5 cm. 632

3 The ntercept s b 0 = 3. Whle necessary for the lne, ths value does not have a useful statstcal nterpretaton n ths example. It estmates the average hand-span for ndvduals who have heght = 0 nches, an mpossble heght far from the range of the observed heghts. It also s an mpossble hand span. The slope s b 1 = Ths value tells us that the average ncrease n hand-span s 0.35 centmeters for every one-nch ncrease n heght. Remnder: The Least-Squares Crteron In Chapter 5, we descrbed the least-squares crteron. Ths mathematcal crteron s used to determne numercal values of the ntercept and slope of a sample regresson lne. The leastsquares lne s the lne, among all possble lnes, that has the smallest sum of squared dfferences between the sample values of y and the correspondng values of ŷ. Devatons from the Regresson Lne n the Sample The terms random error, resdual varaton, and resdual error all are used as synonyms for the term devaton. Most commonly, the word resdual s used to descrbe the devaton of an observed y-value from the sample regresson lne. A resdual s easy to compute. It smply s the dfference between the observed y-value for an ndvdual and the value of ŷ determned from the x-value for that ndvdual. Example 1. Resduals n the Hand-Span and Heght Regresson Consder a person 70 nches tall whose hand-span s 23 centmeters. The sample regresson lne s ŷ = x, so ŷ = (70) = 21.5 cm for ths person. The resdual = observed y- predcted y = y- ŷ = = 1.5 cm. Fgure 14.2 llustrates ths resdual. For an observaton y n the sample, the resdual s e = y ŷ. y = the value of the response varable for the observaton. ŷ = b0 + b1x where x s the value of the explanatory varable for the observaton. Techncal Note : The sum of the resduals s 0 for any least-squares regresson lne. The "least squares" formulas for determnng the equaton always result n y = yˆ, so e =

4 Fgure 14.2 Resdual for a person 70 nches tall wth a hand span = 23 centmeters. The resdual s the dfference between observed y=23 and ŷ =21.5, the predcted value for a person 70 nches tall. The Regresson Lne for the Populaton The regresson equaton for a smple lnear relatonshp n a populaton can be wrtten as: E( Y ) = β 0 + β1 x E(Y) represents the mean or expected value of y for ndvduals n the populaton who all have the same partcular value of x. Note that ŷ s an estmate of E(Y). β 0 s the ntercept of the straght lne n the populaton. β 1 s the slope of the lne n the populaton. Note that f the slope β 1 = 0, there s no lnear relatonshp n the populaton. Unless we measure the entre populaton, we cannot know the numercal values of β 0 and β 1. These are populaton parameters that we estmate usng the correspondng sample statstcs. In the hand-span and heght example, b 1 =0.35 s a sample statstc that estmates the populaton parameter β 1, and b 0 = -3 s a sample statstc that estmates the populaton parameter β 0. Devatons from the Regresson Lne n the Populaton To make statstcal nferences about the populaton, two assumptons about how the y- values vary from the populaton regresson lne are necessary. Frst, we assume that the general sze of the devaton of y-values from the lne s the same for all values of the explanatory varable (x), an assumpton called the constant varance assumpton. Ths assumpton may or may not be correct n any partcular stuaton, and a scatter plot should be examned to see f t s reasonable or not. In Fgure 14.1, the constant varance assumpton looks reasonable because the magntude of the devaton from the lne appears to be about the same across the range of observed heghts. The second assumpton about the populaton s that for any specfc value of x, the dstrbuton of y-values s a normal dstrbuton. Equvalently, ths assumpton s that devatons from the populaton regresson lne have a normal curve dstrbuton. Fgure 14.3 llustrates ths assumpton along wth the other elements of the populaton regresson model for a lnear 634

5 relatonshp. The lne E( Y ) = β 0 + β1 x descrbes the mean of y, and the normal curves descrbe devatons from the mean. Fgure 14.3 Regresson Model for Populaton Summary of the Smple Regresson Model A useful format for expressng the components of the populaton regresson model s Y = MEAN + DEVIATION. Ths conceptual equaton states that for any ndvdual, the value of the response varable (y) can be constructed by combnng two components: The MEAN, whch n the populaton s the lne E( Y ) = β 0 + β1 x f the relatonshp s lnear. There are other possble relatonshps, such as curvlnear, a specal case of whch s a 2 quadratc relatonshp, E(Y) = β0 +β1x + β2x. Relatonshps that are not lnear wll not be dscussed n ths book. The ndvdual's DEVIATION = y - MEAN, whch s what s left unexplaned after accountng for the mean y-value at that ndvdual's x-value. Ths format also apples to the sample, although techncally we should use the term "estmated mean" when referrng to the sample regresson lne. Example 1 Contnued. MEAN and DEVIATION for Heght and Hand-Span Regresson. Recall that the sample regresson lne for hand spans and heghts s ŷ = x. Although t s not lkely to be true, let's assume for convenence that ths equaton also holds n the populaton. If your heght s x=70 nches and your hand span s y=23 cm., then: MEAN = (70) = 21.5, DEVIATION= Y - MEAN = = 1.5, and y = 23 = MEAN + DEVIATION = In other words, your handspan s 1.5 cm above the mean for people wth your heght. 635

6 In the theoretcal development of procedures for makng statstcal nferences for a regresson model, the collecton of all DEVIATIONS n the populaton s assumed to have a 2 normal dstrbuton wth mean 0 and standard devaton σ (so, the varance s σ ). The value of the standard devaton σ s an unknown populaton parameter that s estmated usng the sample. Ths standard devaton can be nterpreted n the usual way that we nterpret a standard devaton. It s, roughly the average dstance between ndvdual values of y and the mean of y as descrbed by the regresson lne. In other words, t s roughly the sze of the average devaton across all ndvduals n the range of x-values. Keepng the regresson notaton straght for populatons and samples can be confusng. Although we have not yet ntroduced all relevant notaton, a summary at ths stage wll help you keep t straght. Smple Lnear Regresson Model For ( x1, y1),(x 2, y2),...,(x n, yn ), a sample of n observatons of the explanatory varable x and the response varable y from a large populaton, the smple lnear regresson model descrbng the relatonshp between y and x s: Populaton verson Mean: Indvdual: E 0 1 ( Y ) = β + β x y = β +β x + ε = E( Y) + ε 0 1 The devatons ε are assumed to follow a normal dstrbuton wth mean 0 and standard devaton σ. Sample verson Mean: ˆ = b + b x y 0 1 Indvdual: y = b + b x + e = yˆ e where e s the resdual for ndvdual. The sample statstcs b 0 and b 1 estmate the populaton parameters β,β 0 1. The mean of the resduals s 0, and the resduals can be used to estmate the populaton standard devaton σ Estmatng the Standard Devaton From the Mean Recall that the standard devaton n the regresson model measures, roughly, the average devaton of y-values from the mean (the regresson lne). Expressed another way, the standard devaton for regresson measures the general sze of the resduals. Ths s an mportant and useful statstc for descrbng ndvdual varaton n a regresson problem, and t also provdes nformaton about how accurately the regresson equaton mght predct y-values for ndvduals. A relatvely small standard devaton from the regresson lne ndcates that ndvdual data ponts generally fall close to the lne, so predctons based on the lne wll be close to the actual values. The calculaton of the estmate of standard devaton s based on the sum of the squared resduals for the sample. Ths quantty s called the sum of squared errors and s denoted by SSE. Synonyms for sum of squared errors are resdual sum of squares or sum of squared resduals. To fnd the SSE, resduals are calculated for all observatons, then the resduals are squared and summed. The standard devaton for the sample s Sum of Squared Resduals SSE s = =, and ths sample statstc estmates the populaton n-2 n 2 standard devaton σ. 636

7 Estmatng the Standard Devaton for a Smple Regresson Model 2 2 SSE = ( y yˆ ) = e 2 SSE ( y yˆ ) s = = n 2 n 2 The statstc s s an estmate of the populaton standard devaton σ. Remember that n the regresson context, σ s the standard devaton of the y-values at each x, not the standard devaton of the whole populaton of y-values. Example 2. Re latonshp Between Heght and Weght for College Men Fgure 14.4 dsplays regresson results from the Mntab program and a scatter plot for the relatonshp between y = weght (pounds) and x = heght (nches) n a sample of n=43 men n a Penn State statstcs class. The regresson lne for the sample s ŷ = x, and ths lne s drawn onto the plot. We see from the plot that there s consderable varaton from the lne at any gven heght. The standard devaton, shown n the row of computer output mmedately above the plot, s "s=24.00." Ths value roughly measures, for any gven heght, the general sze of the devatons of ndvdual weghts from the mean weght for the heght. The standard devaton from the regresson lne can be nterpreted n conjuncton wth the Emprcal Rule for bell-shaped data stated n Secton 2.7. Recall, for nstance, that about 95% of ndvduals wll fall wthn two standard devatons of the mean. As an example, consder men who are 72 nches tall. For men wth ths heght, the estmated average weght determned from the regresson equaton s (72) = 186 pounds. The estmated standard devaton from the regresson lne s s=24 pounds, so we can estmate that about 95% of men 72 nches tall have weghts wthn 2 24=48 pounds of 186 pounds, whch s 186 ± 48, or 138 to 234 pounds. Thnk about whether ths makes sense for all the men you know who are 72 nches (6 feet) tall. 637

8 Fgure 14.4 The Relatonshp Between Weght and Heght for n=43 College Men The regresson equaton s Weght = Heght Predctor Coef SE Coef T P Constant Heght S = R-Sq = 32.3% R-Sq(adj) = 30.7% The Proporton of Varaton Explaned by x In Chapter 5, we learned that a statstc denoted as r 2 s used to measure how well the explanatory varable actually does explan the varaton n the response varable. Ths statstc s also denoted as R 2 (rather than r 2 ), and the value s commonly expressed as a percent. Researchers typcally use the phrase proporton of varaton explaned by x n conjuncton wth the value of r 2. For example, f r 2 = 0.60 (or 60%), the researcher may wrte that the explanatory varable explans 60% of the varaton n the response varable. The formula for r 2 presented n Chapter 5 was 2 SSTO SSE r = SSTO The quantty SSTO s the sum of squared dfferences between observed y values and the sample mean y. It measures the sze of the devatons of the y-values from the overall mean of y, whereas SSE measures the devatons of the y-values from the predcted values ŷ. 638

9 Example 2 Contnued. R 2 Heghts and Weghts of College Men In Fgure 14.4, we can fnd the nformaton the "R-sq = 32.3%" for the relatonshp between weght and heght. A researcher mght wrte the varable heght explans 32.3% of the varaton n the weghts of college men. Ths sn t a partcularly mpressve statstc. As we noted before, there s substantal devaton of ndvdual weghts from the regresson lne so a predcton of a college man's weght based on heght may not be partcularly accurate. Example 3. Drver Age and Hghway Sgn Readng Dstance In Example 5.2, we examned data for the relatonshp between y=maxmum dstance (feet) at whch a drver can read a hghway sgn and x = the age of the drver. There were n=30 observatons n the data set. Fgure 14.5 dsplays Mntab regresson output for these data. The equaton descrbng the lnear relatonshp n the sample s Average dstance = Age From the output, we learn that the standard devaton from the regresson lne s s=49.76 and R- sq=64.2%. Roughly, the average devaton from the regresson lne s about 50 feet, and the proporton of varaton n sgn readng dstances explaned by age s 0.642, or 64.2%. Fgure 14.5 Mntab Output: Sgn Readng Dstance and Drver Age The regresson equaton s Dstance = Age Predctor Coef SE Coef T P Constant Age S = R-Sq = 64.2% R-Sq(adj) = 62.9% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Unusual Observatons Obs Age Dstance Ft SE Ft Resdual St Resd R R denotes an observaton wth a large standardzed resdual The "Analyss of Varance" table provdes the peces needed to compute r 2 and s: SSE=69334 SSE s = = = n 2 28 SSTO= SSTO-SSE = = r 2 = =.642 or 64.2%

10 14.3 Inference about the Lnear Regresson Relatonshp When researchers do a regresson analyss, they occasonally know based on past research or common sense that the varables are ndeed related. In some nstances, however, t may be necessary to do a hypothess test n order to make the generalzaton that two varables are related n the populaton represented by the sample. The statstcal sgnfcance of a lnear relatonshp can be evaluated by testng whether or not the slope s 0. Recall that f the slope s 0 n a smple regresson model, the two varables are not related because changes n the x-varable wll not lead to changes n the y-varable. The usual null hypothess and alternatve hypotheses about β 1, the slope of the populaton lne E( Y ) = β 0 + β1 x, are: H o : β 1 = 0 (the populaton slope s 0, so y and x are not lnearly related.) H a : β 1 0 (the populaton slope s not 0, so y and x are lnearly related.) The alternatve hypothess may be one-sded or two-sded, although most statstcal software uses the two sded alternatve. The test statstc used to do the hypothess test s a t statstc wth the same general format that we saw n Chapter 13. That format, and ts applcaton to ths stuaton, s sample statstc null value b1 0 t = = standard error s. e.( b1 ) Ths s a standardzed statstc for the dfference between the sample slope and 0, the null value. Notce that a large value of the sample slope (ether postve or negatve) relatve to ts standard error wll gve a large value of t. If the mathematcal assumptons about the populaton model descrbed n Secton 14.1 are correct, the statstc has a t dstrbuton wth n-2 degrees of freedom. The p-value for the test s determned usng that dstrbuton. By hand calculatons of the sample slope and ts standard error are cumbersome. Fortunately, the regresson analyss of most statstcal software ncludes a t-statstc and a p-value for ths sgnfcance test. Techncal Note: In case you ever need to compute the values by hand, here are the formulas for the sample slope and ts standard error: sy b 1 = r s s s.e.(b 1) =, where s = 2 (x x) x SSE n 2 In the formula for the sample slope, s x and s y are the sample standard devatons of the x and y values respectvely, and r s the correlaton between x and y. Example 3 Contnued: Drver Age and Hghway Sgn Readng Dstance Fgure 14.5 presents the Mntab output for the regresson of sgn readng dstance and drver age. The sample estmate of the slope s b 1 = Ths sample slope s dfferent than 0, but s t enough dfferent to enable us to generalze that a lnear relatonshp exsts n the populaton represented by ths sample? The part of the Mntab output that can be used to test the statstcal sgnfcance of the relatonshp s shown n bold n Fgure 14.5, and the relevant p-value s underlned (by the authors of ths text, not by Mntab). Ths lne of the output provdes nformaton about the sample slope, the standard error of the sample slope, the t statstc for testng statstcal sgnfcance and the p- value for the test of: 640

11 H o : β 1 = 0 (the populaton slope s 0, so y and x are not lnearly related.) H a : β 1 0 (the populaton slope s not 0, so y and x are lnearly related.) The test statstc s: sample statstc null value b t = = = = 7.09 standard error s. e.( b1) The p-value s, to 3 decmal places, Ths means the probablty s vrtually 0 that the observed slope could be as far from 0 or farther than t s f there s no lnear relatonshp n the populaton. So, as we mght expect for these varables, we can conclude that the relatonshp between the two varables n the sample represents a real relatonshp n the populaton. Confdence Interval for the Populaton Slope The sgnfcance test of whether or not the populaton slope s 0 only tells us f we can declare the relatonshp to be statstcally sgnfcant. If we decde that the true slope s not 0, we mght ask, What s the value of the slope? We can answer ths queston wth a confdence nterval for β 1, the populaton slope. The format for ths confdence nterval s the same as the general format used n Chapters 10 and 12, whch s sample estmate multpler standard error The estmate of the populaton slope β 1 s b 1, the slope of the least-squares regresson lne for the sample. As shown already, the standard error formula s complcated and we ll usually rely on statstcal software to determne ths value. The multpler wll be labeled t* and s determned usng a t-dstrbuton wth df = n-2. Table 12.1 can be used to fnd the multpler for the desred confdence level. Formula for Confdence Interval for β 1, the Populaton Slope A confdence nterval for β 1 s b ± * 1 t s.e.(b1) The multpler t* s found usng a t-dstrbuton wth n-2 degrees of freedom, and s such that the probablty between t* and +t* equals the confdence level for the nterval. Example 3 Contnued. 95% Confdence Interval for Slope Between Age and Sgn Readng Dstance In Fgure 14.4, we see that the estmated slope s b 1=-3.01 and s.e.( b 1 )= There are n=30 observatons so df=28 for fndng t*. For a 95% confdence level, t*=2.05 (see Table 12.1). The 95% confdence nterval for the populaton slope s -3.01± ± to 2.14 Wth 95% confdence, we can estmate that n the populaton of drvers represented by ths sample, the mean sgn readng dstance decreases somewhere between 3.88 and 2.14 feet for each one-year ncrease n age. 641

12 Testng Hypotheses about the Correlaton Coeffcent In Chapter 5, we learned that the correlaton coeffcent s 0 when the regresson lne s horzontal. In other words, f the slope of the regresson lne s 0, the correlaton s 0. Ths means that the results of a hypothess test for the populaton slope can also be nterpreted as applyng to equvalent hypotheses about the correlaton between x and y n the populaton. As we dd for the regresson model, we use dfferent notaton to dstngush between a correlaton computed for a sample and a correlaton wthn a populaton. It s commonplace to use the symbol ρ (pronounced rho ) to represent the correlaton between two varables wthn a populaton. Usng ths notaton, null and alternatve hypotheses of nterest are: H 0 : ρ = 0 (x and y are not correlated) H a : ρ 0 (x and y are correlated) The results of the hypothess test descrbed before for the populaton slope β 1 can be used for these hypotheses as well. If we reject H 0 : β 1 = 0, we also reject H 0 : ρ = 0. If we decde n favor of H a : β 1 0, we also decde n favor of H a : ρ 0. Many statstcal software programs, ncludng Mntab, wll gve a p-value for testng whether the populaton correlaton s 0 or not. Ths p-value wll be the same as the p-value gven for testng whether the populaton slope s 0 or not. In the followng Mntab output for the relatonshp between pulse rate and weght n a sample of 35 college women, notce that s gven as the p-value for testng that the slope s 0 (look under P n the regresson results) and for testng that the correlaton s 0. Because ths s not a small p-value, we can reject the null hypotheses for the slope and the correlaton. Regresson Analyss: Pulse versus Weght The regresson equaton s Pulse = Weght Predctor Coef SE Coef T P Constant Weght Correlatons: Pulse, Weght Pearson correlaton of Pulse and Weght = P-Value = The Effect of Sample Sze on Sgnfcance The sze of a sample always affects whether a specfc observed result acheves statstcal sgnfcance. For example, r =.183 s not a statstcally sgnfcant correlaton for a sample sze of n=35, as n the pulse and weght example, but t would be statstcally sgnfcant f n=1,000. Wth very large sample szes, weak relatonshps wth low correlaton values can be statstcally sgnfcant. The moral of the story here s that wth a large sample sze, t may not be sayng much to say that two varables are sgnfcantly related. Ths only means that we thnk the correlaton s not 0. To assess the practcal sgnfcance of the result, we should carefully examne the observed strength of the relatonshp. 642

13 14.4 Predctng the Value of Y for an Indvdual An mportant use of a regresson equaton s to estmate or predct the unknown value of a response varable for an ndvdual wth a known specfc value of the explanatory varable. Usng the data descrbed n Example 3, for nstance, we can predct the maxmum dstance at whch an ndvdual can read a hghway sgn by substtutng hs or her age for x n the sample regresson equaton. Consder a person 21 years old. The predcted dstance s approxmately ŷ = = 514 feet. There wll be varaton among 21 year-olds wth regard to the sgn readng dstance, so the predcted dstance of 514 feet s not lkely to be the exact dstance for the next 21 year old who vews the sgn. Rather than predctng that the dstance wll be exactly 514 feet, we should nstead predct that the dstance wll be wthn a partcular nterval of values. A 95% predcton nterval for the value of the response varable (y) accounts for the varaton among ndvduals wth a partcular value of x. Ths nterval can be nterpreted n two equvalent ways. The 95% predcton nterval estmates the central 95% of the values of y for members of the populaton wth a specfed value of x. The probablty s 0.95 that a randomly selected ndvdual from the populaton wth a specfed value of x falls nto the correspondng 95% predcton nterval. Notce that a predcton nterval dffers conceptually from a confdence nterval. A confdence nterval estmates an unknown populaton parameter, whch s a numercal characterstc or summary of the populaton. An example n ths Chapter s a confdence nterval for the slope of the populaton lne. A predcton nterval, however, does not estmate a parameter; nstead t estmates the potental data value for an ndvdual. Equvalently, t descrbes an nterval nto whch a specfed percentage of the populaton may fall. As wth most regresson calculatons, the by hand formulas for predcton ntervals are formdable. Statstcal software can be used to create the nterval. Fgure 14.6 shows Mntab output that ncludes the 95% predcton ntervals for three dfferent ages (21 years old, 30 years old, and 45 years old). The ntervals are toward the bottom rght sde of the dsplay n a column labeled "95% PI" and are hghlghted wth bold type. (Note: The term Ft s a synonym for ŷ, the estmate of the average response at the specfc x value.) Here s what we can conclude: The probablty s 0.95 that a randomly selected 21 year-old wll read the sgn at somewhere between roughly 407 and 620 feet. The probablty s 0.95 that a randomly selected 30 year-old wll read the sgn at somewhere between roughly 381and 592 feet. The probablty s 0.95 that a randomly selected 45 year-old wll read the sgn at somewhere between roughly 338 and 545 feet. We can also nterpret each nterval as an estmate of the sgn readng dstances for the central 95% of a populaton of drvers wth a specfed age. For nstance, about 95% of all drvers 21 years old wll be able to read the sgn at a dstance somewhere between 407 and 620 feet. 643

14 Fgure 14.6 Mntab output showng predcton nterval of dstance The regresson equaton s Dstance = Age Predctor Coef SE Coef T P Constant Age S = R-Sq = 64.2% R-Sq(adj) = 62.9% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Unusual Observatons Obs Age Dstance Ft SE Ft Resdual St Resd R R denotes an observaton wth a large standardzed resdual Predcted Values for New Observatons New Obs Ft SE Ft 95.0% CI 95.0% PI ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) Values of Predctors for New Observatons New Obs Age We re not lmted to usng only 95% predcton ntervals. Wth Mntab, we can descrbe any central percentage of the populaton that we wsh. For example, here are 50% predcton ntervals for the sgn readng dstance at the three specfc ages we consdered above. Age Ft 50.0% PI ( , ) ( , ) ( , ) For each specfc age, the 50% predcton nterval estmates the central 50% of the maxmum sgn readng dstances n a populaton of drvers wth that age. For example, we can estmate that 50% of drvers 21 years old would have a maxmum sgn readng dstance somewhere between about 478 feet and 549 feet. The dstances for the other 50% of 21 year-old drvers would be predcted to be outsde ths range wth 25% beyond 549 feet and 25% below 478 feet. Interpretaton of a Predcton Interval A predcton nterval estmates the value of y for an ndvdual wth a partcular value of x, or equvalently, the range of values of the response varable for a specfed central percentage of a populaton wth a partcular value of x. 644

15 Techncal Note: The formula for the predcton nterval for y at a specfc x s: where 2 2 ŷ± t* s + [s.e.(ft)] 2 1 ( x x) s. e.( ft) = s + 2 n ( x x) The multpler t* s found usng a t-dstrbuton wth n-2 degrees of freedom, and s such that the probablty between t* and +t* equals the desred level for the nterval. Note: The s.e.(ft), and thus the wdth of the nterval, depends upon how far the specfed x-value s from x. The further the specfc x s from the mean, the wder the nterval. When n s large, s.e.(ft) wll be small, and the predcton nterval wll be approxmately ŷ± t*s Estmatng the Mean Y at a Specfed X In the prevous secton, we focused on the estmaton of the values of the response varable for ndvduals. A researcher may nstead want to estmate the mean value of the response varable for ndvduals wth a partcular value of the explanatory varable. We mght ask, What s the mean weght for college men who are 6 feet tall? Ths queston only asks about the mean weght n a group wth a common heght, and t s not concerned wth the devatons of ndvduals from that mean. In techncal terms, we wsh to estmate the populaton mean E(Y) for a specfc value of x that s of nterest to us. To make ths estmate, we use a confdence nterval. Ths format for ths confdence nterval s agan: sample estmate multpler standard error The sample estmate of E(Y) s the value of ŷdetermned by substtutng the x-value of nterest nto yˆ = b0 + b1 x, the least-squares regresson lne for the sample. The standard error of ŷ s the s.e.(ft) shown n the Techncal Note n the prevous secton, and ts value s usually provded by statstcal software. The multpler s found usng a t-dstrbuton wth df=n-2, and Appendx A-3 can be used to determne ts value. Example 2 Revsted. Estmatng Mean Weght of College Men at Varous Heghts Based on the sample of n=43 college men n Example 2, let s estmate the mean weght n the populaton of college men for each of three dfferent heghts: 68 nches, 70 nches, and 72 nches. Fgure 14.7 shows Mntab output that ncludes the three dfferent confdence ntervals for these three dfferent heghts. These ntervals are toward the bottom of the dsplay n a column labeled 95% CI. The frst entry n that column s the estmate of the populaton mean weght for men who are 68 nches tall. Wth 95% confdence, we can estmate that mean weght of college men 68 nches tall s somewhere between and pounds. The second row under 95% CI contans the nformaton that the 95% confdence nterval for the mean weght of college men 70 nches tall s to pounds. The 95% confdence nterval for the mean weght for men 72 nches tall s to pounds. Agan, t s mportant to realze that the confdence ntervals for E(Y) do not descrbe the varaton among ndvduals. They only are estmates of the mean weghts for specfc heghts. The predcton ntervals for ndvdual responses descrbe the varaton among ndvduals. You may have notced that 95% predcton ntervals, labeled 95% PI, are next to the confdence 645

16 ntervals n the output. Among men 70 nches tall, for nstance, we would estmate that 95% of the ndvdual weghts would be n the nterval from about 122 to about 221 pounds. Fgure 14.7 Mntab Output wth Confdence Intervals For Mean Weght The regresson equaton s Weght = Heght Predctor Coef SE Coef T P Constant Heght S = R-Sq = 32.3% R-Sq(adj) = 30.7% --- Some Output Omtted ---- Predcted Values for New Observatons New Obs Ft SE Ft 95.0% CI 95.0% PI ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) Values of Predctors for New Observatons New Obs Heght Checkng Condtons for Usng Regresson Models for Inference There are a few condtons that should be at least approxmately true when we use a regresson model to make an nference about a populaton. Of the fve condtons that follow, the frst two are partcularly crucal. Condtons for Lnear Regresson 1. The form of the equaton that lnks the mean value of y to x must be correct. For nstance, we won t make proper nferences f we use a straght lne to descrbe a curved relatonshp. 2. There should not be any extreme outlers that nfluence the results unduly. 3. The standard devaton of the values of y from the mean y s the same regardless of the value of the x varable. In other words, y values are smlarly spread out at all values of x. 4. For ndvduals n the populaton wth the same partcular value of x, the dstrbuton of the values of y s a normal dstrbuton. Equvalently, the dstrbuton of devatons from the mean value of y s a normal dstrbuton. Ths condton can be relaxed f the sample sze s large. 5. Observatons n the sample are ndependent of each other. 646

17 Checkng the Condtons wth Plots A scatter plot of the raw data and plots of the resduals provde nformaton about the valdty of the assumptons. Remember that a resdual s the dfference between an observed value and the predcted value for that observaton, and that some assumptons made for a lnear regresson model have to do wth how y-values devate from the regresson lne. If the propertes of the resduals for the sample appear to be consstent wth the mathematcal assumptons made about devatons wthn the populaton, we can use the model to make statstcal nferences. Condtons 1, 2 and 3 can be checked usng two useful plots: A scatter plot of y versus x for the sample (y vs x) A scatter plot of the resduals versus x for the sample (resds vs x) If Condton 1 holds for a lnear relatonshp, then: The plot of y vs x should show ponts randomly scattered around an magnary straght lne. The plot of resds vs x should show ponts randomly scattered around a horzontal lne at resd = 0. If Condton 2 holds, extreme outlers should not be evdent n ether plot. If condton 3 holds, nether plot should show ncreasng or decreasng spread n the ponts as x ncreases. Example 2 Contnued. Checkng the Condtons for the Weght and Heght Problem Fgure 14.4 dsplayed a scatter plot of the weghts and heghts of n=43 college men. In that plot, t appears that a straght-lne s a sutable model for how mean weght s lnked to heght. In Fgure 14.8 there s a plot of the resduals ( e ) versus the correspondng values of heght for these 43 men. Ths plot s further evdence that the rght model has been used. If the rght model has been used, the way n whch ndvduals devate from the lne (resduals) wll not be affected by the value of the explanatory varable. The somewhat random lookng blob of ponts n Fgure 14.8 s the way a plot of resduals versus x should look f the rght equaton for the mean has been used. Both plots (Fgures 14.4 and 14.8) also show that there are no extreme outlers and that the heghts have approxmately the same varance across the range of heghts n the sample. Therefore, Condtons 2 and 3 appear to be met. Fgure 14.8 Plot of Resduals versus X for Example 2. The Absence of a Pattern Indcates the Rght Model Has Been Used 647

18 Condton 4, whch s that devatons from the regresson lne are normally dstrbuted, s dffcult to verfy but t s also the least mportant of the condtons because the nference procedures for regresson are robust. Ths means that f there are no major outlers or extreme skewness, the nference procedures work well even f the dstrbuton of y-values s not a normal dstrbuton. In Chapters 12 and 13, we saw that confdence ntervals and hypothess tests for a mean or a dfference between two means also were robust. To examne the dstrbuton of the devatons from the lne, a hstogram of the resduals s useful although for small samples a hstogram may not be nformatve. A more advanced plot called a normal probablty plot can also be used to check whether the resduals are normally dstrbuted, but we do not provde the detals n ths text. Fgure 14.9 dsplays a hstogram of the resduals for Example 2. It appears that the resduals are approxmately normally dstrbuted, so Condton 4 s met. Fgure 14.9 Hstogram of Resduals for Example 2 Condton 5 follows from the data collecton process. It s met as long as the unts are measured ndependently. It would not be met f the same ndvduals were measured across the range of x-values, such as f x=average speed and y=gas mleage were to be measured for multple tanks of gas on the same cars. More complcated models are needed for dependent observatons, and those models wll not be dscussed n ths book. Correctons When Condtons Are Not Met There are some steps that can be taken f Condtons 1, 2 or 3 are not met. If Condton 1 s not met, more complcated models can be used. For nstance, Fgure shows a typcal plot of resduals that occurs when a straght-lne model s used to descrbe data that are curvlnear. It may help to thnk of the resduals as predcton errors that would occur f we use the regresson lne to predct the value of y for the ndvduals n the sample. In the plot shown n Fgure 14.10, the predcton errors are all negatve n the central regon of X and nearly all postve for outer values of X. Ths occurs because the wrong model s beng used to make the predctons. A curvlnear model, such as the quadratc model dscussed earler, may be more approprate. Fgure A Resdual Plot Indcatng the Wrong Model Has Been Used 648

19 Condton 2, that there are no nfluental outlers, can be checked graphcally wth the scatter plot of y versus x and the plot of resduals versus x. The approprate correcton f there are outlers depends on the reason for the outlers. The same consderatons and correctve acton dscussed n Chapter 2 would be taken, dependng on the cause of the outler. For nstance, Fgure shows a scatter plot and a resdual plot for the data of Exercse 38 n Chapter 5. A potental outler s seen n both plots. In ths example, the x-varable s weght and the y-varable s tme to chug a beverage. The outler probably represents a legtmate data value. The relatonshp appears to be lnear for weghts rangng up to about 210 pounds, but then t appears to change. It could ether become quadratc, or t could level off. We do not have enough data to determne what happens for hgher weghts. The soluton n ths case would be to remove the outler, and use the lnear regresson relatonshp only for body weghts under about 210 pounds. Determnng the relatonshp for hgher body weghts would requre a larger sample of ndvduals n that range. 649

20 Fgure Scatter plot and Resdual Plot Wth an Outler If ether Condton 1 or Condton 3 s not met, a transformaton may be requred. Ths s equvalent to usng a dfferent model. Fortunately, often the same transformaton wll correct problems wth Condtons 1,3, and 4. For nstance, when the response varable s monetary, such as salares, t s often more approprate to use the relatonshp ln(y) = b 0 + b 1 x + e In other words, to assume that there s a lnear relatonshp between the natural log of y and the x- values. Ths s called a log transformaton on the y's. We wll not pursue transformatons further n ths book. 650