Unit 11 Uing Linear Regreion to Decribe Relationhip Objective: To obtain and interpret the lope and intercept of the leat quare line for predicting a quantitative repone variable from a quantitative explanatory variable The correlation between two quantitative variable provide u with a meaure the trength and direction of a linear relationhip. Obtaining correlation doe not require u to identify one of the two variable a being predicted from the other. However, there are ituation in which we want to predict one variable from another. A repone variable or dependent variable i one we predict from one or more other variable; an explanatory variable or independent variable i one from which prediction are made. Typically, we let repreent the repone (dependent) variable and let repreent the explanatory (independent) variable. With the SURVE DATA, diplayed a Data Set 1-1 at the end of Unit 1, we might wih to predict yearly income from age, making = "yearly income" the repone variable and = "age" the explanatory variable. With the Doage and Reaction Time Data of Table 10-, it i natural to think of predicting reaction time from drug doage, making = "reaction time" the repone variable and = "drug doage" the explanatory variable. Regreion refer to the prediction of one quantitative repone variable from one or more quantitative explanatory variable. We will be primarily concerned with imple linear regreion, which refer to the prediction of one repone variable from one explanatory variable uing the equation of a traight line written in the form = a + b. In thi equation, a i called the intercept of the line, and b i called the lope of the line. To illutrate the role played by the lope and intercept, let u uppoe that i temperature meaured in degree Celiu and i temperature meaured in degree Fahrenheit. It will alway be true that = 3 + 1.8, ince thi i the well known formula for changing from degree Celiu to degree Fahrenheit. For intance, to convert a temperature of 5 degree Celiu to degree Fahrenheit, = 5 i ubtituted into the equation, from which we find that 5 degree Celiu i equal to = 3 + 1.8(5) = 15.6 degree Fahrenheit. The intercept of a line (a) i the value of when i equal to zero. When degree Celiu i 0, then the degree Fahrenheit i a = 3, ince 3 + 1.8(0) = 3. Next, we oberve that when = 1, = 33.8; when =, = 35.6; when = 3, = 37.4; etc., from which we find that each time i increaed by 1 unit, i increaed by b = 1.8 unit. The lope of a line repreent the amount of change in each time i increaed by 1 unit. In thi cae, each time degree Celiu i increaed by 1 degree, the degree Fahrenheit i increaed by 1.8 degree. To graph a line, all we really need to do i plot and connect any two ditinct point on the line. For intance, to graph = 3 + 1.8, which i done in Figure 11-1, we can connect the point (0, 3) and (100, 1), both of which are on the line. There i a poitive linear relationhip between degree Celiu and degree Fahrenheit, becaue the lope i poitive, implying that increae whenever increae. With a negative linear relationhip, the lope 69
i negative, implying that decreae whenever increae. Thi i illutrated by the line in Figure 11-. The equation of thi line i = 10.5 0.6, which ha a lope of b = 0.6 and an intercept of a = 10.5. It i eay to check that each time i increaed by 1, i decreaed by 0.6. We have already een that the lope (b) i indeed 0.6, and it i alo eay to ee that when = 0, then = 10.5, which i the intercept (a). Now, let u return to Figure 7-4, which (if we momentarily ignore the line in the figure) i the catter plot of = age and = yearly income in the SURVE DATA, diplayed a Data Set 1-1 at the end of Unit 1. There appear to be a poitive relationhip, but thi relationhip i not a perfect poitive one, ince the point do not all lie on a traight line. If all the point in a catter plot do not lie on a traight line, how can we poibly find the equation of a line which will decribe thi relationhip? Since we cannot find the equation of a line which goe through every data point, we do the next bet thing by finding the equation of the line which come cloet to all the data point in ome ene. One popular method to do thi i to find the line which minimize the quared vertical ditance between the data point and the line; thi line i called the leat quare line. The line that ha been graphed with the catter plot in Figure 7-4 i the leat quare line. We hall not attempt to compare the merit of the method of leat quare with thoe of competing method, nor hall we attempt to derive the equation of the line reulting from the method of leat quare. We hall imply tate the formula for the lope and intercept to find the leat quare line for predicting from with a bivariate data et. The lope in the leat quare line can be obtained from b = r. From thi formula, one ee that the lope b of the leat quare line and the correlation r mut alway have the ame ign. Intuitively, thi hould be obviou, ince a poitive correlation ugget that tend to increae a increae (implying a poitive lope) and a negative correlation ugget that tend to decreae a increae (implying a negative lope). Once the lope of the leat quare line i available, the intercept can be obtained from a = y bx. We can ue the equation of the leat quare line to make prediction about from or to decribe the relationhip between and. Let u illutrate how we can obtain the leat quare line for predicting = yearly income from = age with the SURVE DATA, diplayed a Data Set 1-1 at the end of Unit 1. Uing the calculation from Table 10-1, we have previouly found that x = 44.1, y = 45.4, = 11.4691, = 15.8953, and r = +0.477. Uing the formula for the lope in the leat quare line and then the formula for the intercept in the leat quare line, we find that 15.8953 b = r = ( 0.477) = 0.661 and a = y bx = 45.4 (0.661)(44.1) = 16.. 11.4691 We could write the equation of the leat quare line a = 16. + 0.661 ; however, in order to emphaize what and repreent, we might prefer to write the equation a inc = 16. + 0.661(age), where inc i an abbreviation for yearly income in thouand of dollar. The lope of thi leat quare line, b = 0.661 thouand dollar, i an etimate for the average amount of change in yearly income accompanying an increae of one year in age; that i, for each one year increae in age, the yearly income among the voter repreented by the 70
SURVE DATA i etimated to increae on average by 661 dollar. Figure 7-4, which diplay the graph of thi leat quare line on a catter plot, provide a viual picture of how well the leat quare line fit the data. At a later time, we hall dicu way for deciding whether or not the fit i a good. We may ue the leat quare line to predict yearly income with a given age or to etimate the average yearly income for a given age group, a long a age i within the range of the age oberved in the data. For example, we etimate the average yearly income for 50-year-old voter repreented by the SURVE DATA to be about 16. + 0.661(50) = 49.5 thouand dollar. If we wanted to predict the yearly income for a particular 30-year-old voter from the voter repreented by the SURVE DATA, our prediction would be 16. + 0.661(30) = 36.1 thouand dollar (which i of coure the ame a the etimated average yearly income for 30-year-old). A a general rule, etimation and prediction outide the range of the oberved value of hould be avoided. For intance, if we blindly ubtituted an age of = 10 year into the leat quare line, we would etimate the average yearly income for 10-year-old to be 16. + 0.661(10) =.8 thouand dollar. We hope you agree that thi i a totally meaningle etimate! Examination of the SURVE DATA will reveal that the age of the voter in the data ranged from 0 to 6 year old. Thi implie that our leat quare line can be applied only to age in thi range. In fact, we hould expect the relationhip between age and yearly income to be quite different for at leat ome age outide the range from 0 to 6 year. We hall now have you find the leat quare line for predicting = the reaction time to a particular timulu (in econd) from = the doage of a certain drug (in gram) with the data of Table 10-. It hould be clear that reaction time i the repone variable and doage i the explanatory variable. In order to find the leat quare line, recall that the calculation done in Table 10-3 were ued to find that x = 7, y = 4.1, = 5.714, = 4.70, and r = 0.968. Once you have found the leat quare line, graph thi line on Figure 10-1, which i a catter plot of the data. (ou hould find that the leat quare line i rct = 10.5 0.875(dg), where rct repreent reaction time in econd, and dg repreent doage in gram; Figure 11-3 i a graph of the leat quare line on the catter plot.) The lope of thi leat quare line for predicting reaction time from doage, b = 0.875 econd, i an etimate for the average amount of change in reaction time accompanying an increae of one gram in doage; that i, for each one gram increae in doage of the drug, the reaction time i etimated to decreae on average by 0.875 econd. Note that ince the relationhip between reaction time and drug doage i a negative one, both the lope b and the correlation r are negative. Ue the leat quare line to etimate the average reaction time with a drug doage of 9 gram and to predict the reaction time with a doage of 7 gram. (ou hould find that the etimated average reaction time with a drug doage of 9 gram i.35 econd, and that the predicted reaction time with a doage of 7 gram i 4.1 econd.) Earlier, we tated that uing the leat quare line for etimation and prediction outide the range of the oberved value of hould be avoided. There i one exception to thi rule. When our explanatory variable i time, and we are attempting to predict a quantitative variable for a future time period, then of coure after oberving value of for everal time period, our goal i to make prediction for future time period outide the range of the data. Data oberved over a equence of time period i called a time erie. The firt two column of Table 9-1 i an example of time erie data for price, and the firt and third column of Table 9-1 i an example of time erie data for quantitie. One may attempt to ue a leat quare line with time erie data to make prediction for the future, but uch prediction will uually not be very accurate, becaue very few quantitative variable change over time in a purely linear fahion. Decribing how a quantitative variable change over time almot alway require the ue of a nonlinear relationhip. Although oon we hall dicu 71
way of deciding whether the aumption of a linear relationhip i warranted (a oppoed to ome other type of relationhip), we hall not have time to explore the many different nonlinear relationhip which exit. There i one final remark that we hall make concerning the topic of correlation and linear regreion. Recall that we previouly cautioned againt interpreting a trong correlation between two variable a an indication that change in one variable caue change in the other variable. In order to tudy whether or not change in a variable caue change in another variable, we mut perform a more ophiticated tatitical analyi than imply finding a correlation. To etablih that a caual effect exit, it i often neceary to be able to control the value of an explanatory variable. For intance, when predicting reaction time from drug doage with the Doage and Reaction Time Data of Table 10-, it eem obviou that the doage were carefully elected by the experimenter. However, when predicting = yearly income from = age with the SURVE DATA, diplayed a Data Set 1-1 at the end of Unit 1, the age in the SURVE DATA look random. With the Doage and Reaction Time Data, the experimenter had control over the value of (doage), wherea with the SURVE DATA, the experimenter had no control over the age of the individual in the tudy. Self-Tet Problem 11-1. In Self-Tet Problem 10-1, the Age and Grip Strength Data, diplayed in Table 10-4, i ued in a tudy of the relationhip between age and grip trength among right-handed male. Suppoe there i interet in the prediction of grip trength from age. (a) Identify the repone variable and the explanatory variable. (b) In Self-Tet Problem 10-1, it wa found that x = 18, y = 6, = 3.73, = 157.091, and r = +0.7698. Find the equation of the leat quare line, and write a one entence interpretation of the lope of the leat quare line. (c) Ue the leat quare line to predict the grip trength of a 0-year-old right-handed male. (d) Ue the leat quare line to etimate the average grip trength of 15-year-old right-handed male. (e) Give an example of an age for which uing thi leat quare line to predict grip trength would be inappropriate. (f) On average, what i the change in grip trength with a five-year increae in age? (g) Suppoe you are told that a particular right-handed male ha a grip trength of 7 lb.; ue the leat quare line to etimate thi male' age. (h) Contruct a catter plot of the data, and graph the leat quare line on the catter plot. (i) Do the value for the explanatory variable in the data look like they were controlled by the experimenter, or do they look random? Self-Tet Problem 11-. Suppoe that time erie data i recorded from the monthly ale of ice cream ale at a particular ice cream parlor. Explain why uing a leat quare line to predict ale for future month would probably not be very accurate. Anwer to Self-Tet Problem 11-1 (a) Grip trength i the repone variable and age i the explanatory variable. (b) The leat quare line can be written a grp = 6 + (age), where grp repreent grip trength in lb. For each increae of one year in age, grip trength increae on average by about lb. (c) The predicted grip trength of a 0-year-old right-handed male i 66 lb. (d) The etimated average grip trength of 15-year-old right-handed male i 56 lb. (e) It i not appropriate to ue thi leat quare line to predict grip trength for any age outide 11 to 5 year (the age range in the data). (f) For each increae of five year in age, grip increae on average by about (5 b) 10 lb. (g) The etimated age i about 3 year. (h) See Figure 11-4. (i) The value for the explanatory variable age in the data look random. 11- Ice cream ale are not likely to change in a linear fahion from month to month, ince ale will tend to be higher in ummer month and lower in winter month, reulting in a nonlinear cycle within each year. 7
Summary A repone variable or dependent variable i one we predict from one or more other variable; an explanatory variable or independent variable i one from which prediction are made. Regreion refer to the prediction of one quantitative repone variable from one or more quantitative explanatory variable. Simple linear regreion refer to the prediction of one repone variable from one explanatory variable uing the equation of a traight line written in the form = a + b. In thi equation, a i called the intercept of the line, and b i called the lope of the line. The intercept i the value of when = 0; the lope i the change in whenever i increaed by one unit. To graph a line, we only need to plot and connect any two ditinct point on the line. One popular method to find the equation of a line which come cloet to all the data point i to ue the leat quare line, which i the line minimizing the quared vertical ditance between the data point and the line. The lope of the leat quare line can be obtained from b = r. Once the lope of the leat quare line i available, the intercept can be obtained from a = y bx. We may ue the leat quare line to predict with a given value of or to etimate the average value of for a given value of, a long a the given value of i within the range of the oberved data. Data oberved over a equence of time period i called a time erie. Uing a leat quare line with time erie data to make prediction for the future are uually not very accurate, becaue quantitative variable almot alway change over time in a nonlinear fahion. 73