CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES In ths chapter, we wll learn how to descrbe the relatonshp between two quanttatve varables. Remember (from Chapter 2) that the terms quanttatve varable and measurement varable are synonyms for data that can be recorded as numercal values and then ordered accordng to those values. The relatonshp between weght and heght s an example of a relatonshp between two quanttatve varables. The questons we ask about the relatonshp between two varables often concern specfc numercal features of the assocaton. For example, we may want to know how much weght wll ncrease on average for each one-nch ncrease n heght. Or, we may want to estmate what the college grade pont average wll be for a student whose hgh school grade pont average was 3.5. We wll use three tools to descrbe, pcture and quantfy the relatonshp between two quanttatve varables: Scatter plot, a two-dmensonal graph of data values. Correlaton, a statstc that measures the strength and drecton of a lnear relatonshp between two quanttatve varables. Regresson equaton, an equaton that descrbes the average relatonshp between a quanttatve response varable and an explanatory varable. 5.1 Lookng for Patterns wth Scatter Plots A scatter plot s a two dmensonal graph of the measurements for two numercal varables. A pont on the graph represents the combnaton of measurements for an ndvdual observaton. The vertcal axs, whch s called the y-axs, s used to locate the value of one of the varables. The horzontal axs, called the x-axs, s used to locate the values of the other varable. As we learned n Chapter 2, when lookng at relatonshps we can often dentfy one of the varables as an explanatory varable that may explan or cause dfferences n the response varable. The term dependent varable s used as a synonym for response varable. In a scatter plot, the response varable s plotted on the vertcal axs (the y-axs), so t may also be called the y varable. The explanatory varable s plotted along the horzontal axs (the x-axs) and may be called the x varable. Questons to Ask about a Scatter Plot What s the average pattern? Does t look lke a straght lne or s t curved? What s the drecton of the pattern? How much do ndvdual ponts vary from the average pattern? Are there any unusual data ponts? 156

Example 1. Heght and Hand-Span Table 5.1 dsplays the frst 12 observatons of a data set that ncludes the heghts (n nches) and fully stretched hand-spans (n centmeters) of 167 college students. The data values for all 167 students are the raw data for studyng the connecton between heght and hand-span. Imagne how dffcult t would be see the pattern n the data f all 167 observatons were shown n Table 5.1. Even when we just look at the data for 12 students, t takes a whle to confrm that there does seem to be a tendency for taller people to have larger hand-spans. Fgure 5.1 s a scatter plot that dsplays the hand-span and heght measurements for all 167 students. The hand-span measurements are plotted along the vertcal axs (y) and the heght measurements are plotted along the horzontal axs. Each pont represents the two measurements for an ndvdual. Table 5.1 Hand-Spans and Heght Ht.(n) Span(cm) 71 23.5 69 22.0 66 18.5 64 20.5 71 21.0 72 24.0 67 19.5 65 20.5 76 24.5 67 20.0 70 23.0 62 17.0 and so on, for n=167 observatons Fgure 5.1. Hand-Span and Heght We see that taller people tend to have greater hand-span measurements than shorter people do. When two varables tend to ncrease together, as they do n Fgure 5.1, we say that they have a postve assocaton. Another noteworthy characterstc of the graph s that we can 157

descrbe the general pattern of ths relatonshp wth a straght lne. In other words, the hand-span and heght measurements may have a lnear relatonshp. Postve and Negatve Assocaton Two varables have a postve assocaton when the values of one varable tend to ncrease as the values of the other varable ncrease. Two varables have a negatve assocaton when the values of one varable tend to decrease as the values of the other varable ncrease. Example 2. Drver Age and the Maxmum Legblty Dstance of Hghway Sgns In a study of the legblty and vsblty of hghway sgns, a Pennsylvana research frm determned the maxmum dstance at whch each of thrty drvers could read a newly desgned sgn. The thrty partcpants n the study ranged n age from 18 to 82 years old. The government agency that funded the research hoped to mprove hghway safety for older drvers, and wanted to examne the relatonshp between age and the sgn legblty dstance. Table 5.2 lsts the data and Fgure 5.2 shows a scatter plot of the ages and dstances. The sgn legblty dstance s the response varable so that varable s plotted on the y-axs (the vertcal axs). The maxmum readng dstance tends to decrease as age ncreases, so there s a negatve assocaton between dstance and age. Ths s not a surprsng result. As a person gets older, hs or her eyesght tends to get worse so we would expect the dstances to decrease wth age. The researchers collected the data to determne numercal estmates for two questons about the relatonshp: How much does the dstance decrease when age s ncreased? For drvers of any specfc age, what s the average dstance at whch the sgn can be read? We ll examne these questons n the next secton. For now, we smply pont out that the pattern n the graph looks lnear, so a straght lne equaton that lnks dstance to age wll help us answer these questons. Table 5.2 Data Values for Example 2 Age Dstance Age Dstance Age Dstance 18 510 37 420 68 300 20 590 41 460 70 390 22 560 46 450 71 320 23 510 49 380 72 370 23 460 53 460 73 280 25 490 55 420 74 420 27 560 63 350 75 460 28 510 65 420 77 360 29 460 66 300 79 310 32 410 67 410 82 360 158

Fgure 5.2 Drver Age and the Maxmum Dstance at Whch Hghway Sgn s Read (Source: Adapted from data collected by Last Resource, Inc., Bellefonte, PA) 5.2 Descrbng Lnear Patterns wth a Regresson Lne Scattter plots show us a lot about a relatonshp, but we often want more specfc numercal descrptons of how the response and explanatory varables are related. Imagne, for example, that we are examnng the weghts and heghts of a sample of college women. We mght want to know what the ncrease n average weght s for each one-nch ncrease n heght. Or, we mght want to estmate the average weght for women wth a specfc heght, lke 5 10. Regresson analyss s the area of statstcs used to examne the relatonshp between a quanttatve response varable and one or more explanatory varables. A key element of regresson analyss s the estmaton of an equaton that descrbes how, on average, the response varable s related to the explanatory varables. Ths regresson equaton can be used to answer the types of questons that we just asked about the weghts and heghts of college women. A regresson equaton can also be used to make predctons. For nstance, t mght be useful for colleges to have an equaton for the connecton between verbal SAT score and college grade pont average (GPA). They could use that equaton to predct the potental GPAs of future students, based on ther verbal SAT scores. Some colleges actually do ths knd of predcton to decde whom to admt, but they use a collecton of varables to predct GPA. The predcton equaton for GPA usually ncludes hgh school GPA, hgh school rank, verbal and math SAT scores, and possbly other factors such as a ratng of the student s hgh school or the qualty of an applcaton essay. There are many types of relatonshps and many types of regresson equatons. The smplest knd of relatonshp between two varables s a straght lne, and that s the only type we wll dscuss here. Straght-lne relatonshps occur frequently n practce, so ths s a useful and mportant type of regresson equaton. Before we use a straght lne regresson model, however, we should always examne a scatterplot to verfy that the pattern actually s lnear. We remnd you of the musc preference and age example where a straght lne defntely does not descrbe the pattern of the data. Interpretng the Regresson Equaton and Regresson Lne When the best equaton for descrbng the relatonshp between x and y s a straght lne the resultng equaton s called the regresson lne. Ths lne s used for two purposes: to estmate the average value of y at any specfed value of x to predct the value of y for an ndvdual, gven that ndvdual's x value 159

Example 1 Revsted. Descrbng Heght and Hand-Span wth a Regresson Lne In Fgure 5.1, we saw that the relatonshp between hand-span and heght has a straghtlne pattern. Fgure 5.6 dsplays the same scatterplot as Fgure 5.1, but now a lne s shown that descrbes the average relatonshp between the two varables. To determne ths lne, we used statstcal software (Mntab) to fnd the best lne for ths set of measurements. We ll dscuss the crteron for "best" later. Presently, let s focus on what the lne tells us about the data. The lne drawn through the scatterplot s the regresson lne and t descrbes how average hand-span s lnked to heght. For example, when the heght s 60 nches, the vertcal poston of the lne s at about 18 centmeters. To see ths, locate 60 nches along the horzontal axs (x axs), look up to the lne, and then read the vertcal axs to determne the hand-span value. The result s that we can estmate that people 60 nches tall have an average hand-span of about 18 centmeters (roughly 7 nches). We can also use the lne to predct the hand-span for an ndvdual whose heght s known. For nstance, someone 60 nches tall s predcted to have a hand-span of about 18 centmeters. Fgure 5.6 Regresson Lne Descrbng Hand-Span and Heght If we estmate the average hand-span at a dfferent heght, we can determne how much hand-span changes, on average, when heght s vared. Let s use the lne to estmate the average hand-span for people who are 70 nches tall. We see that the vertcal locaton of the regresson lne s somewhere betweeen 21 and 22 centmeters, perhaps about 21.5 centmeters (roughly 8.5 nches). So, when heght s ncreased from 60 nches to 70 nches, average hand-span ncreases from about 18 centmeters to about 21.5 centmeters. The average hand-span ncreased by 3.5 centmeters (about 1.5 nches) when the heght was ncreased by 10 nches. Ths s a rate of 3.5/10 = 0.35 centmeters per one nch ncrease n heght, whch s the slope of the lne. For each one-nch dfference n heght, there s about a 0.35 centmeter average dfference n hand-span. 160

Algebra Remnder The equaton for a straght lne relatng y and x s: y = b 0 + b 1 x where b 0 s the "y-ntercept" and b 1 s the slope. When x = 0, y = y-ntercept. The slope of a lne can be determned by pckng any two ponts on the lne, and then calculatng dfference between y values y2 y1 slope = = dfference between x values x2 x1 The letter y represents the vertcal drecton and x represents the horzontal drecton. The slope tells us how much the y varable changes for each ncrease of one unt n the x varable. We ordnarly don t have to read the regresson lne as we just dd. Statstcal software wll tell us the regresson equaton, the specfc equaton used to draw the lne. For the hand-span and heght relatonshp, the regresson equaton determned by statstcal software s: Hand-span = 3 + 0.35 Heght. When emphass s on usng the equaton to estmate the average hand-spans for specfc heghts, we may wrte: Average Hand-span = 3 + 0.35 Heght When emphass s on usng the equaton to predct an ndvdual hand-span, we mght nstead wrte: Predcted Hand-span = 3 + 0.35 Heght In most stuatons, the correct statstcal nterpretaton of a regresson equaton s that t estmates the average value of a response varable (y) for ndvduals wth a specfc value of the explanatory varable (x). The equaton Hand-span = 3 + 0.35 Heght tells us how to draw the lne, but not all ndvduals follow ths pattern exactly. Look agan at Fgure 5.6, n whch we see that the lne descrbes the overall pattern, but we also see substantal ndvdual devaton from ths lne. Let's use the regresson equaton to estmate the average hand-spans for some specfc heghts. For heght=60, average hand-span = 3 + 0.35(60) = 3 + 21 = 18 cm. For heght=70, average hand-span = 3 + 0.35(70) = 3 + 24.5= 21.5 cm In the equaton, the value 0.35 multples the heght. Ths value s the slope of the straght lne that lnks hand-span and heght. Consstent wth our estmates above, the slope n ths example tells us that hand-span ncreases by 0.35 centmeters, on average, for each ncrease of one nch n heght. We can use the slope to estmate the average dfference n hand-span for any dfference n heght. If we consder two heghts that dffer by 7 nches, our estmate of the dfference n handspans would be 7 0.35 = 2.45 centmeters, or approxmately one nch. The Equaton for the Regresson Lne All straght lnes can be expressed by the same formula n whch y s the varable on the vertcal axs and x s the varable on the horzontal axs. The equaton for a regresson lne s: yˆ = b0 + b1x. In any gven stuaton, the sample s used to determne numbers that replace b 0 and b 1. ŷ s spoken as y-hat and t s also referred to ether as predcted y or estmated y. b 0 s the ntercept of the straght lne. The ntercept s the value of y when x = 0. b 1 s the slope of the straght lne. The slope tells us how much of an ncrease (or decrease) there s for the y varable when the x varable ncreases by one unt. The sgn of the slope tells us whether y ncreases or decreases when x ncreases. 161

Interpretng a Regresson Lne ŷ estmates the average y for a specfc value of x. It also can be used as a predcton of the value of y for an ndvdual wth a specfc value of x. The slope of the lne estmates the average ncrease n y for each one unt ncrease n x. The ntercept of the lne s the value of y when x=0. Note that nterpretng the ntercept n the context of statstcal data only makes sense f x=0 s ncluded n the range of observed x- values. Example 2 Revsted: Drver Age and the Maxmum Legblty Dstance of Hghway Sgns The regresson lne y ˆ = 577 3 x descrbes how the maxmum sgn legblty dstance (the y varable) s related to drver age (the x varable). Statstcal software was used to calculate ths equaton and to create the graph shown n Fgure 5.7. Earler, we asked these two questons about dstance and age: How much does the dstance decrease when age s ncreased? For drvers of any specfc age, what s the average dstance at whch the sgn can be read? The slope of the equaton can be used to answer the frst queston. Remember that the slope s the number that multples the x varable and the sgn of the slope ndcates the drecton of the assocaton. Here, the slope tells us that, on average, the legblty dstance decreases 3 feet when age ncreases by one year. Ths nformaton can be used to estmate the average change n dstance for any dfference n ages. For an age ncrease of 30 years, the estmated decrease n legblty dstance s 90 feet because the slope s 3 feet per year. The queston about estmatng the average legblty dstances for a specfc age s answered by usng the specfc age as the x value n the regresson equaton. To emphasze ths use of the regresson lne, we wrte t as: Average dstance = 577 3 Age Here are the results for three dfferent ages: AGE AVERAGE DISTANCE 20 577 3(20) = 517 feet 50 577 3(50) = 427 feet 80 577 3(80) = 337 feet The equaton can also be used to predct the dstance measurement for an ndvdual drver wth a specfc age. To emphasze ths use of the regresson lne, we wrte the equaton as: Predcted dstance = 577 3 Age For example, we can predct that the legblty dstance for a 20-year old wll be 517 feet and for an 80-year old wll be 337 feet. 162

Fgure 5.7 Regresson Lne For Drver Age and Sgn Legblty Dstance The Least Squares Lne, Errors and Resduals We can use statstcal software to estmate the regresson lne, but how does the computer fnd the best equaton for a set of data? The most commonly used method s called least squares and the regresson lne determned by ths method s called the least squares lne. The phrase least squares s actually a shortened verson of least sum of squared errors. Ths crteron focuses on the dfferences between the values of the response varable (y) and the regresson lne. The response varable s emphaszed because we often use the equaton to predct that varable for specfc values of the explanatory varable (x). Therefore, we should mnmze how far off the predctons wll be n that drecton. For any gven lne, we can calculate the predcted value of y for each pont n the observed data. To do ths for any partcular pont, we use the observed x value n the equaton. We then determne the predcton error for each pont. An error s smply the dfference between the observed y value and the predcted value ŷ. These errors are squared and added up for all of the ponts n the sample. The least squares lne mnmzes the sum of the squared errors. Ths termnology s somewhat msleadng, snce the amount by whch ndvdual dffers from the lne s seldom due to "errors" n the measurements. A more neutral term for the dfference ( y yˆ ) s that t s the resdual for that ndvdual. The Least Squares Crteron When we use a lne to predct the values of y, the sum of squared dfferences between the observed values of y and the predcted values s smaller for the least squares lne than t s for any other lne. There s a mathematcal soluton that produces general formulas for the slope and ntercept of the least squares lne. These formulas are used by all statstcal software, spreadsheet programs, and statstcal calculators. To be complete, we nclude the formulas. You won t need them f you use the computer to do a regresson analyss. 163

Formulas for the slope and ntercept of the least squares lne b 1 s the slope and b 0 s the ntercept. b b 1 0 = ( x ( x = y b x 1 x)( y x) x represents the x measurement for the th observaton. y represents the y measurement for the th observaton. x represents the mean of the x measurements. y represents the mean of the y measurements. y) Example 2 Revsted. Errors for the Hghway Sgn data The least squares regresson lne for Example 2 s y ˆ = 577 3 x where y = maxmum sgn legblty dstance and x = drver age. For ths equaton, the calculaton of the errors and the squared errors for the frst three data ponts shown n Table 5.2 s: x y y ˆ = 577 3 x error = y yˆ squared error 18 510 577 3(18) = 523 510 523 = 13 169 20 590 577 3(20) = 517 590 517 = 73 5329 22 516 577 3(22) = 511 516 511 = 5 25 Ths process can be carred out for all 30 data ponts. The sum of the squared errors s smaller for the lne y ˆ = 577 3 x than t would be for any other lne. 2 164

5.3 Measurng Strength and Drecton wth Correlaton The lnear pattern s so common that a statstc was created to characterze ths type of relatonshp. The statstcal correlaton between two quanttatve varables s a number that ndcates the strength and the drecton of a straght-lne relatonshp. The strength of relatonshp s determned by the closeness of the ponts to a straght lne. The drecton s determned by whether one varable generally ncreases or generally decreases when the other varable ncreases. As used n statstcs, the meanng of the word correlaton s much more specfc than t s n everyday lfe. A statstcal correlaton only descrbes lnear relatonshps. Whenever a correlaton s calculated, a straght lne s used as the frame of reference for evaluatng the relatonshp. When the pattern s nonlnear, as t was for the musc preference data shown n Fgure 5.3, a correlaton s not an approprate way to measure the strength of the relatonshp. Correlaton s represented by the letter r. Sometmes ths measure s called the "Pearson product moment correlaton" or the "correlaton coeffcent." The formula for correlaton s complcated. Fortunately, all statstcal software programs and many calculators provde a way to easly calculate ths statstc. A Formula for Correlaton 1 x x y y r = n 1 s x s y n s the sample sze. x s the x measurement for the th observaton. x s the mean of the x measurements. s x s the standard devaton of the x measurements. y s the y measurement for the th observaton. y s the mean of the y measurements. s y s the standard devaton of the y measurements. Interpretng the Correlaton Coeffcent Some specfc features of the correlaton coeffcent are: Correlaton coeffcents are always between 1 and +1. The magntude of the correlaton ndcates the strength of the relatonshp, whch s the overall closeness of the ponts to a straght lne. The sgn of the correlaton does not tell us about the strength the lnear relatonshp. A correlaton of ether +1 or 1 ndcates that there s a perfect lnear relatonshp and all data ponts fall on the same straght lne. The sgn of the correlaton ndcates the drecton of the relatonshp. A postve correlaton ndcates that the two varables tend to ncrease together (a postve assocaton). A negatve correlaton ndcates that when one varable ncreases the other s lkely to decrease (a negatve assocaton). A correlaton of 0 ndcates that the best straght lne through the data s exactly horzontal, so that knowng the value of x does not change the predcted value of y. 165

Example 1 Revsted. The correlaton between hand-span and heght The relatonshp between hand-span and heght appears to be lnear, so a correlaton s useful for characterzng the strength of the relatonshp. For these data, the correlaton s r = +0.74, a value that ndcates a somewhat strong postve relatonshp. A look back at Fgure 5.1 shows us that average hand-span defntely ncreases when heght ncreases, but wthn any specfc heght there s some natural varaton among ndvdual hand-spans. Example 2 Revsted. The Correlaton between Age and Sgn Legblty Dstance The correlaton for the data shown n Fgure 5.2 s r = 0.8, a value that ndcates a somewhat strong negatve assocaton between the varables. The Interpretaton of r 2, the Squared Correlaton The squared value of the correlaton s frequently used to descrbe the strength of a relatonshp. A squared correlaton, r 2, always has a value between 0 and 1, although some computer programs wll express ts value as a percentage between 0 and 100%. By squarng the correlaton, we retan nformaton about the strength of the relatonshp, but we lose nformaton about the drecton. Researchers typcally use the phrase proporton of varaton explaned by x n conjuncton wth the squared correlaton, r 2. For example, f r 2 = 0.60 (or 60%), the researcher may wrte that the explanatory varable explans 60% of the varaton n the response varable. If r 2 = 0.10 (or 10%), the explanatory varable only explans 10% of the varaton n the response varable. Ths nterpretaton stems from the use of the least squares lne as a predcton tool. Let s consder Example 1 agan. For that example, the regresson equaton s hand-span = 3.5+0.35 heght. The correlaton s 0.74 and r 2 = (0.74) 2 = 0.55 (or 55%). We can say that heght explans 55% of the varaton n hand-span, but what does t mean to say ths? Suppose that we gnore the heght nformaton when we make predctons of hand-span. In other words, suppose we don t use the least squares lne to predct hand-span. For the 167 students n the data set, y, the average hand-span s about 20.8 centmeters. If we gnore the least squares lne, we could use ths value to predct the hand-span for any ndvdual, regardless of hs or her heght. Our predcton equaton s smply hand span = 20.8. For both ths equaton and the least squares equaton nvolvng heght, we can compute the sum of squared dfferences between the actual hand-span values and the predcted values. The sum of squared dfferences between observed y values and the sample mean y s called the total varaton n y or sum of squares total and s denoted as SSTO. The sum of squared dfferences between observed y values and the predcted values based on the least squares lne s called the sum of squared errors and s denoted by SSE. Remember that errors are sometmes called resduals, and a synonym for sum of squared errors s resdual sum of squares. Whenever the correlaton s not 0, the least squares lne wll produce generally better predctons than the sample mean so SSE wll be smaller than SSTO. The squared correlaton expresses the reducton n squared predcton error as a fracton of the total varaton. Ths leads to the formula 2 SSTO SSE r = SSTO It can be shown (usng algebra) that ths quantty s exactly equal to the square of the correlaton. Let s consder r 2 for Examples 5 and 7. In Example 5, the correlaton between left and rght hand-spans s 0.95 so r 2 s 0.90, or about 90%. Ths ndcates that the span of one hand s very predctable f we know the span of the other hand (see Fgure 5.8). In Example 7, the correlaton between televson vewng hours and age s only r = 0.12. The squared correlaton s about 0.014. As we can see from the scatter plot n Fgure 5.10, knowng a person s age doesn t help us predct how much televson he or she watches per day. 166