Chapter 8 Linear Regression

Transcription

1 88 Part II Exploring Relationships Between Variales Chapter 8 Linear Regression. Cereals. Potassium ˆ Fier ( 9) 28 mg. According to the model, we expect cereal with 9 grams of fier to have 28 milligrams of potassium. 2. Horsepower. mpg ˆ HP ( 2) 3. 7 mpg. According to the model, we expect a car with 2 horsepower to get aout 3.7 miles per gallon. 3. More cereal. A negative residual means that the potassium content is actually lower than the model predicts for a cereal with that much fier. 4. Horsepower, again. A positive residual means that the car gets etter gas mileage than the model predicts for a car with that much horsepower. 5. Another owl. The model predicts that cereals will have approximately 27 more milligrams of potassium for each additional gram of fier. 6. More horsepower. The model predicts that cars lose an average of.84 miles per gallon for each additional horse power. 7. Cereal again. R 2 (. 93) Aout 8.5% of the variaility in potassium content is accounted for y the model. 8. Another car. R 2 ( 869. ) Aout 75.5% of the variaility in fuel economy is accounted for y the model. 9. Last owl! True potassium content of cereals vary from the predicted values with a standard deviation of 3.77 milligrams.. Last tank! True fuel economy varies from the predicted amount with a standard deviation of miles per gallon.. Residuals. a) The scattered residuals plot indicates an appropriate linear model. Copyright 2 Pearson Education, Inc.

2 Chapter 8 Linear Regression 89 ) The curved pattern in the residuals plot indicates that the linear model is not appropriate. The relationship is not linear. c) The fanned pattern indicates that the linear model is not appropriate. The model s predicting power decreases as the values of the explanatory variale increase. 2. Residuals. a) The curved pattern in the residuals plot indicates that the linear model is not appropriate. The relationship is not linear. ) The fanned pattern indicates heteroscedastic data. The models predicting power increases as the value of the explanatory variale increases. c) The scattered residuals plot indicates an appropriate linear model. 3. What slope? The only slope that makes sense is 3 pounds per foot. 3 pounds per foot is too small. For example, a Honda Civic is aout 4 feet long, and a Cadillac DeVille is aout 7 feet long. If the slope of the regression line were 3 pounds per foot, the Cadillac would e predicted to outweigh the Civic y only 9 pounds! (The real difference is aout 5 pounds.) Similarly, 3 pounds per foot is too small. A slope of 3 pounds per foot would predict a weight difference of 9 pounds (4.5 tons) etween Civic and DeVille. The only answer that is even reasonale is 3 pounds per foot, which predicts a difference of 9 pounds. This isn t very close to the actual difference of 5 pounds, ut at least it is in the right allpark. 4. What slope? The only slope that makes sense is foot in height per inch in circumference.. feet per inch is too small. A trunk would have to increase in circumference y inches for every foot in height. If that were true, pine trees would e all trunk! feet per inch (and, similarly feet per inch) is too large. If pine trees reach a maximum height of 6 feet, for instance, then the variation in circumference of the trunk would only e 6 inches. Pine tree trunks certainly come in more sizes than that. The only slope that is reasonale is foot in height per inch in circumference. 5. Real estate. a) The explanatory variale (x) is size, measured in square feet, and the response variale (y) is price measured in thousands of dollars. ) The units of the slope are thousands of dollars per square foot. c) The slope of the regression line predicting price from size should e positive. Bigger homes are expected to cost more. 6. Roller coaster. a) The explanatory variale (x) is initial drop, measured in feet, and the response variale (y) is duration, measured in seconds. ) The units of the slope are seconds per foot. Copyright 2 Pearson Education, Inc.

3 9 Part II Exploring Relationships Between Variales c) The slope of the regression line predicting duration from initial drop should e positive. Coasters with higher initial drops proaly provide longer rides. 7. Real estate again. 7.4% of the variaility in price can e accounted for y variaility in size. (In other words, 7.4% of the variaility in price can e accounted for y the linear model.) 8. Coasters again. 2.4% of the variaility in duration can e accounted for y variaility in initial drop. (In other words, 2.4% of the variaility in duration can e accounted for y the linear model.) 9. Real estate redux. 2 a) The correlation etween size and price is r R The positive value of the square root is used, since the relationship is elieved to e positive. ) The price of a home that is one standard deviation aove the mean size would e predicted to e.845 standard deviations (in other words r standard deviations) aove the mean price. c) The price of a home that is two standard deviations elow the mean size would e predicted to e.69 (or ) standard deviations elow the mean price. 2. Another ride. 2 a) The correlation etween drop and duration is r R The positive value of the square root is used, since the relationship is elieved to e positive. ) The duration of a coaster whose initial drop is one standard deviation elow the mean drop would e predicted to e aout.352 standard deviations (in other words, r standard deviations) elow the mean duration. c) The duration of a coaster whose initial drop is three standard deviation aove the mean drop would e predicted to e aout.56 (or ) standard deviations aove the mean duration. 2. More real estate. a) According to the linear model, the price of a home is expected to increase $6 (.6 thousand dollars) for each additional square-foot in size. ) Priˆ ce ( Sqrft) Priˆ ce ( 3) Priˆ ce According to the linear model, a 3 square-foot home is expected to have a price of $23,82. c) Priˆ ce ( Sqrft) Priˆ ce ( 2) Priˆ ce 2. 2 According to the linear model, a 2 square-foot home is expected to have a price of $2,2. The asking price is $2,2 - $6 $5,2. $6 is the (negative) residual. Copyright 2 Pearson Education, Inc.

4 Chapter 8 Linear Regression Last ride. a) According to the linear model, the duration of a coaster ride is expected to increase y aout.242 seconds for each additional foot of initial drop. ) Duration ˆ ( Drop) Duration ˆ ( 2) Duration ˆ According to the linear model, a coaster with a 2 foot initial drop is expected to last seconds. c) Duration ˆ ( Drop) Duration ˆ ( 5) Duration ˆ According to the linear model, a coaster with a 5 foot initial drop is expected to last seconds. The advertised duration is shorter, at 2 seconds. 2 seconds seconds seconds, a negative residual. 23. Misinterpretations. a) R 2 is an indication of the strength of the model, not the appropriateness of the model. A scattered residuals plot is the indicator of an appropriate model. ) Regression models give predictions, not actual values. The student should have said, The model predicts that a ird inches tall is expected to have a wingspan of 7 inches. 24. More misinterpretations. a) R 2 measures the amount of variation accounted for y the model. Literacy rate determines 64% of the variaility in life expectancy. ) Regression models give predictions, not actual values. The student should have said, The slope of the line shows that an increase of 5% in literacy rate is associated with an expected 2-year improvement in life expectancy. 25. ESP. a) First, since no one has ESP, you must have scored 2 standard deviations aove the mean y chance. On your next attempt, you are unlikely to duplicate the extraordinary event of scoring 2 standard deviations aove the mean. You will likely regress towards the mean on your second try, getting a lower score. If you want to impress your friend, don t take the test again. Let your friend think you can read his mind! ) Your friend doesn t have ESP, either. No one does. Your friend will likely regress towards the mean score on his second attempt, as well, meaning his score will proaly go up. If the goal is to get a higher score, your friend should try again. Copyright 2 Pearson Education, Inc.

5 92 Part II Exploring Relationships Between Variales 26. SI jinx. Athletes, especially rookies, usually end up on the cover of Sports Illustrated for extraordinary performances. If these performances represent the upper end of the distriution of performance for this athlete, future performance is likely to regress toward the average performance of that athlete. An athlete s average performance usually isn t notale enough to land the cover of SI. Of course, there are always exceptions, like Michael Jordan, Tiger Woods, Serena Williams, and others. 27. Cigarettes. a) A linear model is proaly appropriate. The residuals plot shows some initially low points, ut there is not clear curvature. ) 92.4% of the variaility in nicotine level is accounted for y variaility in tar content. (In other words, 92.4% of the variaility in nicotine level is accounted for y the linear model.) 28. Attendance 26. a) The linear model is appropriate. Although the relationship is not strong, it is reasonaly straight, and the residuals plot shows no pattern. ) 48.5% of the variaility in attendance is accounted for y variaility in the numer of wins. (In other words, 48.5% of the variaility is accounted for y the model.) c) The residuals spread out. There is more variation in attendance as the numer of wins increases. d) The Yankees attendance was aout 3, fans more than we might expect given the numer of wins. This is a positive residual. 29. Another cigarette. 2 a) The correlation etween tar and nicotine is r R The positive value of the square root is used, since the relationship is elieved to e positive. Evidence of the positive relationship is the positive coefficient of tar in the regression output. ) The average nicotine content of cigarettes that are two standard deviations elow the mean in tar content would e expected to e aout.922 ( 2. 96) standard deviations elow the mean nicotine content. c) Cigarettes that are one standard deviation aove average in nicotine content are expected to e aout.96 standard deviations (in other words, r standard deviations) aove the mean tar content. 3. Second inning a) The correlation etween attendance and numer of wins is r R The positive value of the square root is used, since the relationship is positive. ) A team that is two standard deviations aove the mean in numer of wins would e expected to have attendance that is.394 (or ) standard deviations aove the mean attendance. Copyright 2 Pearson Education, Inc.

6 Chapter 8 Linear Regression 93 c) A team that is one standard deviation elow the mean in attendance would e expected to have a numer of wins that is.697 standard deviations (in other words, r standard deviations) elow the mean numer of wins. The correlation etween two variales is the same, regardless of the direction in which predictions are made. Be careful, though, since the same is NOT true for predictions made using the slope of the regression equation. Slopes are valid only for predictions in the direction for which they were intended. 3. Last cigarette. a) Nicotˆ ine ( Tar) is the equation of the regression line that predicts nicotine content from tar content of cigarettes. ) Nicotˆ ine ( Tar) The model predicts that cigarette with 4 mg Nicotˆ ine ( 4) of tar will have aout.44 mg of nicotine. Nicotˆ ine 44. c) For each additional mg of tar, the model predicts an increase of.65 mg of nicotine. d) The model predicts that a cigarette with no tar would have.54 mg of nicotine. e) Nicotˆ ine ( Tar) The model predicts that a cigarette with 7 mg Nicotˆ of tar will have.694 mg of nicotine. If the ine ( 7) residual is.5, the cigarette actually had Nicotˆ ine mg of nicotine. 32. Last inning 26. a) Attendance ˆ Wins is the equation of the regression line that predicts attendance from the numer of games won y American League aseall teams. ) + ( ) Attendance ˆ Wins Attendan ˆ ce Attendance ˆ 2, 58 + ( ) ( 5) c) For each additional win, the model predicts an increase in attendance of people. d) A negative residual means that the team s actual attendance is lower than the attendance model predicts for a team with as many wins. e) Attendance ˆ Wins Attendan ˆ ce + ( ) ( 83) Attendance ˆ 3, The model predicts that a team with 5 wins will have attendance of 2,58 people. The predicted attendance for the Cardinals was 3, The actual attendance of 42,588 gives a residual of 42,588 3, , The Cardinals had over 2, more people attending on average than the model predicted. Copyright 2 Pearson Education, Inc.

7 94 Part II Exploring Relationships Between Variales 33. Income and housing revisited. a) Yes. Both housing cost index and median family income are quantitative. The scatterplot is Straight Enough, although there may e a few outliers. The spread increases a it for states with large median incomes, ut we can still fit a regression line. ) Using the summary statistics given in the prolem, calculate the slope and intercept: c) rshci smfi (. 65)( 6. 55) yˆ + x y + x ( 46234) (If you went ack to Chapter 7, and found the regression equation from the original data, the equation is HCI ˆ MFI, not a huge difference!) HCI ˆ MFI HCI ˆ ( 44993) HCI ˆ (Using the regression equation calculated from the actual data would give an average housing cost index of approximately ) d) The prediction is too low. Washington has a positive residual. (223.5 from the equation generated from the original data.) e) The correlation is the slope of the regression line that relates z-scores, so the regression equation would e zˆ 65. z. HCI MFI f) The correlation is the slope of the regression line that relates z-scores, so the regression equation would e zˆ 65. z. MFI 34. Interest rates and mortgages. HCI a) Yes. Both interest rate and total mortgages are quantitative, and the scatterplot is Straight Enough. The spread is fairly constant, and there are no outliers. ) Using the summary statistics given in the prolem, calculate the slope and intercept: rsmortamt sintrate (. )(. ) yˆ + x y + x ( 8. 88) The regression equation that predicts HCI from MFI is HCI ˆ MFI The model predicts that a state with median family income of $44993 have an average housing cost index of The regression equation that predicts total mortgage amount from interest rate is MortAmt ˆ IntRate (If you went ack to Chapter 7, and found the regression equation from the original data, the equation is MortAmt ˆ IntRate.) Copyright 2 Pearson Education, Inc.

8 c) MortAmt ˆ IntRate MortAmt ˆ ( 2) MortAmt ˆ Chapter 8 Linear Regression 95 d) We should e very cautious in making a prediction aout an interest rate of 2%. It is well outside the range of our original x-variale, and care should always e taken when extrapolating. This prediction may not e appropriate. e) The correlation is the slope of the regression line that relates z-scores, so the regression equation would e zˆ 84. z. MortAmt IntRate f) The correlation is the slope of the regression line that relates z-scores, so the regression equation would e zˆ 84. z. IntRate MortAmt If interest rates were 2%, we would expect there to e $65.52 million in total mortgages. ($65.39 million if you worked with the actual data.) 35. Online clothes. a) Using the summary statistics given in the prolem, calculate the slope and intercept: rstotal sage (. 37)( ) yˆ + x y + x ( ) The regression equation that predicts total online clothing purchase amount from age is Total ˆ Age ) Yes. Both total purchases and age are quantitative variales, and the scatterplot is Straight Enough, even though it is quite flat. There are no outliers and the plot does not spread throughout the plot. c) Total ˆ Age Total ˆ ( 8) Total ˆ The model predicts that an 8 year old will have $ in total yearly online clothing purchases. Total ˆ Age Total ˆ ( 5) Total ˆ d) R 2 (. 37) %.. The model predicts that a 5 year old will have $ in total yearly online clothing purchases. e) This model would not e useful to the company. The scatterplot is nearly flat. The model accounts for almost none of the variaility in total yearly purchases. Copyright 2 Pearson Education, Inc.

9 96 Part II Exploring Relationships Between Variales 36. Online clothes II. a) Using the summary statistics given in the prolem, calculate the slope and intercept: rstotal sincome (. 722)( ) yˆ + x y + x ( ) 3. 6 (Since the mean income is a relatively large numer, the value of the intercept will vary, ased on the rounding of the slope. Notice that it is very close to zero in the context of yearly income.) ) The assumptions for regression are met. Both variales are quantitative and the plot is Straight Enough. There are several possile outliers, ut none of these points are extreme, and there are 5 data points to estalish a pattern. The spread of the plot does not change throughout the range of income. c) Total ˆ Income Total ˆ ( 2, ) Total ˆ $ The regression equation that predicts total online clothing purchase amount from income is Total ˆ Income The model predicts that a person with $2, yearly income will make $28.4 in online purchases. (Predictions may vary, ased on rounding of the model.) Total ˆ Income Total ˆ ( 8, ) Total ˆ $ The model predicts that a person with $8, yearly income will make $928.4 in online purchases. (Predictions may vary, ased on rounding of the model.) d) R 2 (. 722) %. e) The model accounts for a 52.% of the variation in total yearly purchases, so the model would proaly e useful to the company. Additionally, the difference etween the predicted purchases of a person with $2, yearly income and $8, yearly income is of practical significance. 37. SAT scores. a) The association etween SAT Math scores and SAT Veral Scores was linear, moderate in strength, and positive. Students with high SAT Math scores typically had high SAT Veral scores. ) One student got a 5 Veral and 8 Math. That set of scores doesn t seem to fit the pattern. Copyright 2 Pearson Education, Inc.

10 Chapter 8 Linear Regression 97 c) r.685 indicates a moderate, positive association etween SAT Math and SAT Veral, ut only ecause the scatterplot shows a linear relationship. Students who scored one standard deviation aove the mean in SAT Math were expected to score.685 standard deviations aove the mean in SAT Veral. Additionally, R 2 (. 685) , so 46.9% of the variaility in math score was accounted for y variaility in veral score. d) The scatterplot of veral and math scores shows a relationship that is straight enough, so a linear model is appropriate. rsmath sveral (. 685)( 96.) yˆ + x y + x ( ) The equation of the least squares regression line for predicting SAT Math score from SAT Veral score is Math ˆ ( Veral). e) For each additional point in veral score, the model predicts an increase of.662 points in math score. A more meaningful interpretation might e scaled up. For each additional points in veral score, the model predicts an increase of 6.62 points in math score. f) Math ˆ ( Veral) Math ˆ ( 5) Math ˆ According to the model, a student with a veral score of 5 was expected to have a math score of g) Math ˆ ( Veral) Math ˆ ( 8) Math ˆ According to the model, a student with a veral score of 8 was expected to have a math score of She actually scored 8 on math, so her residual was points 38. Success in college a) A scatterplot showed the relationship etween comined SAT score and GPA to e reasonaly linear, so a linear model is appropriate. rsgpa ssat 47)(. 56) yˆ + x y + x ( 833). 262 The regression equation predicting GPA from SAT score is: GPA ˆ ( SAT ) ) The model predicts that a student with an SAT score of would have a GPA of.262. The y-intercept is not meaningful in this context, since oth scores are impossile. Copyright 2 Pearson Education, Inc.

11 98 Part II Exploring Relationships Between Variales c) The model predicts that students who scored points higher on the SAT tended to have a GPA that was.24 higher. d) GPA ˆ ( SAT ) GPA ˆ ( 2) GPA ˆ 323. According to the model, a student with an SAT score of 2 is expected to have a GPA of e) According to the model, SAT score is not a very good predictor of college GPA. R 2 (. 47) , which means that only 22.9% of the variaility in GPA can e accounted for y the model. The rest of the variaility is determined y other factors. f) A student would prefer to have a positive residual. A positive residual means that the student s actual GPA is higher than the model predicts for someone with the same SAT score. 39. SAT, take 2. a) r.685. The correlation etween SAT Math and SAT Veral is a unitless measure of the degree of linear association etween the two variales. It doesn t depend on the order in which you are making predictions. ) The scatterplot of veral and math scores shows a relationship that is straight enough, so a linear model is appropriate. rsveral smath (. 685)( 99.) yˆ + x y + x ( 62. 2) The equation of the least squares regression line for predicting SAT Veral score from SAT Math score is: Veral ˆ ( Math) c) A positive residual means that the student s actual veral score was higher than the score the model predicted for someone with the same math score. d) e) Veral ˆ ( Math) Veral ˆ ( 5) Veral ˆ Math ˆ ( Veral) Math ˆ ( ) Math ˆ According to the model, a person with a math score of 5 was expected to have a veral score of points. According to the model, a person with a veral score of was expected to have a math score of points. Copyright 2 Pearson Education, Inc.

12 Chapter 8 Linear Regression 99 f) The prediction in part e) does not cycle ack to 5 points ecause the regression equation used to predict math from veral is a different equation than the regression equation used to predict veral from math. One was generated y minimizing squared residuals in the veral direction, the other was generated y minimizing squared residuals in the math direction. If a math score is one standard deviation aove the mean, its predicted veral score regresses toward the mean. The same is true for a veral score used to predict a math score. 4. Success, part 2. rssat sgpa (. 47)( 23) yˆ + x y + x ( 2. 66) The regression equation to predict SAT score from GPA is: SAT ˆ ( GPA ) SAT ˆ ( 3 ) SAT ˆ 868. The model predicts that a student with a GPA of 3. is expected to have an SAT score of Used cars 27. a) We are attempting to predict the price in dollars of used Toyota Corollas from their age in years. A scatterplot of the relationship is at the right. ) There is a strong, negative, linear association etween price and age of used Toyota Corollas. Price ($) Age and Price of Used Toyota Corollas c) The scatterplot provides evidence that the relationship is Straight Enough. A linear model will likely e an appropriate model Age (years) d) Since R 2.944, simply take the square root to find r Since association etween age and price is negative, r e) 94.4% of the variaility in price of a used Toyota Corolla can e accounted for y variaility in the age of the car. f) The relationship is not perfect. Other factors, such as options, condition, and mileage explain the rest of the variaility in price. 42. Drug ause. a) The scatterplot shows a positive, strong, linear relationship. It is straight enough to make the linear model the appropriate model. ) 87.3% of the variaility in percentage of other drug usage can e accounted for y percentage of marijuana use. Copyright 2 Pearson Education, Inc.

13 Part II Exploring Relationships Between Variales c) R 2.873, so r (since the relationship is positive). rso s ˆ The regression equation used to predict y M + x the percentage of teens who use other ( )(.) 2 y + x drugs from the percentage who have used ( 23. 9) marijuana is: Other ˆ ( Marijuana ) (Using the actual data set from Chapter 7, Other ˆ ( Marijuana) ) d) According to the model, each additional percent of teens using marijuana is expected to add.6 percent to the percentage of teens using other drugs. e) The results do not confirm marijuana as a gateway drug. They do indicate an association etween marijuana and other drug usage, ut association does not imply causation. 43. More used cars 27. a) The scatterplot from the previous exercise shows that the relationship is straight, so the linear model is appropriate. The regression equation to predict the price of a used Toyota Corolla from its age is Price ˆ ( Years). The computer regression output used is at the right. ) According to the model, for each additional year in age, the car is expected to drop $959 in price. c) The model predicts that a new Toyota Corolla ( years old) will cost $4,285. d) Price ˆ ( Years) Price ˆ ( 7) Price ˆ 7573 e) Buy the car with the negative residual. Its actual price is lower than predicted. f) Price ˆ Price ˆ ( Years) ( ) Price ˆ 4696 Dependent variale is: No Selector Price ($) R squared 94.4% R squared (adjusted) 94.% s 86.2 with degrees of freedom Source Regression Residual Variale Constant Age (yr) Sum of Squares Coefficient According to the model, an appropriate price for a 7-year old Toyota Corolla is $7573. According to the model, a -year-old Corolla is expected to cost $4696. The car has an actual price of $35, so its residual is $35 $4696 $96 The car costs $96 less than predicted. g) The model would not e useful for predicting the price of a 2-year-old Corolla. The oldest car in the list is 3 years old. Predicting a price after 2 years would e an extrapolation. df 3 s.e. of Coeff Mean Square t-ratio F-ratio 22 pro.. Copyright 2 Pearson Education, Inc.

14 44. Birth rates 25. a) A scatterplot of the live irth rates in the US over time is at the right. The association is negative, strong, and appears to e curved, with one low outlier, the rate of 4.8 live irths per women age 5 44 in 975. Generally, as time passes, the irth rate is getting lower. ) Although the association is slightly curved, it is straight enough to try a linear model. The linear regression output from a computer program is shown elow: Dependent variale is: Birth Rate No Selector R squared 67.4% R squared (adjusted) 62.8% s.22 with degrees of freedom Source Regression Residual Sum of Squares df 7 Mean Square F-ratio 4.5 Chapter 8 Linear Regression # of live irths US Birth Rate Year The linear regression model for predicting irth rate from year is: Birthrate ˆ ( Year ) Variale Constant Year Coefficient s.e. of Coeff t-ratio pro c) The residuals plot, at the right, shows a slight curve. Additionally, the scatterplot shows a low outlier for the year 975. We may want to investigate further. At the very least, e cautious when using this model. d) The model predicts that each passing year is associated with a decline in irth rate of. irths per women. e) Birthrate ˆ ( Year) Birthrate ˆ ( 978) Birthrate ˆ The model predicts aout 6.74 irths per women in 978. f) If the actual irth rate in 978 was 5. irths per women, the model has a residual of irths per women. This means that the model predicted.74 irths higher than the actual rate. g) Birthrate ˆ ( Year) Birthrate ˆ ( 2) Birthrate ˆ 3. 2 Residuals Predicted # of live irths per women According to the model, the irth rate in 2 is predicted to e 3.6 irths per women. This prediction seems a it low. It is an extrapolation outside the range of the data, and furthermore, the model only explains 67% of the variaility in irth rate. Don t place too much faith in this prediction Copyright 2 Pearson Education, Inc.

15 2 Part II Exploring Relationships Between Variales h) Birthrate ˆ ( Year) Birthrate ˆ ( 225) Birthrate ˆ. 55 According to the model, the irth rate in 225 is predicted to e.55 irths per women. This prediction is an extreme extrapolation outside the range of the data, which is dangerous. No faith should e placed in this prediction. 45. Burgers. a) The scatterplot of calories vs. fat content in fast food hamurgers is at the right. The relationship appears linear, so a linear model is appropriate. 675 Fat and Calories of Fast Food Burgers Dependent variale is: Calories No Selector R squared 92.3% R squared (adjusted) 9.7% s with degrees of freedom Source Regression Residual Sum of Squares df 5 Mean Square F-ratio 59.8 # of Calories Variale Constant Fat Coefficient s.e. of Coeff t-ratio pro Fat (grams) ) From the computer regression output, R %. 92.3% of the variaility in the numer of calories can e explained y the variaility in the numer of grams of fat in a fast food urger. c) From the computer regression output, the regression equation that predicts the numer of calories in a fast food urger from its fat content is: Calories ˆ Fat d) The residuals plot at the right shows no pattern. The linear model appears to e appropriate. e) The model predicts that a fat free urger would have calories. Since there are no data values close to, this is an extrapolation outside the data and isn t of much use. + ( ) f) For each additional gram of fat in a urger, the model predicts an increase of.56 calories. predicted(c/f) g) Calories ˆ ( Fat) ( 28) The model predicts a urger with 28 grams of fat will have calories. If the residual is +33, the actual numer of calories is calories. 46. Chicken. a) The scatterplot is fairly straight, so the linear model is appropriate. ) The correlation of.947 indicates a strong, linear, positive relationship etween fat and calories for chicken sandwiches Copyright 2 Pearson Education, Inc.

16 Chapter 8 Linear Regression 3 c) rs s Cal Fat (. 947)( 44.) yˆ + x y + x ( 2. 6) The linear model for predicting calories from fat in chicken sandwiches is: Calories ˆ Fat + ( ) d) For each additional gram of fat, the model predicts an increase in calories. e) According to the model, a fat-free chicken sandwich would have calories. This is proaly an extrapolation, although without the actual data, we can t e sure. f) In this context, a negative residual means that a chicken sandwich has fewer calories than the model predicts. 47. A second helping of urgers. a) The model from the previous exercise was for predicting numer of calories from numer of grams of fat. In order to predict grams of fat from the numer of calories, a new linear model needs to e generated. ) The scatterplot at the right shows the relationship etween numer fat grams and numer of calories in a set of fast food urgers. The association is strong, positive, and linear. Burgers with higher numers of calories typically have higher fat contents. The relationship is straight enough to apply a linear model. Dependent variale is: Fat No Selector R squared 92.3% R squared (adjusted) 9.7% s with degrees of freedom Fat (grams) Calories and Fat in Fast Food Burgers Source Regression Residual Sum of Squares df 5 Mean Square F-ratio # of Calories Variale Constant Calories Coefficient s.e. of Coeff t-ratio pro The linear model for predicting fat from calories is: Fat ˆ ( Calories) The model predicts that for every additional calories, the fat content is expected to increase y aout 8.3 grams predicted(f/c) Copyright 2 Pearson Education, Inc.

17 4 Part II Exploring Relationships Between Variales The residuals plot shows no pattern, so the model is appropriate. R %, so 92.3% of the variaility in fat content can e accounted for y the model. Fat ˆ Calories Fat ˆ ( 6) Fat ˆ A second helping of chicken. + ( ) a) The model from the previous exercise was for predicting numer of calories from numer of grams of fat. In order to predict grams of fat from the numer of calories, a new linear model needs to e generated. ) The scatterplot is fairly straight, so the linear model is appropriate. The correlation of.947 indicates a strong, linear, positive relationship etween fat and calories for chicken sandwiches. rsfat s yˆ + x Cal 947)(.) 98 y + x ( ) According to the model, a urger with 6 calories is expected to have 35. grams of fat. The linear model for predicting fat from calories in chicken sandwiches is: Fat ˆ Calories + ( ) According to the linear model, a chicken sandwich with 4 calories is expected to have approximately ( 4) 5. 9 grams of fat. 49. Body fat. a) The scatterplot of % ody fat and weight of 2 male sujects, at the right, shows a strong, positive, linear association. Generally, as a suject s weight increases, so does % ody fat. The association is straight enough to justify the use of the linear model. The linear model that predicts % ody fat from weight is: % Fat ˆ ( Weight) Body fat (%) 3 2 Weight and Body Fat Weight(l) ) The residuals plot, at the right, shows no apparent pattern. The linear model is appropriate. c) According to the model, for each additional pound of weight, ody fat is expected to increase y aout.25%. Residual d) Only 48.5% of the variaility in % ody fat can e accounted for y the model. The model is not expected to make predictions that are accurate Predicted Body Fat (%) Copyright 2 Pearson Education, Inc.

18 e) % Fat ˆ ( Weight) % Fat ˆ ( 9) % Fat ˆ Chapter 8 Linear Regression 5 According to the model, the predicted ody fat for a 9-pound man is %. The residual is %. 5. Body fat, again. The scatterplot of % ody fat and waist size is at the right. The association is strong, linear, and positive. As waist size increases, % ody fat has a tendency to increase, as well. The scatterplot is straight enough to justify the use of the linear model. The linear model for predicting % ody fat from waist size is : % Fat ˆ ( Waist). For each additional inch in waist size, the model predicts an increase of 2.222% ody fat. 78.7% of the variaility in % ody fat can e accounted for y waist size. The residuals plot, at right, shows no apparent pattern. The residuals plot and the relatively high value of R 2 indicate an appropriate model with more predicting power than the model ased on weight. Body fat (%) Residual Waist Size and Body Fat Waist Size (inches) Predicted Body Fat (%) 5. Heptathlon 24. a) Both high jump height and 8 meter time are quantitative variales, the association is straight enough to use linear regression. Dependent variale is: High Jump No Selector R squared 6.4% R squared (adjusted) 2.9% s.67 with degrees of freedom Source Regression Residual Variale Constant 8m Sum of Squares Coefficient e-3 df 24 s.e. of Coeff Mean Square t-ratio F-ratio 4.7 pro..4 High jump heights Heptathlon Results meter times (sec) Copyright 2 Pearson Education, Inc.

19 6 Part II Exploring Relationships Between Variales The regression equation to predict high jump from 8m results is: Highjump ˆ ( 8 m). According to the model, the predicted high jump decreases y an average of.67 meters for each additional second in 8 meter time. ) R 2 6.%. 4 This means that 6.4% of the variaility in high jump height is accounted for y the variaility in 8 meter time. c) Yes, good high jumpers tend to e fast runners. The slope of the association is negative. Faster runners tend to jump higher, as well. d) The residuals plot is fairly patternless. The scatterplot shows a slight tendency for less variation in high jump height among the slower runners than the faster runners. Overall, the linear model is appropriate. e) The linear model is not particularly useful for predicting high jump performance. First of all, 6.4% of the variaility in high jump height is accounted for y the variaility in 8 meter time, leaving 83.6% of the variaility accounted for y other variales. Secondly, the residual standard deviation is.62 meters, which is not much smaller than the standard deviation of all high jumps,.66 meters. Predictions are not likely to e accurate. 52. Heptathlon 24 again. a) Both high jump height and long jump distance are quantitative variales, the association is straight enough, and there are no outliers. It is appropriate to use linear regression. Dependent variale is: Long Jump No Selector R squared 2.6% R squared (adjusted) 9.% s.96 with degrees of freedom Source Regression Residual Variale Constant High Jump Sum of Squares Coefficient df 24 s.e. of Coeff Mean Square t-ratio F-ratio 3.47 pro The regression equation to predict long jump from high jump results is: Longjump ˆ ( Highjump). Residuals plot Predicted high jumps (meters) According to the model, the predicted long jump increases y an average of.54 meters for each additional meter in high jump height. ) R 2 2.%. 6 This means that only 2.6% of the variaility in long jump distance is accounted for y the variaility in high jump height. c) Yes, good high jumpers tend to e good long jumpers. The slope of the association is positive. Better high jumpers tend to e etter long jumpers, as well. Long jump distance Residual Heptathlon Results High jump heights (meters) Copyright 2 Pearson Education, Inc.

20 Chapter 8 Linear Regression 7 d) The residuals plot is fairly patternless. The linear model is appropriate. e) The linear model is not particularly useful for predicting long jump performance. First of all, only 2.6% of the variaility in long jump distance is accounted for y the variaility in high jump height, leaving 87.4% of the variaility accounted for y other variales. Secondly, the residual standard deviation is.96 meters, which is aout the same as the standard deviation of all long jumps jumps,.26 meters. Predictions are not likely to e accurate. Residual Residuals plot Predicted high jumps (meters) 53. Least squares. If the 4 x-values are plugged into yˆ 7+. x, the 4 predicted values are ŷ 8, 29, 5 and 62, respectively. The 4 residuals are 8, 2, 3, and 8. The squared residuals are 64, 44, 96, and 324, respectively. The sum of the squared residuals is 79. Least squares means that no other line has a sum lower than 79. In other words, it s the est fit. y x 54. Least squares. If the 4 x-values are plugged into yˆ x, the 4 predicted values are ŷ 885, 795, 75, and 65, respectively. The 4 residuals are 65, -45, 95, and -5. The squared residuals are 4225, 225, 925, and 225, respectively. The sum of the squared residuals is 34,5. Least squares means that no other line has a sum lower than 34,5. In other words, it s the est fit. y x Copyright 2 Pearson Education, Inc.