AP Statistics Topic 9 ~ Measures of Spread Activity 9 : Baseball Lineups The table to the right contains data on the ages of the two teams involved in game of the 200 National League Division Series. Is there a relationship between the ages of the players on the teams and the outcome of the NLDS? The Reds lost in three games. a. Identify the observational units. Also identify the explanatory and the response variable. Classify each variable as categorical or quantitative. Observational units: starting players in game of 200 NLDS Explanatory variable: age Type: quantitative Response variable: NLDS Winner Type: categorical b. Create comparative dotplots, using the axes shown here for comparing the ages between the two teams. (Be sure to label which dotplot represents which team.) Comment on how the age distributions compare. Reds Phillies Comment: The starting lineup for the Phillies tended to be older than the starting lineup for the Reds. The Phillies are more consistent in age than the Reds. c. Calculate the mean and median age of each team's lineup. Reds Mean: 29.67 Median: 29 Phillies Mean: 32.22 Median: 32
d. Which team's lineup appears to have more variability in its ages? The Red's lineup appears to have more variability in its ages. In the previous topic, you learned that the mean and median are two ways to measure the center of a distribution; you will now learn several ways to measure the spread, or variability, of a distribution. e. What is the age of the oldest player in the Phillies lineup? The youngest? What is the difference in age between the oldest and youngest player? Oldest: 38 Youngest: 30 Difference: 38-30 = 8 f. Repeat part e for the Red's lineup. Oldest: 36 Youngest: 23 Difference: 36-23 = 3 A very simple, but not particularly useful, measure of variability is the range, calculated as the difference between the maximum and minimum values in a data set. Another measure of variability is the interquartile range (IQR), which is the difference between the upper quartile and the lower quartile of a distribution. The lower quartile (or the 25th percentile, abbreviated Q ) is the value such that 25% of the data values are less that that value and 75% are greater than it, while the upper quartile (or the 75th percentile, abbreviated Q 3 ) is the value such that 75% of the values in the data set are less than that value and 25% are greater than it. Thus, the IQR is the range of the middle 50% of the data. g. Determine the lower and upper quartiles of the ages for the Phillies. Then find the IQR of the Phillie's ages. 30 3 3 3 32 32 32 33 38 IQR = Q3 - Q = 32.5-3 =.5 Q = 3 Q3 = 32.5 h. Determine the lower and upper quartiles of the ages for the Reds. Then find the IQR of the Red's ages. 23 26 27 27 29 30 34 35 36 Q3 = 34.5 Q = 26.5 IQR = Q3 - Q = 34.5-26.5 = 8 i. Which team has the greater age range? Which has the greater IQR? Are these values consistent with your answer to question d? The Reds have a greater range of ages, 3 versus 8 years, and a greater interquartile range of ages, 8 versus.5 years. j. Based on this analysis, summarize how the age distributions differ between the 200 Reds and Phillies (shape, center, spread). The distribution of ages for the Reds has no distinct shape, but there is a roughly symmetric cluster of ages between 23 & 30 and an evenly dispersed set of ages between 34 & 36. The median age is 29 and the IQR is 8 years. The distribution of ages for the Phillies is mound-shaped and symmetric with a potential outlier in the 38-year-old left-fielder Raul Ibanez. The median age of 32 and an IQR of.5 years. The Cincinnati team tended to be younger than the Philadelphia team, though the ages of the Red's players vary quite a bit more. 2
Activity 9 2: Baseball Lineups Other measures of variability examine how far the data values fall or deviate from the mean of the distribution. a. The mean age for Cincinnati's starting lineup in game one of the 200 NLDS was approximately 29.67. Complete the missing entries for Votto and Rolen in the "deviation from the mean" column of the following table by calculating the differences between their ages and the mean age. 27-29.67 = -2.67 35-29.67 = 5.33 b. Add the values in the "deviation from Mean" column. Then calculate the average deviation from the mean. -. 03/9 = -. 0033 The un rounded values from the table appear in the table at right. Fathom calculates the sum of the deviations to be zero, as in the table below. -. 03 c. The sum of the deviations from the mean is always equal to zero. Verify this fact for the data set {, 5, 2}. ( + 5 + 2)/3 = 6-6 = -5 5-6 = - 2-6 = 6-5 + - + 6 = 0 d. Given the fact that the sum of the deviations from the mean is always zero, what does that imply about using the average deviation from the mean as a measure of spread (variation) for a data set? The average deviation is a useless measure of spread since it is always going to be zero. Because a measure of spread is concerned with distances from the mean rather than direction from the mean, you could work with the absolute values of these deviations. e. Complete the missing entries in the "Absolute Deviation" column of the table below. Then calculate the average absolute deviation. Report the units of measurement for this calculation. 32.67/9 = 3.63 years 2.67 5.33 7.3 28.4 32.67 60.0 The measure of spread you have just calculated is the mean absolute deviation (MAD). It is certainly a reasonable measure of the amount of variation relative to the mean in a data set, but there is yet another measure of spread that has properties desirable to statisticians, as you soon shall see. f. Complete the missing entries in the "Squared Deviation" column of the table above. Then calculate the average squared deviation. Report the units of measurement for this calculation. This value is called the variance (V). g. 7.78 years 2 To convert back to the original units of the data set years of age take the square root of the average squared deviation. 4.2 years The measure of spread you have just calculated is the standard deviation (SD). The standard deviation is the most widely used measure of variation in statistical calculations. 3
The standard deviation ("baby" sigma σ) is a widely used measure of variability. To compute the standard deviation, you calculate the difference between the mean and each data value and then square the difference: (data value mean) 2. Add these squared terms, and divide the number of observational units n. The standard deviation is the square root of the result: or, more simply, The standard deviation can loosely be interpreted as the typical distance that a data value in the distribution deviates from the mean. The variance σ 2 is calculated by the formula The variance is literally the average squared deviation from the mean. σ is the Greek lowercase "sigma" and is used to represent the standard deviation (of a population). is the Greek uppercase "sigma" and is the symbol used to imply summation. μ is the Greek lowercase "myoo" and is the symbol used to represent the mean. n is the number of observational units. x is used to represent the value of a variable for a particular observational unit. Here's how to do it on the TI 83/84. standard deviation h. Calculate, with technology, the standard deviation of the ages for the Phillies' starting lineup in game of the 200 NLDS. 2.2 years i. j. Now, remove 38 year old Raul Ibanez from the Phillies lineup and calculate the standard deviation of the ages for the Phillies' starting lineup..87 Calculate the range, interquartile range (IQR), and standard deviation of the ages for the Phillies' starting lineup in game of the 200 NLDS with and without Raul Ibanez. Complete the table below. 8 3.5 2.2 years.87 years k. Which measures of spread are resistant to outliers and which are not? Explain. The IQR is least affected by the presence of Raul Ibanez since it changed the least when he is included in the Phillies lineup. 4
count count Activity 9-4: Placement Exam Scores 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 Placement Score a. The distribution of placement scores appears to be roughly symmetric and mound shaped. b. μ 0.22 and σ 3.859 μ - σ = 0.22 3.859 = 6.362 32 μ + σ = 0.22 + 3.859 = 4.08 c. 46 of the 23 scores fall within one standard deviation of the mean, i.e., between 7 and 4 inclusive.this accounts for 46/23.685 or about 69% of the scores. This is quite consistent with the 68% advertised by the Empirical Rule. 5 7 2 6 7 7 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 9 2 5 6 2 Placement Score 8 4 7 d. μ - 2σ = 0.22 2(3.859) = 2.503 μ + 2σ = 0.22 + 2(3.859) = 7.939 2.503 2.503 6.362 4.08 202 of the 23 scores fall within two standard deviations of the mean, i.e., between 3 and 7, inclusive. This accounts for.948 or about 95% of the scores. This is exactly what the Empirical Rule advertises. e. 23 of the 23 scores fall within three standard deviations of the mean, i.e., between and 9. This accounts for 00% of the placement scores. This is quite in line with the Empirical Rule. 5
Activity 9-5: SATs and ACTs a. 740 is 240 points above the mean SAT score. b. 30 is 9 points above the mean ACT score. c. No. You cannot compare these point differences because the SAT and ACT scores are not measured on the same numeric scale. d. Bobby's SAT score is 240/240 = standard deviation above the mean SAT score. e. Kathy's ACT score is 9/6 =.5 standard deviations above the mean ACT score. f. Kathy's ACT z score is which is greater than Bobby's SAT z score of 740 500. 30 2 6 =.5 g. Since Kathy's score of 30 on the ACT is.5 standard deviations above the mean score in the approximately Normal distribution of ACT scores, while Bobby's score of 540 on the SAT is only one standard deviation above the mean score in the approximately Normal distribution of SAT scores, Kathy performed better relative to the peers whose scores appear in the distribution of all ACT scores. 240 = h. z Peter = 380 500 240 = 0.5 z Kelly = 5 2 6 = i. Peter has the higher z score since < 0.5. j. A z score turns out to be negative when calculated for any score less than the mean score for the associated distribution. 6
Activity 9-6: Marriage Ages a. Husbands tend to be older than their wives by a mean of.875 years and a median of.5 years on average. b. The IQR for the distribution of husbands' ages is 44.5 25 = 9.5 years. The IQR for the distribution of wives' ages is 4.5 24 = 7.5 years. The standard deviation of husbands' ages is 4.26 years while the standard deviation of wives' ages is 3.27. There is more variability in the distribution of husbands' ages than in the distribution of wives' ages. c. The distributions of husbands' ages and wives' ages are both skewed right. The median age of the husbands is 30.5, while the median age of the wives is only 29, indicating that the husbands tended to be older than the wives. With an interquartile range of 9.5, two more than that of the wives, there is slightly more variation in the ages of the husbands than in the ages of the wives. 5 25 35 45 55 65 75 husbands' ages 5 25 35 45 55 65 75 wives' ages d. The mean difference (husband age minus wife age) is equal to the difference of the means, mean husband age minus mean wife age. The median difference is NOT equal to the difference of the medians. IQR = e. Neither the difference in the IQRs, nor the difference in the standard deviations, is equal to the IQR of the differences or the standard deviation of the differences. 7