Stats Review Chapters 3-4

Stats Review Chapters 3-4 Created by Teri Johnson Math Coordinator, Mary Stangler Center for Academic Success Examples are taken from Statistics 4 E by Michael Sullivan, III And the corresponding Test Generator from Pearson Revised 12/13

Note: This review is composed of questions the textbook and the test generator. This review is meant to highlight basic concepts from the course. It does not cover all concepts presented by your instructor. Refer back to your notes, unit objectives, handouts, etc. to further prepare for your exam. A copy of this review can be found at www.sctcc.edu/cas. The final answers are displayed in red and the chapter/section number is the corner.

Find the Mean, Median, Mode Data Set: 71, 74, 67, 64, 72, 71, 65, 66, 69,70 Mean: add up all the numbers and divide by the amount of numbers 71+74+ 67+ 64+72+71+65+66+69+70 = 68.9 10 Median: Arrange numbers from smallest to largest then find the middle number 64, 65,66, 67,69,70,71, 71, 72, 74 The middle number is between 69 and 70. Finding the mean of these two numbers will give us the median which is 69.5 Mode: The most repeated number (can have none, one, or more than one) The mode is 71. 3.1

Find the Range and Sample Standard Deviation Data Set: 71, 74, 67, 64, 72, 71, 65, 66, 69,70 Range: The highest number the smallest number 74-64=10 Sample Standard Deviation: Use Excel or Calculator You get 3.28 3.2

Find the Five-Number Summary: Part 1 of 2 Data Set: 64, 65, 66, 67, 69, 70, 71, 71, 72, 74 Min: the smallest number 64 Max: the largest number 74 Median: the middle number 69.5 Q1: Find the median of the numbers from the minimum and the median (do not include the median) These numbers are 64, 65, 66, 67, 69. The median of these numbers are 66. So Q1 is 66. 3.4

Find the Five-Number Summary: Part 2 of 2 Q3: Find the median of the numbers from the median (not included) to the maximum. These numbers are 70, 71, 71, 72, 74. The median of these numbers are 71. So Q1 is 71. The five-number summary is 64 66 69.5 71 74 3.4

Find the Upper and Lower Fences Data Set: 71, 74, 67, 64, 72, 71, 65, 66, 69, 70 1) Find the IQR. IQR=Q3-Q1=71-66=5 2) Multiply the IQR by 1.5: 5*1.5=7.5 3) Upper Fence: Add 7.5 to Q3; 7.5+71=78.5 4) Lower Fence: Subtract 7.5 from Q1; 66-7.5=58.5 Any numbers outside this range would be considered outliers. 3.4

Construct a Boxplot Construct a box plot given the five-number summary 64 66 69.5 71 74 64 66 69.5 71 74 3.5

Relationship between Mean, Median and Shape Match the shape to the relationship between mean and median. Mean>Median Mean=Median Mean<Median Skewed Left Mean<Median Symmetric Mean=Median Skewed right Mean>Median If data is skewed, median is more representative of the typical observation. 3.1

Empirical Rule to Find the Probability: Part 1 of 2 At a tennis tournament, a statistician keeps track of every serve. She reported that the mean serve speed of a particular player was 104 mph and the standard deviation of the serve speeds was 8 mph. Assume that the statistician also gave us the information that the distribution of the serve speeds was bell shaped. Using the Empirical Rule, what proportion of the player's serves are expected to be between 112 mph and 120 mph? 3.2

Empirical Rule to Find the Probability: Part 1 of 2 Start by drawing the bell curve Compare this to the empirical rule curve on page 149 or find the percents as described on page 149. We can see that the probability is 13.5% or.135. *You could also find the z-scores, then find the probability. 3.2

Find the Mean: Part 1 of 2 Days Frequency 1-2 2 3-4 21 5-6 20 7-8 10 9-10 30 Step 1: Find the midpoint of each class (days) Days Class Midpoint, x i Frequency, f i 1-2 1 + 2 2 = 1.5 2 3-4 3.5 21 5-6 5.5 20 7-8 7.5 10 9-10 9.5 30 3.3

Find the Mean: Part 2 of 2 Step 2: Multiply the class midpoint by frequency Days Class Midpoint, x i Frequency, f i x i f i 1-2 1 + 2 2 1.5*2=3 = 1.5 2 3-4 3.5 21 73.5 5-6 5.5 20 110 7-8 7.5 10 75 9-10 9.5 30 285 Step 3: Find the sum of the x i f i column: 3+73.3+110+75+285=546.5 Step 4: Find the sum of the frequency column: 83 Step 5: Divide the number from step 3 by the number from step 4: 546.5/83=6.6 The mean is 6.6. 3.3

Find the Sample Standard Deviation: Part 1 of 2 (from previous problem) Step 1: Find the mean (see previous 2 slides). Mean=6.6 Step 2: Complete the following table (the first three columns were done in the previous 2 slides and column 4 is the mean) Days Class Midpoint, x i Frequency, f i x x i -x (x i x) 2 f i 1-2 1 + 2 2 = 1.5 2 6.6 1.5-6.6=-5.1 ( 5.1) 2 2 = 52.02 3-4 3.5 21 6.6-3.1 201.81 5-6 5.5 20 6.6-1.1 24.2 7-8 7.5 10 6.6.9 8.1 9-10 9.5 30 6.6 2.9 252.3 3.3

Find the Sample Standard Deviation: Part 2 of 2 (from previous problem) Days Class Midpoint, x i Frequency, f i x x i -x (x i x) 2 f i 1-2 1 + 2 2 6.6 1.5-6.6=-5.1 ( 5.1) = 1.5 2 2 = 52.02 2 3-4 3.5 21 6.6-3.1 201.81 5-6 5.5 20 6.6-1.1 24.2 7-8 7.5 10 6.6.9 8.1 9-10 9.5 30 6.6 2.9 252.3 Step 3: Find the total of the (x i x ) 2 f i column. Total =538.43 Step 4: Find the total of the frequency column. Total =83. Take this number and subtract 1. 83-1=82 Step 5: Take the number from Step 3 and divide by the number from step 4. 538.43/82=6.56622 Step 6: Take the square root of the number from step 5. 6.56622 = 2.6 3.3

Correlation Coefficient Match the Correlation with the Graph r=-.6 r=0 r=.99 should not use correlation 20 10 r=-.6 15 10 8 6 4 r=0 5 2 0 0 2 4 6 8 10 0 0 2 4 6 8 10 r=.999 12 10 8 6 4 2 0 0 2 4 6 8 10 20 15 10 5 0 0 2 4 6 8 10 Should not use correlation 4.1

Least-Squares Regression Line: Part 1 of 3 The data are the average one-way commute times (in minutes) for selected students and the number of absences for those students during the term. a) Find the equation of the regression line for the given data. Round the regression line values to the nearest hundredth. b) What would be the predicted number of absences if the commute time was 40 minutes? Is this a reasonable question? c) Interpret the Slope Commute time (x) Number of absences (y) 72 3 85 7 91 10 90 10 88 8 98 15 75 4 100 15 80 5 4.2

Least-Squares Regression Line: Part 2 of 3 a) Find the regression Line With Calculator y =.45x 30.3 By Hand First find the mean of x and y (labeled x and y), the sample standard deviation of x and y (labeled s x and s y ), and the correlation (r). x =86.556, y=8.556, s x =9.593, s y =4.39, r=.98 Slope (b 1 )=r s y s x =.98 4.39 9.593 =.45 Y-Intercept (b 0 )= y-b1 x =8.556-.45(86.556)=-30.3 The regression Line is y = b 1 x + b o so ours is y =.45x 30.3 4.2

Least-Squares Regression Line: Part 3 of 3 b) What would be the predicted number of absences if the commute time was 40 minutes? Is this a reasonable question? Tine is 40 minutes or x=40. Put this value into our least-squares regression line y =.45(40) 30.3=-12.3 This means that when the commute time is 40 minutes, than the number of absences is -12.3. This is not a reasonable question since 40 is outside the scope (i.e. 40 is not within the given range of x values). c) Interpret the slope The slope is.45. This means that for every minute we increase our commute the number of absences increases by.45. 4.2

Sum Of Residuals: Part 1 of 2 Find the sum of residuals Step 1: Find the least squares regression line (see previous slides) Step 2: Find the predicted y values (y) for each x Commute time (x) Number of absences (y) 72 3 85 7 91 10 90 10 88 8 98 15 75 4 100 15 80 5 Predicted y =.45(72)-30.3=2.1 7.95 10.65 10.2 9.3 13.8 3.45 14.7 5.7 4.3

Sum Of Residuals: Part 2 of 2 Step 3: Calculate the residuals: observed predicted or y-y Step 4: Calculate the residuals squared: (observed predicted) 2 or (y-y) 2 Step 5: Find the sum of the numbers in the column from step 4. This is the sum of residuals which equals 6.1875. Commute time (x) Number of absences (y) 72 3 2.1 85 7 7.95 Predicted y 91 10 10.65 90 10 10.2 88 8 9.3 98 15 13.8 75 4 3.45 100 15 14.7 80 5 5.7 Step 3 y-y.9 -.95 -.65 -.2-1.3 1.2.55.3 -.7 Step 4 (y-y) 2.81.9025.4225.04 1.69 1.44.3025.09.49 4.3

Coefficient of Determination (R 2 ) If the coefficient of determination (R 2 ) is 86.44% and the data shows a negative association, what is the linear correlation coefficient (r)? r = r 2 =.8644 =.9297 Since it has a negative association, r =-.9297. Interpret R 2 = 86.44% 86.44% of the variability in y (the response variable) is explained by the least-squares regression line. 4.3

Residual Plots What does the residual plot to the right suggest? There is an outlier Removing the outlier, what does the residual plot suggest? No pattern, linear model is appropriate 6 4 2 0-2 -4-6 -8-10 -12 0 2 4 6 8 10-14 4.3

Contingency Tables: Part 1 of 3 Is there an association between party affliction and gender? The following represents the gender and party affliction of registered voters based on random sample 802 adults. Female Male Republican 105 115 Democratic 150 103 Independent 150 179 a) Construct a frequency marginal distribution b) Construct a relative frequency marginal distribution c) Construct a conditional distribution of party affiliation by gender d) Is gender associated with party affiliation? If so, how? 4.4

Contingency Tables: Part 2 of 3 a) Construct a frequency marginal distribution Gender Party Female Male frequency marginal distribution Republican 105 115 =105+115=220 Democratic 150 103 253 Independent 150 179 329 =105+150+150=405 397 802 To do: Find the total for each row and column b) Construct a relative frequency marginal distribution Gender Party Female Male relative frequency marginal distribution Republican 105 115.274 Democratic 150 103.315 Independent 150 179.41 =405/802=.505.495 1 To do: Divide the row/column total by the sample size 4.4

Contingency Tables: Part 3 of 3 c) Construct a conditional distribution of party affiliation by gender Gender Party Female Male Republican =105/405=.259 =115/397=.290 Democratic.370.259 Independent.370.451 Total 1 1 To do: Divide the each cell by its column total d) Is gender associated with party affiliation? If so, how? Yes; males are more likely to be Independents and less likely to be democrats. 4.4