Review for FINAL EXAM

Transcription

1 Review for FINAL EXAM 1 Population the complete collection of elements (scores, people, measurements, etc.) to be studied Sample Chapter 1 a sub-collection of elements drawn from a population 2 The Nature of Data Quantitative data Definitions numbers representing counts or measurements Qualitative (attribute) data nonnumeric data that can be separated into different categories (categorical data) 3

2 Definitions Discrete - Countable Continuous - Measurements with no gaps 4 Levels of Measurement Nominal - names only Ordinal - names with some order Interval - differences but no zero Ratio - differences and a zero 5 Methods of Sampling Random Systematic Convenience Stratified Cluster 6

3 Chapters 2,3 7 Determine the Definition Values for this Frequency Table Quiz Scores Frequency Classes Lower Class Limits Upper Class Limits Class Boundaries Class Midpoints Class Width 8 Regular Freq. Table Axial Load Frequency Tables Frequency Relative Freq. Table Axial Load Relative Frequency Cumulative Freq. Table Axial Load Cumulative Frequency Less than 210 Less than 220 Less than 230 Less than 240 Less than 250 Less than 260 Less than 270 Less than 280 Less than 290 Less than

4 Histogram of Axial Load Data Frequency Axial Load (pounds) 10 Important Distributions Normal Uniform Skewed Right Skewed Left 11 Stem-Leaf Plots Stem Leaves

5 Mean Median Mode Measures of Center Midrange 13 Calculator Basics for Statistical Data 1. Put calculator into statistical mode 2. Clear previous data 3. Enter data (and frequency) 4. Select key(s) that calculate x 14 Mean for a Frequency Table Quiz Scores Midpoints Frequency x = 14.4 ( rounded to one more decimal place than data )

6 Measure of Variation Range highest score lowest score 16 Measure of Variation Standard Deviation a measure of variation of the scores about the mean (average deviation from the mean) 17 Measure of Variation Variance standard deviation squared 18

7 Same Means (x = 4) Different Standard Deviations Frequency s = 0 s = 0.8 s = 1.0 s = Standard Deviation 19 Estimation of Standard Deviation Range Rule of Thumb x - 2s x x + 2s (minimum usual value) Range» 4s (maximum usual value) Range s» = 4 highest value - lowest value 4 20 Rough Estimates of Usual Sample Values minimum usual value» (mean) - 2 (standard deviation) minimum» x - 2(s) maximum usual value» (mean) + 2 (standard deviation) maximum» x + 2(s) 21

8 FIGURE 2-13 The Empirical Rule (applies to bell-shaped distributions) 99.7% of data are within 3 standard deviations of the mean 95% within 2 standard deviations 68% within 1 standard deviation 34% 34% 2.4% 2.4% 0.1% 0.1% 13.5% 13.5% x - 3s x - 2s x - 1s x x + 1s x + 2s x + 3s 22 Measures of Position z score Sample z = x - x s Population z = x - µ s Round to 2 decimal places 23 Interpreting Z Scores Unusual Values Ordinary Values Unusual Values Z 24

9 Other Measures of Position Quartiles and Percentiles 25 Sort the data. (Arrange the data in order of lowest to highest.) Start Compute L = ( k ) n where 100 n = number of values k = percentile in question Is L a whole number? Change L by rounding it up to the next No larger whole number. Finding the Value of the kth Percentile Find the 75th percentile. Yes (75 ) 11 = 8.75 = L 100 L = 9 The value of the kth percentile is midway between the Lth value and the next value in the sorted set of data. Find P k by adding the L th value and the next value and dividing the total by 2. Figure 3-6 The value of P k is the The 75th percentile is the 9th score, or 215. Lth value, counting from the Final lowest Review. Triola, Essentials of Statistics, Third Edition. Copyright Pearson Education, Inc. 26 Quartiles Q 1 = P 25 Q 2 = P 50 Q 3 = P 75 27

10 Boxplot pulse rates (beats per minute) of smokers number summary Minimum - 52 first quartile Q1-60 Median third quartile Q3-78 Maximum Boxplot Box-and-Whisker Diagram Boxplot of Pulse Rates (Beats per minute) of Smokers 29 Chapters 4 and 5 30

11 Fundamentals of Probability 31 Basic Rules for Computing Probability Rule 1: Relative Frequency Approximation Conduct (or observe) an experiment a large number of times, and count the number of times event A actually occurs, then an estimate of P(A) is P(A) number of times A occurred number of times trial was repeated 32 Rule 2: Classical approach (requires equally likely outcomes) If a procedure has n different simple events, each with an equal chance of occurring, and event A can occur in s of these ways, then P(A) = Basic Rules for Computing Probability s n = number of ways A can occur number of different simple events 33

12 Rule 1 Relative frequency approach Throwing a die 100 times and getting 15 threes P(3) = Rule 2 Classical approach P(3 on a die) = 1/6 = Probability Limits The probability of an impossible event is 0. The probability of an event that is certain to occur is 1. 0 P(A) 1 Impossible Certain to occur to occur A probability value must be a number between 0 and Complementary Events The complement of event A, denoted by A, consists of all outcomes in which event A does not occur. P(A) P(A) (read not A ) 36

13 Rounding Off Probabilities give the exact fraction or decimal or round the final result to three significant digits P(struck by lightning last year)» Compound Event Any event combining 2 or more events Notation Definitions P(A or B) = P (event A occurs or event B occurs or they both occur) 38 Disjoint Events A = Green ball B = Blue ball } disjoint events P(A or B) = P(A) + P(B) = + =

14 Not Disjoint Events A = Even number B = Number greater than 5 } Overlapping events; some counted twice P(A or B) = P(A) + P(B) - P(A and B) = = & 8 Final Review. Triola, Essentials of Statistics, Third Edition. Copyright counted Pearson twice Education, Inc. 40 Contingency Table Homicide Robbery Assault Totals Stranger Acqu. or Rel Unknown Totals Find the probability of randomly selecting one person from this group and getting someone who was robbed or was a stranger. P(robbed or a stranger) = = 1244 = * * NOT Disjoint Events ** 41 Complementary Events P(A) and P(A) are disjoint events All simple events are either in A or A. P(A) + P(A) = 1 42

15 Finding the Probability of Two or More Selections Multiple selections Multiplication Rule 43 Definitions Independent Events Two events A and B are independent if the occurrence of one does not affect the probability of the occurrence of the other. Dependent Events If A and B are not independent, they are said to be dependent. 44 Find the probability of drawing two cards from a shuffled deck of cards such that the first is an Ace and the second is a King. (The cards are drawn without replacement.) P(Ace on first card) = P(King Ace) = P(drawing Ace, then a King) = 4 4 = DEPENDENT EVENTS = 45

16 Independent Events G G D G G Two selections With replacement P (both good) = P (good and good) = = = Example: On a TV program it was reported that there is a 60% success rate for those who try to stop smoking through hypnosis. Find the probability that for 8 randomly selected smokers who undergo hypnosis, they all successfully quit smoking. P(all 8 quit smoking) = P(quit) P(quit) P(quit) P(quit) P(quit) P(quit) P(quit) P(quit) = (0.60) (0.60) (0.60) (0.60) (0.60) (0.60) (0.60) (0.60) = or 47 Small Samples from Large Populations If small sample is drawn from large population (if n 5% of N), you can treat the events as independent. 48

17 Chapter 4 49 Probability Distribution x (# of correct) P(x) P(x) # of correct answers Probability Histogram Requirements for Probability Distribution S P(x) = 1 where x assumes all possible values 0 P(x) 1 for every value of x 51

18 Mean, Variance and Standard Deviation of a Probability Distribution Mean s 2 µ = S x P(x) Variance = S[x 2 P(x) ] - µ 2 Standard Deviation s = S[x 2 P(x) ] - µ 2 52 Mean, Standard Deviation and Variance of Probability Distribution x P(x) µ = 2.7 s = s = Binomial Experiment Definition 1. The procedure must have a fixed number of trials. 2. The trials must be independent.. (The outcome of any individual trial doesn t affect the probabilities in the other trials.) 3. Each trial must have all outcomes classified into two categories. 4. The probabilities must remain constant for each trial. 54

19 Binomial Probability Formula P(x) = n! (n - x )! x! p x q n-x 55 For n = 15 and p = 0.10 Table A-1 Binomial Probability Distribution n x P(x) x P(x) Example: US Air has 20% of all domestic flights and one year had 4 of 7 consecutive major air crashes in the United States. Assuming that airline crashes are independent and random events, find the probability that when seven airliners crash, at least four of them are from US Air. According to the definition, this is a binomial experiment. n = 7 p = 0.20 q = 0.80 x = 4, 5, 6, 7 Table A-1 can be used. + + P(4,5,6,7) = =

20 Binomial Probability Formula P(x) = n C r p x q n-x Number of outcomes with exactly x successes among n trials Probability of x successes among n trials for any one particular order 58 Example: Find the probability of getting exactly 3 left-handed students in a class of 20 if 10% of us are left-handed. This is a binomial experiment where: n = 20 x = 3 p =.10 q =.90 Table A-1 cannot be used; therefore, we must use the binomial formula P(3) = 20 C = For a Binomial Distribution: Mean µ = n p Variance s 2 = n p q Standard s = n Deviation = n p q 60

21 Example: US Air has 20% of all domestic flights. What is considered the unusual number of US Air crashes out of seven randomly selected crashes? We previously found for this binomial distribution, µ = 1.4 crashes s = 1.1 crashes µ - 2 s = 1.4-2(1.1) = (or 0) µ + 2 s = (1.1) = 3.6 The usual number of US Air crashes out of seven randomly selected crashes should be between -0.8 (or 0) and 3.6. Four crashes would be unusual! 61 Chapter 6 Normal Probability Distributions The Standard Normal Distribution 63

22 Because the total area under the density curve is equal to 1, there is a correspondence between area and probability. 64 Definition Standard Normal Distribution a normal probability distribution that has a mean of 0 and a standard deviation of 1, and the total area under its density curve is equal to NEGATIVE Z Scores Table A-2 66

23 Table A-2 Designed only for standard normal distribution Is on two pages: negative z-scores and positive z-scores Body of table is a cumulative area from the left up to a vertical boundary Avoid confusion between z-scores and areas Z-score hundredths is across the top row 67 Table A-2 Standard Normal Distribution Negative z-scores: cumulative from left x z 0 68 Table A-2 Standard Normal Distribution Positive z-scores: cumulative from left X z 69

24 Table A-2 Standard Normal Distribution µ = 0 s = 1 z = x X z 70 Table A-2 Standard Normal Distribution Area = Probability µ = 0 s = 1 z = x X z 71 Example: If thermometers have an average (mean) reading of 0 degrees and a standard deviation of 1 degree for freezing water and if one thermometer is randomly selected, find the probability that it reads freezing water is less than 1.58 degrees. m = 0 s = 1 P(z < 1.58) = % of the thermometers will read freezing water less than 1.58 degrees. 72

25 Example: If we are using the same thermometers, and if one thermometer is randomly selected, find the probability that it reads (at the freezing point of water) above 1.23 degrees. P (z > 1.23) = The percentage of thermometers with a reading above degrees is 89.07%. 73 Example: A thermometer is randomly selected. Find the probability that it reads (at the freezing point of water) between 2.00 and 1.50 degrees. P (z < 2.00) = P (z < 1.50) = P ( 2.00 < z < 1.50) = = The probability that the chosen thermometer has a reading between 2.00 and 1.50 degrees is The Empirical Rule Standard Normal Distribution: µ = 0 and s = % of data are within 3 standard deviations of the mean 95% within 2 standard deviations 68% within 1 standard deviation 34% 34% 2.4% 2.4% 0.1% 0.1% 13.5% 13.5% x - 3s x - 2s x - 1s x x + 1s x + 2s x + 3s 75

26 Notation P(a < z < b) betweena and b P(z > a) greater than, at least, more than, not less than P (z < a) less than, at most, no more than, not greater than Applications of Normal Distributions 77 Converting to Standard Normal Distribution z = x m s Figure

27 Probability of Sitting Heights Less Than 38.8 Inches m = 36.0 σ= 1.4 z = = Probability of Sitting Heights Less Than 38.8 Inches m = 36.0 P ( x < 38.8 in.) = P(z < 2) σ= 1.4 = Finding Values of Normal Distributions 81

28 Procedure for Finding Values Using Table A-2 and Formula Sketch a normal distribution curve, enter the given probability or percentage in the appropriate region of the graph, and identify the x value(s) being sought. 2. Use Table A-2 to find the z score corresponding to the cumulative left area bounded by x. Refer to the BODYof Table A-2 to find the closest area, then identify the corresponding z score. 3. Using Formula 6-2, enter the values for µ, s,, and thez score found in step 2, then solve for x. x = µ + (z s) (another form of Formula 6-2) (If z is located to the left of the mean, be sure that it is a negative number.) 4. Refer to the sketch of the curve to verify that the solution makes sense in the context of the graph and the context of the problem. 82 Find P 98 for Hip Breadths of Men x = m + (z? s) x = ( ) x = Table A-2: Positive Z- scores 84

29 Find P 98 for Hip Breadths of Men The hip breadth of 16.5 in. separates the lowest 98% from the highest 2% The Central Limit Theorem 86 Central Limit Theorem Conclusions: 1. The distribution of sample means x will, as the sample size increases, approach a normal distribution. 2. The mean of the sample means will be the population mean µ. 3. The standard deviation of the sample means will approach will approach s/ n. 87

30 Practical Rules Commonly Used: 1. For samples of size n larger than 30, the distribution of the sample means can be approximated reasonably well by a normal distribution. The approximation gets better as the sample size n becomes larger. 2. If the original population is itself normally distributed, then the sample means will be normally distributed for any sample size n (not just the values of n larger than 30). 88 Notation the mean of the sample means µ x = µ the standard deviation of sample means s x = s (often called standard error of the mean) n 89 Example: Given the population of men has normally distributed weights with a mean of 172 lb. and a standard deviation of 29 lb, b.) if 12 different men are randomly selected, find the probability that their mean weight is greater than 167 lb. z = =

31 Example: Given the population of men has normally distributed weights with a mean of 172 lb. and a standard deviation of 29 lb, b.) if 12 different men are randomly selected, find the probability that their mean weight is greater than 167 lb. The probability that the mean weight of 12 randomly selected men is greater than 167 lb. is Chapter 7 Estimates and Sample Sizes 92 Definition Confidence Interval (or Interval Estimate) a range (or an interval) of values used to estimate the true value of the population parameter Lower # < population parameter < Upper # As an example < p <

32 Confidence Interval for Population Proportion pˆ - E < p E = where z a / 2 < p ˆ + E ˆ p q ˆ n 94 Notation for Proportions p = population proportion p ˆ = x n sample proportion (pronounced p-hat ) of x successes in a sample of size n q ˆ = 1 - p ˆ = sample proportion of x failures in a sample size of n 95 Round-Off Rule for Confidence Interval Estimates of p Round the confidence interval limits to three significant digits 96

33 Procedure for Constructing a Confidence Interval for p 1. Verify that the required assumptions are satisfied. (The sample is a simple random sample, the conditions for the binomial distribution are satisfied, and the normal distribution can be used to approximate the distribution of sample proportions because np 5, and nq 5 are both satisfied). 2. Refer to Table A-2 and find the critical value z a/2 that corresponds to the desired confidence level. 3. Evaluate the margin of error E = p q ˆˆ n 97 Procedure for Constructing a Confidence Interval for p 4. Using the calculated margin of error, E and the value of the sample proportion, p, ˆ, find the values of p ˆ E and p ˆ + E.. Substitute those values in the general format for the confidence interval: p ˆ E < p < p ˆ + E 5. Round the resulting confidence interval limits to three significant digits. 98 Example: In the Chapter Problem, we noted that 829 adult Minnesotans were surveyed, and 51% of them are opposed to the use of the photo-cop for issuing traffic tickets. Use these survey results. Find the 95% confidence interval estimate of the population proportion p. 99

34 Example: In the Chapter Problem, we noted that 829 adult Minnesotans were surveyed, and 51% of them are opposed to the use of the photo-cop for issuing traffic tickets. Use these survey results. First, we check for assumptions. We note that np = , and nq = ˆ ˆ Next, we calculate the margin of error. We have found that p = 0.51, q = = 0.49, z a/ 2 = 1.96, and n = 829. ˆ E = 1.96 E = ˆ (0.51)(0.49) Example: In the Chapter Problem, we noted that 829 adult Minnesotans were surveyed, and 51% of them are opposed to the use of the photo-cop for issuing traffic tickets. Use these survey results. Find the 95% confidence interval for the population proportion p. We substitute our values from Part a to obtain: < p < , < p < Example: In a given example, we noted that 829 adult Minnesotans were surveyed, and 51% of them are opposed to the use of the photo-cop for issuing traffic tickets. Use these survey results. Based on the results, can we safely conclude that the majority of adult Minnesotans oppose use of the photo-cop? Based on the survey results, we are 95% confident that the limits of 47.6% and 54.4% contain the true percentage of adult Minnesotans opposed to the photo-cop. The percentage of opposed adult Minnesotans is likely to be any value between 47.6% and 54.4%. However, a majority requires a percentage greater than 50%, so we cannot safely conclude that the majority is opposed (because the entire confidence interval is not greater than 50%). 102

35 Estimating a Population Mean: s Not Known 103 Confidence Interval for the Estimate of m Based on an Unknown s and a Small Simple Random Sample from a Normally Distributed Population x - E < µ < x + E where E = t s a/2 n t a/2 found in Table A Table A-3 t Distribution Degrees of freedom.005 (one tail).01 (two tails).01 (one tail).02 (two tails).025 (one tail).05 (two tails).05 (one tail).10 (two tails).10 (one tail).20 (two tails).25 (one tail).50 (two tails) Large (z)

36 Example: A study of 12 Dodge Vipers involved in collisions resulted in repairs averaging $26,227 and a standard deviation of $15,873. Find the 95% interval estimate of m, the mean repair cost for all Dodge Vipers involved in collisions. (The 12 cars distribution appears to be bell-shaped.) x = 26,227 s = 15,873 a = 0.05 a/2 = t a /2 = E = t a / 2 s = (2.201)(15,873) = 10,085.3 n 12 x - E < µ < x + E 26,227-10,085.3 < µ < 26, ,085.3 $16,141.7 < µ < $36,312.3 We are 95% confident that this interval contains the average cost of repairing a Dodge Viper. 106 End of 7-2 and 7-3 Determining Sample Size Required to Estimate p and m 107 Sample Size for Estimating Proportion p When an estimate of p is known: n = ˆ z a/2 E 2 ˆ ˆ ( ) 2 p q ˆ Formula 7-2 When no estimate of p is known: ( z ) Formula 7-3 n = a/2 E 2 108

37 Example: Example:We want to determine, with a margin of error of four percentage points, the current percentage of U.S. households using . Assuming that we want 90% confidence in our results, how many households must we survey? A 1997 study indicates 16.9% of U.S. households used . n = [z a /2 ] 2 p q E 2 ˆˆ = [1.645] 2 (0.169)(0.831) = = 238 households To be 90% confident that our sample percentage is within four percentage points of the true percentage for all households, we should randomly select and survey 238 households. 109 Example: Example:We want to determine, with a margin of error of four percentage points, the current percentage of U.S. households using . Assuming that we want 90% confidence in our results, how many households must we survey? There is no prior information suggesting a possible value for the sample percentage. n = [z a /2 ] 2 (0.25) E = (1.645) 2 (0.25) = = 423 households With no prior information, we need a larger sample to achieve the same results with 90% confidence and an error of no more than 4%. 110 Sample Size for Estimating Mean m E = z s a/2 n (solve for n by algebra) n = z a/2 E s 2 Formula 7-5 z a /2 /2 = critical z score based on the desired degree of confidence E = desired margin of error s = population standard deviation 111

38 Example: If we want to estimate the mean weight of plastic discarded by households in one week, how many households must be randomly selected to be 99% confident that the sample mean is within 0.25 lb of the true population mean? (A previous study indicates the standard deviation is lb.) a = 0.01 z a/2 = E = 0.25 s = n = z a/2 s = (2.575)(1.065) E 0.25 = = 121 households We would need to randomly select 121 households and obtain the average weight of plastic discarded in one week. We would be 99% confident that this mean is within 1/4 lb of the population mean. 112 Chapter 8 Hypothesis Testing 113 Claim: Using math symbols H 0 : Must contain equality H 1 : Will contain,, <, > 114

39 Test Statistic The test statistic is a value computed from the sample data, and it is used in making the decision about the rejection of the null hypothesis. /\ z = p - p pq n Test statistic for proportions 115 Test Statistic The test statistic is a value computed from the sample data, and it is used in making the decision about the rejection of the null hypothesis. t = x - µ x s n Test statistic for mean 116 Test Statistic The test statistic is a value computed from the sample data, and it is used in making the decision about the rejection of the null hypothesis. c 2 = (n 1)s2 s 2 Test statistic for standard deviation 117

40 Critical Region Set of all values of the test statistic that would cause a rejection of the null hypothesis Critical Regions 118 Critical Value Any value that separates the critical region (where we reject the null hypothesis) from the values of the test statistic that do not lead to a rejection of the null hypothesis Reject H 0 Fail to reject H 0 Critical Value ( z score ) 119 Two-tailed, Right-tailed, tailed, Left-tailed tailed Tests The tails in a distribution are the extreme regions bounded by critical values. 120

41 Decision Criterion Traditional method: Reject H 0 if the test statistic falls within the critical region. Fail to reject H 0 if the test statistic does not fall within the critical region. 121 Wording of Final Conclusion Figure Comprehensive Hypothesis Test 123

42 Example: It was found that 821 crashes of midsize cars equipped with air bags, 46 of the crashes resulted in hospitalization of the drivers. Using the 0.01 significance level, test the claim that the air bag hospitalization is lower than the 7.8% rate for cars with automatic tic safety belts. Claim: p < p = 46 / 821 = reject H 0 H 0 : p = H 1 : p < a = 0.01 p = z = p - p pq n (0.078 )(0.922) 821 z = =» z = p = There is sufficient evidence to support claim that the air bag hospitalization rate is lower than the 7.8% rate for automatic safety belts Testing a Claim about a Mean: s Not Known 125 Example: Seven axial load scores are listed below. At the 0.01 level of significance, test the claim that this sample comes from a population with a mean that is greater than 165 lbs n = 7 df = 6 x = lb s = 27.6 lb Claim: Claim: µ > 165 lb H 0 : µ = 165 lb H 1 : µ > 165 lb (right tailed test) 126

43 a = t = t = x - µ t = x s = = n Reject H o 127 Example: Seven axial load scores are listed below. At the 0.01 level of significance, test the claim that this sample comes from a population with a mean that is greater than 165 lbs Final conclusion: There is sufficient evidence to support the claim that the sample comes from a population with a mean greater than 165 lbs. Reject Claim: µ > 165 lb H 0 : µ = 165 lb H 1 : µ > 165 lb (right tailed test) Testing a Claim about a Standard Deviation or Variance 129

44 Chi-Square Distribution Test Statistic X 2 = (n - 1) s 2 s 2 n s 2 s 2 = sample size = sample variance = population variance (given in null hypothesis) 130 Critical Values and P-values for Chi-Square Distribution Found in Table A-4 Degrees of freedom = n -1 Based on cumulative areas from the RIGHT 131 Table A-4: Critical values are found by determining the area to the RIGHT of the critical value df = 80 a = 0.05 a/2 =

45 Example: Aircraft altimeters have measuring errors with a standard deviation of 43.7 ft. With new production equipment, 81 altimeters ers measure errors with a standard deviation of 52.3 ft. Use the significance level to test the claim that the new altimeters have a standard deviation different from the old value of 43.7 ft. Claim: s 43.7 H 0 : s = 43.7 H 1 : s a = a/2 = n = 81 df = 80 Table A x 2 = (n -1)s 2 (81-1) (52.3) = 2» s Reject H x 2 = Example: Aircraft altimeters have measuring errors with a standard deviation of 43.7 ft. With new production equipment, 81 altimeters ers measure errors with a standard deviation of 52.3 ft. Use the significance level to test the claim that the new altimeters have a standard deviation different from the old value of 43.7 ft. SUPPORT REJECT Claim: s 43.7 H 0 : s = 43.7 H 1 : s 43.7 The new production method appears to be worse than the old method. The data supports that there is more variation in the error readings than before. 135

46 Table 8-3 Hypothesis Tests Parameter Conditions Distribution and Test Statistic Critical and P-values Proportion np =5 and nq =5 Normal: z p = ˆ p p q Table A-2 n Mean Standard Deviation or Variance σ not known and normally distributed or n =30 Population normally distributed Student t: t = X s n µ Chi-Square: x 2 ( n 1) = 2 σ X 2 s Table A-3 Table A Chapter 10 Correlation and Regression 137 Overview Paired Data is there a relationship if so, what is the equation use the equation for prediction 138

47 Definition Correlation exists between two variables when one of them is related to the other in some way 139 Definition Scatterplot (or scatter diagram) is a graph in which the paired (x,y)) sample data are plotted with a horizontal x axis and a vertical y axis. Each individual (x,y)) pair is plotted as a single point. 140 Scatter Diagram of Paired Data Lengths and Weights of Male Bears 500 Weight (lb.) (72,416) (68.5,360) (67.5,344) (72,348) (73,332) (73.5,262) (37,34) (53,80) Length (in.) 141

48 Positive Linear Correlation y y y (a) Positive x (b) Strong positive x (c) Perfect positive x Figure 10-2 Scatter Plots 142 Negative Linear Correlation y y y (d) Negative x (e) Strong negative x (f) Perfect negative x Figure 10-2 Scatter Plots 143 No Linear Correlation y y (g) No Correlation x (h) Nonlinear Correlation x Figure 10-2 Scatter Plots 144

49 Definition Linear Correlation Coefficient r measures strength of the linear relationship between paired x- and y-quantitative values in a sample 145 Definition Linear Correlation Coefficient r r = nsxy - (Sx)(Sy) n(sx 2 ) - (Sx) 2 n(sy 2 ) - (Sy) 2 Formula 10-1 Calculators can compute r r (rho) is the linear correlation coefficient for all paired data in the population. 146 Rounding the Linear Correlation Coefficient r Round to three decimal places so that it can be compared to critical values in Table A-5 Use calculator or computer if possible 147

50 Interpreting the Linear Correlation Coefficient If the absolute value of r exceeds the value in Table A - 5, conclude that there is a significant linear correlation. Otherwise, there is not sufficient evidence to support the conclusion of significant linear correlation. 148 TABLE A-5 Critical Values of the Pearson Correlation Coefficient r n a =.05 a = Properties of the Linear Correlation Coefficient r r 1 2. Value of r does not change if all values of either variable are converted to a different scale. 3. The value of r is not affected by the choice of x and y.. Interchange x and y and the value of r will not change. 4. r measures strength of a linearrelationship. relationship. 150

51 Formal Hypothesis Test To determine whether there is a significant linear correlation between two variables Two methods Both methods let H 0 : r = 0 (no significant linear correlation) H 1 : r 0 (significant linear correlation) 151 Method 2: Test Statistic is r (uses fewer calculations) Test statistic: r Critical values: Refer to Table A-5 (no degrees of freedom) Reject r = 0 Fail to reject r = 0 Reject r = 0-1 r = r = Sample data: r = Is there a significant linear correlation? Data from the Garbage Project x Plastic (lb) y Household n = 8 a = 0.05 H 0 : r = 0 H 1 :r 0 Test statistic is r =

52 Is there a significant linear correlation? n = 8 a = 0.05 H 0 : r = 0 H 1 :r 0 Test statistic is r = Critical values are r = and (Table A-5 with n = 8 and a = 0.05) TABLE A -5 Critical Values of the Pearson Correlation Coefficient r n a =.05 a = Is there a significant linear correlation? > The test statistic does fall within the critical region. Therefore, we REJECT H 0 : r = 0 (no correlation) and conclude there is a significant linear correlation between the weights of discarded plastic and household size. Reject r = 0 Fail to reject r = 0 Reject r = 0-1 r = r = Sample data: r = Regression 156

53 Definition Regression Regression Equation Given a collection of paired data, the regression equation y ^ = b 0 + b 1 x algebraically describes the relationship between the two variables Regression Line (line of best fit or least-squares squares line) the graph of the regression equation 157 The Regression Equation x is the independent variable (predictor variable) ^y is the dependent variable (response variable) y ^= b 0 +b 1 x y = mx +b b 0 = y - intercept b 1 = slope 158 Regression Line Plotted on Scatter Plot 159

54 Formula for b 1 and b 0 Formula 10-2 b 1 = n(σxy) (Σx) (Σy) n(σx 2 ) (Σx) 2 (slope) Formula 10-3 b 0 = y b 1 x (y-intercept) calculators or computers can compute these values 160 Rounding the y-intercept b 0 and the slope b 1 Round to three significant digits If you use the formulas 10-2, 10-3, try not to round intermediate values or carry to at least six significant digits. 161 Example: Lengths and Weights of Male Bears x Length (in.) y Weight (lb) b 0 = (rounded) b 1 = 9.66 (rounded) y ^ = x 162

55 Scatter Diagram of Paired Data Lengths and Weights of Male Bears 500 Weight (lb.) Length (in.) 163 Predictions In predicting a value of y based on some given value of x If there is not a significant linear correlation, the best predicted y-value is y. 2. If there is a significant linear correlation, the best predicted y-value is found by substituting the x-value into the regression equation. 164 Guidelines for Using The Regression Equation 1. If there is no significant linear correlation, don t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don t make predictions about a population that is different from the population from which the sample data was drawn. 165

56 Example: Lengths and Weights of Male Bears x Length (in.) y Weight (lb.) y ^ = x What is the weight of a bear that is 60 inches long? Since the data does have a significant positive linear correlation, we can use the regression equation for prediction. 166 Example: Lengths and Weights of Male Bears x Length (in.) y Weight (lb.) y ^ = (60) ^ y = pounds 167 Example: Lengths and Weights of Male Bears x Length (in.) y Weight (lb.) A bear that is 60 inches long will weigh approximately pounds. 168

57 Example: Lengths and Weights of Male Bears x Length (in.) y Weight (lb.) If there were no significant linear correlation, to predict a weight for any length: use the average of the weights (y-values) y = 272 lbs 169 Chapter 11 Multinomial Experiments And Contingency Tables Multinomial Experiments 171

58 Definition Goodness-of-fit fit test used to test the hypothesis that an observed frequency distribution fits (or conforms to) some claimed distribution 172 Goodness-of-Fit Test Notation 0 represents the observed frequency of an outcome E represents the expected frequency of an outcome k n represents the number of different categories or outcomes represents the total number of trials 173 Expected Frequencies If all expected frequencies are equal: E = n k the sum of all observed frequencies divided by the number of categories 174

59 Expected Frequencies If all expected frequencies are not all equal: E = n p each expected frequency is found by multiplying the sum of all observed frequencies by the probability for the category 175 Key Question We need to measure the discrepancy between O and E; the test statistic will involve their difference: O - E 176 Test Statistic X 2 = S (O - E)2 E Critical Values 1. Found in Table A-4 using k -1 degrees of freedom where k = number of categories 2. Goodness-of-fit fit hypothesis tests are always right-tailed. tailed. 177

60 Multinomial Experiment: Goodness-of-Fit Test H 0 : No difference between observed and expected probabilities H 1 : at least one of the probabilities is different from the others 178 Categories with Equal Frequencies H 0 : p 1 = p 2 = p =... = p 3 k H 1 (Probabilities) : at least one of the probabilities is different from the others 179 Example: A study was made of 147 industrial accidents that required medical attention. Test the claim that the accidents occur with equal proportions on the 5 workdays. Frequency of Accidents Day Mon Tues Wed Thurs Fri Observed accidents Claim: Accidents occur with the same proportion (frequency); that is, p 1 = p 2 = p 3 = p 4 = p 5 H 0 : p 1 = p 2 = p 3 = p 4 = p 5 H 1 : At least 1 of the 5 proportions is different from others 180

61 Example: A study was made of 147 industrial accidents that required medical attention. Test the claim that the accidents occur with equal proportions on the 5 workdays. Frequency of Accidents Day Mon Tues Wed Thurs Fri Observed accidents O: E: E = n/k = 147/5 = 29.4 Observed and Expected Frequencies Day Mon Tues Wed Thurs Fri Observed accidents Expected accidents Observed and Expected Frequencies of Industrial Accidents Day Mon Tues Wed Thurs Fri Observed accidents Expected accidents (O -E) 2 /E (rounded) Test Statistic: X 2 (O -E) = S 2 E = = Critical Value:X 2 = Table A-4 with k-1 1 = 5-1 = 4 and a = Fail to Reject p 1 = p 2 = p 3 = p 4 = p 5 Reject p 1 = p 2 = p 3 = p 4 = p 5 a = X 2 = Sample data: X 2 = Test Statistic falls within the critical region: REJECT the null hypothesis Claim: Accidents occur with the same proportion (frequency); that is, p 1 = p 2 = p 3 = p 4 = p 5 H 0 : p 1 = p 2 = p 3 = p 4 = p 5 H 1 : At least 1 of the 5 proportions is different from others 183

62 Fail to Reject p 1 = p 2 = p 3 = p 4 = p 5 Reject p 1 = p 2 = p 3 = p 4 = p 5 a = X 2 = Sample data: X 2 = Test Statistic falls within the critical region: REJECT the null hypothesis We reject claim that the accidents occur with equal proportions (frequency) on the 5 workdays. (Although it appears Wednesday has a lower accident rate, arriving at such a conclusion would require other methods of analysis.) 184 Categories with Unequal Frequencies (Probabilities) H 0 : p 1, p 2, p,..., p 3 k are as claimed H 1 : at least one of the above proportions is different from the claimed value 185 Example: Mars, Inc. claims its M&M candies are distributed with the color percentages of 30% brown, 20% yellow, 20% red, 10% orange, 10% green, and 10% blue. At the 0.05 significance level, test the claim that the color distribution is as claimed by Mars, Inc. Claim: p = 0.30, p 1 2 = 0.20, p 3 = 0.20, p 4 p = 0.10, p = 0.10, 5 6 = 0.10 H : p 0 1 = 0.30, p 2 = 0.20, p 3 = 0.20, p 4 = 0.10, p = 0.10, p 5 6 = 0.10 H 1 : At least one of the proportions is different from the claimed value. 186

63 Example: Mars, Inc. claims its M&M candies are distributed with the color percentages of 30% brown, 20% yellow, 20% red, 10% orange, 10% green, and 10% blue. At the 0.05 significance level, test the claim that the color distribution is as claimed by Mars, Inc. Frequencies of M&Ms Brown Yellow Red Orange Green Blue Observed frequency n = 100 Brown E= np = (100)(0.30) = 30 Yellow E= np = (100)(0.20) = 20 Red E= np = (100)(0.20) = 20 Orange E= np = (100)(0.10) = 10 Green E= np = (100)(0.10) = 10 Blue E= np = (100)(0.10) = Frequencies of M&Ms Brown Yellow Red Orange Green Blue Observed frequency Expected frequency (O -E) 2 /E Test Statistic (O - E) 2 X 2 = S E = 5.95 Critical Value X 2 = (with k-1 1 = 5 and a = 0.05) 188 Fail to Reject Reject a = X 2 = Sample data: X 2 = 5.95 Test Statistic does not fall within critical region; Fail to reject H 0 : percentages are as claimed There is not sufficient evidence to warrant rejection of the claim that the colors are distributed with the given percentages. 189

64 11-3 Contingency Tables 190 Definition Contingency Table (or two -way frequency table) a table in which frequencies correspond to two variables. (One variable is used to categorize rows, and a second variable is used to categorize columns.) Contingency tables have at least two rows and at least two columns. 191 Definition Test of Independence tests the null hypothesis that there is no association between the row variable and the column variable. (The null hypothesis is the statement that the row and column variables are independent.) 192

65 H 0 H 1 Tests of Independence : The row variable is independent of the column variable : The row variable is dependent (related to) the column variable This procedure cannot be used to establish a direct cause-and-effect link between variables in question. Dependence means only there is a relationship between the two variables. 193 Test of Independence Test Statistic X 2 = S (O - E)2 E Critical Values 1. Found in Table A-4 using degrees of freedom = (r - 1)(c - 1) r is the number of rows and c is the number of columns 2. Tests of Independence are always right-tailed. tailed. 194 E = (row total) (column total) (grand total) Total number of all observed frequencies in the table 195

66 Is the type of crime independent of whether the criminal is a stranger? Stranger Homicide Robbery Assault Row Total 1118 Acquaintance or Relative Column Total H 0 : Type of crime is independent of knowing the criminal H 1 : Type of crime is dependent with knowing the criminal 196 Is the type of crime independent of whether the criminal is a stranger? Stranger Acquaintance or Relative Homicide Robbery Assault (29.93) 39 (284.64) 106 (803.43) 642 (21.07) (200.36) (565.57) Row Total Column Total E = (1118)(51) 1905 (row total) (column total) E = (grand total) = E = (1118)(485) 1905 etc. = Is the type of crime independent of whether the criminal is a stranger? X 2 = S (O - E )2 E Stranger Acquaintance or Relative Homicide Robbery Forgery 12 (29.93) [10.741] 39 (21.07) [15.258] 379 (284.64) [31.281] 106 (200.36) [44.439] 727 (803.43) [7.271] 642 (565.57) [10.329] (E) (O - E ) 2 E (O -E ) 2 ( ) 2 Upper left cell: = = E

67 Is the type of crime independent of whether the criminal is a stranger? X 2 = S (O - E )2 E Stranger Acquaintance or Relative Homicide Robbery Forgery 12 (29.93) [10.741] 39 (21.07) [15.258] 379 (284.64) [31.281] 106 (200.36) [44.439] 727 (803.43) [7.271] 642 (565.57) [10.329] (E) (O - E ) 2 E Test Statistic X 2 = = Test Statistic: X 2 = with a = 0.05 and (r -1) (c -1) = (2-1) (3-1) = 2 Critical Value:X 2 = (from Table A-4) 1) = 2degrees of freedom Fail to Reject Independence Reject Independence 0 X 2 = a = 0.05 Reject independence Sample data: X 2 = H o H 1 : The type of crime and knowing the criminal are independent : The type of crime and knowing the criminal are dependent 200 Test Statistic: X 2 = with a = 0.05 and (r -1) (c -1) = (2-1) (3-1) = 2 Critical Value:X 2 = (from Table A-4) 1) = 2degrees of freedom Fail to Reject Independence Reject Independence 0 X 2 = a = 0.05 Reject independence Sample data: X 2 = It appears that the type of crime and knowing the criminal are related. 201

68 Definition Test of Homogeneity tests the claim that different populations have the same proportions of some characteristics 202 Example - Test of Homogeneity Seat Belt Use in Taxi Cabs New York Chicago Pittsburgh Taxi has Yes usable No seat belt? Claim: The 3 cities have the same proportion of taxis with usab le seat belts H 0 : The 3 cities have the same proportion of taxis with usable seat belts H 1 : The proportion of taxis with usable seat belts is not the same in all 3 cities Fail to Reject homogeneity 0 a = 0.05 X 2 = Reject homogeneity There is sufficient evidence to warrant rejection of the claim that the 3 cities have the same proportion of usable seat belts in taxis; appears from Table Chicago has a much higher proportion. Sample data: X Final Review. Triola, Essentials of Statistics, Third 2 = Edition. Copyright Pearson Education, Inc