Tests for Goodness of Fit: General Notion: We often wish to know whether a particular distribution fits a general definition Example: To use t tests, we must suppose that the population is normally distributed If a sample is drawn from, say, a normal distribution, the sample values should be reflect the population distribution Allows us to state the number in the sample that should be in a particular range Example: 68% of a normal distribution is within +/- 1 standard deviation of the mean. About 68% of the values in a sample from a normal distribution should be within +/- 1 standard deviation of the mean Comparison of actual and expected numbers is the province of the distribution
Let O j be the number observed in the sample in range j Let E j be the number that would be expected if the population had a given distribution, as uniform, Poisson, normal, etc. Then ( O j E j ) E k j 1 where k is the number of categories degrees of freedom = k 1 m where m is the number of parameter estimates used in the calculation j
Example: Are the answers to Dr. Dinwiddie s multiple-choice tests random? If so, the answers should conform to a uniform distribution and P(A) = P(B) = P(C) = P(D) = ¼. (For the uniform distribution P(E) = 1/n, where n is the number of possible values.) On a recent exam there were sixty questions with correct answers: A-0, B-5, C-17, and D-18. H 0 : the distribution of answers is uniform H 1 : the distribution is not uniform
Correct Answer Observed Expected Squared Difference A 0 15 B 5 15 C 17 15 D 18 15 k ( O j E j ) E j 1 j Then = 9.07, and no parameters were estimated, so degrees of freedom = 4 1 = 3
Excel and the chi-square distribution CHIDIST(x value, df) returns the area in the right-hand tail of the chi-square distribution goodness of fit tests are all upper one-tail tests, so chidist gives the p-value of the test CHIINV(probability, df) gives the chi-square value for the upper tail of the probability entered use to find the critical value for a chi-square test For the Dinwiddie problem: CHIDIST(9.07, 3) gives the p-value of the test
EXAMPLE: Hamish suspects that the dice at Black Bart s are not fair, so he spirits one out of the casino one night. After rolling the stolen die 10 times, he has the following result: No. of Dots No. of Times 1 7 4 3 18 4 11 5 7 6 13 What are the null and alternative hypotheses? Is Hamish right to be suspicious of Black Bart? k ( O j E j ) E j 1 j
Testing for normality suppose that nationally auto insurance has a mean price of $700 with standard deviation $135. We have a sample of 80 NC drivers, and we d like to know whether their insurance bills are normally distributed with the national parameters. how many would we expect in the range 700 to 835? HINT: how many standard deviations? What proportion are within that range of standard deviations?
answer: on a normal distribution, 0.34 are between the mean and +1 st dev, so we d expect to find 0.34 * 80 = 7. in that range Setting up a spreadsheet: use normsdist normsdist(-) gives the proportion more than two standard deviations below the mean normsdist(-1) normsdist(-) would give proportion between 1 and st devs below mean
Continuing in that fashion, we d have the following St Devs Range Prop. Expected freq < - < 430 0.075 1.8 - to -1 430-565 0.1359 10.87-1 to 0 565-700 0.3413 7.31 0 to 1 700-835 0.3413 7.31 1 to 835-970 0.1359 10.87 > > 970 0.075 1.8
To find the observed values in the sample, use the HISTOGRAM tool An elaborated solution appears under Study Aids on my web site. Click on the link to normaltest.xls Issue: how many degrees of freedom does the statistic have? df= k 1 m = 6 1 0 = 5
Alternate technique: determine whether the sample was drawn from a normal population First, calculate sample mean and standard deviation and use those numbers in the calculation Issue: how many degrees of freedom does the statistic have? df = k 1 m = 6 1 = 3
A problem and an alternate solution Each cell should have expected frequency at least 5, otherwise chisquare value is not correct One solution: choose ranges with equal expected frequencies Divide data into, say, 10 ranges each expected to contain 8 observations So we define ranges that each contain 1/10 of total Remember NORMINV(probability, mean, standard deviation) displays the upper boundary of the given probability for the specified mean and standard deviation Example: NORMINV(.1, 300, 0) = 74.37. 10% of this distribution is 74.37 NORMINV(1/10, X, s) will find the boundary of the lowest 10% of the distribution NORMINV(4/10, X, s) finds the boundary of the lowest 40% and so on Look carefully at sheet of the workbook normaltest.xls as posted
The boundaries thus found are the bin range Each will have expected number equal to n/c where n is the amount of data and c the number of categories
Testing for conformity to an observed distribution: The national distribution of pets is as follows: Number of Pets Percentage of Households 0 55 1 5 10 3 5 4 3 5 or more A marketing company wants to know whether Boone conforms to the national pattern. In a sample of 300 Boone households, they found the following:
No. of Pets 0 18 1 75 50 3 0 4 18 5 or more 9 No. of Households Expected No. Squares k ( O j E j ) E j 1 j