GOODNESS OF FIT INTRODUCTION GOODNESS OF FIT TESTS

Transcription

1 GOODNESS OF FIT INTRODUCTION Goodness of fit tests are used to determine how well the shape of a sample of data obtained from an experiment matches a conjectured or hypothesized distribution shape for the population from which the data was collected. The idea behind a goodness-of-fit test is to see if the sample comes from a population with the claimed distribution. Another way of looking at that is to ask if the frequency distribution fits a specific pattern, or even more to the point, how do the actual observed frequencies in each class interval of a histogram compare to the frequencies that theoretically would be expected to occur if the data exactly followed the hypothesized probability distribution. This is relevant to cost risk analysis because we often want to apply a distribution to an element of cost based on observed sample data. A goodness of fit test is a statistical hypothesis test: Set up the null and alternative hypothesis; determine alpha; calculate a test statistic; look-up a critical value statistic; draw a conclusion. In this course, we will discuss three different methods or tests that are commonly used to perform Goodness-of-Fit analyses: the Chi-Square (χ 2 ) test, the Kolmogorov-Smirnov One Sample Test, and the Anderson-Darling test. The Kolmogorov-Smirnov and Anderson-Darling tests are restricted to continuous distributions while the χ 2 test can be applied to both discrete and continuous distributions. GOODNESS OF FIT TESTS CHI SQUARE TEST The Chi-Square test is used to test if a sample of data came from a population with a specific distribution. An attractive feature of the chi-square goodness-of-fit test is that it can be applied to any univariate distribution for which you can calculate the cumulative distribution function. The chi-square goodness-of-fit test is applied to binned data (i.e., data put into classes). This is actually not a restriction, since for non-binned data you can simply calculate a histogram or 1 Dec 2014

2 frequency table before generating the chi-square test. However, the value of the chi-square test statistic is dependent on how the data is binned. Another characteristic of the chi-square test is that it requires a sufficient sample size in order for the chi-square test statistic to be valid. The chi-square statistic measures how well the expected frequency of the fitted distribution compares with the frequency of a histogram of the observed data. It compares the histogram of the data to the shape of the candidate density (continuous data) or mass (discrete data) function. Definition The chi-square test is defined for the hypothesis: H 0 : The data follow a specified distribution. H 1 : The data do not follow the specified distribution. Test Statistic For the chi-square goodness-of-fit computation, the data are divided into k bins and the test 2 2 statistic is defined as: x (( Oi Ei ) / Ei ) where is the observed frequency for bin i and is the expected frequency for bin i. Computation of the expected frequency (E i ) will be shown by example. For the chi-square approximation to be valid, the expected frequency in each bin should be at least 5. This test is less sensitive when the sample size is small, and if some of the theoretical bin counts are less than five, you may need to combine some bins to ensure that there are at least 5 theoretical observations in each bin. Significance Level Critical Region: The test statistic follows, approximately, a Chi-Square distribution with (k 1- number of population parameters estimated) degrees of freedom where k is the number of nonempty bins. If specific sample statistics need to be computed in order to develop the binning, then the degrees of freedom are reduced by the number of statistics that were computed. Therefore, the hypothesis that the data are from a population with the specified distribution is rejected if the computed χ is greater than the critical value. Note that the information needed to determine critical values from the χ 2 distribution is the level of significance (α) and the Degrees 2 2 of Freedom (df). If the sum of the squared deviations from x (( Oi Ei ) / Ei ) is small, the observed frequencies are close to the expected frequencies and there would be no reason to reject 2 k i 1 k i 1

3 the claim that it came from that distribution. Only when the sum is large is there a reason to question the distribution. Therefore, the chi-square goodness-of-fit test is always a right tail test. KOLMOGOROV-SMIRNOV TEST The Kolmogorov-Smirnov One Sample Test, also referred to as the KS test, is an alternative to the χ 2 test and is called a distribution-free test because it does not require that any assumptions about the underlying distribution of the Test Statistic be made. The KS test compares the cumulative relative frequency distribution derived from sample data with the theoretical cumulative relative frequency distribution that is described by the Null Hypothesis. In essence, the KS test is based on the maximum distance between these two cumulative relative frequency curves. The Tests Statistic, D, is the absolute value of the maximum deviation between the observed cumulative relative frequencies and the expected (theoretical) relative cumulative frequencies. Depending on the probability that such a deviation would occur if the sample data really came from the distribution specified in the Null Hypothesis, the Null Hypothesis should be rejected or not rejected. Note that in the KS test we are talking about relative frequencies, which are percentages rather than actual frequencies. The KS test is restricted to continuous distributions only. Definition The Kolmogorov-Smirnov test is defined as: H 0 : The data follow a specified distribution H 1 : The data do not follow the specified distribution Test Statistic: The Kolmogorov-Smirnov test statistic is defined as: D = Maximum F o F e where: F o = observed relative frequency F e = theoretical relative frequency 3

4 Significance Level Critical Values: The hypothesis regarding the distributional form is rejected if the test statistic, D, is greater than the critical value obtained from a table. There are several variations of these tables in the literature that use somewhat different scalings for the KS test statistic and critical regions. These alternative formulations should be equivalent, but it is necessary to ensure that the test statistic is calculated in a way that is consistent with how the critical values were tabulated. ANDERSON-DARLING TEST The Anderson-Darling test is used to test if a sample of data came from a population with a specific distribution. It is a modification of the Kolmogorov-Smirnov (KS) test and gives more weight to the tails than does the KS test. The KS test is distribution free in the sense that the critical values do not depend on the specific distribution being tested. The Anderson-Darling test makes use of the specific distribution in calculating critical values. This has the advantage of allowing a more sensitive test and the disadvantage that critical values must be calculated for each distribution. Definition The Anderson-Darling test is defined as: H 0 : The data follow a specified distribution. H 1 : The data do not follow the specified distribution Test Statistic: The Anderson-Darling test statistic is defined as: A 2 = (-Sum/n)-n Where Sum is the sum of the (2i-1)*{(ln(P i )+ln(1-p n+1-i )} column and n is the sample size. The estimated (computed) Critical Value, designated as A* is computed as follows: A* = A 2 ( /n /n 2 ) This is the value that is compared against the Critical Region value. Significance Level: α Critical Region: The critical values for the Anderson-Darling test are dependent on the specific distribution that is being tested. Tabulated values and formulas have been published (Stephens, 1974, 1976, 1977, 1979) for a few specific distributions (normal, lognormal, exponential, Weibull, logistic, extreme 4

5 value type 1). The test is a one-sided test and the hypothesis that the distribution is of a specific form is rejected if the test statistic, A, is greater than the critical value. Note that for a given distribution, the Anderson-Darling statistic may be multiplied by a constant (which usually depends on the sample size, n). These constants are given in the various papers by Stephens. In the sample output below, this is the adjusted Anderson-Darling statistic. This is what should be compared against the critical values. Also, be aware that different constants (and therefore critical values) have been published. You just need to be aware of what constant was used for a given set of critical values (the needed constant is typically given with the critical values). EXAMPLES CHI-SQUARE TEST EXAMPLE You have been presented with a set of 25 data points that represent the weights in pounds of missile warheads that have been installed on a number of different kinds of aircraft. The government is interested in determining if the distribution of these weights can be considered to be normally distributed with a mean of 100 lbs. and a standard deviation of 5 pounds. Table 1 provides the raw data with the values ranked from low to high. Table 1: Sample Data WEIGHTS (lbs.) In order to perform the Chi-Square test, the data must be tabulated into bins to form the histogram. The question is: how many bins should I use? There is no optimal choice for the bin width (since the optimal bin width depends on the distribution). Most reasonable choices should produce similar, but not identical, results. A commonly used algorithm called Sturges Rule, is sometimes used to determine a reasonable number of bins for a given sample size. The formula for Sturges is given as follows: k = *log (n) where k is the number of bins and n is the sample size. Once k is determined, the range (discussed earlier) of the data can be divided by k to get an approximate bin width. 5

6 For this problem, Sturges Rule yields the following: k = *log (25) = 5.61 = 6 bins or cells The range for this data set is computed to be: R = Max value Min value = ( ) = 34.4 Dividing R by 6 yields a cell width of approximately 6 lbs. Table 2 shows the data in tabular form. Figure 1 provides the histogram or bar chart. Table 2: Tabular or Binned Data LOWER BOUND UPPER BOUND FREQ (f) TOTAL Figure 1: Data Histogram Your job is to perform a statistical hypothesis test on this data to determine if it fits the stipulated distribution. You are directed to use the Chi-Square Goodness of Fit test. 6

7 1. Establish the Null Hypothesis and Alternative Hypothesis (What you are trying to prove or disprove). H o = N(100, 5) This is a Normal distribution with mean of 100 lbs and standard deviation of 5 lbs. The Alternative Hypothesis (Your fallback position in the case you cannot disprove H o ). The Alternative Hypothesis is designated as H 1. H 1 = Not N(100, 5) 2. Set the level of significance. For this test we will set = Perform the calculations. For this test we will use the Chi-Square distribution. The test statistic is given by: x 2 k (( Oi i 1 E ) i 2 / E ) Where: O i = Observed frequency E i = Theoretical expected frequency So, as you can see, it will be necessary to compute the E i. Let s use the spreadsheet (Table 3) below to walk through the steps. i Table 3: Chi Square Example Calculation Table LOWER UPPER FREQ CELL TO GET TO BOUND BOUND Oi LL UL Z AREA AREA E 5 IN CELL (O-E)^2/E The columns labeled LOWER BOUND and UPPER BOUND are taken straight from the binned data source presented to you. The column labeled FREQ contains the observed frequencies that were also given to you. These numbers represent the O i. The columns labeled LL and UL contain the values in the LOWER BOUND and UPPER BOUND columns. Note that up until the bin or cell which contains the hypothesized mean (100) is reached, only LL values are entered into the column. Entries after the mean cell is reached are entered in the UL column. 7

8 Z represents the Standard Normal deviate and is computed as follows: Z = (LL 100)/5 for the rows that contain LL values and Z = (UL 100)/5 for the rows that contain UL values. For example, for the first LL value of 77 the computation is: Z = (77 100)/5 = as shown. Other values are likewise computed. The column labeled AREA represents the area under the Standard Normal distribution curve between the center of the distribution and the point at which Z is plotted. The following diagram depicts the AREA for the Z value (AREA = ) Figure 2: Normal Distribution Curve The column labeled CELL AREA represents the area in each cell of the distribution for each Z value. For example, the CELL AREA associated with Z = represents the difference in area from the center of the distribution to Z = (0.4861) and from the center of the distribution to Z = (0.3413). This is depicted in Figure 3 below. 8

9 Figure 3: Distribution of Areas The other values are computed in the same manner. The values in the column labeled EXPECTED FREQUENCY are derived by multiplying the CELL AREA values times the total sample size of 25 for each row. For example, the value of results from the product of times 25. All the other values are computed in the same manner. These are the E i values. The column labeled (O i E i ) 2 /E i contains values computed exactly as this formula states. However, recall that each E i must be at least 5 in each cell for the Chi-Square test. Since the first three cells contain numbers that are less than 5, and their sum is less than 5, the first four cells must be combined which results in the total of Likewise, the last two cells in the Table contain E i s which are less than 5, and their sum is less than 5, so the last three cells in the Table must also be combined resulting in the value of Once the requirement of 5 in each cell is satisfied by combining adjacent cells, the (O i E i ) 2 /E i can be computed. The value of is computed as follows: ( ) 2 / =

10 And the value is computed as follows: ( ) 2 / = Note that in both of these computations, the O i corresponding to the combined E i cells also needed to be combined. The estimated Chi-Square statistics is computed to be χ 2 = ((O i E i ) 2 /E i ) = Evaluate the results. Based on a comparison of the computed result with the Critical Value, make a conclusion about the test. That is, either reject H o and accept H 1 ; or Accept H o. As previously mentioned, the Chi-Square distribution has associated with it a parameter called the degrees of freedom denoted as df. For this kind of Goodness of Fit test, the df are computed by counting up the number of cells in the original data set and subtracting one from that total and then subtracting the number of population parameters that needed to be estimated from the data.. For this problem, df = 1 because although the data was put into seven cells, when the cells were combined to satisfy the 5 in each cell rule, only two cells remained. There were no population parameters estimated. The df and are the two values that you need to look up the critical value for a Chi- Square Goodness of Fit test. For this problem we chose = 0.05 and we have 1df, so the critical value is 3.841, which we looked up in a table of critical values for the Chi-Square distribution. These tables are contained in most standard statistical textbooks. 5. Make a decision. Since our computed Chi-Square value of is less than the critical value of 3.841, there is not enough evidence from the sample data to refute the assertion (hypothesis) that the data came from a N(100, 5). Therefore fail to reject the Ho. The diagram below depicts this result. 10

11 % Area Figure 4: Critical Area for Chi-Square Test KOLMOGOROV-SMIRNOV TEST EXAMPLE Table 4 below contains the same data that was used for the Chi-Square example, Note that this table does not include any Chi-Square calculations, but does include the computation of relative frequencies and their differences. Table 4: K-S Test Computational Spreadsheet RELATIVE EXPECTED RELATIVE LOWER UPPER FREQ FREQUENCY FREQUENCY FREQUENCY BOUND BOUND (O i ) (O i ) (E i ) (E i ) F O F E D

12 Note that the red bolded number in the D column is the maximum absolute difference between the observed cumulative frequencies and the expected cumulative frequencies. Staying consistent with the Chi-Square example, we will use significance level 0.05 for this test also. Critical values for D are found in most statistics texts. For a sample size of 25, the critical value with significance level 0.05 is The maximum D from the table above is Since is less than , we fail to reject the hypothesis that this data came from a N(100, 5) distribution. This conclusion is consistent with the results under the Chi-Square test. ANDERSON-DARLING TEST EXAMPLE Table 5 below contains the same data that was used for the Chi-Square and the K-S examples. Unlike the Chi-Square test, the Anderson-Darling Test does not need the data to be binned, so the example which follows shows how to do the Anderson-Darling Test on raw data. The Table below summarizes the computational results. The calculations associated with each column are presented in subsequent paragraphs. Recall that the Null Hypothesis is N(100, 5). 12

13 Table 5: Anderson-Darling Test Computational Spreadsheet i WEIGHTS (lbs.) NORMAL P 1-P LN P LN(1-P) (2i-1)*{(ln(Pi)+ln(1-Pn+1-i)} SUM The column labeled i simply contains a count of each data point. The column labeled WEIGHTS (lbs.) contains the raw data. In order to perform the Anderson-Darling test, it is necessary to compute the average and standard deviation of the raw data. These are shown below. Average = Standard Deviation = The column labeled NORMAL is the Standard Normal Deviate (Z) computed as follows: Z = (Data Point Average)/Standard Deviation So the value of in the first row results from: Z = ( )/8.247 = All the other NORMAL values are computed in the same manner. The column labeled P is the cumulative area in the Standard Normal Distribution that is associated with the Z value. 13

14 The column labeled 1-P is self explanatory. The columns labeled LN P and LN (1-P) represent the Natural Logarithms of the P and the 1-P values. The column labeled (2i-1)*{(ln(Pi)+ln(1-P n+1-i )} represents the computation as indicated. For example, the value of is derived as follows: (2(1)-1)( ( )) = Likewise, the value of is derived as follows: (2(2)-1)( ( )) = All subsequent values are derived in the same manner. Once all of the (2i-1)*{(ln(Pi)+ln(1-P n+1-i )} values are computed, that column is summed resulting in the as shown in the Table. The next step is to compute a value designated as A 2 as follows: A 2 = (-Sum/n)-n Where Sum is the sum of the (2i-1)*{(ln(Pi)+ln(1-P n+1-i )} column and n is the sample size. For this example, A 2 is computed as follows: A 2 = (-( /25) 25 = Finally, the estimated hypothetical Critical Value, designated as A* is computed as follows: A* = A 2 ( /n /n 2 ) For this example: A* = ( /25 = 2.25/625) = The hypothetical critical value for the Anderson-Darling Test is dependent on the distribution being tested under the Null hypothesis. For testing the Normal Distribution, the Critical Values are as follows for the range of significance levels shown in Table 6. 14

15 Table 6: Critical Values for the A-D Test When Testing for a Normal Distribution Significance Level Critical Value Since the computed critical value of is less than the Tabled Critical Value of 0.752, there is not enough evidence to reject the Null Hypothesis of N(100, 5). This conclusion is consistent with the findings under the Ch-Square Test and the K-S Test. SUMMARY Goodness of Fit (GOF) tests provide guidance for evaluating the suitability of a potential model. There is no single correct distribution choice from GOF testing don t be locked into the numbers of the test results. GOF tests do not provide a true probability measure for the data actually coming from the fitted distribution they provide a probability that the random data from the fitted distribution would have produced a GOF statistic value as low as that calculated for the observed data. The most intuitive measure is a visual comparison of the probability distributions. Chi-Square: o Used for continuous and discrete data. o The test is sensitive to the choice of bins. o No optimal choice for bin width (distribution-dependent). o Not valid for small sample size (one rule of thumb states the N>50). o For bin counts <5 (expected frequency), may need to combine bins. o It is sensitive to large errors (it uses a sum of squared errors). o Most commonly used GOF test. Kolmogorov-Smirnov (K-S): o Used with continuous distributions. o Tends to be more sensitive near the center of the distribution that at the tails. o Avoids problem of determining bins some believe more useful than Chi-Square. 15

16 o Value determined by the largest distance between observed and fitted distribution, as it does not take into account lack of fit across rest of distribution. Anderson-Darling (A-D): o A sophisticated version of the K-S test. o Used with continuous distributions. o Only available for a few specific distributions (critical values computed). o Gives more weight to the tails than the K-S. o Vertical distances are integrated over all values of X to make maximum use of the observed data. o Generally, more useful than K-S, especially when equal emphasis on body and tails is desired. 16