DOEACC Society Ministry of C&IT, Govt. of India. PSIT, Kanpur, U.P., India

Transcription

1 Necessity of Goodness of Fit Tests in Research and Development 1 Sanjeev Kumar Jha, 2 Dr. A.K.D.Dwivedi, 3 Dr. Amod Tiwari 1,2 DOEACC Society Ministry of C&IT, Govt. of India 3 PSIT, Kanpur, U.P., India Abstract There are various life data distributions which can be used for reliability and other statistical analysis. The main problem in these types of analysis is identification of distributions to be fitted for collected data. Thus it is essential to identify appropriate distribution to be fitted for given data. Goodness of fit Test is a technique which is used for identification of appropriate distribution to be fitted for given data. In this paper Goodness of fit Test is discussed in detail. Initially theoretical background of this test is explained and then entire test is applied on live data collected from online bug repositories for Linux kernel 2.6 series. As there are different sub versions of Linux Kernel, thus these records are further grouped on the basis of sub versions and different samples are constructed. On each of these samples all three important methods for goodness of fit test, and Chi-squared Tests are used and result is analyzed. On the basis of these results necessity of goodness of fit test is explained. Keywords Goodness of Test, Open Source, Reliability, Linux Kernel, Distribution. I. Introduction The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are drawn from identical distributions or whether outcome frequencies follow a specified distribution. There are various methods used for goodness of fit test. Most important among them are as given below: A. Test [1] This test is used to decide if a sample comes from a hypothesized continuous distribution. It is based on the empirical cumulative distribution function (ECDF). Assume that we have a random sample x 1,.,x n from some distribution with CDF F(x). The empirical CDF is denoted by 1. Definition The statistic (D) is based on the largest vertical difference between the theoretical and the empirical cumulative distribution function: 2. Hypothesis Testing [4] The null and the alternative hypotheses are: (1) (2) H0: the data follow the specified distribution. HA: the data do not follow the specified distribution. The hypothesis regarding the distributional form is rejected at the chosen significance level (α) if the test statistic, D, is greater than the critical value obtained from a table. The fixed values of α (0.01, 0.05 etc.) are generally used to evaluate the null hypothesis (H0) at various significance levels. A value of 0.05 is typically used for most applications, however, in some critical industries; a lower (α) value may be applied. The standard tables of critical values used for this test are only valid when testing whether a data set is from a completely specified distribution. If one or more distribution parameters are estimated, the results will be conservative: the actual significance level will be smaller than that given by the standard tables and the probability that the fit will be rejected in error will be lower. 3. The P-value, in contrast to fixed α values, is calculated based on the test statistic, and denotes the threshold value of the significance level in the sense that the null hypothesis (H0) will be accepted for all values of α less than the P-value. For example, if P=0.025, the null hypothesis will be accepted at all significance levels less than P (i.e and 0.02), and rejected at higher levels, including 0.05 and 0.1. The P-value can be useful in particular, when the null hypothesis is rejected at all predefined significance levels, and you need to know at which level it could be accepted. B. Test [3] The procedure is a general test to compare the fit of an observed cumulative distribution function to an expected cumulative distribution function. This test gives more weight to the tails than the test. 1. Definition The statistic (A2) is defined as 2. Hypothesis Testing The null and the alternative hypotheses are: H0: the data follow the specified distribution. H A : the data do not follow the specified distribution. The hypothesis regarding the distributional form is rejected at the chosen significance level (α) if the test statistic, A 2, is greater than the critical value obtained from a table. The fixed values of α (0.01, 0.05 etc.) are generally used to evaluate the null hypothesis (H0) at various significance levels. A value of 0.05 is typically used for most applications however; in some critical industries a lower α value may be applied. In general, critical values of the Anderson- Darling test statistic depend on the specific distribution being tested. However, tables of critical values for many distributions (except several the most widely used ones) are not easy to find. The test implemented in Easy Fit uses the same critical values for all distributions. These values are calculated (3) International Journal of Computer Science and Technology 135

2 using the approximation formula, and depend on the sample size only. This kind of test (compared to the original A-D test) is less likely to reject the good fit, and can be successfully used to compare the goodness of fit of several fitted distributions. C. Test [2] The test is used to determine if a sample comes from a population with a specific distribution. This test is applied to binned data, so the value of the test statistic depends on how the data is binned. This test is used for continuous sample data only. Although there is no optimal choice for the number of bins (k), there are several formulas which can be used to calculate this number based on the sample size (N). For example, Easy Fit employs the following empirical formula: k=1+log 2 N. The data can be grouped into intervals of equal probability or equal width. Each bin should contain at least 5 or more data points, so certain adjacent bins sometimes need to be joined together for this condition to be satisfied. 1. Definition The statistic is defined as: (4) where O i is the observed frequency for bin i, and E i is the expected frequency for bin i calculated by E i = F( x2 )-F( x1 ),where F is the CDF[6] of the probability distribution being tested, and x 1, x 2 are the limits for bin i. 2. Hypothesis Testing The null and the alternative hypotheses are: H 0 : the data follow the specified distribution. H A : the data do not follow the specified distribution. The hypothesis regarding the distributional form is rejected at the chosen significance level (α) if the test statistic is greater than the critical value defined as meaning the inverse CDF with k-1 degrees of freedom and a significance level of α. Though the number of degrees of freedom can be calculated as k-c-1 (where c is the number of estimated parameters), Easy Fit calculates it as k-1 since this kind of test is least likely to reject the fit in error. The fixed values of α (0.01, 0.05 etc.) are generally used to evaluate the null hypothesis (H 0 ) at various significance levels. A value of 0.05 is typically used for most applications, however, in some critical industries; a lower α value may be applied. 3. The P-value [6], in contrast to fixed α values, is calculated based on the test statistic, and denotes the threshold value of the significance level in the sense that the null hypothesis (H 0 ) will be accepted for all values of α less than the P-value. For example, if P=0.025, the null hypothesis will be accepted at all significance levels less than P (i.e and 0.02), and rejected at higher levels, including 0.05 and 0.1. The P-value can be useful; in particular, when the null hypothesis is rejected at all predefined significance levels, and you need to know at which level it could be accepted. Easy Fit displays the P-values based on the test statistics (χ 2 ) calculated for each fitted distribution. In this research goodness of fit tests are applied on failure data of Linux Kernel 2.6 series collected from various bug repositories. For statistical calculations Easy Fit which can be easily incorporated with Microsoft Excel 2007 is used. 136 International Journal of Computer Science and Technology II. Related Work The concept of goodness of fit is used for identification of statistical model for different analysis. In research area this concept is used in almost all the fields directly or indirectly. Here we have studied several papers and discussed the situations where this test is used. In [3] authors have used test to identify whether Weibull Distribution is suitable for collected sample or not. In [4] Test is used to identify whether the selected distribution is suitable for sample or not. In [7] Bayesian approach for software reliability is discussed. In this paper authors have used Test to identify the best distributions. In [8] user oriented reliability model is derived and used. In [17] authors have used Weibull distribution for construction of reliability model. A. Problem Identification After a detail study of research papers, articles and books related to reliability and other statistical analysis it has been found that in maximum of researches either a single goodness of fit test is applied or a particular distribution is selected randomly. Very little importance is given to Goodness of fit test due to which sometimes researchers got unexpected results. The main problems in all these researches are: If a distribution is accepted by using a particular goodness of fit test will it be accepted by all other tests? Can we select a particular distribution randomly for fitting on a given sample and hence for statistical analysis? Here in this research to answer above questions Goodness of fit test is applied on collected samples and result is analyzed. III. Research Methodology used To answer above questions we have collected failure data of Linux kernel 2.6 series from online bug repositories Bugzilla [18]. Total of 2821 records were collected. Initially these data were in raw format. During initial analysis phase all irrelevant records were deleted and finally we have selected 1600 records. The preprocessed data is stored in Mysql Table. The structure of table is as given below: Field Data Type Purpose Bug_Id Opendate Bug_ Severity Int Date Varchar Identification Number (Primary Key) Date of submission of the bug How severe is this bug Product Varchar Module of the kernel Component Varchar Sub module of the kernel cf_kernel_ version ISSN : (Print) ISSN : (Online) Varchar Sub Version of the kernel It is clear from above that there are different sub versions of Linux kernel 2.6 series with different release date. Hence minimum value of Open date for each of the sub versions will be different. Thus on the basis of kernel sub versions, the entire sample is subdivided into different sub samples and then goodness of fit test is applied on each of these sub samples. Separate samples are constructed for each of the kernel subversions from to Here in this paper to discuss the necessity of goodness of fit only versions is considered and discussed in detail. Time to Failure Data

3 for each of these two samples is calculated and goodness of fit test is applied on these data. For mathematical calculation Easy fit 5.5 software (which is very light weight software) with Microsoft Excel 2007 is used. Goodness of fit Test for both the samples are discussed here. A. Sub Version [Release Date: 24th January 2008] This sub version of the Linux Kernel was released on 24th January 2008 [19]. For Linux kernel series we have total of preprocessed bugs. Time to Failure Data in terms of week is as given below: Table 1 : TTFDATA Goodness of Fit Test is applied for collected sample data mentioned in Table 1. For goodness of fit test all important life data distributions are tested by using all three tests: Kolmogorov Smirnov Test, Anderson Darling Test and Chi Square Test. All above tests are applied on 1%, 2%, 5%, 10% and 20% level of significance. The detail of the test is elaborated in Table2. Table 2: Goodness of Fit Detail Kernel Fatigue Life Distribution Critical Value Deg. of freedom Fatigue Life (3P) Frechet Deg. of freedom E-5 Critical Value Frechet (3P) Deg. of freedom International Journal of Computer Science and Technology 137

4 Gamma E-5 Gamma (3P) Gen. Gamma E-4 Gen. Gamma (4P) Reject? Yes Yes Yes No No Gumbel Max Deg. of freedom E-4 Critical Value Gumbel Min ISSN : (Print) ISSN : (Online) E International Journal of Computer Science and Technology

5 Deg. of freedom E-6 Critical Value Kumaraswamy E Log-Logistic Log-Logistic (3P) Logistic E Lognormal Lognormal (3P) International Journal of Computer Science and Technology 139

6 Normal E Deg. of freedom Critical Value Reject? Yes Yes Yes Yes No Weibull Reject? Yes Yes Yes No No Weibull (3P) 140 International Journal of Computer Science and Technology Rank ISSN : (Print) ISSN : (Online) IV Conclusions After analyzing the results of Table 2 it has been found that there are various distributions available which are accepted by using one test while rejected by other test. Let us take the case of Gen. Gamma Distribution with 4 parameters, this distribution is rejected by Test at all level of significance but accepted by having at 1% and 2% level of significance as well as by Chi square test having P value at all level of significance. Thus here it is clear that merely acceptance by a single test does not guarantee its acceptance by all other tests. Similarly Weibull distribution with 3 parameters which is one of widely used life data distribution is rejected by Anderson darling but it is accepted by Chi Square Test. Thus without proper goodness of fit test one should not decide about the distribution to be fitted. We should consider those distributions for fitting which is accepted by maximum number of tests at maximum level of significances. Here for this version of the kernel Frechet distribution with 3 parameters is considered as best fit. Thus it indicates necessity of goodness of fit test in every research. If it is not considered properly normally we get unexpected results. Thus we must use goodness of fit test properly for decision about distribution to be fitted. This paper will be useful for researchers from all the fields. References [1] " Test", [Online] Available : en.wikipedia.org/wiki/kolmogorovsmirnov_test. [2] "General Concept of Goodness of Fit". [Online] Available : htm. [3], "A Goodness of Fit Test for Small Sample", [Online] Available : pdf/a_dtest.pdf [4] J.L., RAC START, " GOF Test Romeu: A Publication of the Reliability Centre", Volume 10 Number 6. [5] On line Manual: Easy Fit 5.5 [6] Rohatagi, V.K., "An Introduction to Probability Theory and Mathematical s", Willey. [7] Littlewood, Verrall, A Bayesian Reliability Growth Model for Computer Software.

7 [8] R.C. Cheung, A user-oriented software reliability model, IEEE Transactions on Software Engineering, vol. 6, no. 2 [9] J.De.S Coutinho, Software reliability growth. IEEE Symposium on Computer Software Reliability, [10] A.L. Goel, K. Okumoto, A time-dependent error-detection rate model for software reliability and other performance measure, IEEE Transactions on Reliability, vol. R-28, [11] S.S. Gokhale, M.R. Lyu, K.S. Trivedi, Reliability simulation of component-based software systems, Proceedings of 9th Int l Symposium on Software Reliability Engineering, [12] IEEE Reliability Society, IEEE recommended practice on software reliability, IEEE Std , June [13] Littlewood, J.L. Verrall, A bayesian reliability model with a stochastically monotone failure rate, IEEE Transactions on Reliability, vol. R-23, June [14] A. Mockus, T.R. Fielding, J.D. Herbsleb, Two case studies of open source software development: Apache and Mozilla, ACM Transactions on Software Engineering and Methodology, vol. 11, no. 3, July [15] T.R. Moss, The Reliability Data Handbook, ASME Press, [16] J.D. Musa, K. Okumoto, A logarithmic poisson execution time model for software reliability measurement, 7th Int l Conference on Software Engineering (ICSE), [17] Cobra Rahmani, Harvey Siy, Azad Azadmanesh An experimental Analysis of open Source Software Reliability. [18] "On line bug repositories Bugzilla", [Online] Available : [19] Linux Kernel, [Online] Available : org. Sanjeev Kumar Jha, Senior Systems Analyst, DOEACC Society Chandigarh B.O: Lucknow, Ministry of CO&IT Government of India. Bachelors and Masters Degree in s [Honors] from Patna University and presently doing his PhD in Computer Science from Singhania University. He has attended 2 international Conferences. Area of research interest is Reliability and Open Source Software Dr. A.K.D. Dwivedi, Director, DOEACC Society Chandigarh, Ministry of C&IT Government of India. M.Tech from Allahabad University and PhD from Gorakhpur University. Attended many international and national Conferences. Published many papers in international and national journals. Executed many government projects successfully. Area of research interest is Signal Processing, Reliability and Open Source Software. Dr. Amod Tiwari, Associate Professor, PSIT Kanpur. PhD from IIT Kanpur. Attended many international and national Conferences. Published many papers in international and national journals. Area of research interest is Image Processing, Reliability and Open Source Software. International Journal of Computer Science and Technology 141