Exploratory Failure Analysis of Open Source Software 1

Transcription

1 Exploratory Failure Analysis of Open Source Software Cobra Rahmani, Satish M. Srinivasan, Azad Azadmanesh College of Information Science & Technology University of Nebraska-Omaha, Omaha, U.S. {crahmani, smsrinivasan, Abstract- Reliability growth modeling in software system plays an important role in measuring and controlling software quality during software development. One main approach to reliability growth modeling is based on the statistical correlation of observed failure intensities versus estimated ones by the use of statistical models. Although there are a number of statistical models in the literature, this research concentrates on the following seven models: Weibull, Gamma, S-curve, Exponential, Lognormal, Cubic, and Schneidewind. The failure data collected are from five popular open source software (OSS) products. The objective is to determine which of the seven models best fits the failure data of the selected s as well as predicting the future failure pattern based on partial failure history. The outcome reveals that the best model fitting the failure data is not necessarily the best predictor model. Keywords- empirical software engineering, goodness-of-fit; open source software; software reliability growth modeling; I. INTRODUCTION Over the past years, open source software (OSS) has drawn increasing attention, both from the business and academic world. The leading concept of open source presented by Raymond [7] differentiates the collaborative open source approach from the traditional in-house and proprietary software development. Success behind OSS can be attributed to collaboration with volunteers across organizations and geographical boundaries, faster development due to surcharge in the number of developers, and platform independence due to its development environment [6]. In this research, five different OSS products named Eclipse V.2, Apache HTTP Server 2, Firefox, MPlayer OS X, and ClamWin Free Antivirus have been considered for estimating their reliability and for predicting the failure process. The rationale behind choosing these projects is popularity of the products in terms of number of users and downloads, the time-span in which the products have been in operation, and availability of the bug reports. The failure data collection concentrated on two popular archive sources, i.e. Bugzilla [5] and Sourceforge.net [9]. In this study, failure intensities, in sequence of two weeks, are collected for each software. The Failure information for MPlayer OS X and ClamWin Free Antivirus are obtained from Sourceforge.net. For Firefox, Eclipse V.2, and Apache 2, This research is funded in part by Department of Defense (DoD)/Air Force Office of Scientific Research (AFOSR), NSF Award Number FA , under the title High Assurance Software. the failure information is gathered from Bugzilla. Table I gives more information about the s used in this study. The table highlights the official release year and the duration of the collected failure data for each product. Table I. INFORMATION OF COLLECTED FAILURE FOR FIVE OSS Durations of collected Official failures release year Start date End date Firefox 999 3/999 2 /26 Eclipse V.2 2 /2 3 2/27 Apache /22 2/28 ClamWin Free Antivirus 24 3/24 8/28 MPlayer 22 9/22 6/26 The study compares seven distribution models in order to determine whether there is a consistency between any of these models with respect to the goodness-of-fit and reliability prediction of the selected s. The study attempts to shed light on the probable reasons if this consistency cannot be observed. The distribution models are Cubic, Exponential, Gamma, Lognormal, Schneidewind, S-curve, and Weibull [3,8,9,3,4,8,2, 22]. These models are chosen because of a combination of reasons such as their capability in providing various distribution shapes or potential in creating distributions that follow the failure patterns of software systems. The rest of the paper is organized as follows. Section II provides some definitions and background information about the aforementioned distribution models. Section III focuses on failure data analysis and the reliability modeling process. Section IV concludes the paper with a summary. II. BACKGROUND Software reliability growth models (SRGM) have been in existence for approximately 4 years with the intent to creating models that can accurately quantify software quality. Other distribution models have been developed over the years for purposes other than software reliability that at times are used in the software quality field. The hope is that by deciding on an appropriate model, its parameters can reflect the software behavior at one or more software phases such as development, testing, and operation [8,,4]. In [8], a number of models are discussed with information that can help the users and practitioners on deciding a model and assessing the reliability of software. 2 The failure data collected prior to the official release date of Firefox are obtained from Mozilla bug reports. 3 This date is prior to the official release date.

2 Lakey [] provides a number of SRGMs and provides a flowchart approach on how to decide on a model. However, because of so many interacting factors, no single model can be trusted to universally perform well at all times in estimating reliability or predicting the expected number of remaining defects [,4,,2]. To deal with the inaccurate predictions made by SRGMs, some authors have offered recalibrating the models. That is, the previous errors in earlier predictions are used to transform the model into a more accurate prediction model [,4]. In [7], authors have used non-linear regression to analyze defect data obtained from testing of three releases of a commercial system. They have applied four reliability models for selecting a suitable reliability model, which can best fit the customer defect data as testing progresses. Their study uses the method presented in [2]. In their approach, four traditional reliability models have been compared, but their case study is limited to just one commercial software. In [23], the authors have compared several SRGM models on one set of OSS failure data and concluded the logarithmic Poisson execution time model fits better than the other SRGMs for the actual data set. Their work is mainly concentrated on the goodness-of-fit without any assessment on prediction capability of these models. In [24], the author compares six different SRGMs on four different data sets taken from previous researches. Eighty percent of failure data is used to estimate the goodness-of-fit of those models and the other twenty percent is selected to validate the prediction capability of the models. The outcomes have shown differences between the best fitting and predicting models. A similar approach is used in [], in that some observations from the end of failure data are removed for the purpose of comparing predication performance between two SRGMs. In this study, the Probability Density Function (PDF) of the chosen distribution models are used to model the failure patterns of the selected products. PDF, denoted as f(t), shows the relative concentration of failures at different points of time t. The following gives a brief introduction to the seven distribution models considered in this study. Weibull The PDF of Weibull function is: f ( t) t ( t / ) e where α and β represent the scale and shape of the distribution model. The shape value determines the shape of the graph and the effect of the scale parameter is to squeeze or stretch the graph. S-curve There are different s-shaped distribution models. The one adopted in this study is used by SPSS [2] and has the following PDF: f ( t) e b b / t where is a constant and is the regression coefficient. If is positive, then the slope of the graph is upward. Otherwise the slope is downward. Lognormal The lognormal assumes that the natural logarithm of time to failure is normally distributed. The and µ are the mean and standard deviation of the natural logarithm of time to failure, respectively. The PDF of lognormal distribution is given by: Furthermore, and µ determine the shape and scale of the distribution model. Schneidewind This model assumes that the cumulative number of failures is Non-Homogeneous Poisson Process (NHPP) [2,4], which was originally studied in hardware reliability. NHPP models assume that the failure process varies with time and that the cumulative number of failures up to time t is Poisson distributed with the parameter m(t) that is the mean value of failures. Specifically, [ m ( t)] P ( M ( t) n) n! n e m ( t) where M(t) and m(t) are the total and the expected number of failures in interval [, t], and n is an integer. The mean value of the distribution model is: where α and β are the initial failure rate and the negative derivative of failure rate, respectively. Therefore, the expected number of failure during the period is m( ) - m( ). The Schneidewind s model is built on the belief that the failure frequency changes over time and that the recent failures rather than the past failures are more beneficial in predicting the future behavior of the system [8]. Gamma This model has properties similar to that of Weibull distribution with the scale and shape parameters α and β, respectively. The PDF of the Gamma distribution is given by: where is the gamma function: It is known [22] that for positive integer values x >,

3 Firefox Failure Eclipse Failure Eclipse V2. Failure ClamWin Failure Mplayer Failure Apache 2 Failure Exponential - Exponential distribution is a special case of Gamma and Weibull distributions with. Its PDF is given by: Cubic The PDF of the Cubic model is given as: where is a constant and are the regression coefficient values. III. EXPERIMENTAL ANALYSIS Prior to analyzing the performance estimates of the reliability growth models, the failure data for the five selected s must first be collected and filtered. Therefore, the reliability estimate process is partitioned into three steps: bug-gathering, bug-filtering, and bug-analysis. For the bug-gathering step, a java program has been developed to extract the raw failure data from the bug repository systems for each product. Although the breadth and depth of the bug reports vary from one repository system to another, each bug report normally contains a unique identification value for the report, the actual time/date the bug is reported, some information about the user reporting the bug, the product name, and also the status of the bug report filled by the organization in charge of the product development, such as whether the bug is fixed, valid, or deleted. The quality of reliability estimation highly depends on sufficient error reports and the accuracy of reports provided by the users. During the second step, i.e. bug-filtering, the extracted reports from the first step are filtered out in order to remove the unwanted reports such as duplicated ones. The reason for filtering is that some reports may not represent a real defect, or the information provided may not be complete. Among the bug-reports for MPlayer and ClamWin, which are gathered from Sourceforge.net, those reports with status other than Deleted (not a valid bug-report) are collected. For the other three software products, the bug reports are gathered from Bugzilla and those bug-reports with the following status values are accepted and the rest are discarded: FIXED (bug is fixed), WONTFIX (bug will not be fixed), LATER (bug won t be fixed in the current product version) and REMIND (bug probably won t be fixed in the current product version) Biweekly time Figure. Filtered failure intensities for the selected s Table II. VALUES FOR THE SEVEN DISTRIBUTION FUNCTIONS Distribution Cubic Exponential Gamma Lognormal Schneidewind S-curve Weibull function Apache ClamWin Free Antivirus Eclipse V Firefox MPlayer

4 Finally, in the last step, i.e. bug-analysis, the dates of the biweekly intervals for further analysis. Figure 4 exhibits the failure intensities for the five s. The x-axis and y-axis represent each biweekly period and its corresponding failure intensity, respectively. Also, each graph in the figure shows the interval for which the failure reports are collected. On a quick glance at the figure, Eclipse does not seem to follow a pattern similar to those of the other software products. Further investigation reveals that the bug reports include failures of multiple Eclipse versions. When the reports for each version are separated, it is noticed that the pattern of failure intensities for each version generally follows the same pattern as others. Therefore, rather than dealing with multiple versions of Eclipse with similar patterns, one single version i.e. Eclipse V.2. is analyzed for reliability estimation. The last graph in Figure shows the failure intensities for Eclipse V 2.. This version is selected for reliability analysis because of its high volume of bug reports in comparison to other versions. A. Goodness-of-fit Performance In this study, SPSS is used for conducting the statistical tests of goodness-of-fit. Specifically Non-Linear Regression (NLR) is employed to measure the goodness-offit of the seven distribution models with respect to the selected s. NLR is used because the failure intensities of the selected s follow a curvature pattern instead of a linear trend, which is evident from Figure. Table II shows the calculated values, as the result of NLR for the seven distribution functions. is a measure of the strength of how well the regression estimate fits the failure data [2]. value is between and, inclusive. The closer is to, the stronger the match is between the estimated regression and the observed failure data. In Table II, the highest value of among the distribution models for each product is bold-faced. Looking at the values, the Cubic model exhibits the overall best estimate of fitting the observed failure data. This is followed by the Weibull distribution. Furthermore not much discrepancy in values is noticed between the Cubic and Weibull distributions. One may also observe that the performance of the Gamma distribution is close to Weibull. Recall that the Gamma distribution is a special case of Weibull. Among these, S-curve shows the overall worst performance. Table III provides the best fitting models for each of the five s. Table III. BEST MODELS FOR FITTING THE FAILURE INTENSITY OF THE SELECTED OSS PRODUCTS Apache 2 ClamWin Free Antivirus Eclipse V.2 Firefox MPlayer Best fitting model Cubic, Lognormal Cubic, Weibull Weibull Cubic Exponential 4 The intensities of bug reports are connected to form smoother plots. The purpose is to better visualize the pattern of failure reports. filtered bug-reports are used to organize the reports into The next objective is to determine whether the model showing the best goodness-of-fit is also the best predictor of future failures. To investigate this, the time interval of the collected failures data for each product is halved. The failure data in the first half is used to estimate the parameters for each distribution model. Then, the same estimate of parameters is used to forecast the failures during the second half. B. Prediction Performance As indicated, the time interval of failure sample size is divided in to half where one-half is used for predicting the other-half is used for estimating future failures. Since the failure data for the software products under study is gathered for at least four years, it seems there is sufficient data in the first half to picture a decent estimate of the future failures. Except for Firefox, the failure data for the other products seem to be in a stabilized phase. So there should be a decent fit for the first half interval. Indeed doing the estimate for the first half supports this observation. For Firefox, even though the failure data is collected for over six years, it does not seem that the failure detection and removal are in a stable state. This study attempts the prediction process for Firefox as well, to obtain better insight for situations where sufficient failure data is not available or the reliability growth of a product may not be stable. As shown in different studies [3,4,5,6], among all reliability models, there is no single model to be always superior over the other models. But the failure pattern can be used as a simple way to decide on some models believed to provide a decent prediction. The prediction performance of the chosen distribution models are compared by determining the least average difference between the observed and predicted number of failures in the second half interval. This is measured by the Average Predicted Error (APE) form given below: APE= where n is the number of biweekly periods in the second half interval of a product. After calculating the estimated parameters for the first half interval and stretching the graph results over the second half, Figure 2 exhibits a graphical view in predicting of ClamWin failures for the seven distribution models. Due to the lack of space, the prediction graphs for the other products are not shown. However, Table IV shows the APE values for all selected products. As APE shows the average difference between actual failure intensities and predicted ones, a smaller APE value represents a better prediction. As shown in the table, Gamma and Lognormal are good predictors. Whereas, the Cubic model that was a good fitter identified earlier has the worst prediction performance. Comparing the table with Figure 2, the APE values support the visual patterns in the

5 7 2 7 Exponential Gamma Lognormal S-curve Weibull Schneidewind Cubic ClamWin FI Figure 2. ClamWin actual failure intensity and prediction by the seven distribution functions Table IV. APE VALUES FOR THE SEVEN DISTRIBUTION FUNCTIONS Distribution Cubic Exponential Gamma Lognormal Schneidewind S-curve Weibull function Apache ClamWin Free Antivirus Eclipse V Firefox MPlayer Table V. BEST MODELS FOR PREDICTING THE FUTURE FAILURE INTENSITY OF THE SELECTED OSS PRODUCTS Apache 2 ClamWin Free Antivirus Eclipse V.2 Firefox MPlayer Best prediction model Lognormal Gamma S-curve Lognormal Gamma figure. Table V provides the best predictor models for each product. Based on these observations, it is concluded that a best goodness-of-fit model may not necessarily be a good predictor model. Comparing the tables III and V, it is noticed that the best models for goodness-of-fit and prediction disagree for majority of the products. To better understand the reasons for not seeing the same consistency among the models in terms of goodness-of-fit and future prediction of failures, the Firefox product is further scrutinized. The observations are shown in Figure 3. The graph titled filtered bug pattern is the same as the failure intensity as shown in Figure 2. Fitted FI is the estimated fitted graph by Weibull based on the entire failure data. The other two graphs in the figure show the predictions when the first one- year and two-years of failure data are used for estimating the parameters of Weibull. The early portions of the two graphs are thus the fitted estimates based on oneand two-years of failure data and the latter parts of the graphs are the prediction estimates. As anticipated, the prediction based on one-year failure data is very poor. This is because as the length of prediction interval is increased by having less failure data to depend on, it becomes more difficult to predict future failures. For the same reason, the prediction using two years of failure data shows better accuracy of prediction. This observation could be the possible reason that the authors in [,24] adopted to predict a small percentage of failures compared to the total failure data. Additionally, the graph of the filtered bug pattern in Figure 3 shows a dip in about the 25 th biweekly period, which causes Weibull to adjust its estimate accordingly. This forces the one-year fitted graph to continue the decreasing trend of failures as time increases. This dip can also be wrongfully interpreted as a sign of cumulative failure data becoming stable. A similar observation (dip) is taking place around the 5 th biweekly period, although not as severe as the dip for the one-year failure data used for prediction purpose. In general, there are many factors that affect the accuracy of prediction. One obvious factor is the model type used. A survey done in the late 99s by the American Society for Quality reported that only 4% of the responders

6 Failure could apply a SRGM [4]. Additionally, application of a SRGM correctly requires a good understanding of the product profile at different stages. As some examples, whether the failures are independent of each other, whether the defect removals are imperfect, or whether there has been any shift in operation profile of the product, all can affect the prediction estimate. As a self-experience, the Eclipse product in Figure shows a failure pattern that can be modeled by multiple distributions such as Weibull. But most likely, the fitted graph would not provide a good estimate of the actual graph. Investigation revealed that the operation profile of Eclipse changes during each release of the product, which happens around January of each year Filtered Bug Pattern Fitted FI Fitted -Year FI Fitted 2-Year FI Figure 3. Firefox actual failure intensity and prediction by Weibull based on -year, 2-years, and the entire failure data. IV. CONCLUSION This study has attempted to compare seven reliability models with respect to estimates of failure intensities and failure forecasts against the actual failure data. The bug reports of five different s are collected and used as input to the seven models. The study has used nonlinear regression analysis as a metric to measure the goodness-of-fit. As the second metric, APE is used to determine which model is the best predictor. For the selected products, Weibull and Cubic are promising models for goodness-of-fit. But the Cubic model is shown to be the worst predictor. In general, Gamma and Lognormal models provided the best prediction models for future failures followed by the S-curve model. Therefore, the results show that a model able to provide a good fit may not be a good predictor of future failures because of so many interacting factors. It is reasonable to believe that some failure intensities, called outliers [6], out of a larger sample may have tangible effect on the parameters of the regression estimates. Therefore, as an avenue of future research, it is worth investigating this phenomenon, as to whether forecast of failures is improved when these outliers are removed from the estimation process based on the available failure data. Another avenue is to determine the effect on prediction by recalibrating the models used in this study. REFERENCES [] A.A. Abdel-Ghaly, P.Y. Chan, B. Littlewood, Evaluation of computing software reliability, IEEE Transactions on Software Engineering, vol. SE-2, no. 9, pp , 986. [2] A.D. Aczel, J. Sounderpandian, Complete Business Statistics, 6 th Ed., McGraw Hills, 25. [3] P. Asthana, Jumping the technology S-curve, IEEE Spectrum, vol. 32, no. 6, pp , 995. [4] S. Brocklehurst, B. Littlewood, New ways to get accurate reliability measures, IEEE Software, pp , July 992. [5] Bugzilla, [6] W.J. Conover, Practical Nonparametric Statistics, 3 rd Ed., John Wiley, 999. [7] R. Hewett et.al, On Effective Use of Reliability Models and Defect Data in Software Development, [8] IEEE Reliability Society, IEEE recommended practice on software reliability, IEEE Std , June 28. [9] H.S. Kan, Metrics and Models in Software Quality Engineering, 2 nd Ed., Addison-Wesley, 23. [] P. Lakey, A. Neufelder, System and software reliability assurance notebook, Rome Laboratory, 997. [] J.S. Lawson, C.W. Wesselman, D.T. Scott, Simple plots improve software reliability prediction models, Quality Engineering, vol. 5, no. 3, pp. 4-47, 23. [2] M.R. Lyu, Handbook of Software Reliability Engineering, McGraw Hills, 996. [3] R. Mullen, S.S. Gokhale, The Lognormal distribution of software failure rates: Applications to software reliability growth modeling, 9 th Int l Symposium on Software Reliability Engineering, pp , 998. [4] H. Pham, System Software Reliability, Springer, 26. [5] H. Pham, L. Nordmann, A generalized NHPP software reliability model, 3 rd Int l Conference on Reliability and Quality in Design, 997. [6] C. Rahmani, H. Siy, H., A. Azadmanesh,, An experimental analysis of open source software reliability, 28 th IEEE Symposium on Reliable Distributed Systems, Sep 29. [7] E.S. Raymond, The cathedral and the bazaar: musings on Linux and open source by an accidental revolutionary, 2 nd Ed., O Reilly, 2. [8] N.F. Schneidewind, "Analysis of error processes in computer software, Sigplan Note, vol., no. 6, pp , 975. [9] SourceForge, [2] SPSS, [2] C. Stringfellow, A.A. Andrews, An empirical method for selecting software reliability growth models, Empirical Software Engineering, vol. 7, no. 4, pp , Dec 22. [22] K.S. Trividi, Probability and Statistics with Reliability and Computer Science Applications, 2 nd Ed., John Wiley, 22. [23] Y. Tamura, S. Yamada, Comparison of software reliability assessment methods for open source software and reliability assessment tool, Journal of Computer Science vol. 2, no. 6, pp , 26. [24] D.R.P. Williams, Prediction capability analysis of two and three parameters software reliability growth models, Information Technology Journal, vol. 5, no. 6, pp , 26.