A replicated Assessment and Comparison of Common Software Cost Modeling Techniques

A replicated Assessment and Comparison of Common Software Cost Modeling Techniques Lionel Briand Tristen Langley Isabella Wieczorek Carleton University CAESAR, University of Fraunhofer Institute for Systems and Computer New South Wales Experimental Software Engineering Engineering Department. Kensington NSW 2052 Sauerwiesen 6 Ottawa, ON K1S 5B6 Canada Sydney, Australia 67661 Kaiserslautern, Germany briand@sce.carleton.ca langley@netspace.net.au wieczo@iese.fhg.de International Software Engineering Network Technical Report ISERN-99-15

A replicated Assessment and Comparison of Common Software Cost Modeling Techniques Lionel Briand Tristen Langley Isabella Wieczorek Carleton University CAESAR, University of Fraunhofer Institute for Systems and Computer New South Wales Experimental Software Engineering Engineering Department. Kensington NSW 2052 Sauerwiesen 6 Ottawa, ON K1S 5B6 Canada Sydney, Australia 67661 Kaiserslautern, Germany +1 613 520 2600 2471 +49 6301 707 255 briand@sce.carleton.ca langley@netspace.net.au wieczo@iese.fhg.de ABSTRACT Delivering a software product on time, within budget, and to an agreed level of quality is a critical concern for many software organizations. Underestimating software costs can have detrimental effects on the quality of the delivered software and thus on a company s business reputation and competitiveness. On the other hand, overestimation of software cost can result in missed opportunities to funds in other projects. In response to industry demand, a myriad of estimation techniques has been proposed during the last three decades. In order to assess the suitability of a technique from a diverse selection, its performance and relative merits must be compared. The current study replicates a comprehensive comparison of common estimation techniques within different organizational contexts, using data from the European Space Agency. Our study is motivated by the challenge to assess the feasibility of using multi-organization data to build cost models and the benefits gained from companyspecific data collection. Using the European Space Agency data set, we investigated a yet unexplored application domain, including military and space projects. The results showed that traditional techniques, namely, ordinary leastsquares regression and analysis of variance outperformed analogy-based estimation and regression trees. Consistent with the results of the replicated study no significant difference was found in accuracy between estimates derived from company-specific data and estimates derived from multi-organizational data. International Software Engineering Network Technical Report ISERN-99-15 Keywords Cost estimation, classification and regression trees, analogy, analysis of variance, least-squares regression, replication 1 INTRODUCTION Accurate estimates of the cost for software projects are crucial for better project planning, monitoring, and control. Project managers usually stress the importance of improving estimation accuracy and techniques to support better estimates. Over the last three decades, a variety of estimation techniques have been developed and investigated to provide improved estimates. Many studies have performed comparative evaluations of the prediction accuracy of different techniques. But despite intense research, only few tangible conclusions can be drawn based on existing results. Two of the main reasons for this are that are that (1) many studies rely on small and medium-size data sets, for which results tend to be inconclusive (lack of statistical power) and (2) studies tend to be incomplete, i.e., consider small subsets of techniques for their evaluation. Moreover, evaluation studies are rarely replicated and are usually not reported in a form that allows the comparison of results. Hence, one can only have limited confidence in the conclusions drawn from such one-time studies. Like in any other experimental or empirical field, replication is key to establish the validity and generalizability of results. Regarding software cost estimation, we want to determine the stability of the trends observed across data sets and their generalizability across application domains and organizations. Our paper s contribution is twofold: (1) it replicates and expands on a study we previously reported in the management information systems (MIS) domain [8], (2) it provides a procedure to perform and present cost model evaluations and comparisons. The data set used in the current study is the European Space Agency (ESA) multiorganization software project database [17][24]. The ISERN-99-15 2

projects come from a variety of European organizations, with applications from the aerospace, military, industrial, and business domains. We will carefully compare the results obtained here with the ones reported in [8] so that common trends may be identified and differences be investigated. The purpose of both our studies is to address two fundamental issues in cost estimation. First, as mentioned above, we compare the prediction accuracy of most commonly used cost estimation techniques using a large project cost data set. This is important as the decision to select an estimation technique highly depends on its ability to provide accurate results. As a by-product, such studies help us identify what are the most influential cost factors and thus leads to practical benefits in terms of identifying important data to collect. Second, we assess the feasibility of using multi-organizational data sets to build software cost estimation models as compared to company-specific data. Company-specific data are believed to provide a better basis for accurate estimates, but company-specific data collection only allows for the collection of project data at a slow pace, as in most organizations there is a limited number of projects completed every year. As statistically based models require substantial amount of data, the time required to accumulate enough data for cost modeling purposes may be prohibitive. In addition, by the time enough data are available older project technologies may be obsolete in the organization, thus providing models that are not fully representative of current practices. Multiorganizational databases, where organization share project data collected in a consistent manner can address these issues but have problems of their own. Collecting consistent data may turn out to be difficult and, because of differences in processes and practices, trends may differ significantly across organizations. In this paper we would like to focus on the issue whether multiple-organization project data, within consistent application domains (e.g., projects coming from ESA subcontractors), can yield cost models of comparable accuracy to models based on data coming from one company. The paper starts by a discussion of related work in Section 2, then presents our research method in Section 3 in order to enable the replication of our investigations on other data sets and in a comparable form. The project cost database on which the paper is based is then described in Section 4 and we continue by describing the results of our research in Section 5. Section 6 stresses some further important points the can be concluded from our analysis results. Section 7 then summarizes our results and concludes the paper. 2 RELATED WORK Given the diversity of estimation methods coming from statistics, machine learning, and knowledge acquisition, one is faced with the difficult exercise of determining which technique would be the best in given circumstances. ISERN-99-15 3 In order to assess a technique s appropriateness, the underlying assumptions, strengths, and weaknesses have to be known and its performances must be assessed. In response to these challenges, many different studies comparing software cost estimation techniques have been performed in the last thirty years. Investigations aimed to (1) determine which technique has the greatest effort prediction accuracy [8][14][21][31], (2) propose new or combined techniques that could provide better estimates [5][11][25][26]. Most of the previous comparisons have been restricted to small and medium-sized data sets and have found that there is no superior technique for all data contexts. Few studies were replicated and rarely techniques are investigated with the data sets beyond those with which they were originally proposed. Even when the same data set is used in different studies, the results are not always comparable because of different experimental designs. For a more detailed description of previous studies we refer the reader to [8]. In general, assessments of effort prediction accuracy did not only reveal strength and weaknesses of estimation methods, but also the advantages and limitations of the data sets used. However, there is still little evidence to suggest which data set characteristics are best suited to which estimation techniques. The current study improves upon the main drawbacks of many previous studies. It contributes in evaluating and comparing many of the common cost modeling techniques, using a relatively large and representative 1 database, focusing on both company specific and multiorganizational data contexts, which allows the relative utility of multi-organizational databases to be evaluated. This is done applying a procedure consistent with a previous study that was based on data from business applications. 3 RESEARCH METHOD Selecting Modeling Techniques Our study considered a large subset of the many different techniques that have been proposed in the literature. These modeling techniques were selected according to the following criteria [8]: Applied in Software Engineering: The rationale is that the technique should have an initial demonstrated utility in software engineering. Automatable: Due to practical considerations, only techniques that can be substantially automated are considered. 1 We will see below that the data set used in this study comes from a variety of organizations and application domains, and contains typical information about cost drivers.

Interpretable: The results of the estimation technique should provide cost engineers and managers with insights into the structures present in the data set. Suitable Input Requirements: The input variables required for the application of a technique should be compatible with the data set used for the comparative study. The project attributes collected should be adequate and suitable for use as inputs. Modeling techniques that fulfilled the criteria above are: Ordinary Least-Squares Regression (OLS) [23], stepwise Analysis of Variance for unbalanced data sets (stepwise ANOVA) [22], Regression Trees (CART) [4], and Analogy-based estimation [25]. We also considered combinations of CART with OLS regression and Analogy. These techniques were also applied in our previous study on the Laturi database [8]. In addition, we applied additional variants of the above mentioned techniques. They are described in more detail in the following paragraphs. Ordinary Least-Squares Regression We applied multivariate OLS regression analysis fitting the data to a specified model. Using the J-test, the exponential functional form was formally shown to be the most plausible model specification [7][13]. Moreover, linear models revealed heteroscedasticity. A mixed stepwise process was followed to select variables having a significant impact on effort (α=0.05). Dummy variables [23] were created to deal with categorical, nominal scaled variables. Ordinal-scaled variables were treated as if they were measured using an interval scale. It simplifies the analysis and it has been shown to be a reasonable practice for the kind of scales used in the ESA questionnaire (see Spector [27] and Bohrnstedt et al [3]). Stepwise ANOVA In line with our previous study, we applied a stepwise analysis of variance procedure that can analyze the variance of unbalanced data. We automated the procedure described by Kitchenham [22] using the Stata tool [29]. This procedure applies ANOVA on categorical variables, and OLS regression on continuous variables in a stepwise manner. It alleviates problems with unbalanced data by focusing on residual analysis. The steps can be summarized as follows: For each of the independent variables (in our case cost factors) ANOVA or OLS regression is applied depending on the measurement level of the variable. The most significant factor is identified. Its effect is removed by taking the residuals as the new dependent variable values. ANOVA and OLS regression is applied in turn on the remaining independent variables using the new dependent variable. These steps are repeated until all significant factors are removed or there are insufficient degrees of freedom. The result of the procedure is an equation including the most significant factors found during each step. Confounding factors are also identified and removed when applying the procedure. These are factors found significant in one step but found insignificant in the following iteration. We transformed the continuous variables applying their natural logarithms. This was done to ensure a normally distributed dependent variable, which is an assumption to apply this procedure, and to account for heteroscedasticity. Replicating our previous study, we used effort as the dependent variable (referred to as ANOVA_e). In addition, we used productivity as a dependent variable, in consistency to Kitchenham s study [22] (referred to as ANOVA_p). Analogy Analogy-based estimation involves the comparison of a new (target) project with completed (source) projects. The most similar projects are selected from a database as a basis to build an estimate. Major issues are to select an appropriate similarity function, to select relevant project attributes (in our case cost-drivers), and to decide upon the number of similar projects to consider for estimation (analogues). We used the case-based reasoning tool CBR-Works, 4.0 beta [10], and implemented a similarity function identical to that proposed by Schepperd and Schofield [25] and used in the ANGEL tool [19]. This function is based on the normalized unweighted Euclidean distance. For each variable k out of n variables included in the similarity measure, the distance (δ(p ik,p jk )) between the target project variable value P ik and the source project variable value P jk is defined as: 3 ik,p jk 2 P ik Pjk, if k is continuous max k mink ) = 0, if k is categorical AND Pik = P 1, if k is categorical AND Pik P The overall distance between the two projects is then defined as: distance(p,p ) = i j n δ (Pik,P k = 1 n To select the n variables that were used in the similarity measure, ANGEL determines the optimal combination of variables by implementing a comprehensive search. This however is inefficient as reported in [8][25]. We employed a procedure proposed by Finnie et al. [14]. For all categorical variables we applied a two tailed t-test to determine variables that show significant influence on productivity. We generated two levels for each variable by jk ) jk jk ISERN-99-15 4

merging the variable s original 4 or 5 levels. To do so, we used a heuristic based on two criteria used for merging the levels: (a) the proportions of observations in each level had to be balanced, and (b) marginal differences in the level means resulted in merging the corresponding levels. Predictions were based on the selected most similar project. If more than one similar project (analogue) were retrieved due to equal similarity values, the average of these projects was used. This choice is justified since various studies [8] [25] report no significant differences in accuracy when using different numbers of analogues. To estimate effort, we 8used productivity as the dependent variable and then calculating the predicted effort by adjusting the effort prediction by system size [8]. The predicted effort is calculated by dividing the actual size (which has to estimated during project planning) by the predicted productivity. Alternatively, we also tried the approaches proposed by Walkerden and Jeffery [31], Shepperd and Schofield [25], using system size additionally in the similarity measure, and using effort as a dependent variable without any size adjustment, respectively. As the results among the three approaches were almost the same, we will only provide below the results from applying our approach. Detailed results can be found in the Appendix of [9]. CART We generated regression trees based on the CART algorithm [4] using the CART tool [30]. CART examines the data and determines data splitting criteria that will most successfully partition the dependent variable. To build a regression tree the data set is recursively split until a stopping criterion is satisfied. All but the terminal nodes in a tree specify a condition based on one of the variables that have an influence on project effort or any other dependent variable. After a tree is generated it may be used for effort prediction of a project. To predict the effort for a project, a path is followed through the tree according to the project s specific variable values until a terminal node is reached. The mean or median value of the terminal node may then be used as the predicted value. As a replication to our study [8], we used productivity as a dependent variable (referred to as CART_p). The stopping criterion was set to a minimum of five observations in each terminal node. Predictions were based on the median productivity in a terminal node. When using productivity, predicted effort is calculated by dividing the actual system size by the predicted productivity within the terminal node. In addition, we generated separate CART trees using effort as a dependent variable (referred to as CART_e) and system size as an additional independent variable, consistent with the studies by Kitchenham [22] and Srinivasan and Fisher [28]. Combinations of Techniques According to our previous study we applied two combinations of techniques on our data set. We combined CART trees with OLS regression using the tree structures produced from CART_p. For each terminal node in a tree, the observations belonging to that node were used to build a simple OLS regression model. We fitted an exponential relationship between effort and system size. At each terminal node, instead of the median productivity, an OLS equation was applied to predict effort. A project's effort is determined by following the tree to a terminal node and then applying the effort equation corresponding to this node. We also combined CART and our Analogy-based approach using the tree structures produced from CART_p. For each terminal node in a tree, the observations belonging to that node were used to build the basis for similarity-based retrieval. A project's effort is determined by following down the tree to a terminal node and selecting the most similar project within the subset of projects that corresponds to this node. Evaluation Criteria A common evaluation criterion for cost estimation models was proposed by Conte et al [11]. It is the Magnitude of Relative Error (MRE) and defined as: MRE i Actual Efforti Predicted Efforti = Actual Effort The MRE value is calculated for each observation i whose effort is predicted. The aggregation of MRE over multiple observations, say N, can be achieved through the Mean MRE (MMRE). However, the MMRE is sensitive to individual predictions with excessively large MREs. Therefore, an aggregate measure less sensitive to extreme values should also be considered, namely the median of MRE values for the N observations (MdMRE). Another criterion that is commonly used is the prediction at level l, k PRED(l) =, where k is the number of observations N where MRE is less than or equal to l. The criteria we used to assess and compare cost estimation models are the relative values of MMRE, MdMRE, and PRED(0.25) for the different techniques. Cross Validation If one constructs a cost estimation model using a particular data set, and then computes the accuracy of the model using the same data set, the accuracy evaluation will be optimistic [32]. The cross-validation approach we use involves dividing the whole data set into multiple train and test sets, calculating the accuracy for each test set, and then aggregating the accuracy across all the test sets. In our previous and in the current study, we investigate the i ISERN-99-15 5

performance of estimation techniques when models are built using company-specific data (local models), and when they are built using multi-organizational data (generic models). To determine the accuracy of generic cost models, we partitioned the data set divided by organizations. We limited this to organizations for which there are 8 or more projects in our data set (in total 60 projects). These 60 projects were provided by 4 companies. We will refer to them as company 1 to company 4. The remaining 100 projects came from 65 different companies and we considered this chunk as one additional partition. A training set is the whole data set minus a certain organization s projects. This resulted in 4 different training-test set combinations. Calculating accuracy in this manner indicates the accuracy of using an external multiorganizational data set for building a cost estimation model, and then testing it on an organization s projects. To determine the accuracy of deriving local cost estimation models we used 29 projects coming from company 1. Here, we used a three-fold cross-validation approach. We randomly generated 3 test sets, and for each test set we used the remaining projects as the training set. The overall accuracy is aggregated across the three test sets. Calculating accuracy in this manner indicates the accuracy to be expected if an organization builds a model using its own data set, and then uses that model to predict the cost of new projects. 4 DATABASE DESCRIPTION The database used in this study is the European Space Agency (ESA) multi-organization software project database [17]. Since 1988, the ESA continuously collects historical project data on cost and productivity from different application domains. The data comes from European organizations, with applications from the aerospace, military, industrial, and business environment. Each data supplier is contacted on a regular basis to determine if projects are nearing completion. Once a project questionnaire is filled out, each data supplier is contacted to ensure the validity and comparability of the responses. Each data supplier regularly receives data analysis reports of the data set. At the time of our analysis, the database consisted of 166 projects. The breakdown of projects by environment was: 34% space, 31% military, 21% industry, 9% business, 5% other environments. The projects originated from 69 different organizations coming from 10 different European countries. Four companies provided data for at least 8 projects. Company 1 provided 29 projects, company 2 provided 15 projects, companies 3 and 4 provided 8 projects each. We can see from this description that we have a varied set of projects, which we think are representative of the European industry practice. This is important if we want the results of our study to be generalizable to a large population of the software industry. System size is measured in KLOC. Because the data set contains multiple programming languages, we adjusted each project s KLOC value to obtain a comparable size measure. To calculate the adjusted KLOC for each project we used the language levels proposed by Jones [20]. It was decided to exclude 6 projects from the analysis, because these projects were enhancements rather than new projects. Missing data is a common attribute of software engineering data [5][16]. In our case, only 90 projects had values for all the variables used in the current analysis. We will explain how missing data were handled when discussing our Variable Description Scale Values / Range / Unit ENV The domain the project was nominal Space, Military, Business, Industry developed for. Adj_KLOC Adjusted New developed KLOC ratio 1 KLOC=1000 LOC EFFORT Effort for SW project ratio Person hours, where 144 person hours=1 person month TEAM Maximal team size at any time during a project ratio 1-120 VIRT virtual machine volatility ordinal 2-5 (low-very high) RELY required reliability ordinal 1-5 (very low-very high) TIME execution time constraints ordinal 2-6 (low-extra high) STOR main storage constraint ordinal 2-6 (low-extra high) MODP use of modern programming practices ordinal ISERN-99-15 6 1-5 (very low-very high) TOOL use of software tools ordinal 2-5 (very low-very high) LEXP programming language experience ordinal 1-5 (very low-very high) Table 1: Variables from the ESA Database

results. The variables that are taken into account in our analysis are listed in Table 1. These are variables that may potentially have an impact on project cost and include system size plus 7 of the COCOMO cost drivers [1]. Descriptive Statistics Table 2 summarizes the descriptive statistics for system size (in adjusted KLOC: Adj_KLOC) and project effort (in person-month: PM). The table shows results from the whole database and for the company that provided 29 projects. All but one projects are from the military environment. On average, projects from this company have a higher effort than projects in the remainder of the database. On the other hand, system size is lower in this company than for the remainder of the database. It therefore shows a low productivity compared with the remainder of the data set. This may be due to the fact that 28 out of 29 projects are military applications having high reliability requirements and time constraints. Company 1 Size (Adj_KLOC) Effort (PM) Whole DB Size (Adj_KLOC) Effort (PM) min 10.5 11.1 5.5 3 mean 133.97 558.97 264.73 231.35 max 732 4361 2948 4361 st.dev 174.15 1063.81 449.46 535.86 obs. 29 29 160 160 Table 2: Descriptive Statistics for system size and effort Table 3 summarizes descriptive statistics for team size. On average the maximal team size at any stage in a project is higher within company 1 than for the remainder of the database. Company 1 Teamsize (TEAM) Whole DB Teamsize (TEAM) min 2 1 mean 20.11 11.22 max 120 120 st.dev 29.64 15.75 obs. 28 131 Table 3: Descriptive Statistics for team size 5 DATA ANALYSIS RESULTS This section reports the results concerning the comparisons of all the modeling techniques briefly described in Section 3. Table 4, Table 5, and Table 6 report, respectively, average results for the whole database, the company 1 data set, and the company 1 data set when using the remainder of the data set as a learning sample for model building. Before delving into the details of result tables, we need to discuss a few technical issues regarding our comparison ISERN-99-15 7 procedure. Some estimation techniques, such as Analogy are able to provide effort predictions even when there is missing data for a project. Other techniques, such as OLS-regression, cannot cope with missing data related to the parameters used in their specifications. To ensure a valid comparison of all the applied techniques, we selected a subset of projects from our holdout samples for which there was no missing data. For a set of 39 projects (subset from the holdout samples: companies 1, 2, 3, and 4), all techniques were able to provide predictions within the multiorganizational context. Within the company-specific context (company 1), all techniques could predict effort for 25 projects. In this paper, we only provide the results reporting the median MRE values for the sake of brevity. It is important to note, however, that the main trends in the results are supported when looking at other goodness-of-fit measures such as MRE or pred(0.25) (see Appendix of [9]). Having refined our tool suite, we were able to assess a few additional modeling variants as compared to our previous study. Since no result can be provided, these additional techniques are visible as a * is placed in the third column of Table 4, Table 5, and Table 6. Significant differences in accuracy across the techniques are indicated through differently shaded cells in the tables. Dark gray rows indicate the group of techniques that performed poorly and within which no significant difference can be identified. Next are the light gray rows, which represent a second group of techniques, performing significantly better than the first group, but worse than the third group (white rows) which contains all the techniques performing best. The statistical significance is tested using the matched-pair Wilcoxon signed rank test [15], a nonparametric analogue to the t-test. Comparison of Modeling Techniques Table 4 presents the average results obtained when using cross validation, for all models, across the whole ESA project database. Table 5 presents comparable results for company 1. We also systematically compare the results with the ones from our former study on the Laturi data set (third column in Tables 4 and 5) [11]. Considering the ESA database, OLS regression and ANOVA_e (using effort as a dependent variable) perform significantly better than the other techniques. This can be observed for the multi-organizational (Table 4, second column) as well as for the company-specific context (Table 5, second column). The technique proposed by Kitchenham [22] seems to perform as well as simple OLS regression. Considering that its automation is significantly more complex, it is doubtful that any benefit can be gained by using stepwise ANOVA on a data set similar to the one we use here. So, despite the numerous claims in the literature,

it seems that simply using traditional OLS regression provides the best results overall, though there is still room for improvement. Within the multi-organizational context, ANOVA_p (light gray cell, second column in Table 4) was significantly different and slightly better than CART, Analogy, and all combinations of techniques (dark gray cells). The main reason for the different accuracy levels may be the different assumptions made by the techniques about the size-effort relationship. The two most accurate techniques (i.e., ANOVA_e and OLS) modeled effort as opposed to productivity. They are both based on log-transformations of some cost-driver variables and therefore best accommodated the exponential effort-size relationship that seem to be underlying our data set, like most large software project cost data sets. This relationship was poorly modeled by ANOVA_p, Analogy, CART_p, and CART_p + combinations. These techniques assume effort is proportional to size (other cost drivers being constant) when using productivity as a dependent variable. CART_e considered system size as an independent variable but did not adjust effort to account for any relationship between effort and size. Therefore, aligned with the widely supported view that system size is the most important costdriver [1][2], estimation techniques that could account for non-linear system size-effort relationships provided more accurate predictions. Confirming the results from our previous study [8], Analogy does not seem to bring any significant advantage over other modeling techniques. This contradicts some of the most recent work in this domain [25][31][14]. We observed extreme outlier predictions with Analogy-based estimation. Having a closer look at those observations, we found that the retrieved analogue matched perfectly with respect to the COCOMO factors, but greatly differed in terms of system size. Consequently, we think the poor performance of Analogy in our study can be attributed to the equal weighting of variables in the similarity measure. This illustrates the importance of defining appropriate similarity measures. Another result is that combining techniques did not lead to any significantly improved predictions. Considering these results are consistent across two distinct, large data sets, the software cost engineering community should start questioning whether looking for yet other modeling technique variants is likely to yield any practically significant benefit in terms of prediction accuracy. This has, however, been a major activity in recent years, consuming most of the research effort in software cost engineering. Another result, which has been observed numerous times in previous studies, is that even the best models are not very accurate (32% median MRE), although better than most reported results. Techniques MdMRE (ESA) MdMRE (Laturi) OLS 0.32 0.44 ANOVA_e 0.40 0.45 ANOVA_p 0.62 * CART_p + OLS 0.68 0.42 CART_e 0.70 * Analogy 0.76 0.61 CART_p 0.80 0.39 CART_p + Analogy 0.84 0.57 Number of predicted projects 39 119 Table 4: Average results over all hold out samples using the whole database Table 5 reports results within one company, i.e., using more homogeneous but smaller data sets, for both the ESA and Laturi study. Three accuracy levels were identified using the ESA data. Again, OLS regression and ANOVA_e appeared to perform significantly better than all the other techniques and CART_e (light gray cell, second column within Table 5) shows an intermediary prediction accuracy. It can be observed that the techniques with relatively high and medium prediction accuracy used effort as a dependent variable. All the techniques that considered productivity as a dependent variable performed significantly worse (dark gray cells, second column within Table 5). This is likely because using productivity as a dependent variable instead of using size as an independent variable and model its impact on effort has implications in terms of the model s assumptions: it implies that effort and size are proportional. Techniques MdMRE (ESA) MdMRE (Laturi) ANOVA_e 0.26 0.42 OLS 0.34 0.41 CART_e 0.41 * CART_p + OLS 0.68 0.47 Analogy 0.73 0.48 CART_p 0.73 0.46 ANOVA_p 0.73 * CART_p + Analogy 0.74 0.43 Number of predicted projects 25 63 Table 5: Average results over all hold out samples using company-specific data The results from the Laturi study showed no significant differences among all the techniques applied within the one-company context. Such differences are to be expected across data sets, especially small ones like most onecompany data sets, as the accuracy of modeling techniques is sensitive to the underlying structure in the data, e.g., linearity of relationships, outliers. ISERN-99-15 8

Comparison of local versus generic Models The ESA results from Table 6 (second column) are obtained using the whole database minus company 1 projects for model building and applying the models to company 1 only. This corresponds to the situation where company 1 has only external data available to build a cost estimation model. We compare these results with Table 5 (second column), which corresponds to the situation where a company has its own local data available to build cost estimation models. We note that the median MRE values have not significantly improved by going to a one company data set, as opposed to what could have been expected. The largest difference in median MRE values (28%) can be observed for CART_e (0.41 vs. 0.69). This is again consistent with results from our previous study on the Laturi data set (Table 5 third column vs. Table 6 third column). In general, it appears that there is no obvious advantage to build local cost models based on company specific data when generic cost-drivers are collected. It is possible that the advantage of using a more homogeneous data set is offset by a significantly smaller project sample available to build the models. Techniques MdMRE (ESA) MdMRE (Laturi) ANOVA_e 0.30 0.49 OLS 0.32 0.47 CART_e 0.69 * CART_p + OLS 0.82 0.46 Analogy 0.76 0.61 CART_p 0.80 0.46 ANOVA_p 0.69 * CART_p + Analogy 0.85 0.56 Number of predicted projects 25 63 Table 6: Results on the one company data using the remainder of the whole database Comparison with our Previous Study Consistent with the findings of the previous study, the current study found no significant advantages using local, company-specific data to build estimates over using external, multi-organizational databases. From a general perspective, the results from both studies may have serious implications on the way we collect cost data and build datadriven cost models. To really benefit from collecting company-specific cost data, one should not just automatically collect generic, COCOMO-like factors, but investigate the important factors in the organization to be considered and design a tailored, specific measurement program [6]. One difference between the two studies is that CART_p appears to perform significantly better in the Laturi study where it appeared comparably accurate to OLS regression and ANOVA_e. However, in the current study CART_p, when combined with other techniques or not, performed ISERN-99-15 9 relatively poorly. Depending on the structures present in the data, we have to expect variations in performance across data sets. For example, the trees generated with the Laturi data were more balanced than with the ESA database, i.e., more comparable proportions of the data set could be found along the different branches. Therefore, with the ESA data set, CART predictions were sometimes based on relatively small samples in the terminal nodes. From the results tables above, we can observe from the median MRE values that the best models are, in most cases, more accurate on the ESA database. This is particularly true for OLS regression and ANOVA_e (using effort as a dependent variable). This may be explained in part by the fact that most projects come from ESA contractors, who have to follow a consistent high-level development process. More process consistency may lead to more accurate models. Results are confirmed when looking at the mean MRE or pred(0.25) (see Appendix of [9]). On the other hand, the differences in accuracy among the techniques are higher for the ESA data set than for the Laturi data. In the multi-organizational context (Table 4), for the Laturi and ESA data sets, we observe two and three different accuracy levels, respectively. Within one company (Table 5), there were no significant differences among the techniques in the Laturi study, but three accuracy levels in the current study. Another important result is that none of the strategies for combining modeling techniques have been very successful. Modeling techniques like CART and OLS are inherently complementary in the sense that they can potentially capture different types of structures in the data, e.g., local structures and interactions for CART versus global, linear or log-linear structures for OLS regression. Research in procedures to combine modeling techniques might bring higher benefits. Selected Model Variables We summarize the independent variables selected for the different model types in order to gain some insight into the importance of those variables regarding cost prediction. ANOVA_e OLS ANOVA_e ANOVA_p CART_e CART_p Analogy CART_p+ OLS TEAM, Adj_KLOC, STOR Adj_KLOC, TEAM, STOR, VIRT TEAM, Adj_KLOC, STOR TEAM, STOR, LEXP, TOOL TEAM, Adj_KLOC TEAM, TOOL, ENV, STOR, LEXP, RELY, MODP, VIRT, TIME TEAM, RELY, TIME, STOR, MODP, TOOL, LEXP, ENV 2 Adj_KLOC, TEAM, TOOL, ENV, STOR, LEXP, RELY, MODP, VIRT, TIME 2 For all Analogy variants and all combinations of Analogy with CART_p, all variables have been selected the same number of times during cross validation.

Table 7: Selected model variables Table 7 provides, for each technique, the variables in decreasing order of importance when using the whole database for model building. This ordering is based on the frequency of variable selection during the cross validation process (except for analogy). It is clear from the results that the two main, overriding trends are the effect of system and maximum team size on effort. STOR (storage constraints) also seems to play an important role. 6 FURTHER CONSIDERATIONS We have seen that ANOVA_e and OLS regression are better, across data sets, in terms of their predictive capability using cross validation. Now, it would also be interesting to investigate whether any other criterion might justify using less accurate modeling techniques. What comes in favor of OLS regression is the level of maturity of the statistical packages that implement it, and the abundance of literature and experience on the subject. Stepwise ANOVA is not as well documented and automating it requires to implement it using a statistical package programming capabilities. Both stepwise ANOVA and OLS regression require a significant statistical background to be used and interpreted. This is not the case for CART that is easier to interpret and use for model building, and which is supported by a relatively mature commercial tool. Therefore, despite the results shown above, in a situation where a balanced, thorough CART tree is generated, and in a context where interacting with domain experts is crucial for the interpretation of the cost model, CART may be a better alternative than OLS regression. Analogy is a priori simple to use as it does not require any data modeling. However, this is misleading as the complexity of Analogy lies in the definition of the similarity and adaptation functions with which analogues are selected and predictions are adjusted. There is, however, no straightforward, well-understood way of defining similarity. Moreover, only little is published about how to appropriately define adaptation rules. Therefore, only a context where not enough data is available for OLS regression or CART model building, and experts can help define similarity, would seem to warrant the use of Analogy. 7 CONCLUSIONS The contribution of this paper is mainly in assessing and comparing of the predictive accuracy a comprehensive set of cost modeling techniques on representative cost data sets, following a rigorous and repeatable procedure. In addition, by comparing our results to a previous study whose objective was similar, we could better understand what results seem to generalize across data sets. The main conclusions from our studies are of high ISERN-99-15 10 significance to the field of software cost modeling. A great deal of effort has been spent over the last 20 years on devising and comparing ways of modeling software development data to achieve higher accuracy. However, our results, which are based on two relatively large (by software engineering standards), representative data sets, show that ordinary least-squares regression is probably sufficient to get the most out of your data and help predict development effort. If other modeling techniques among the ones we investigate here are to be used, this should probably not be motivated by an attempt to obtain better accuracy. As discussed above, other reasons may justify it. As far as we know, our research initiative is the first to undertake a wide-scale comparison of cost models on relatively large data sets. Our results have practical implications as they suggest that a great deal of research in the area of cost modeling has not produced practically significant improvements. The research in software cost engineering should perhaps put less emphasis on investigating ways of achieving better predictive accuracy through new data modeling procedures, but re-direct that research effort to other questions that call for more investigation, such as subjective effort estimation [18], modeling based on expert knowledge elicitation [6], techniques combining expert opinion and project data [12]. ACKNOWLEDGEMENTS We would like to thank the European Space Agency and INSEAD for giving us access to the ESA data. In particular, we are grateful to Benjamin Schreiber from the ESA/ESTEC center in the Netherlands for making this analysis possible. The ESA database [17] is available to any organization that contributes project data to the database. REFERENCES 1. Boehm, B. Software Engineering Economics. Englewood Cliffs, NJ Prentice Hall (1981). 2. Boehm B., Clark, B., Horowitz, E. Westland, C. Cost models for future software life cycle processes: COCOMO 2.0. Annals of Software Engineering, 1 (1995), 57-94. 3. Bohrnstedt G., Carter, T. Robustness in Regression Analysis. In: Costner, H. (ed). Chapter 5, Sociological Methodology. Jossey-Bass (1971). 4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. Classification and Regression Trees. Wadsworth & Books/Cole Advanced Books & Software (1984). 5. Briand, L.C., Basili, V.R., Thomas, W.M. A pattern recognition approach for software engineering data analysis. IEEE Transactions on Software Engineering, 18, 11 (1992) 931-942. 6. Briand, L.C., El Emam K., Bomarius, F. A Hybrid Method for Software Cost Estimation, Benchmarking, and Risk Assessment. Proceedings of the 20 th

International Conference on Software Engineering, ICSE-20 (April 1998) 390-399. 7. Briand, L.C., El Emam, K., Wieczorek I. Explaining the Cost of European Space and Military Projects. In: Proceedings of the 21 st International Conference on Software Engineering, ICSE 99, (Los Angeles, USA 1999) 303-312. 8. Briand, L.C. El Emam K., Maxwell, K., Surmann, D., Wieczorek, I. An Assessment and Comparison of Common Software Cost Estimation Models. In: Proceedings of the 21 st International Conference on Software Engineering, ICSE 99, (Los Angeles, USA 1998) 313-322. 9. Briand, L.C., Langley, T., Wieczorek, I. Using the European Space Agency Database: A replicated Assessment of Common Software Cost Estimation Techniques. Technical Report, ISERN TR-99-15, International Software Engineering Research Network (1999). 10. CBR-Works, 4.0 beta. Research group Artificial Intelligence- Knowledge-Based Systems, University of Kaiserslautern <http://wwwagr.informatik.unikl.de/~lsa/cbratukl.html> 11. Conte, S.D., Dunsmore, H.E., Shen, V. Y. Software engineering metrics and models. The Benjamin/Cummings Publishing Company, Inc. (1986). 12. Chulani S., Boehm B., Steece B. Bayesian Analysis of Empirical Software Engineering Cost Models. IEEE Transactions on Software Engineering, 25,4 (1999) 13. Davidson, R., McKinnon, J. Several Tests for Model Specification in the Presence of Alternative Hypotheses. Econometrica, 49, 3 (1981), 781-93. 14. Finnie, G. R., Wittig, G. E. A comparison of software effort estimation techniques: using function points with neural networks, case based reasoning and regression models. J. Systems Software, 39 (1997) 281-289. 15. Gibbons, J.D. S. Nonparametric Statistics. Series: Quantitative Application in the Social Sciences 90, SAGE University Paper (1993). 16. Gray, A., MacDonell, D. A comparison of techniques for developing predictive models of software metrics. Information and Software Technology, 39, (1997) 425-437 17. Greves, D., Schreiber, B. The ESA Initiative for Software Productivity Benchmarking and Effort Estimation. ESA Bulletin, 87 (August 1996). <http://esapub.esrin.esa.it/bulletin/bulett87/greves87.ht m> 18. Höst, M., Wohlin, C. An Experimental Study of Individual Subjective Effort Estimation and Combinations of the Estimates. In: Proceedings of the ISERN-99-15 11 20 st International Conference on Software Engineering, ICSE 98, (Japan, 1998) 332-339. 19. <http://dec.bournemouth.ac.uk/dec_ind/decind22/web/ Angel.html>. 20. Jones, C. Applied Software Measurement: Assuring Productivity and Quality, (2 nd Ed.), Mc-Graw-Hill, NY, USA, (1996). 21. Jørgensen, M. Experience with the accuracy of Software Maintenance Task Effort Prediction Models. IEEE Transactions on Software Engineering, 21, 8 (August, 1995) 674-681. 22. Kitchenham, B. A. Procedure for Analyzing Unbalanced Data Sets, IEEE Transactions on Software Engineering, 24, 4 (April 1998) 278-301. 23. Schroeder, L. Sjoquist, D., Stephan, P. Understanding Regression Analysis: An Introductory Guide. No. 57 In Series: Quantitative Applications in the Social Sciences, Sage Publications, Newbury Park CA, USA, (1986) 24. Maxwell, K., Van Wassenhove, L. and Dutta, S. Software Development Productivity of European Space, Military and Industrial Applications. IEEE Transactions on Software Engineering, 22, 10 (1996). 25. Shepperd, M., Schofield, C. Estimating software project effort using analogies. IEEE Transactions on Software Engineering, 23, 12 (November 1997) 736-743. 26. Stensrud, E., Myrtveit, I. Human Performance Estimation with Analogy and Regression Models. In: Proceedings of the METRICS 98 Symposium, (1998) 205-213. 27. Spector, P. Ratings of Equal and unequal Response Choice Intervals. The Journal of Social Psychology, 112 (1980) 115-119. 28. Srinivasan, K., Fisher, D. Machine learning approaches to estimating software development effort. IEEE Transactions on Software Engineering, 21, 2 (February 1995) 126-137. 29. StataCorp, Stata Statistical Software: Release 5.0. Stata Corporation, College Station, (Texas 1997). <http://www.stata.com> 30. Steinberg, D., Colla, P. CART, Classification and Regression Trees, Tree Structured Non-parametric Data Analysis, Interface Documentation. Saflord Systems (1995), <http://www.salfordsystems.com/ index.html> 31. Walkerden F., Jeffery R. An Empirical Study of Analogy-based Software Effort Estimation. Empirical Software Engineering, 4, 2, (June 1999) 135-158. 32. Weiss, S., Kulikowski, C. Computer Systems that Learn. Morgan Kaufmann Publishers, Inc. San Francisco, CA, (1991).

APPENDIX The following Tables summarize the mean MRE, Median MRE, and the Pred (0.25) values. Table A-1 gives the results based on the entire database using a 4-fold cross validation. Table A-2 summarizes the results based on company 1 data using a 3- fold cross validation. The shaded cells for Company 1 in Table A-2 correspond to Table 6, second column. The shaded cells in row Avg. in Table A-1 correspond to Table 4, second column. The shaded cells in Table A-2 correspond to Table 5, second column. Two additional Analogy variants are presented: Analogy2 Walkerden and Jeffery (1998) [31] uses productivity as a dependent variable, includes size in the similarity measure, and adjusts for size during effort prediction. Analogy3 Shepperd et al (1997) [25] uses effort as a dependent variable, includes size in the similarity measure, but does not adjust for size during effort prediction. By Company 1 2 3 4 Avg. Measure OLS Regression ANOVA_p ANOVA_e Analogy Analogy2 Analogy3 CART_p CART_e CART_p + OLS CART_p + Analogy MMRE 0.3640 0.5904 0.3948 0.7398 1.1603 3.3679 0.6721 0.9270 0.6433 0.7819 MdMRE 0.3207 0.6920 0.3002 0.7567 0.8440 0.9289 0.7951 0.6948 0.8165 0.8537 Pred(0.25) 48 % 20% 32% 4% 8% 4% 16% 16% 20% 8% MMRE 0.6082 0.6410 0.8604 0.9356 0.9110 2.2962 1.3252 2.0179 1.3651 1.3736 MdMRE 0.4247 0.6742 0.5128 0.9177 0.7097 1.6230 0.8298 0.6978 0.8819 0.7117 Pred(0.25) 42.9% 0% 14.3% 0% 0% 14.3% 0% 14.3% 0% 0% MMRE 0.2150 2.5522 0.3410 15.3099 0.7447 1.4640 1.8305 0.5221 0.2616 1.5179 MdMRE 0.2502 2.5506 0.4085 0.9177 0.5627 0.9320 1.4977 0.6855 0.3739 1.4405 Pred(0.25) 33.3 % 0% 33.3% 33.3% 33.3% 0% 0% 33.3% 33.3% 33.3% MMRE 0.3530 0.1421 0.2912 0.5768 0.4982 0.4541 1.2688 1.3665 3.1994 0.9919 MdMRE 0.3653 0.1550 0.2634 0.5756 0.5330 0.4717 0.4595 1.1672 1.3925 0.5069 Pred(0.25) 25 % 100% 50% 0% 0% 0% 25% 25% 25% 0% MMRE 0.3952 0.7044 0.4636 1.8790 1.0157 2.7303 0.9396 1.1367 1.006 0.9663 MdMRE 0.3207 0.6218 0.3962 0.7570 0.7097 0.8600 0.7951 0.6978 0.6887 0.8440 Pred(0.25) 43.6% 23.1% 30.8% 5.1% 7.7% 5.1% 12.8% 18.0% 18.0% 7.7% A 1: Results using the whole database By Random 1 2 3 Measure OLS Regression ANOVA1 ANOVA2 Analogy1 Analogy2 Analogy3 CART1 CART2 CART1 + OLS Reg. CART1 + Analogy1 MMRE 0.4883 2.3377 0.4951 1.3617 1.1904 0.7875 0.9466 0.3549 0.7624 1.3761 MdMRE 0.5036 0.7450 0.4792 0.7639 0.7639 0.5486 0.7294 0.3686 0.6031 0.7639 Pred(0.25) 33.3% 22.2% 33.3% 0% 0% 11.1% 0% 33.3% 33% 0% MMRE 0.4898 0.6393 0.5380 1.0439 1.0135 1.5856 1.0124 1.3096 0.9519 1.0302 MdMRE 0.3351 0.6324 0.2646 0.7292 0.6397 0.6189 0.7342 0.6189 0.7160 0.7292 Pred(0.25) 33.3% 22.2% 44.4% 11.1% 11.1% 33.3% 11.1% 22.2% 0% 11.1% MMRE 0.2859 0.8842 0.3085 0.9906 0.6304 0.6391 1.6379 0.4167 1.5846 1.0226 MdMRE 0.1739 0.6206 0.1840 0.5681 0.4584 0.7445 0.5050 0.3987 0.6043 0.5681 Pred(0.25) 71.4% 14.3% 71.4% 0% 28.6% 0% 14.3% 28.6% 0% 0% Avg. MMRE 0.4322 1.3193 0.4583 1.1434 0.9699 1.0332 1.1639 0.7156 1.0609 1.1526 MdMRE 0.3351 0.7319 0.2646 0.7292 0.6786 0.6189 0.7294 0.4047 0.6873 0.7444 Pred(0.25) 44% 20% 48% 4% 12% 16% 8% 28% 12% 4% A 2: Results using the company 1 data ISERN-15 12