A replicated Assessment and Comparison of Common Software Cost Modeling Techniques

Size: px
Start display at page:

Download "A replicated Assessment and Comparison of Common Software Cost Modeling Techniques"

Transcription

1 A replicated Assessment and Comparison of Common Software Cost Modeling Techniques Lionel C. Briand Tristen Langley Isabella Wieczorek Carleton University CAESAR, University of Fraunhofer Institute for Systems and Computer New South Wales Experimental Software Engineering Engineering Department. Kensington NSW 2052 Sauerwiesen 6 Ottawa, ON K1S 5B6 Canada Sydney, Australia Kaiserslautern, Germany ABSTRACT Delivering a software product on time, within budget, and to an agreed level of quality is a critical concern for many software organizations. Underestimating software costs can have detrimental effects on the quality of the delivered software and thus on a company s business reputation and competitiveness. On the other hand, overestimation of software cost can result in missed opportunities to funds in other projects. In response to industry demand, a myriad of estimation techniques has been proposed during the last three decades. In order to assess the suitability of a technique from a diverse selection, its performance and relative merits must be compared. The current study replicates a comprehensive comparison of common estimation techniques within different organizational contexts, using data from the European Space Agency. Our study is motivated by the challenge to assess the feasibility of using multi-organization data to build cost models and the benefits gained from companyspecific data collection. Using the European Space Agency data set, we investigated a yet unexplored application domain, including military and space projects. The results showed that traditional techniques, namely, ordinary leastsquares regression and analysis of variance outperformed Analogy-based estimation and regression trees. Consistent with the results of the replicated study no significant difference was found in accuracy between estimates derived from company-specific data and estimates derived from multi-organizational data. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. ICSE 2000, Limerick, Ireland ACM /00/06 $5.00 Keywords Cost estimation, Classification and Regression Trees, Analogy, Analysis of Variance, Ordinary Least-Squares Regression, replication 1 INTRODUCTION Accurate estimates of the cost for software projects are crucial for better project planning, monitoring, and control. Project managers usually stress the importance of improving estimation accuracy and techniques to support better estimates. Over the last three decades, a variety of estimation techniques have been developed and investigated to provide improved estimates. Many studies have performed comparative evaluations of the prediction accuracy of different techniques (see Section 2). But despite intense research, only few tangible conclusions can be drawn based on existing results. Two of the main reasons for this are that (1) many studies rely on small and medium-size data sets, for which results tend to be inconclusive (lack of statistical power) and (2) studies tend to be incomplete, i.e., consider small subsets of techniques for their evaluation. Moreover, evaluation studies are rarely replicated and are usually not reported in a form that allows the comparison of results. Hence, one can only have limited confidence in the conclusions drawn from such one-time studies. As in any other experimental or empirical field, replication is key to establishing the validity and generalizability of results. With regards to software cost estimation, we want to determine the stability of the trends observed across data sets and the generalizability of results across application domains and organizations. Our paper s contribution is twofold: (1) it replicates and expands on a study we previously reported in the management information systems (MIS) domain [8], (2) it provides a procedure to perform and present cost model evaluations and comparisons. The data set used in the current study is the European Space Agency (ESA) multiorganization software project database [17][24]. The 377

2 projects come from a variety of European organizations, with applications from the aerospace, military, industrial, and business domains. We will carefully compare the results obtained here with the ones reported in [8] so that common trends are identified and differences can be investigated. The purpose of both our studies is to address two fundamental issues in cost estimation. First, as mentioned above, we compare the prediction accuracy of most commonly used cost estimation techniques using a large project cost data set. This is important as the decision to select an estimation technique highly depends on its ability to provide accurate results. As a by-product, such studies help us identify the most influential cost factors and also lead to practical benefits in terms of identifying important data to collect. Second, we assess the feasibility of using multi-organizational data sets to build software cost estimation models as compared to company-specific data. Company-specific data are believed to provide a better basis for accurate estimates, but company-specific data collection only allows for the collection of project data at a slow pace, as in most organizations there is a limited number of projects completed every year. As statistically based models require substantial amount of data, the time required to accumulate enough data for cost modeling purposes may be prohibitive. In addition, by the time enough data are available older project technologies may be obsolete in the organization, thus models that are not fully representative of current practices. Multi-organizational databases, where organizations share project data collected in a consistent manner, can address these issues but have problems of their own. Collecting consistent data may turn out to be difficult and, because of differences in processes and practices, trends may differ significantly across organizations. In this paper we would like to focus on the issue whether multiple-organizational project data, within consistent application domains (e.g., projects coming from ESA subcontractors), can yield cost models of comparable accuracy to models based on data coming from one company. The paper starts with a discussion of related work in Section 2, then presents our research method in Section 3 in order to enable the replication of our investigations on other data sets in a comparable form. The project cost database on which the paper is based is then described in Section 4 and we continue by describing the results of our research in Section 5. Section 6 stresses some further important points the can be concluded from our analysis results. Section 7 then summarizes our results and concludes the paper. 2 RELATED WORK Cost estimation techniques have drawn upon a variety of fields, statistics, machine learning, and knowledge acquisition. Given the diversity of estimation techniques one is faced with the difficult exercise of determining which technique would be the best in given circumstances. In order to assess a technique s appropriateness, the underlying assumptions, strengths, and weaknesses have to be known and its performances must be assessed. In response to these challenges, many different studies comparing software cost estimation techniques have been performed in the last thirty years. Investigations aimed to (1) determine which technique has the greatest effort prediction accuracy [8][14][21][31], (2) propose new or combined techniques that could provide better estimates [5][11][25][26]. Most of the previous comparisons have been restricted to small and medium-sized data sets and found that there is no superior technique for all data contexts. Few studies were replicated and rarely techniques were investigated with the data sets beyond those with which they were originally proposed. Even when the same data set is used in different studies, the results were not always comparable because of different experimental designs. For a more detailed description of previous studies we refer the reader to [8]. In general, assessments of effort prediction accuracy did not only reveal strength and weaknesses of estimation methods, but also the advantages and limitations of the data sets used. However, there is still little evidence to suggest which data set characteristics are best suited to which estimation techniques. The current study improves upon the main drawbacks of many previous studies. It contributes in evaluating and comparing many of the common cost modeling techniques, using a relatively large and representative 1 database, focusing on both company specific and multiorganizational data contexts, which allows the relative utility of multi-organizational databases to be evaluated. This is done by applying a procedure consistent with a previous study that was based on data from business applications. 3 RESEARCH METHOD Selecting Modeling Techniques Our study considered a large subset of the many different techniques that have been proposed in the literature. These modeling techniques were selected according to the following criteria [8]: Applied in Software Engineering: The rationale is that the technique should have an initial demonstrated utility in software engineering. Automatable: Due to practical considerations, only techniques that can be substantially automated are 1 We will see below that the data set used in this study comes from a variety of organizations and application domains, and contains typical information about cost drivers. 378

3 considered. Interpretable: The results of the estimation technique should provide cost engineers and managers with insights into the structures present in the data set. Suitable Input Requirements: The input variables required for the application of a technique should be compatible with the data set used for the comparative study. The project attributes collected should be adequate and suitable for use as inputs. Modeling techniques that fulfilled the criteria above are: Ordinary Least-Squares Regression (OLS) [23], stepwise Analysis of Variance for unbalanced data sets (stepwise ANOVA) [22], Regression Trees (CART) [4], and Analogy-based estimation [25]. We also considered combinations of CART with OLS regression and Analogy. These techniques were also applied in our previous study on the Laturi database [8]. In addition, we applied additional variants of the above mentioned techniques. They are described in more detail in the following paragraphs. Ordinary Least-Squares Regression We applied multivariate OLS regression analysis fitting the data to a specified model. Using the J-test, the exponential functional form was formally shown to be the most plausible model specification [7][13]. Moreover, linear models revealed heteroscedasticity. A mixed stepwise process was followed to select variables having a significant impact on effort (α=0.05). Dummy variables [23] were created to deal with categorical, nominal scaled variables. Ordinal-scaled variables were treated as if they were measured using an interval scale. It simplifies the analysis and it has been shown to be a reasonable practice for the kind of scales used in the ESA questionnaire (see Spector [27] and Bohrnstedt et al [3]). Stepwise ANOVA In line with our previous study, we applied a stepwise analysis of variance procedure that can analyze the variance of unbalanced data. We automated the procedure described by Kitchenham [22] using the Stata tool [29]. This procedure applies ANOVA on categorical variables, and OLS regression on continuous variables in a stepwise manner. It alleviates problems with unbalanced data by focusing on residual analysis. The steps can be summarized as follows: For each of the independent variables (in our case cost factors) ANOVA or OLS regression is applied depending on the measurement level of the variable. The most significant factor is identified. Its effect is removed by taking the residuals as the new dependent variable values. ANOVA or OLS regression is applied in turn on the remaining independent variables using the new dependent variable. These steps are repeated until all significant factors are removed or there are insufficient degrees of freedom. The result of the procedure is an equation including the most significant factors found during each step. Confounding factors are also identified and removed when applying the procedure. These are factors found significant in one step but found insignificant in the following iteration. We transformed the continuous variables applying their natural logarithms. This was done to ensure a normally distributed dependent variable, which is an assumption to apply this procedure, and to account for heteroscedasticity. Replicating our previous study, we used effort as the dependent variable (referred to as ANOVA_e). In addition, we used productivity as a dependent variable, in consistency to Kitchenham s study [22] (referred to as ANOVA_p). Analogy Analogy-based estimation involves the comparison of a new (target) project with completed (source) projects. The most similar projects are selected from a database as a basis to build an estimate. Major issues are to select an appropriate similarity function, to select relevant project attributes (in our case cost-drivers), and to decide upon the number of similar projects to consider for estimation (analogues). We used the case-based reasoning tool CBR-Works, 4.0 beta [10], and implemented a similarity function identical to that proposed by Shepperd and Schofield [25] and used in the ANGEL tool [19]. This function is based on the normalized unweighted Euclidean distance. For each variable k out of n variables included in the similarity measure, the distance (δ(p ik,p jk )) between the target project variable value P ik and the source project variable value P jk is defined as: 3 ik,p jk 2 P ik Pjk, if k is continuous max k mink ) = 0, if k is categorical AND Pik = P 1, if k is categorical AND Pik P The overall distance between the two projects is then defined as: distance(p,p ) = i j n δ (Pik,P k = 1 n To select the n variables that were used in the similarity measure, ANGEL determines the optimal combination of variables by implementing a comprehensive search. This however is inefficient as reported in [8][25]. We employed a procedure proposed by Finnie et al. [14]. For all categorical variables we applied a two tailed t-test to determine variables that show significant influence on jk ) jk jk 379

4 productivity. We generated two levels for each variable by merging the variable s original 4 or 5 levels. To do so, we used a heuristic based on two criteria used for merging the levels: (a) the proportions of observations in each level had to be balanced, and (b) marginal differences in the level means resulted in merging the corresponding levels. Predictions were based on the selected most similar project. If more than one similar project (analogue) were retrieved due to equal similarity values, the average of these projects was used. This choice was justified since various studies [8] [25] report no significant differences in accuracy when using different numbers of analogues. To estimate effort, we used productivity as the dependent variable and then calculated the predicted effort by adjusting the effort prediction by system size [8]. The predicted effort is calculated by dividing the actual size (which has to be estimated during project planning) by the predicted productivity. Alternatively, we also tried the approaches proposed by Walkerden and Jeffery [31], Shepperd and Schofield [25], using system size additionally in the similarity measure, and using effort as a dependent variable without any size adjustment, respectively. As the results among the three approaches were almost the same, we will only provide below the results from applying our approach. Detailed results can be found in the Appendix of [9]. CART We generated regression trees based on the CART algorithm [4] using the CART tool [30]. CART examines the data and determines data splitting criteria that will most successfully partition the dependent variable. To build a regression tree the data set is recursively split until a stopping criterion is satisfied. All but the terminal nodes in a tree specify a condition based on one of the variables that have an influence on project effort or any other dependent variable. After a tree is generated it may be used for effort prediction of a project. To predict the effort for a project, a path is followed through the tree according to the project s specific variable values until a terminal node is reached. The mean or median value of the terminal node may then be used as the predicted value. As a replication to our study [8], we used productivity as a dependent variable (referred to as CART_p). The stopping criterion was set to a minimum of five observations in each terminal node. Predictions were based on the median productivity in a terminal node. When using productivity, predicted effort is calculated by dividing the actual system size by the predicted productivity within the terminal node. In addition, we generated separate CART trees using effort as a dependent variable (referred to as CART_e) and system size as an additional independent variable, consistent with the studies by Kitchenham [22] and Srinivasan and Fisher [28]. Combinations of Techniques According to our previous study we applied two combinations of techniques on our data set. We combined CART trees with OLS regression using the tree structures produced from CART_p. For each terminal node in a tree, the observations belonging to that node were used to build a simple OLS regression model. We fitted an exponential relationship between effort and system size. At each terminal node, instead of the median productivity, an OLS equation was applied to predict effort. A project's effort is determined by following the tree to a terminal node and then applying the effort equation corresponding to this node. We also combined CART and our Analogy-based approach using the tree structures produced from CART_p. For each terminal node in a tree, the observations belonging to that node were used to build the basis for similarity-based retrieval. A project's effort is determined by following down the tree to a terminal node and selecting the most similar project within the subset of projects that corresponds to this node. Evaluation Criteria A common evaluation criterion for cost estimation models was proposed by Conte et al [11]. It is the Magnitude of Relative Error (MRE) and defined as: MRE i Actual Efforti Predicted Efforti = Actual Effort The MRE value is calculated for each observation i whose effort is predicted. The aggregation of MRE over multiple observations, say N, can be achieved through the Mean MRE (MMRE). However, the MMRE is sensitive to individual predictions with excessively large MREs. Therefore, an aggregate measure less sensitive to extreme values should also be considered, namely the median of MRE values for the N observations (). Another criterion that is commonly used is the prediction at level l, k PRED(l) =, where k is the number of observations N where MRE is less than or equal to l. The criteria we used to assess and compare cost estimation models are the relative values of MMRE,, and PRED(0.25) for the different techniques. Cross Validation If one constructs a cost estimation model using a particular data set, and then computes the accuracy of the model using the same data set, the accuracy evaluation will be optimistic [32]. The cross-validation approach we use involves dividing the whole data set into multiple train and test sets, calculating the accuracy for each test set, and then aggregating the accuracy across all the test sets. i 380

5 In our previous and in the current study, we investigate the performance of estimation techniques when models are built using company-specific data (local models), and when they are built using multi-organizational data (generic models). To determine the accuracy of generic cost models, we partitioned the data set divided by organizations. We limited this to organizations for which there are 8 or more projects in our data set (in total 60 projects). These 60 projects were provided by 4 companies. We refer to them as company 1 to company 4 (see also [9]). The remaining 100 projects came from 65 different companies and we considered this chunk as one additional partition. The projects from each of the four companies were used in turn as holdout samples. Accordingly, for each company, we used the whole data set minus the projects of the selected holdout company as a training set. This resulted in 4 different training-test set combinations. Calculating accuracy in this manner indicates the accuracy of using an external multi-organizational data set for building a cost estimation model, and then testing it on an organization s projects. To determine the accuracy of deriving local cost estimation models we used 29 projects coming from company 1. Here, we used a three-fold cross-validation approach. We randomly generated 3 test sets, and for each test set we used the remaining projects as the training set. The overall accuracy is aggregated across the three test sets. Calculating accuracy in this manner indicates the accuracy to be expected if an organization built a model using its own data set, and then uses that model to predict the cost of new projects. 4 DATABASE DESCRIPTION The database used in this study is the European Space Agency (ESA) multi-organization software project database [17]. Since 1988, the ESA continuously collects historical project data on cost and productivity from different application domains. The data comes from European organizations, with applications from the aerospace, military, industrial, and business environment. Each data supplier is contacted on a regular basis to determine if projects are nearing completion. Once a project questionnaire is filled out, each data supplier is contacted to ensure the validity and comparability of the responses. Each data supplier regularly receives data analysis reports of the data set. At the time of our analysis, the database consisted of 166 projects. The breakdown of projects by environment was: 34% space, 31% military, 21% industry, 9% business, 5% other environments. The projects originated from 69 different organizations coming from 10 different European countries. Four companies provided data for at least 8 projects. Company 1 provided 29 projects, company 2 provided 15 projects, companies 3 and 4 provided 8 projects each. We can see from this description that we have a varied set of projects, which we think are representative of the European industry practice. This is important if we want the results of our study to be generalizable to a large population of the software industry. System size is measured in KLOC. Because the data set contains multiple programming languages, we adjusted each project s KLOC value to obtain a comparable size measure. To calculate the adjusted KLOC for each project we used the language levels proposed by Jones [20]. Six projects were excluded from the analysis because these projects were enhancements rather than new projects. Missing data is a common attribute of software engineering data [5][16]. In our case, only 90 projects had values for all the variables used in the current analysis. We will explain how missing data were handled when discussing our results. Variable Description Scale Values / Range / Unit ENV The domain the project was developed nominal Space, Military, Business, Industry for. Adj_KLOC Adjusted New developed KLOC ratio 1 KLOC=1000 LOC EFFORT Effort for SW project ratio Person hours, where 144 person hours=1 person month TEAM Maximal team size at any time during a project ratio VIRT virtual machine volatility ordinal 2-5 (low-very high) RELY required reliability ordinal 1-5 (very low-very high) TIME execution time constraints ordinal 2-6 (low-extra high) STOR main storage constraint ordinal 2-6 (low-extra high) MODP use of modern programming practices ordinal 1-5 (very low-very high) TOOL use of software tools ordinal 2-5 (low-very high) LEXP Programming language experience ordinal 1-5 (very low-very high) Table 1: Variables from the ESA Database 381

6 The variables that are taken into account in our analysis are listed in Table 1. These are variables that may potentially have an impact on project cost and include system size plus 7 of the COCOMO cost drivers [1]. Descriptive Statistics Table 2 summarizes the descriptive statistics for system size (in adjusted KLOC: Adj_KLOC) and project effort (in person-month: PM). The table shows results from the whole database and for the company that provided 29 projects. On average, projects from this company have a higher effort than projects in the remainder of the database. On the other hand, system size is lower in this company than for the remainder of the database. It therefore shows a low productivity compared with the remainder of the data set. This may be due to the fact that 28 out of 29 projects are military applications that have high reliability requirements and time constraints. Company 1 Size (Adj_KLOC) Effort (PM) Whole DB Size (Adj_KLOC) Effort (PM) min mean max st.dev obs Table 2: Descriptive Statistics for system size and effort Table 3 summarizes descriptive statistics for team size. On average the maximal team size at any stage in a project is higher within company 1 than for the remainder of the database. Company 1 Teamsize (TEAM) Whole DB Teamsize (TEAM) min 2 1 mean max st.dev obs Table 3: Descriptive Statistics for team size 5 DATA ANALYSIS RESULTS This section reports the results concerning the comparisons of all the modeling techniques briefly described in Section 3. Table 4, Table 5, and Table 6 report, respectively, average results for the whole database, the company 1 data set, and the company 1 data set when using the remainder of the data set as a training sample for model building. Before delving into the details of result tables, we need to discuss a few technical issues regarding our comparison procedure. Some estimation techniques, such as Analogy are able to provide effort predictions even when there is missing data for a project. Other techniques, such as OLS regression, cannot cope with missing data related to the parameters used in their specifications. To ensure a valid comparison of all the applied techniques, we selected a subset of projects from our holdout samples for which there was no missing data. For a set of 39 projects (subset from the holdout samples: companies 1, 2, 3, and 4), all techniques were able to provide predictions within the multiorganizational context. Within the company-specific context (company 1), all techniques could predict effort for 25 projects. In this paper, we only provide the results reporting the median MRE values for the sake of brevity. It is important to note, however, that the main trends in the results are supported when looking at other goodness-of-fit measures such as MRE or pred(0.25) (see Appendix of [9]). Having refined our tool suite, we were able to assess a few additional modeling variants as compared to our previous study. Since no result can be provided, these additional techniques are visible as a * is placed in the third column of Table 4, Table 5, and Table 6. Significant differences in accuracy across the techniques are indicated through differently shaded cells in the tables. Dark gray rows indicate the group of techniques that performed poorly and within which no significant difference can be identified. Next are the light gray rows, which represent a second group of techniques, performing significantly better than the first group, but worse than the third group (white rows) which contains all the techniques performing best. The statistical significance is tested using the matched-pair Wilcoxon signed rank test [15], a nonparametric analogue to the t-test. Comparison of Modeling Techniques Table 4 presents the average results obtained when using cross validation, for all models, across the whole ESA project database (for detailed results from the application on each of the four companies see [9]). Table 5 presents comparable results for company 1 (for detailed results from the 3-fold cross validation see [9]). We also systematically compare the results with the ones from our former study on the Laturi data set (third column in Tables 4 and 5) [10]. Considering the ESA database, OLS regression and ANOVA_e (using effort as a dependent variable) perform significantly better than the other techniques. This can be observed for the multi-organizational (Table 4, second column) as well as for the company-specific context (Table 5, second column). The technique proposed by Kitchenham [22] seems to perform as well as simple OLS regression. Considering that its automation is significantly more complex, it is doubtful that any benefit can be gained by using stepwise ANOVA on a data set similar to the one we use here. So, despite the numerous claims in the literature, 382

7 it seems that simply using traditional OLS regression provides the best results overall, though there is room still for improvement. Within the multi-organizational context, ANOVA_p (light gray cell, second column in Table 4) was significantly different and slightly better than CART, Analogy, and all combinations of techniques (dark gray cells). The main reason for the different accuracy levels could be the different assumptions made by the techniques about the size-effort relationship. The two most accurate techniques (i.e., ANOVA_e and OLS) modeled effort as opposed to productivity. They are both based on log-transformations of some cost-driver variables and therefore best accommodated the exponential effort-size relationship that seem to be underlying our data set, like most large software project cost data sets. This relationship was poorly modeled by ANOVA_p, Analogy, CART_p, and CART_p + combinations. These techniques assume effort is proportional to size (other cost drivers being constant) when using productivity as a dependent variable. CART_e considered system size as an independent variable but did not adjust effort to account for any relationship between effort and size. Therefore, aligned with the widely supported view that system size is the most important costdriver [1][2], estimation techniques that could account for non-linear system size-effort relationships provided more accurate predictions. Confirming the results from our previous study [8], Analogy does not seem to bring any significant advantage over other modeling techniques. This contradicts some of the most recent work in this domain [25][31][14]. We observed extreme outlier predictions with Analogy-based estimation. Having a closer look at those observations, we found that the retrieved analogue matched perfectly with respect to the COCOMO factors, but greatly differed in terms of system size. Consequently, we think the poor performance of Analogy in our study can be attributed to the equal weighting of variables in the similarity measure. This illustrates the importance of defining appropriate similarity measures. Another result is that combining techniques did not lead to any significantly improved predictions. Considering these results are consistent across two distinct, large data sets, the software cost engineering community should start questioning whether looking for yet other modeling technique variants is likely to yield any practically significant benefit in terms of prediction accuracy. This has, however, been a major activity in recent years, consuming most of the research effort in software cost estimation. Another result, which has been observed numerous times in previous studies, is that even the best models are not very accurate (32% median MRE), although better than most reported results. Techniques (ESA) (Laturi) OLS ANOVA_e ANOVA_p 0.62 * CART_p + OLS CART_e 0.70 * Analogy CART_p CART_p + Analogy Number of predicted projects Table 4: Average results over all hold out samples using the whole database Table 5 reports results within one company, i.e., using more homogeneous but smaller data sets, for both the ESA and Laturi study. Three accuracy levels were identified using the ESA data. Again, OLS regression and ANOVA_e appeared to perform significantly better than all the other techniques and CART_e (light gray cell, second column within Table 5) shows an intermediary prediction accuracy. It can be observed that the techniques with relatively high and medium prediction accuracy used effort as a dependent variable. All the techniques that considered productivity as a dependent variable performed significantly worse (dark gray cells, second column within Table 5). This is likely because using productivity as a dependent variable instead of using size as an independent variable and model its impact on effort has implications in terms of the model s assumptions: it implies that effort and size are proportional. Techniques (ESA) (Laturi) ANOVA_e OLS CART_e 0.41 * CART_p + OLS Analogy CART_p ANOVA_p 0.73 * CART_p + Analogy Number of predicted projects Table 5: Average results over all hold out samples using company-specific data The results from the Laturi study showed no significant differences among all the techniques applied within the one-company context. Such differences are to be expected across data sets, especially small ones like most onecompany data sets, as the accuracy of modeling techniques is sensitive to the underlying structure in the data, e.g., linearity of relationships, outliers. Comparison of local versus generic Models The ESA results from Table 6 (second column) are 383

8 obtained using the whole database minus company 1 projects for model building and applying the models to company 1 only. This corresponds to the situation where company 1 has only external data available to build a cost estimation model. We compare these results with Table 5 (second column), which corresponds to the situation where a company has its own local data available to build cost estimation models. We note that the median MRE values have not significantly improved by going to a one company data set, as opposed to what could have been expected. The largest difference in median MRE values (28%) can be observed for CART_e (0.41 vs. 0.69). This is again consistent with results from our previous study on the Laturi data set (Table 5 third column vs. Table 6 third column). In general, it appears that there is no obvious advantage to build local cost models based on company specific data when generic cost-drivers are collected. It is possible that the advantage of using a more homogeneous data set is offset by a significantly smaller project sample available to build the models. Techniques (ESA) (Laturi) ANOVA_e OLS CART_e 0.69 * CART_p + OLS Analogy CART_p ANOVA_p 0.69 * CART_p + Analogy Number of predicted projects Table 6: Results on the one company data using the remainder of the whole database Comparison with our Previous Study Consistent with the findings of the previous study, the current study found no significant advantages using local, company-specific data to build estimates over using external, multi-organizational databases. From a general perspective, the results from both studies may have serious implications on the way we collect cost data and build datadriven cost models. To really benefit from collecting company-specific cost data, one should not just automatically collect generic, COCOMO-like factors, but investigate the important factors in the organization to be considered and design a tailored, specific measurement program [6]. One difference between the two studies is that CART_p appears to perform significantly better in the Laturi study where it appeared comparably accurate to OLS regression and ANOVA_e. However, in the current study CART_p, when combined with other techniques or not, performed relatively poorly. Depending on the structures present in the data, we have to expect variations in performance across data sets. For example, the trees generated with the Laturi data were more balanced than with the ESA database, i.e., more comparable proportions of the data set could be found along the different branches. Therefore, with the ESA data set, CART predictions were sometimes based on relatively small samples in the terminal nodes. From the results tables above, we can observe from the median MRE values that the best models are, in most cases, more accurate on the ESA database. This is particularly true for OLS regression and ANOVA_e (using effort as a dependent variable). This may be explained in part by the fact that most projects come from ESA contractors, who have to follow a consistent high-level development process. More process consistency may lead to more accurate models. Results are confirmed when looking at the mean MRE or pred(0.25) (see Appendix of [9]). On the other hand, the differences in accuracy among the techniques are higher for the ESA data set than for the Laturi data. In the multi-organizational context (Table 4), for the Laturi and ESA data sets, we observe two and three different accuracy levels, respectively. Within one company (Table 5), there were no significant differences among the techniques in the Laturi study, but three accuracy levels in the current study. Another important result is that none of the strategies for combining modeling techniques have been very successful. Modeling techniques like CART and OLS are inherently complementary in the sense that they can potentially capture different types of structures in the data, e.g., local structures and interactions for CART versus global, linear or log-linear structures for OLS regression. Research in procedures to combine modeling techniques might bring higher benefits. Selected Model Variables We summarize the independent variables selected for the different model types in order to gain some insight into the importance of those variables regarding cost prediction. OLS ANOVA_e ANOVA_p CART_e Adj_KLOC, TEAM, STOR, VIRT TEAM, Adj_KLOC, STOR TEAM, STOR, LEXP, TOOL TEAM, Adj_KLOC CART_p TEAM, TOOL, ENV, STOR, LEXP, RELY, MODP, VIRT, TIME Analogy TEAM, RELY, TIME, STOR, MODP, TOOL, LEXP, ENV 2 CART_p+ Adj_KLOC, TEAM, TOOL, ENV, STOR, LEXP, OLS RELY, MODP, VIRT, TIME Table 7: Selected model variables Table 7 provides, for each technique, the variables in decreasing order of importance when using the whole database for model building. This ordering is based on the frequency of variable selection during the cross validation 2 For all Analogy variants and all combinations of Analogy with CART_p, all variables have been selected the same number of times during cross validation. 384

9 process (except for Analogy). It is clear from the results that the two main, overriding trends are the effect of system size and maximum team size on effort. STOR (storage constraints) also seems to play an important role. 6 FURTHER CONSIDERATIONS We have seen that ANOVA_e and OLS regression are better, across data sets, in terms of their predictive capability using cross validation. Now, it would also be interesting to investigate whether any other criterion might justify using less accurate modeling techniques. What comes in favor of OLS regression is the level of maturity of the statistical packages that implement it, and the abundance of literature and experience on the subject. Stepwise ANOVA is not as well documented and automated. The procedure requires implementation using a statistical package programming capabilities. Both stepwise ANOVA and OLS regression require a significant statistical background to be used and interpreted. This is not the case for CART that is easier to interpret and use for model building, and which is supported by a relatively mature commercial tool. Therefore, despite the results shown above, in a situation where a balanced, thorough CART tree is generated, and in a context where interacting with domain experts is crucial for the interpretation of the cost model, CART may be a better alternative than OLS regression. Analogy, a priori, is simple to use because it does not require any data modeling. However, this is misleading as the complexity of Analogy lies in the definition of the similarity and adaptation functions with which analogues are selected and predictions are adjusted. There is, however, no straightforward, well-understood way of defining similarity. Moreover, only little is published about how to appropriately define adaptation rules. Therefore, the use of Analogy would be warranted in a context where not enough data is available for OLS regression or CART model building, and experts can help define similarity. 7 CONCLUSIONS The contribution of this paper is mainly in assessing and comparing the predictive accuracy of a comprehensive set of cost modeling techniques on representative cost data sets, following a rigorous and repeatable procedure. In addition, by comparing our results to a previous study whose objective was similar, we could better understand what results seem to generalize across data sets. The main conclusions from our studies are of high significance to the field of software cost modeling. A great deal of effort has been spent over the last 20 years on devising and comparing ways of modeling software development data to achieve higher accuracy. However, our results, which are based on two relatively large (by software engineering standards), representative data sets, show that ordinary least-squares regression is probably sufficient to get the most out of your data and help predict development effort. If other modeling techniques among the ones we investigate here are to be used, this should probably not be motivated by an attempt to obtain better accuracy. As discussed above, other reasons may justify it. As far as we know, our research initiative is the first to undertake a wide-scale comparison of cost models on relatively large data sets. Our results have practical implications as they suggest that a great deal of research in the area of cost modeling has not produced practically significant improvements. The research in software cost engineering should perhaps put less emphasis on investigating ways of achieving better predictive accuracy through new data modeling procedures, but re-direct that research effort to other questions that call for more investigation, such as subjective effort estimation [18], modeling based on expert knowledge elicitation [6], techniques combining expert opinion and project data [12]. ACKNOWLEDGEMENTS We would like to thank the European Space Agency and INSEAD for giving us access to the ESA data. In particular, we are grateful to Benjamin Schreiber from the ESA/ESTEC center in the Netherlands for making this analysis possible. The ESA database [17] is available to any organization that contributes project data to the database. REFERENCES 1. Boehm, B. Software Engineering Economics. Englewood Cliffs, NJ Prentice Hall (1981). 2. Boehm B., Clark, B., Horowitz, E. Westland, C. Cost models for future software life cycle processes: COCOMO 2.0. Annals of Software Engineering, 1 (1995), Bohrnstedt G., Carter, T. Robustness in Regression Analysis. In: Costner, H. (ed). Chapter 5, Sociological Methodology. Jossey-Bass (1971). 4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. Classification and Regression Trees. Wadsworth & Books/Cole Advanced Books & Software (1984). 5. Briand, L.C., Basili, V.R., Thomas, W.M. A pattern recognition approach for software engineering data analysis. IEEE Transactions on Software Engineering, 18, 11 (1992) Briand, L.C., El Emam K., Bomarius, F. A Hybrid Method for Software Cost Estimation, Benchmarking, and Risk Assessment. Proceedings of the 20 th International Conference on Software Engineering, ICSE-20 (April 1998) Briand, L.C., El Emam, K., Wieczorek I. Explaining the Cost of European Space and Military Projects. In: Proceedings of the 21 st International Conference on 385

10 Software Engineering, ICSE 99, (Los Angeles, USA 1999) Briand, L.C. El Emam K., Maxwell, K., Surmann, D., Wieczorek, I. An Assessment and Comparison of Common Software Cost Estimation Models. In: Proceedings of the 21 st International Conference on Software Engineering, ICSE 99, (Los Angeles, USA 1998) Briand, L.C., Langley, T., Wieczorek, I. Using the European Space Agency Database: A replicated Assessment of Common Software Cost Estimation Techniques. Technical Report, ISERN TR-99-15, International Software Engineering Research Network (1999). 10. CBR-Works, 4.0 beta. Research group Artificial Intelligence- Knowledge-Based Systems, University of Kaiserslautern <http://wwwagr.informatik.unikl.de/~lsa/cbratukl.html> 11. Conte, S.D., Dunsmore, H.E., Shen, V. Y. Software engineering metrics and models. The Benjamin/Cummings Publishing Company, Inc. (1986). 12. Chulani S., Boehm B., Steece B. Bayesian Analysis of Empirical Software Engineering Cost Models. IEEE Transactions on Software Engineering, 25,4 (1999) 13. Davidson, R., McKinnon, J. Several Tests for Model Specification in the Presence of Alternative Hypotheses. Econometrica, 49, 3 (1981), Finnie, G. R., Wittig, G. E. A comparison of software effort estimation techniques: using function points with neural networks, case based reasoning and regression models. J. Systems Software, 39 (1997) Gibbons, J.D. S. Nonparametric Statistics. Series: Quantitative Application in the Social Sciences 90, SAGE University Paper (1993). 16. Gray, A., MacDonell, D. A comparison of techniques for developing predictive models of software metrics. Information and Software Technology, 39, (1997) Greves, D., Schreiber, B. The ESA Initiative for Software Productivity Benchmarking and Effort Estimation. ESA Bulletin, 87 (August 1996). <http://esapub.esrin.esa.it/bulletin/bulett87/greves87.ht m> 18. Höst, M., Wohlin, C. An Experimental Study of Individual Subjective Effort Estimation and Combinations of the Estimates. In: Proceedings of the 20 st International Conference on Software Engineering, ICSE 98, (Japan, 1998) <http://dec.bournemouth.ac.uk/dec_ind/decind22/web/ Angel.html>. 20. Jones, C. Applied Software Measurement: Assuring Productivity and Quality, (2 nd Ed.), Mc-Graw-Hill, NY, USA, (1996). 21. Jørgensen, M. Experience with the accuracy of Software Maintenance Task Effort Prediction Models. IEEE Transactions on Software Engineering, 21, 8 (August, 1995) Kitchenham, B. A. Procedure for Analyzing Unbalanced Data Sets, IEEE Transactions on Software Engineering, 24, 4 (April 1998) Schroeder, L. Sjoquist, D., Stephan, P. Understanding Regression Analysis: An Introductory Guide. No. 57 In Series: Quantitative Applications in the Social Sciences, Sage Publications, Newbury Park CA, USA, (1986) 24. Maxwell, K., Van Wassenhove, L. and Dutta, S. Software Development Productivity of European Space, Military and Industrial Applications. IEEE Transactions on Software Engineering, 22, 10 (1996). 25. Shepperd, M., Schofield, C. Estimating software project effort using analogies. IEEE Transactions on Software Engineering, 23, 12 (November 1997) Stensrud, E., Myrtveit, I. Human Performance Estimation with Analogy and Regression Models. In: Proceedings of the METRICS 98 Symposium, (1998) Spector, P. Ratings of Equal and unequal Response Choice Intervals. The Journal of Social Psychology, 112 (1980) Srinivasan, K., Fisher, D. Machine learning approaches to estimating software development effort. IEEE Transactions on Software Engineering, 21, 2 (February 1995) StataCorp, Stata Statistical Software: Release 5.0. Stata Corporation, College Station, (Texas 1997). <http://www.stata.com> 30. Steinberg, D., Colla, P. CART, Classification and Regression Trees, Tree Structured Non-parametric Data Analysis, Interface Documentation. Saflord Systems (1995), <http://www.salfordsystems.com/ index.html> 31. Walkerden F., Jeffery R. An Empirical Study of Analogy-based Software Effort Estimation. Empirical Software Engineering, 4, 2, (June 1999) Weiss, S., Kulikowski, C. Computer Systems that Learn. Morgan Kaufmann Publishers, Inc. San Francisco, CA, (1991). 386

A replicated Assessment and Comparison of Common Software Cost Modeling Techniques

A replicated Assessment and Comparison of Common Software Cost Modeling Techniques A replicated Assessment and Comparison of Common Software Cost Modeling Techniques Lionel Briand Tristen Langley Isabella Wieczorek Carleton University CAESAR, University of Fraunhofer Institute for Systems

More information

An Assessment and Comparison of Common Software Cost Estimation Modeling Techniques

An Assessment and Comparison of Common Software Cost Estimation Modeling Techniques An Assessment and Comparison of Common Software Cost Estimation Modeling Techniques Lionel C. Briand, Khaled El Emam Dagmar Surmann, Isabella Wieczorek Fraunhofer Institute for Experimental Solhare Engineering

More information

Multinomial Logistic Regression Applied on Software Productivity Prediction

Multinomial Logistic Regression Applied on Software Productivity Prediction Multinomial Logistic Regression Applied on Software Productivity Prediction Panagiotis Sentas, Lefteris Angelis, Ioannis Stamelos Department of Informatics, Aristotle University 54124 Thessaloniki, Greece

More information

Resource Estimation in Software Engineering

Resource Estimation in Software Engineering Resource Estimation in Software Engineering Lionel C. Briand Carleton University Systems and Computer Engineering Dept. Ottawa, ON K1S 5B6 Canada briand@sce.carleton.ca Isabella Wieczorek Fraunhofer Institute

More information

Resource Estimation in Software Engineering 1

Resource Estimation in Software Engineering 1 Resource Estimation in Software Engineering 1 Lionel C. Briand and Isabella Wieczorek 1 Introduction This paper presents a comprehensive overview of the state of the art in software resource estimation.

More information

Cost Estimation for Web Applications

Cost Estimation for Web Applications Melanie Ruhe 1 Siemens AG, Corporate Technology, Software Engineering 3 80730 Munich, Germany melanie.ruhe@siemens.com Cost Estimation for Web Applications Ross Jeffery University of New South Wales School

More information

Building Software Cost Estimation Models using Homogenous Data

Building Software Cost Estimation Models using Homogenous Data First International Symposium on Empirical Software Engineering and Measurement Building Software Cost Estimation Models using Homogenous Data Rahul Premraj Thomas Zimmermann Saarland University Saarbrücken,

More information

Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model

Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model Iman Attarzadeh and Siew Hock Ow Department of Software Engineering Faculty of Computer Science &

More information

Efficiency in Software Development Projects

Efficiency in Software Development Projects Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University aneeshchinubhai@gmail.com Abstract A number of different factors are thought to influence the efficiency of the software

More information

Effort Estimation: How Valuable is it for a Web Company to Use a Cross-company Data Set, Compared to Using Its Own Single-company Data Set?

Effort Estimation: How Valuable is it for a Web Company to Use a Cross-company Data Set, Compared to Using Its Own Single-company Data Set? Effort Estimation: How Valuable is it for a Web Company to Use a Cross-company Data Set, Compared to Using Its Own Single-company Data Set? Emilia Mendes The University of Auckland Private Bag 92019 Auckland,

More information

A Comparison of Calibrated Equations for Software Development Effort Estimation

A Comparison of Calibrated Equations for Software Development Effort Estimation A Comparison of Calibrated Equations for Software Development Effort Estimation Cuauhtemoc Lopez Martin Edgardo Felipe Riveron Agustin Gutierrez Tornes 3,, 3 Center for Computing Research, National Polytechnic

More information

Measurement Information Model

Measurement Information Model mcgarry02.qxd 9/7/01 1:27 PM Page 13 2 Information Model This chapter describes one of the fundamental measurement concepts of Practical Software, the Information Model. The Information Model provides

More information

A HYBRID INTELLIGENT MODEL FOR SOFTWARE COST ESTIMATION

A HYBRID INTELLIGENT MODEL FOR SOFTWARE COST ESTIMATION Journal of Computer Science, 9(11):1506-1513, 2013, doi:10.3844/ajbb.2013.1506-1513 A HYBRID INTELLIGENT MODEL FOR SOFTWARE COST ESTIMATION Wei Lin Du 1, Luiz Fernando Capretz 2, Ali Bou Nassif 2, Danny

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

PREDICTING THE COST ESTIMATION OF SOFTWARE PROJECTS USING CASE-BASED REASONING

PREDICTING THE COST ESTIMATION OF SOFTWARE PROJECTS USING CASE-BASED REASONING PREDICTING THE COST ESTIMATION OF SOFTWARE PROJECTS USING CASE-BASED REASONING Hassan Y. A. Abu Tair Department of Computer Science College of Computer and Information Sciences King Saud University habutair@gmail.com

More information

A Fuzzy Decision Tree to Estimate Development Effort for Web Applications

A Fuzzy Decision Tree to Estimate Development Effort for Web Applications A Fuzzy Decision Tree to Estimate Development Effort for Web Applications Ali Idri Department of Software Engineering ENSIAS, Mohammed Vth Souissi University BP. 713, Madinat Al Irfane, Rabat, Morocco

More information

Deducing software process improvement areas from a COCOMO II-based productivity measurement

Deducing software process improvement areas from a COCOMO II-based productivity measurement Deducing software process improvement areas from a COCOMO II-based productivity measurement Lotte De Rore, Monique Snoeck, Geert Poels, Guido Dedene Abstract At the SMEF2006 conference, we presented our

More information

Investigating effort prediction of web-based applications using CBR on the ISBSG dataset

Investigating effort prediction of web-based applications using CBR on the ISBSG dataset Investigating prediction of web-based applications using CBR on the ISBSG dataset Sukumar Letchmunan Marc Roper Murray Wood Dept. Computer and Information Sciences University of Strathclyde Glasgow, U.K.

More information

Factors Influencing Software Development Productivity - State of the Art and Industrial Experiences

Factors Influencing Software Development Productivity - State of the Art and Industrial Experiences Factors Influencing Software Development Productivity - State of the Art and Industrial Experiences Adam Trendowicz, Jürgen Münch Fraunhofer Institute for Experimental Software Engineering Fraunhofer-Platz

More information

METHODS OF EFFORT ESTIMATION IN SOFTWARE ENGINEERING

METHODS OF EFFORT ESTIMATION IN SOFTWARE ENGINEERING I International Symposium Engineering Management And Competitiveness 2011 (EMC2011) June 24-25, 2011, Zrenjanin, Serbia METHODS OF EFFORT ESTIMATION IN SOFTWARE ENGINEERING Jovan Živadinović, Ph.D * High

More information

Domain Analysis for the Reuse of Software Development Experiences 1

Domain Analysis for the Reuse of Software Development Experiences 1 Domain Analysis for the Reuse of Software Development Experiences 1 V. R. Basili*, L. C. Briand**, W. M. Thomas* * Department of Computer Science University of Maryland College Park, MD, 20742 USA ** CRIM

More information

Software Cost Estimation with Incomplete Data

Software Cost Estimation with Incomplete Data 890 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 10, OCTOBER 2001 Software Cost Estimation with Incomplete Data Kevin Strike, Khaled El Emam, and Nazim Madhavji AbstractÐThe construction of

More information

James E. Bartlett, II is Assistant Professor, Department of Business Education and Office Administration, Ball State University, Muncie, Indiana.

James E. Bartlett, II is Assistant Professor, Department of Business Education and Office Administration, Ball State University, Muncie, Indiana. Organizational Research: Determining Appropriate Sample Size in Survey Research James E. Bartlett, II Joe W. Kotrlik Chadwick C. Higgins The determination of sample size is a common task for many organizational

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

C. Wohlin, "Is Prior Knowledge of a Programming Language Important for Software Quality?", Proceedings 1st International Symposium on Empirical

C. Wohlin, Is Prior Knowledge of a Programming Language Important for Software Quality?, Proceedings 1st International Symposium on Empirical C. Wohlin, "Is Prior Knowledge of a Programming Language Important for Software Quality?", Proceedings 1st International Symposium on Empirical Software Engineering, pp. 27-36, Nara, Japan, October 2002.

More information

Project Planning Objectives. Project Estimation. Resources. Software Project Estimation

Project Planning Objectives. Project Estimation. Resources. Software Project Estimation Project Planning Objectives Project Estimation Providing a framework that allows managers to make responsible estimates of the resources and time required to build a software product. Determining the scope

More information

Estimating the Impact of the Programming Language on the Development Time of a Software Project

Estimating the Impact of the Programming Language on the Development Time of a Software Project Estimating the Impact of the Programming Language on the Development Time of a Software Project Frank Padberg Fakultät für Informatik Universität Karlsruhe, Germany padberg@ira.uka.de Abstract An empirical

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

A Concise Neural Network Model for Estimating Software Effort

A Concise Neural Network Model for Estimating Software Effort A Concise Neural Network Model for Estimating Software Effort Ch. Satyananda Reddy, KVSVN Raju DENSE Research Group Department of Computer Science and Systems Engineering, College of Engineering, Andhra

More information

A Fresh Look at Cost Estimation, Process Models and Risk Analysis

A Fresh Look at Cost Estimation, Process Models and Risk Analysis A Fresh Look at Cost Estimation, Process Models and Risk Analysis Frank Padberg Fakultät für Informatik Universität Karlsruhe, Germany padberg@ira.uka.de Abstract Reliable cost estimation is indispensable

More information

Module 9: Nonparametric Tests. The Applied Research Center

Module 9: Nonparametric Tests. The Applied Research Center Module 9: Nonparametric Tests The Applied Research Center Module 9 Overview } Nonparametric Tests } Parametric vs. Nonparametric Tests } Restrictions of Nonparametric Tests } One-Sample Chi-Square Test

More information

Bayesian Belief Networks as a Software Productivity Estimation Tool

Bayesian Belief Networks as a Software Productivity Estimation Tool Bayesian Belief Networks as a Software Productivity Estimation Tool S. Bibi, I. Stamelos, L. Angelis Department of Informatics, Aristotle University 54124 Thessaloniki, Greece Emails: {sbibi,stamelos,lef}@csd.auth.gr

More information

Scott Knott Test Based Effective Software Effort Estimation through a Multiple Comparison Algorithms

Scott Knott Test Based Effective Software Effort Estimation through a Multiple Comparison Algorithms Scott Knott Test Based Effective Software Effort Estimation through a Multiple Comparison Algorithms N.Padma priya 1, D.Vidyabharathi 2 PG scholar, Department of CSE, SONA College of Technology, Salem,

More information

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

A Case Retrieval Method for Knowledge-Based Software Process Tailoring Using Structural Similarity

A Case Retrieval Method for Knowledge-Based Software Process Tailoring Using Structural Similarity A Case Retrieval Method for Knowledge-Based Software Process Tailoring Using Structural Similarity Dongwon Kang 1, In-Gwon Song 1, Seunghun Park 1, Doo-Hwan Bae 1, Hoon-Kyu Kim 2, and Nobok Lee 2 1 Department

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Web Cost Estimation:

Web Cost Estimation: 182 Mendes and Mosley Chapter VIII Web Cost Estimation: An Introduction Emilia Mendes, University of Auckland, New Zealand Nile Mosley, MetriQ (NZ) Limited, New Zealand Abstract Despite the diversity of

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

An Empirical Approach for Estimation of the Software Development Effort

An Empirical Approach for Estimation of the Software Development Effort , pp. 97-110 http://dx.doi.org/10.14257/ijmue.2015.10.2.09 An Empirical Approach for Estimation of the Software Development Effort Amit Kumar Jakhar and Kumar Rajnish Department of Computer Science & Engineering,

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN

EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN Sridhar S Associate Professor, Department of Information Science and Technology, Anna University,

More information

Pragmatic Peer Review Project Contextual Software Cost Estimation A Novel Approach

Pragmatic Peer Review Project Contextual Software Cost Estimation A Novel Approach www.ijcsi.org 692 Pragmatic Peer Review Project Contextual Software Cost Estimation A Novel Approach Manoj Kumar Panda HEAD OF THE DEPT,CE,IT & MCA NUVA COLLEGE OF ENGINEERING & TECH NAGPUR, MAHARASHTRA,INDIA

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Efficient Indicators to Evaluate the Status of Software Development Effort Estimation inside the Organizations

Efficient Indicators to Evaluate the Status of Software Development Effort Estimation inside the Organizations Efficient Indicators to Evaluate the Status of Software Development Effort Estimation inside the Organizations Elham Khatibi Department of Information System Universiti Teknologi Malaysia (UTM) Skudai

More information

SOFTWARE EFFORT ESTIMATION USING RADIAL BASIS FUNCTION NEURAL NETWORKS Ana Maria Bautista, Angel Castellanos, Tomas San Feliu

SOFTWARE EFFORT ESTIMATION USING RADIAL BASIS FUNCTION NEURAL NETWORKS Ana Maria Bautista, Angel Castellanos, Tomas San Feliu International Journal Information Theories and Applications, Vol. 21, Number 4, 2014 319 SOFTWARE EFFORT ESTIMATION USING RADIAL BASIS FUNCTION NEURAL NETWORKS Ana Maria Bautista, Angel Castellanos, Tomas

More information

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

More information

Benchmarking Open-Source Tree Learners in R/RWeka

Benchmarking Open-Source Tree Learners in R/RWeka Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects

Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects Daniel P. Delorey pierce@cs.byu.edu Charles D. Knutson knutson@cs.byu.edu Scott Chun chun@cs.byu.edu Abstract

More information

Exploratory Analysis of Marketing Data: Trees vs. Regression

Exploratory Analysis of Marketing Data: Trees vs. Regression Exploratory Analysis of Marketing Data: Trees vs. Regression J. Scott Armstrong Assistant Professor of Marketing, The Wharton School and James G. Andress Consultant at Booz, Allen, and Hamilton, Inc.,

More information

COMPLEXITY METRIC FOR ANALOGY BASED EFFORT ESTIMATION

COMPLEXITY METRIC FOR ANALOGY BASED EFFORT ESTIMATION COMPLEXITY METRIC FOR ANALOGY BASED EFFORT ESTIMATION 1 VANDANA BHATTACHERJEE 2 PRABHAT KUMAR MAHANTI 3 SANJAY KUMAR 1 Department of Cs & E, Birla Institute Of Technology, Ranchi 2 Department of Csas,

More information

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST UNDERSTANDING The independent-samples t test evaluates the difference between the means of two independent or unrelated groups. That is, we evaluate whether the means for two independent groups are significantly

More information

A HYBRID FUZZY-ANN APPROACH FOR SOFTWARE EFFORT ESTIMATION

A HYBRID FUZZY-ANN APPROACH FOR SOFTWARE EFFORT ESTIMATION A HYBRID FUZZY-ANN APPROACH FOR SOFTWARE EFFORT ESTIMATION Sheenu Rizvi 1, Dr. S.Q. Abbas 2 and Dr. Rizwan Beg 3 1 Department of Computer Science, Amity University, Lucknow, India 2 A.I.M.T., Lucknow,

More information

Software Migration Project Cost Estimation using COCOMO II and Enterprise Architecture Modeling

Software Migration Project Cost Estimation using COCOMO II and Enterprise Architecture Modeling Software Migration Project Cost Estimation using COCOMO II and Enterprise Architecture Modeling Alexander Hjalmarsson 1, Matus Korman 1 and Robert Lagerström 1, 1 Royal Institute of Technology, Osquldas

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Prediction of Business Process Model Quality based on Structural Metrics

Prediction of Business Process Model Quality based on Structural Metrics Prediction of Business Process Model Quality based on Structural Metrics Laura Sánchez-González 1, Félix García 1, Jan Mendling 2, Francisco Ruiz 1, Mario Piattini 1 1 Alarcos Research Group, TSI Department,

More information

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

More information

Testing Metrics. Introduction

Testing Metrics. Introduction Introduction Why Measure? What to Measure? It is often said that if something cannot be measured, it cannot be managed or improved. There is immense value in measurement, but you should always make sure

More information

A hybrid method for increasing the accuracy of software development effort estimation

A hybrid method for increasing the accuracy of software development effort estimation Scientific Research and Essays Vol. 6(30), pp. 6382-6382, 9 December, 2011 Available online at http://www.academicjournals.org/sre DOI: 10.5897/SRE11.1736 ISSN 1992-2248 2011 Academic Journals Full Length

More information

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

More information

Optimal Resource Allocation for the Quality Control Process

Optimal Resource Allocation for the Quality Control Process Optimal Resource Allocation for the Quality Control Process Pankaj Jalote Department of Computer Sc. & Engg. Indian Institute of Technology Kanpur Kanpur, INDIA - 208016 jalote@cse.iitk.ac.in Bijendra

More information

Software project cost estimation using AI techniques

Software project cost estimation using AI techniques Software project cost estimation using AI techniques Rodríguez Montequín, V.; Villanueva Balsera, J.; Alba González, C.; Martínez Huerta, G. Project Management Area University of Oviedo C/Independencia

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

More information

Data Analysis. Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) SS Analysis of Experiments - Introduction

Data Analysis. Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) SS Analysis of Experiments - Introduction Data Analysis Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) Prof. Dr. Dr. h.c. Dieter Rombach Dr. Andreas Jedlitschka SS 2014 Analysis of Experiments - Introduction

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING

PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING Jelber Sayyad Shirabad Lionel C. Briand, Yvan Labiche, Zaheer Bawar Presented By : Faezeh R.Sadeghi Overview Introduction

More information

Automated Classification of Change Messages in Open Source Projects

Automated Classification of Change Messages in Open Source Projects Automated Classification of Change Messages in Open Source Projects Ahmed E. Hassan School of Computing Queen s University Kingston, Canada ahmed@cs.queensu.ca ABSTRACT Source control systems permit developers

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3. IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION

More information

Research Methods & Experimental Design

Research Methods & Experimental Design Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and

More information

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing. Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative

More information

Inferential Statistics

Inferential Statistics Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

More information

MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Stephen MacDonell Andrew Gray Grant MacLennan Philip Sallis. The Information Science Discussion Paper Series. Number 99/12 June 1999 ISSN 1177-455X

Stephen MacDonell Andrew Gray Grant MacLennan Philip Sallis. The Information Science Discussion Paper Series. Number 99/12 June 1999 ISSN 1177-455X DUNEDIN NEW ZEALAND Software Forensics for Discriminating between Program Authors using Case-Based Reasoning, Feed-Forward Neural Networks and Multiple Discriminant Analysis Stephen MacDonell Andrew Gray

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

Data Quality Assessment

Data Quality Assessment Data Quality Assessment Leo L. Pipino, Yang W. Lee, and Richard Y. Wang How good is a company s data quality? Answering this question requires usable data quality metrics. Currently, most data quality

More information

Step 5: Conduct Analysis. The CCA Algorithm

Step 5: Conduct Analysis. The CCA Algorithm Model Parameterization: Step 5: Conduct Analysis P Dropped species with fewer than 5 occurrences P Log-transformed species abundances P Row-normalized species log abundances (chord distance) P Selected

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship) 1 Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship) I. Authors should report effect sizes in the manuscript and tables when reporting

More information

Estimating Size and Effort

Estimating Size and Effort Estimating Size and Effort Dr. James A. Bednar jbednar@inf.ed.ac.uk http://homepages.inf.ed.ac.uk/jbednar Dr. David Robertson dr@inf.ed.ac.uk http://www.inf.ed.ac.uk/ssp/members/dave.htm SAPM Spring 2007:

More information

Software Metrics & Software Metrology. Alain Abran. Chapter 4 Quantification and Measurement are Not the Same!

Software Metrics & Software Metrology. Alain Abran. Chapter 4 Quantification and Measurement are Not the Same! Software Metrics & Software Metrology Alain Abran Chapter 4 Quantification and Measurement are Not the Same! 1 Agenda This chapter covers: The difference between a number & an analysis model. The Measurement

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

An Evaluation of Neural Networks Approaches used for Software Effort Estimation

An Evaluation of Neural Networks Approaches used for Software Effort Estimation Proc. of Int. Conf. on Multimedia Processing, Communication and Info. Tech., MPCIT An Evaluation of Neural Networks Approaches used for Software Effort Estimation B.V. Ajay Prakash 1, D.V.Ashoka 2, V.N.

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

AN APPROACH FOR SOFTWARE EFFORT ESTIMATION USING FUZZY NUMBERS AND GENETIC ALGORITHM TO DEAL WITH UNCERTAINTY

AN APPROACH FOR SOFTWARE EFFORT ESTIMATION USING FUZZY NUMBERS AND GENETIC ALGORITHM TO DEAL WITH UNCERTAINTY AN APPROACH FOR SOFTWARE EFFORT ESTIMATION USING FUZZY NUMBERS AND GENETIC ALGORITHM TO DEAL WITH UNCERTAINTY Divya Kashyap and A. K. Misra Department of Computer Science and Engineering Motilal Nehru

More information

Software Estimation: Practical Insights & Orphean Research Issues

Software Estimation: Practical Insights & Orphean Research Issues Software Estimation: Practical Insights & Orphean Research Issues Alain Abran École de Technologie Supérieure, University of Québec, Montréal, Canada alain.abran@etsmtl.ca 9 th International Conference

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Software Defect Prediction Modeling

Software Defect Prediction Modeling Software Defect Prediction Modeling Burak Turhan Department of Computer Engineering, Bogazici University turhanb@boun.edu.tr Abstract Defect predictors are helpful tools for project managers and developers.

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

The Usability of Electronic Stores based on the Organization of Information and Features

The Usability of Electronic Stores based on the Organization of Information and Features The Usability of Electronic Stores based on the Organization of Information and Features CHAN KAH-SING Singapore Polytechnic This paper describes an investigation on how the perceived usability of electronic

More information