A Comparison of Calibrated Equations for Software Development Effort Estimation Cuauhtemoc Lopez Martin Edgardo Felipe Riveron Agustin Gutierrez Tornes 3,, 3 Center for Computing Research, National Polytechnic Institute, Mexico Av. Juan de Dios Batiz s/n esquina Miguel Othon de Mendizabal, Unidad Profesional "Adolfo Lopez Mateos" Edificio CIC, Colonia Nueva Industrial Vallejo, Delegación Gustavo A. Madero, P.O. 07738, Mexico D.F. cuauhtemoc@sagitario.cic.ipn.mx ; edgardo@cic.ipn.mx ; 3 atornes@cic.ipn.mx Abstract. In this paper, from actual data of four projects, equations for software development effort estimation are calibrated for a local environment. Metrics of lines of code as well as function points are used as independent variables in linear and non-linear regression equations. Furthermore, Mean Magnitude of Relative Error (MMRE) is used as the evaluation criterion to compare these calibrated equations with other ones obtained by other researchers. Results demonstrate that calibrated linear regression estimation model has a better accuracy for the local environment of this case study. Keywords: Software effort estimation; Lines of code; Function points; Correlation; Linear and Non-Linear regression.. Introduction Three main problems are related to a project: delivery time, effort, and quality. It results difficult to know how long the software will be finished and how much its cost will be. Software estimation has been identified as one of the three great challenges for halfcentury-old computer science []. No method or model of estimation should be preferred over all others. The key consists in using a variety of methods and tools and then to investigate why estimation may differ significantly from one to another []. In this paper, from actual data of four projects developed in the University of Guadalajara, the effort estimation equations are calibrated for its local environment. In accordance with Heemstra and Kusters [3], in practice, expert judgment and analogy estimation are the most frequently applied estimation methods, while algorithmic (or parametric) estimation methods seem to be rarely used. This paper encourages the use of algorithmic estimation methods. In algorithmic models, the development effort is estimated as a function of variables representing the most important cost drivers in the project. Usually, the variables are identified by correlation analysis of data on completed projects [4]. In order to measure the accuracy of software estimations, several studies have evaluated estimation models using the Mean
Magnitude of Relative Error (MMRE), defined as MMRE = Σ i= [ estimate i - actual i / actual i ] / n, where estimate i is the estimated effort from the model, actual i is the actual effort, and n is the number of projects. When some models have not been calibrated, the MMRE have ranged from 57% to 800%, whereas those ones that have been calibrated the MMRE have reflected % [5]... Correlation (r) and Coefficient of Determination (r ) The correlation is the degree to which two sets of data (i.e. lines of code and effort) are related [6]. The correlation value r, varies from -.0 to +.0. To be useful for estimating, the value of r (named coefficient of determination) should be greater than 0.5; the correlation coefficient can be calculated as follows: [ ( LOC E) ] [( LOC) ( E) ] n r = () n LOC ( LOC) n E ( E) Where n is the number of observation pairs, LOC are the lines of code and E is the development effort... Linear Regression When two sets of data are strongly related, it is possible to use a linear regression procedure to model this relationship. The regression analysis is a technique to express the relationship between two variables and to estimate the dependent variable (i.e. Effort) basing on independent variable (i.e. LOC). The regression analysis is used to develop the equation of the line, which serves to do predictions. The linear regression equation using least squares is the following [7]: Where E = a + b (LOC) () [ ( LOC E) ] ( LOC)( E) n b = (3) n ( LOC ) ( LOC) E LOC a = b (4) n n.3. Non -Linear Regression If the number of projects is less than ten, then the constant a of the COCOMO equation can be calibrated using the equation 5 [4]. The COCOMO equation is E = a(kloc) b *EAF, where E is effort in man-months (a man-month is equivalent to 5 hours per month); EAF is the effort adjustment factor; KLOC is the number of lines of code (in thousands); a and b are all constants based on the mode: Organic: a =.4, b =.05; Semi-detached: a = 3.0, b =.05; and Embedded a = 3.6, b =.0. The EAF is used to tailor the estimation based on conditions of the development environment.
For the COCOMO basic model it is not used and just set to. For the COCOMO intermediate model there are 5 different cost drivers that can be used to calculate (multiplying themselves) the EAF [4]. a n i= = n i= AE Q Where n corresponds to the number of developed projects, AE is the actual effort, and i is each individual projects. To calculate Q according to organic model (each model has its own equation), the equation Q i = (KLOC i ).05 * EAF i must be used. If the number of projects is more than nine, both constant a and exponent b of COCOMO equation can be calibrated using the following equations [4]: Where: ad0 ad loga = a0a a Q i i i a0d ad 0 b = a0a a a 0 = Number of projects d 0 = log(effort Real /EAF) a = log(kloc Real ) d = log(effort Real /EAF) log(kloc Real ) a = log(kloc Real ) (5).4. Evaluation Criterion A common criterion for the evaluation of cost estimation models is the Magnitude of Relative Error (MRE) [8]. The MRE value is calculated for each observation i whose effort is predicted. The aggregation of MRE over multiple observations (N), can be achieved through the Mean Magnitude of Relative Error (MMRE). MRE as well as MMRE are defined as follows: Actual Effort N = i predicted Efforti MREi MMRE = MREi Actual Efforti N i= In general, the accuracy of an estimation technique is inversely proportional to the MMRE.. State of the art For most algorithmic models, the calibration to a specific software environment can be performed to improve the estimation. The equations are based upon research and historical data, and use such inputs as source Lines of Code (LOC) (either physical or logical [9] based on a coding standard [6]) or Function Points. So far, several equations have been generated by previous researches; some of them are the following [0]: Effort Equation Author(s) Effort Equation Author(s) E = 5. (KLOC) 0.9 Walston-Felix E = 4.86 (KLOC) 0.976 RADC E = 0.7 (KLOC).50 Halstead E = 5.8 (KLOC).047 Doty E = 5.5 +0.73 (KLOC).6 Bailey-Basili E =.43 (KLOC) 0.96 JPL
3. Methodology used. The number of physical lines of code (LOC) of each project was counted and then using linear regression based on both correlation and coefficient of determination, the development effort was calculated.. COCOMO non-linear effort equation was both calibrated and applied basing it on correlation as well as on coefficient of determination. 3. The number of Unadjusted Function Points (UFP) of each project was calculated and then using linear regression (considering correlation as well as coefficient of determination), the development effort was calculated. 4. Non-linear equations of algorithmic models proposed by Boehm (COCOMO), Walston-Felix, Halstead, Bailey-Basili, RADC, Doty model, and JPL were applied. Results of these equations were compared with those results generated in points, and 3 of this section. MMRE was used as evaluation criterion. 4. Experimental Results 4.. Data Gathering In accordance with the Mexican National Program for Software Industry Development, the 98% of software from Mexican enterprises do not have formal processes to record, track and control measurable issues during the development process []. This fact implies difficulty to obtain actual data. Data from four projects of the Information Systems Department of the University of Guadalajara were collected, that is, : Emission and Tracking of Students Pay Orders; : Extensions and Demands System; 3: Regional System for Fruit and Vegetable Planning; and 4: Virtual Payment; their metrics are depicted in Table and they will serve to calibrate regression equations. A detailed description of COCOMO EAF as well as Unadjusted Function Points (UFP) can be consulted in []. Project LOC Effort Unadjusted Function COCOMO Points (UFP) EAF 3944 8 3.08 3006 0.80 3 500.5 74 0.800 4 600 35 409.80 Table. Projects Actual Data 4.. Calibrating Linear and non-linear Regression Equations Once the number of LOC has been counted, it is possible to generate effort equations. The first step is to calculate coefficients of correlation as well as determination. According to Equation, the results obtained are r = 0.9869 and r = 0.9740. Both
results show high level. In accordance with Equations 3 and 4, the values of a and b are calculated. The final effort equation using linear regression, according to Equation, is the following: E = 7.6996 + 9.94( KLOC) (6) According to Equation 5, the value of constant a for a non-linear equation is calculated as follows: Project KLOC EAF Effort Q (Effort)(Q) Q 3.944.08 8 4.34 34.739 8.856 3.006.80 4.065 8.3 6.57 3.8 0.800.5.358 5.896 5.56 4 6..80 35 8.05 80.58 64.37 Sum 39.84 05.8 a = 3.3 Then, in accordance with COCOMO Equation, the non-linear equation for estimating the effort (organic model) is the following:.05 E = 3.3( KLOC) EAF (7) With Function Points as independent variable, the results are r = 0.9566 and r = 0.95. Both these results depict high level. In accordance with Equations 3 and 4, the values of a and b are calculated. The final effort equation using linear regression according to Equation is the following (a paper related with LOC-FP equivalence can be consulted in [3]): E = 0.0354 + 0.074( FP) (8) 4.3. Comparing MRE i and MMRE Results with both Calibrated and Original Equations (the unit measure of effort is man-month) Project Eq. 6 Eq. 7 Eq. 8 COCOMO Walston- Halstead Bailey- RADC Doty JPL Felix Basili 0.43 0.70 0.60 0.30.7 0.3 0.4.3.78 0.4 0.07 5.36 0. 3.88 6.08 0.8 3.06 6. 7.36.50 3 0.96.95.46.6 4.3 0.3.6 4.3 5..6 4 0.03 0.8 0.03 0.45 0. 0.69 0.67 0.8 0.0 0.60 Sum.49 8.30 3. 5.90.87.4 6.03.9 4.36 4.86 MMRE 0.37.07 0.80.47.97 0.54.5.98 3.59. Last table depicted that MMRE values vary from 0.37 to 3.59. It can be observed that calibrated linear regression equation using LOC has better accuracy with MMRE = 0.37, while calibrated linear regression equation using Function Points appears in third place with 0.80.
5. Conclusions and Directions for Future Researches In this paper, from actual data of four projects, linear and non-linear regression equations for software development effort estimation were calibrated for a local environment. These calibrated equations were compared with others ones obtained by other researches. This comparison was based on the Mean Magnitude of Relative Error (MMRE). Results demonstrated that the calibrated linear estimation model for this local environment had a better accuracy. The 98% of software from Mexican enterprises do not have formal processes to record, track and control measurable issues during the development process; this fact reduces the effectiveness of any software estimation technique since all techniques require historical data. This situation was reflected in this paper and could represent its weakness. However, the calibration activities depicted can be used when more data is available. Future research will involve the application of other estimation alternatives as Fuzzy Logic as well as Neural Networks. References [] Brooks Fredrick P. Jr., Three Great Challenges for Half-Century-Old Computer Science. Journal of the ACM, Vol. 50, No. pp. 5-6, January 003 [] Boehm B., Abts Ch., Chulani S. Software Development Cost Estimation Approaches A Survey. Chulani Ph. D. Report. 998 [3] Heemstra F., Kusters R., Software cost estimation in the Netherlands: 0 years later, Proceedings of the European Software Control and Metrics Symposium (ESCOM- SCOPE), 999, pp. 3 3. [4] Boehm B., Software Engineering Economics, Englewood Cliffs, 98. [5] Hareton Leung, Zhang Fan, Software Cost Estimation, The Hong Kong Polytechnic University, Hong Kong. 000 [6] Humphrey W. A Discipline for Software Engineering, Addison Wesley, 00. [7] Richard A. Johnson. Probabilidad y Estadística para Ingenieros. Prentice Hall, 997 [8] Lionel C. Briand, Khaled El Emam, Dagmar Surmann, Isabella Wieczorek. An Assessment and Comparison of Common Software Cost Estimation Modeling Techniques. ISERN-98-7 [9] Park R. E. Software Size Measurement: A Framework for Counting Source Statements. SEI, Carnegie Mellon University, September 99. [0] Pressman R., Software Engineering, A Practitioner s Approach, McGraw Hill, 00 [] Secretaría de Economía, Programa para el Desarrollo de la Industria del Software, June 00. Available: http://www.economia.gob.mx/?p=8 [] Lopez-Martin Cuauhtemoc, Gutierrez-Tornes Agustin, Software Effort Estimation: A Designed Process for Structured and Object Oriented Software Engineering Approaches, Proceedings of the th International Congress on Computer Science Research, CIICC 04, September 9-30, October, 004 Tlalnepantla, México [3] Lopez-Martin, Cuauhtémoc, Lines of Code as a Source for Function Point Estimation Using Linear Regression and Correlation, XVI Congreso Nacional y II Internacional de Informática y Computación 003, October 003