EVALUATING SOFTWARE METRICS AND SOFTWARE MEASURMENT PRACTICES Version 4.0 March 14, 2014 Capers Jones, VP and CTO; Namcook Analytics LLC Web: www.namcook.com Blog: http://namcookanalytics.com Email: Capers.Jones3@Gmail.com Abstract Software productivity and quality are important topics with significant economic importance in the modern world. Both productivity and quality should be measured with accuracy using effective metrics and proven measurement practices. But unlike older and more mature scientific disciplines, software engineering has used inaccurate metrics and ineffective measurement practices for more than 50 years. This paper analyzes and evaluates a sample of current software metrics and measurement practices. Some common problems include the fact that lines of code or LOC penalizes highlevel languages and makes requirements and design invisible. The common cost per defect metric penalizes quality and makes buggy software look less expensive than high-quality software. The urban legend that cost per defect goes up by more than 100 times after release is not true and is due to poor measurement practices that ignore fixed costs. The purpose of this paper is to show how both productivity and quality can be measured with high precision and with adherence to standard economic principles. Overall activity-based costs using function point metrics are the best choice for productivity and economic analysis. Function points combined with defect removal efficiency (DRE) are the best choice for software quality analysis. More than 300 metric and measurement topics are included. Copyright 2014 by Capers Jones. All rights reserved. 1
INTRODUCTION Over the past 50 years the software industry has grown to become one of the major industries of the 21 st century. On a global basis software applications are the main operating tools of corporations, government agencies, and military forces. Every major industry employs thousands of software professionals. The total employment of software personnel on a global basis probably exceeds 20,000,000 workers. Because of the importance of software and because of the high costs of software development and maintenance combined with less than optimal quality, it is important to measure both software productivity and software quality with high precision. But this seldom happens. For more than 50 years the software industry has used a number of metrics that violate standard economic concepts and produce inaccurate and distorted results. Two of these are lines of code or LOC metrics and the cost per defect metric. LOC metrics penalize high-level languages and make requirements and design invisible. Cost per defect penalizes quality and ignores the true value of quality, which derive from shorter schedules and lower development and maintenance costs. Both LOC and cost per defect metrics can be classed as professional malpractice for overall economic analysis. However both have limited use for more specialized purposes. One of the reasons IBM invested more than a million dollars into the development of function point metrics was to provide a metric that could be used to measure both productivity and quality with high precision and with adherence to standard economic principles. For more than 200 years a basic law of manufacturing has been used by all major industries except software: If a manufacturing cycle includes a high component of fixed costs and there is a decline in the number of units manufactured, the cost per unit will go up. The problems with both LOC metrics and cost per defect are due to ignoring this basic law. For modern software projects requirements and design are often more expensive than coding. Further, requirements and design are inelastic and stay more or less constant regardless of coding time. When there is a switch from a low-level language such as assembly to a higher level language such as Java the quantity and effort for coding is reduced but requirements and design act like fixed costs, so the cost per line of code will go up. Table 1 illustrates the paradoxical reversal of productivity rates using LOC metrics in a sample of 10 versions of a PBX switching application coded in 10 languages, but all the same size of 1,500 function points: 2
Table 1: Productivity Rates for 10 Versions of the Same Software Project (A PBX Switching system of 1,500 Function Points in Size) Language Effort Funct. Pt. Work Hrs. LOC per LOC per (Months) per Staff per Staff Staff Month Funct. Pt. Month Hour Assembly 781.91 1.92 68.81 480 3.38 C 460.69 3.26 40.54 414 3.13 CHILL 392.69 3.82 34.56 401 3.04 PASCAL 357.53 4.20 31.46 382 2.89 PL/I 329.91 4.55 29.03 364 2.76 Ada83 304.13 4.93 26.76 350 2.65 C++ 293.91 5.10 25.86 281 2.13 Ada95 269.81 5.56 23.74 272 2.06 Objective C 216.12 6.94 19.02 201 1.52 Smalltalk 194.64 7.71 17.13 162 1.23 Average 360.13 4.17 31.69 366 2.77 As can be seen the Assembly version had the largest amount of effort but also the highest apparent productivity measured with LOC per month and the lowest measured with function points per month. Function points match standard economic assumptions while LOC metrics reverse standard economics. In this table using LOC metrics for productivity comparison would be professional malpractice. When testing software, the time needed to write test cases and run them are comparatively inelastic and stay more or less constant regardless of how many bugs are found. When few bugs are found, test case preparation and execution act like fixed costs so the cost per defect will go up. Actual defect repairs are comparatively flat, although there are ranges. But the ranges are found in every form of defect removal and do not rise in a consistent pattern. Table 2 shows the mathematics of the cost per defect metric. Every column uses fixed costs that are exactly the same. Labor costs are set at $75.75 per hour for every row and column of the 3
table. The defect repair column assumes a constant value of 5 hours per defect for every form of test: Table 2: Cost per Defect for Six Forms of Testing (Assumes $75.75 per staff hour for costs) Writing Running Repairing TOTAL Number of $ per Test Cases Test Cases Defects COSTS Defects Defect Unit test $1,250.00 $750.00 $18,937.50 $20,937.50 50 $418.75 Function test $1,250.00 $750.00 $7,575.00 $9,575.00 20 $478.75 Regression test $1,250.00 $750.00 $3,787.50 $5,787.50 10 $578.75 Performance test $1,250.00 $750.00 $1,893.75 $3,893.75 5 $778.75 System test $1,250.00 $750.00 $1,136.25 $3,136.25 3 $1,045.42 Acceptance test $1,250.00 $750.00 $378.75 $2,378.75 1 $2,378.75 As can be seen the fixed costs of writing test cases and running test cases cause cost per defect to go up as defect volumes come down. This of course is due to the basic rule of manufacturing economics that in the presence of fixed costs a decline in units will increase cost per unit. However actual defect repairs were a constant 5 hours for every form of testing in the table. By contrast looking at the same project and the same testing sequence using the metric defect removal cost per function point the true economic situation becomes clear: Table 3 Cost per Function Point for Six Forms of Testing (Assumes $75.75 per staff hour for costs) (Assumes 100 function points in the application) Writing Running Repairing TOTAL $ Number of Test Cases Test Cases Defects PER F.P. Defects Unit test $12.50 $7.50 $189.38 $209.38 50 Function test $12.50 $7.50 $75.75 $95.75 20 Regression test $12.50 $7.50 $37.88 $57.88 10 Performance test $12.50 $7.50 $18.94 $38.94 5 System test $12.50 $7.50 $11.36 $31.36 3 Acceptance test $12.50 $7.50 $3.79 $23.79 1 4
It is important to understand that tables 2 and 3 both show the results for the same project and also use identical constant values for writing test cases, running them, and fixing bugs. However defect removal costs per function point decline when total defects decline, while cost per defect grows more and more expensive as defects declines. Both of these problems will be discussed again later in this report. But the basic point is that manufacturing economics and fixed costs need to be included in software manufacturing and production studies. Thus far much of the software literature has ignored fixed costs. What is software productivity? The standard economic definition for productivity for more than 100 years has been goods or services produced per unit of labor or expense. For software projects the critical topic is what exactly constitutes a unit of measure for software s good or services. The oldest definition for software goods was a line of code. In the 1950 s when only machine language and assembly language existed, more than 90% of the total effort for software was involved with coding and this was not a bad choice. Today in 2014 there are over 126 occupations associated with software engineering and for major systems coding is less than 30% of the total effort. There are also more than 3,000 programming languages of various levels and capabilities. LOC is no longer an effective metric of economic productivity, and indeed reverses true economic productivity as will be discussed later. The best choice for software goods in 2014 is the function point metric. The function point metric can be used for requirements, design, coding, testing, documentation, management and all other software activities and occupations. The LOC metric only applied to coding and has no relevance for any of the other kinds of work associated with modern software. Furthermore, function point metrics are methodology neutral and work equally well with agile projects, iterative projects, extreme programming, the Rational Unified Process (RUP), the Team Software Process (TSP), Merise, Prince2, and any of the other software development methods now in common use. By contrast story point metrics are limited to agile projects with user stories. Use-case point metrics are limited to projects that utilize use cases, and have no relevance for other design methods such as state transition diagrams or even flow charts. 5
What is software quality? Quality is something of a subjective topic in all fields. However for software a workable definition for quality is the absence of defects that would cause a software application to either fail completely or produce incorrect results. This has been used by the author for more than 40 years and has been applied to embedded software, systems software, commercial software, military software, outsource software, web software, and other forms as well. An older and common definition for quality is, quality means conformance to requirements. However this is not a valid definition for software quality. Requirements themselves are filled with defects and also with toxic requirements that should not be included in software applications at all. To define quality as conforming to something that has been measured to cause more than 20% of total software defects is not safe and not satisfactory. Software quality needs to encompass requirements defects and not assume that all user requirements are perfect and error free. There are many other definitions for quality including a host of words ending in..ility such as reliability and maintainability. However the absence of defects is an attribute that can be measured with precision while terms such as maintainability and reliability are ambiguous. Further empirical data supports the hypothesis that low defect counts correlate well with high levels of user satisfaction. Studies within IBM found that low defects and high levels of user satisfaction were consistently related. This brings up two powerful metrics for understanding software quality: 1) software defect potentials; 2) defect removal efficiency (DRE). The phrase software defect potentials was first used in IBM circa 1970. It is based on measured data from IBM software applications but analyzed and pointed toward future applications. The term defect potential means the sum total of bugs or defects that are likely to be found in all software deliverables. In other words defect potentials include bugs that might originate in requirements, architecture, design, code, user documents, bugs in test cases themselves, and also bad fixes or bugs in attempts to fix earlier bugs. In today s world of 2014 the best metric for expressing software defect potentials is the metric defects per function point because this allows defects from all sources to be summed; i.e. Table 4: Software Defect Potentials Circa 2014 1. Requirements defects = 1.00 per function point 2. Architecture defects = 0.30 per function point 3. Design defects = 1.25 per function point 4. Code defects = 1.50 per function point 5. Document defects = 0.60 per function point 6. Test case defects = 0.75 per function point 7. Bad fix defects = 0.35 per function point TOTAL DEFECTS = 5.75 per function point 6
No other metric allows defects from all origins to be compared and summed in order to show overall defect potentials. This is a capability that only function point metrics provide. The second powerful metric for software quality is defect removal efficiency (DRE). This metric refers to the percentage of defects that are found and removed prior to the release of software to clients. Normally DRE is measured at a fixed time point, which IBM set at 90 days after release of the software to clients. A simple example illustrates the basic mathematics of DRE. Assume that the development team found 90 bugs in a small software application. In the first three months of use, the clients reported another 10 bugs in the first three months. The total number of bugs is 100, so the defect removal efficiency level is 90%. The metric of defect potential and the metric of defect removal efficiency are synergistic and a show how effective various methods can be, including formal inspections, static analysis, pair programming, and all forms of testing. The combination of defect potential and defect removal efficiency also show that some kinds of defects, such as requirements defects, are harder to eliminate than coding defects. Table 5 shows that when defect potentials are combined with DRE, the results are of considerable importance to the software industry: Table 5: Software Defect Potentials and Defect Removal Efficiency (DRE) Defect Origins Defect Defect Defects % of Potential Removal Delivered Total Requirements defects 1.00 75.00% 0.25 31.15% Design defects 1.25 85.00% 0.19 23.36% Test case defects 0.75 85.00% 0.11 14.02% Bad fix defects 0.35 75.00% 0.09 10.90% Code defects 1.50 95.00% 0.08 9.35% User document defects 0.60 90.00% 0.06 7.48% Architecture defects 0.30 90.00% 0.03 3.74% 100.00 TOTAL 5.75 85.00% 0.80 % Note: table 3 is only an example. Defect potentials and DRE vary widely from the results shown here, in both directions. The combination of defect potentials and DRE measures shows that requirements defects, design defects, test case defects, and bad fix defects are harder to remove than code defects. This is proven by the percentages of delivered defects attributable to each kind of defect origin. It also shows that the older definition of quality as conformance to requirements has serious flaws when requirements are a major contributor to overall delivered defect volumes. The bottom line is that requirements, architecture, and design defects are resistant to testing, and therefore pre-test inspections of requirements and design documents should be used for all major 7
software projects. Testing is efficient against coding defects, of course, but testing is not efficient in finding requirements, architecture, and design defects so additional methods need to be used prior to release. Because this paper discusses a great many metrics and measurement topics, it is useful to summarize overall results by selecting the 10 best metrics for software economic analysis and then showing the 10 worst metrics for economic topics: Table 6: The 10 Best Metrics for Software Economic Analysis (alphabetical order) 1 Activity-based costs 2 Cost drivers 3 Cost of quality (COQ) 4 Defect potentials 5 Defect removal efficiency (DRE) 6 Function points 7 Occupation groups 8 Total cost of ownership (TCO) 9 User costs 10 Work hours per function point The combination of metrics in table five will show software productivity without violating standard economic assumptions and also show software quality including the economic value of high quality. We turn now to the set of the 10 worst metrics for software economic studies, some of which have been in use without analysis of their flaws for more than 50 years. Table 7: The 10 Worst Metrics for Software Economic Analysis (alphabetical order) 1 Backfiring 2 Cost per defect 3 DCUT 4 Leaky historical data 5 Lines of code (logical) 6 Lines of code (physical) 7 Phase-based costs 8 Story points 9 Technical debt 10 Total project costs with no details The combination of metrics in table 6 will show reversed productivity that violates the rules of standard economics and will not show quality at all. In fact the true value of quality will be 8
concealed and poor quality will look better than high quality. Technical debt is included because it only covers about 17% of the total costs of poor quality. Story points are included because they are not standardized and vary by more than 400% from company to company. Also, they are specific to agile projects with user stories and can t be used for projects that don t utilize user stories. Phase-based metrics are included because they cannot be validated. DCUT and leaky historical data are included because they make productivity and quality look better than they really are. Backfiring is included because it varies by over 50% in both directions from average values. Six Urgent Needs For Software Engineering Software is a major industry, but not yet a full profession with consistent excellence in results. Indeed quality lags far behind what is needed. Software engineering has an urgent need for six significant accomplishments: 1. Stop measuring with unreliable metrics such as LOC and cost per defect and begin to move towards activity-based costs, function point metrics, and defect removal efficiency (DRE) metrics. 2. Start every project with formal early sizing that includes requirements creep, formal risk analysis, formal cost and quality predictions using parametric estimation tools, and with requirements methods that will minimize toxic requirements and excessive requirements creep later on. 3. Raise defect removal efficiency from below 90% to more than 99.5% across the board. This will also shorten development schedules and lower costs. This can t be done by testing alone but needs a synergistic combination of pre-test inspections, static analysis, and formal testing using mathematically designed test cases and certified test personnel. 4. Lower defect potentials from above 4.00 per function point to below 2.00 per function point for the sum of bugs in requirements, design, code, documents, and bad-fix injections. This can only be done by increasing the volume of reusable materials, combined with much better quality measures than today. 5. Increase the volume of reusable materials from below 15% to more than 85% as rapidly as possible. Custom designs and hand coding are intrinsically expensive and error-prone. Only use of certified reusable materials that approach zero-defects can lead to industrialstrength software applications that can operate without excessive failure and without causing high levels of consequential damages to clients. 6. Increase the immunity of software to cyber attacks. This must go beyond normal firewalls and anti-virus packages and include permanent changes in software permissions, and probably in the von Neumann architecture as well. There are proven methods that can do this, but they are not yet deployed. Cyber attacks are a growing 9
threat to all governments, businesses, and also to individual citizens whose bank accounts and other valuable stored data are at increasing risk. The remainder of this paper discusses a variety of software metrics and measurement methods in alphabetical order: ALPHABETICAL DISCUSSION OF SOFTWARE METRICS AND MEASURES This paper is a work in progress and additional metrics will be added from time to time. 15 Common Software Risks (alphabetical order) Note the common risks associated with software measurement and metrics issues as indicated by triple asterisks *** : 1. Cancelled projects *** 2. Consequential damages to clients 3. Cost overruns *** 4. Cyber attacks 5. Estimate errors or rejection of accurate estimates *** 6. Impossible demands by clients or management *** 7. Litigation for breach of contract *** 8. Litigation for patent violation 9. Poor change control 10. Poor measurement after completion *** 11. Poor quality control *** 12. Poor tracking during development *** 13. Requirements creep 14. Toxic requirements and requirements errors 15. Schedule slips by > 25% *** Nine out of 15 common risks involve numeric information and errors with estimates or measures as contributing factors. Many software project risks could be minimized or avoided by formal parametric estimates, formal risk analysis prior to starting, accurate status tracking, and accurate benchmarks from similar projects. 20 Criteria for Software Metrics Selection Software metrics are created by ad hoc methods, often by amateurs, and broadcast to the world with little or no validation or empirical results. This set of 20 criteria shows the features that effective software metrics should have as attributes: 1. Be validated before release to the world 10
2. Be standardized; preferably by ISO 3. Be unambiguous 4. Be consistent from project to project 5. Be cost effective and have automated support 6. Be useful for both predictions and measurements 7. Have formal training for new practitioners 8. Have a formal user association 9. Have ample and accurate published data 10. Have conversion rules for other metrics 11. Support both development and maintenance 12. Support all activities (requirements, design, code, test, etc.) 13. Support all software deliverables (documents, code, tests, etc.) 14. Support all sizes of software from small changes through major systems 15. Support all classes and types of software (embedded, systems, web, etc.) 16. Support both quality and productivity measures and estimates 17. Support requirements creep over time 18. Support consumption and usage of software as well as construction 19. Support new projects, enhancement projects, and maintenance projects 20. Support new technologies as they appear (languages, cloud, methodologies, etc.) Currently IFPUG function point metrics meet 19 of these 20 criteria. Function points are somewhat slow and costly so criterion 5 is not fully met. Other function point variations such as COSMIC, NESMA, FISMA, unadjusted, engineering function points, feature points, etc. vary in how many criteria they meet, but most meet more than 15 of the 20 criteria. The older lines of code metric meets only criterion 5 and none of the others. LOC metrics are fast and cheap but otherwise fail to meet the other 19 criteria. The LOC metric makes requirements and design invisible and penalizes high-level languages. The cost per defect metric does not actually meet any of the 20 criteria and also does not address the value of high quality in achieving shorter schedules and lower costs. The technical debt metric does not currently meet any of the 20 criteria, although it such a new metric that it probably will be able to meet some of the criteria in the future. Technical debt has a large and growing literature but does not actually meet criterion 9 because the literature resembles the blind men and the elephant, with various authors using different definitions for technical debt. Technical debt comes close to meeting criteria 14 and 15. The story point metric for agile projects seems to meet five criteria; i.e numbers 6, 14, 16, 17, 18 but varies so widely and is so inconsistent it cannot be used across companies and certainly can t be used without user stories. The use-case metric seems to meet criteria 5, 6, 9, 11, 14, and 15 but can t be used to compare data from projects that don t utilize use cases. 11
This set of metric criteria is a useful guide for selecting metrics that are likely to produce results that match standard economics and do not distort reality, as do so many current software metrics. Abeyant defects The term abeyant defect originated in IBM in the late 1960 s. It refers to an unusual kind of bug that is unique to a single client and a single configuration and does not occur anywhere else. In fact the change team tasked with fixing the bug may not be able to reproduce it. Abeyant defects are both rare and extremely troublesome when they occur. It is usually necessary to send a quality expert to the client site to find out what unique combination of hardware and software led to the abeyant defect occurring. Some abeyant defects have taken more than two weeks to identify and repair. In today s world of software with millions of users and spotty technical support some abeyant defects may never be fixed. Activity-based costs The term activity is defined as the sum total of the work required to produce a major deliverable such as requirements documents or source code. The number of activities associated with software projects ranges from a low of three (design, code, test) to more than 40. Several parametric estimation tools such as Software Risk Master (SRM) predict activity costs. A typical pattern of software activities for a mid-sized software project of 1000 function points in size might include these seven: 1) requirements; 2) design; 3) coding; 4) testing; 5) quality assurance; 6) user documentation; 7) project management. One of the virtues of function point metrics is that they can show productivity rates for every known activity, as illustrated by table 3 which is an example for a generic project of 1000 function points in size: Table 8: Example of Activity-Based Effort Activities Requirements Design Coding Testing Documentation Work Hours per FP 1. 39 1. 96 5. 94 4. 36 0. 43 Quality Assurance 0.54 1. Management 89 16.5 Totals 0 12
The ability to show productivity for each activity is a virtue of function point metrics and is not possible with many older metrics such as lines of code and cost per defect. To conserve space table 8 only shows seven activities but this same form of representation can be extended to more than 40 activities and more than 250 tasks. Function points are the only available metric in 2014 that allows both activity and task-level analysis of software projects. LOC metrics could not show non-code work at all. Story points might show activities but only for agile projects and not for other forms of software. Use-case points require use-cases to be used at all. Only function point metrics are methodology neutral and applicable to all known software activities and tasks. Accuracy The topic of accuracy is often applied to questions such as the accuracy of estimates compared to historical data. However it should be applied to the question of how accurate is the historical data itself. As discussed in the section historical data leakage what is called historical data is often less than 50% complete and omits major tasks and activities such as unpaid overtime, project management, and the work of part-time specialists such as technical writers. There is little empirical data available on the accuracy of a host of important software topics including in alphabetical order costs, customer satisfaction, defects, development effort, maintainability, maintenance effort, reliability, schedules, size, staffing, and usability. Various function point metrics (COSMIC, FISMA, NESMA, etc.) frequently assert that their specific counting method is more accurate than rival function point methods such as those of the International Function Point Users Group (IFPUG). These are unproven assertions and also irrelevant in an industry where historical data includes only about 37% of the true costs of software development. As a general rule, better accuracy is needed for every software metric without exception. Agile metrics The agile development approach has created an interesting and unique set of metrics that are used primarily by the agile community. Other metrics such as function points and defect removal efficiency (DRE) work with agile projects too, and are needed if agile is to compared to other methods such RUP and TSP because the agile metrics themselves are not very useful for cross-method comparisons. The agile approach of dividing larger applications into small discrete sprints adds challenge to overall data collection. Some common agile metrics include burn down, burn up, story points, and velocity. This is a complex topic and one still in evolution so a google search on agile metrics will turn up many alternatives. The method used by the author for comparison between agile and other methods is to convert story points and other agile metrics into function points and to convert the effort from various sprints into a standard chart of accounts showing requirements, design, coding, testing, etc. for all sprints in aggregate form. This allows side-by-side comparisons between agile projects and other methods such as the Rational Unified Process (RUP), team software process (TSP), waterfall, iterative, and many others. Analysis of Variance (ANOVA) 13
Analysis of variance is a collection of statistical methods for analyzing the ranges of outcomes from groups of related factors. ANOVA might be applied to the schedules of a sample of 100 software projects of the same size and type, or to the delivered defect volumes of the same sample. There are text books and statistical tools available that explain and support analysis of variance. ANOVA is related to design of experiments and particularly to the design of wellformed experiments. Variance and variations are major elements of both software estimating and software measures. Annual Reports As all readers know, public companies are required to produce annual reports for shareholders. These reports discuss costs, profits, business expansion or contraction, and other vital topics. Some sophisticated corporations also produce annual software reports on the same schedule as corporate annual reports; i.e. in the first quarter of a fiscal year showing results for the prior fiscal year. The author has produced such reports and they are valuable in explaining to senior management at the CFO and CEO level what kind of progress in software occurred in the past fiscal year. Some of the topics included in these annual reports are software demographics such as numbers of software personnel by job and occupation group; numbers of customer supported by the software organizations; productivity for the prior year and current year targets; quality for the prior year and current year targets; customer satisfaction, reliability levels, and other relevant topics such as the mix of COTS packages, open source packages and internal development. Also included would be modern issues such as cyber attacks and any software related litigation. Really sophisticated companies might also include topics such as numbers of software patents filed in the prior year. Application size ranges Because function point metrics circa 2014 are somewhat expensive to count manually, they have not been used on really large systems above 10,000 function points in size. As a result the software literature is biased towards small applications and has little data on the world s largest systems. Among the set of really large systems can be found the world-wide military command and control system (WWMCCS) at about 300,000 function points; major ERP packages such as SAP and Oracle at about 250,000 function points; large operating systems from IBM and Microsoft at about 150,000 function points; large information systems such as airline reservation at about 100,000 function points; and dozens of heavy-duty systems software applications such as central-office switching systems at about 25,000 function points in size. These sizes were derived from backfiring, which is discussed later in this report. The approximate global distribution of applications by size approximates the following: 100,000 function points and above 1% 10,000 to 100,000 function points 5% 1,000 to 10,000 function points 15% 100 to 1,000 function points 45% 14
Below 100 function points 34% The size ranges for various types of large systems are shown by the following chart: The size ranges for smaller applications are shown using a reduced scale to match their smaller dimensions: Small projects are far more numerous than large systems. Large systems are far more expensive and more troublesome than small projects. Coincidentally agile development is a good choice below 1000 function points. TSP and RUP are good choices above 1000 function points. So far agile has not scaled up well to really large systems above 10.000 function points but TSP and RUP do well in this zone. Size is not constant either before release or afterwards. So long as there are active users applications grow continuously. During development the measured rate is 1% to 2% per calendar month; after release the measured rate is 8% to 15% per calendar year. A typical postrelease growth pattern might resemble the following: Over a 10 year period a typical mission-critical departmental system starting at 15,000 function points might have: Two major system releases of: 2,000 function points Two minor system releases of: 500 function points Four major enhancements of: 250 function points Ten minor enhancements of: 50 function points Total Growth for 10 years: 6,500 function points System size after 10 years: 21,500 function points Ten-year total growth percent: 43% As can be seen, software applications are never static if they have active users. This continuous growth is important to predict before starting and to measure at the end of each calendar or fiscal year. The cumulative information on original development, maintenance, and enhancement is called total cost of ownership or TCO. Predicting TCO is a standard estimation feature of Software Risk Master, which also predicts growth rates before and after release. Appraisal metrics Many major corporations have annual appraisals of both technical and managerial personnel. Normally the appraisals are given by an employee s immediate manager but often include comments from other managers. Appraisal data is highly confidential and in theory not used for 15
any purpose other than compensation adjustments or occasionally for terminations for cause. One interesting sociological issue has been noted from a review of appraisal results in a Fortune 500 company. Technical personnel with the highest appraisal scores tend to leave jobs more frequently than those with lower scores. The most common reason for leaving is I don t like working for bad management. Indirect observation supports the hypothesis that teams with high appraisal scores outperform teams with low appraisal scores. Some companies such as Microsoft try and force fit appraisal scores into patterns; i.e. only a certain and low percentage of employees can be ranked as excellent. While the idea is to prevent appraisal score creep or assessing many more people as excellent than truly occurs, the force-fit method tends to lower morale and lead to voluntary turnover by employees who feel wrongly appraised. In some countries and in companies with software personnel who are union members, it may be illegal to have appraisals. The topic of appraisal scores and their impact on quality and productivity needs additional study, but of necessity studies involving appraisal scores would need to be highly confidential and covered by non-disclosure agreements. The bottom line is that appraisals are a good source of data on experience and knowledge, and it would be useful to the industry to have better empirical data on these important topics. 16
Assessment The term assessment in a software context has come to mean a formal evaluation key practice areas covering topics such as requirements, quality, measures, etc. In the defense sector the assessment method developed by Watts Humphrey, Bill Curtis, and colleagues at the Software Engineering Institute (SEI) is the most common. One byproduct of SEI assessments is placing organizations on a five-point scale called the capability maturity model integrated or CMMI. However the SEI is not unique nor is it the oldest organization performing software assessments. The author s former company, Software Productivity Research (SPR) was doing combined assessment and benchmark studies in 1984 a year before the SEI was first incorporated. There is also a popular assessment method in Europe called TickIT. Several former officers of SPR now have companies that provide both assessment and benchmark data collection. These include, in alphabetical order, the Davids Consulting Group, Namcook Analytics LLC, and the Quality/Productivity Measurement group. SPR itself continues to provide assessments and benchmarks as well. Assessments are generally useful because most companies need impartial outside analysis by trained experts to find out their software strengths and weaknesses. Assignment scope The term assignment scope refers to the amount of a specific deliverable that is normally assigned to one person. The metrics used for assignment scope can be either natural metrics such as pages of a manual or synthetic metrics such as function points. Common examples of assignment scopes would include code volumes, test case construction, documentation pages, customers supported by one phone agent, the amount of source code assigned to maintenance personnel. Assignment scope metrics and production rate metrics are used in software estimation tools. Assignment scopes are discussed in several of the author s books including Applied Software Measurement and Estimating Software Costs. Attrition measures As we all know personnel change jobs frequently. During the high-growth period of software engineering in the 1970 s most software engineers had as many as five jobs for five companies. In today s weak economy job hopping is less common. In any case most corporations measure annual personnel attrition rates by job titles. Examination of exit interviews show that top personnel leave more often than average personnel, and do so because they don t like working for bad management. For software engineers technical challenge and capable colleagues tend to be larger factors in attrition than compensation. Automatic function point counting The Object Management Group (OMG) has published a standard for automatic function point counting. This standard is supported by an automated tool by CAST Software. A similar tool 17
has been demonstrated by Relativity Technologies. Both tools use mathematical approaches and can generate size from source code or, in the CAST tool, from UML diagrams. Neither tool has published data on the speed of counting or on the accuracy of the counts compared to normal manual function point analysis. The author of this paper has filed a U.S. utility patent application on a method of high-speed and early sizing that can produce function point sizes for applications between 10 and 300,000 function points in about 1.8 minutes, regardless of the actual application size. The tool operates via pattern matching. Using a formal taxonomy of application nature, scope, class, type, and complexity the tool derives function point size from historical data of projects that share the same taxonomy pattern. The author s sizing tool is included in the Software Risk Master (SRM) tool and is also available for demonstration purposes on the Namcook Analytics LLC web site, www.namcook.com. The author s tool produces size in a total of 23 metrics including function points, story points, use-case points, physical and logical source code size, and others. Backfiring In the early 1970 s IBM became aware that lines of code metrics had serious flaws as a productivity metric since it penalized modern languages and made non-coding work invisible. Alan Albrecht and colleagues at IBM White Plains began development of function points. They had available hundreds of IBM applications with accurate counts of logical code statements. As the function point metric was being tested it was noted that various languages had characteristic levels, or number of code statements per function point. The COBOL language, for example, average about 106.7 statements per function point in the procedure and data divisions. Basic assembly language average about 320 statements per function point. These observations led to a concept called backfiring or mathematical conversion between older lines of code data and newer function points. However due to variances in programming styles there were ranges of over two to one in both directions. COBOL varied from about 50 statements per function point to more than 175 statements per function point even though the average value was 106.7. Backfiring was not accurate but was easy to do and soon became a common sizing method for legacy applications where code already existed. Today in 2014 several companies such as Gartner Group, QSM, and Software Productivity Research (SPR) sell commercial tables of conversion rates for more than 1000 programming languages. Interestingly, the values among these tables are not always the same for specific languages. Backfiring remains popular in spite of its low accuracy for specific applications and languages. Bad fix injections Some years ago IBM discovered that about 7% of attempts to fix software bugs contained new bugs in the repairs themselves. These were termed bad fixes. In extreme cases such as very high cyclomatic complexity levels bad fix injections can top 25%. This brings up the point that repairs to software are themselves sources of error. Therefore static analysis, inspections, and regression testing are needed for all significant defect repairs. Bad fix injections were first identified in the 1970 s. They are discussed in the author s book The Economics of Software Quality. 18
Bad test cases A study of regression test libraries by IBM in the 1970 s found that about 15% of test cases had errors in them. (The same study also found about 20% duplicate test cases that tested the same topics without adding any value.) This is a topic that is severely under-reported in the quality and test literature. Test cases that themselves contain errors add to testing costs but do not add to testing thoroughness. Balanced scorecard Art Schneiderman, Robert Kaplan, and David Norton (formerly of Nolan and Norton) originated the balanced scorecard concept as known today, although there were precursors. The book The Balanced Scorecard by Kaplan and Norton made it popular. It is now widely used for both software and non-software purposes. A balanced scorecard comprises four views and related metrics that combine financial perspective, learning and growth perspective, customer or stakeholder perspective, and financial perspective. The balanced scorecard is not just a retroactive set of measures but also include proactive forward planning and strategy approaches. Although balanced scorecards might be used by software organizations, they are most commonly used at higher corporate levels where software, hardware, and other business factors need integration. Baselines For software process improvement, a baseline is a measurement of quality and productivity at the current moment before the improvement program begins. As the improvement program moves through time, additional productivity and quality data collections will show rates of progress over time. Baselines may also have contract implications is an outsource vendor tenders and offer to provide development or maintenance services cheaper and faster than the current rates. In general the same kinds of data are collected for both baselines and benchmarks, which are discussed later in this paper. Bayesian analysis Bayesian analysis is named after the English mathematician Thomas Bayes from the 18 th century. Its purpose in general is to use historical data and observations to derive the odds of occurrences or events. In 1999 a doctoral student at the University of Southern California, Sunita Devnani- Chulani, applied Baysian analysis to software cost estimating methods such as Checkpoint (designed by the author of this paper), COCOMO, SEER, SLIM, and some others. This was an interesting study. In any case Baysian analysis is useful in combining prior data points with hypotheses about future outcomes. 19
Benchmarks The term benchmark is much older than software and originally applied to chiseled marks in stones used by surveyors for leveling rods. Since then the term has become generalized and as of 2014 can be used with well over 500 different forms of benchmarks in almost every industry. Major corporations have been observed to use more than 60 benchmarks including attrition rates, compensation by occupation, customer satisfaction, market shares, quality, productivity, and many more. Total costs for benchmarks can top $5,000,000 per year, but are scattered among many operating units so benchmark costs are seldom consolidated. In this paper a more narrow form of benchmark is relevant which deals specifically with software development productivity and sometimes with software quality. As this paper is written in 2014 there are more than 25 organizations that provide software benchmark services. Among these can be found the International Software Benchmark Standards Group (ISBSG), Namcook Analytics (the author s company), the Quality and Productivity Management Group, Quantimetrics, Reifer Associates, Software Productivity Research (SPR) and many more. The data provided by these various benchmark organizations varies, of course, but tends to concentrate on software development results. Function point metrics are most widely used for software benchmarks, but other metrics such as lines of code also occur. Benchmark data can either be self-reported by clients of benchmark groups or collected by on-site or remote meetings with clients. The on-site or remote collection of benchmark data by commercial benchmark groups allows known errors such as failure to records unpaid overtime to be corrected which may not occur with self-reported benchmark data. Breach of contract litigation The author has worked as an expert witness in more than a dozen software breach of contract cases. These are concerned with either projects that are terminated without being delivered, or with projects that were delivered but failed to work or at least failed to work well. The main kinds of data collected during breach of contract cases center on quality and on requirements creep, both of which are common in breach of contract litigation. Common problems noted during these cases that are relevant to software metrics issues include: 1) Poor estimates prior to starting; 2) Poor quality control during development; 3) Poor change control during development; 4) Very poor and sometimes misleading status tracking during development. About 5% of outsource contracts seem to end up in court. Litigation is expensive and the costs can easily top $5,000,000 for both the plaintiff and the defendant. It is an interesting phenomenon that all of the cases except one where the author was an expert witness were for major systems larger than 10,000 function points in size. It is unfortunate that neither the costs of canceled projects nor the costs of breach of contract litigation are currently included in the metric of technical debt which is discussed later in this report. 20
Bug One of the legends of software engineering is that the term bug first refereed to an actual insect that had jammed a relay in an electromechanical computer. The term bug has since come to mean any form of defect in either code or other deliverables. Bug reports during development and after release are standard software measures. See also defect later in this paper. There is a pedantic discussion among academics that involves differences between failures and faults and defects and bugs, but common definitions are more widely used than academic nuances. Burden rates Software cost structures are divided into two main categories; the costs of salaries and the costs of overhead commonly called the burden rate and also overhead. Salary costs are obvious and include the hourly or monthly salaries of software personnel. Burden rates are not at all obvious and vary from industry to industry, from company to company, and from country to country. In the United States some of the normal components of burden rates include insurance, office space, computers and equipment, telephone service, taxes, unemployment, and a variety of other fees and local taxes. Burden rates can vary from a low of about 25% of monthly salary costs to a high of over 100% of salary costs. Some industries such as banking and finance have very high burden rates; other industries such as manufacturing and agriculture have lower burden rates. But the specifics of burden rates need to be examined for each company in specific locations where the company does business. Burn down Although this metric can be used with any method, it is most popular with agile projects. The burn down rate is normally expressed graphically by showing the amount of work to be performed compared to the amount of time desired to complete the work. Burn down is somewhat similar in concept to earned value. A variety of commercial and open-source tools can produce burn down charts. See also the next topic of burn up. The work can be expressed in terms of user stories or natural deliverable such as pages of documentation or source code. Burn up This form of chart can also be used with any method but is most popular with agile projects. Burn up charts show the amount of work already completed compared to the backlog of uncompleted work. Burn down charts, just discussed, show uncompleted work and time remaining. Here too a variety of commercial and open-source tools can produce the charts. The work completed can be stories or natural metrics. 21
Business value The term business value is somewhat subjective and ambiguous. Business value can include tangible financial value, intangible value, and also intellectual property such as patents. Tangible value can include revenues, profits, and services such as education and consulting. Intangible value can include customer satisfaction, employee morale, and benefits to human life or safety as might be found with medical software. Business value tends to vary from industry to industry and from company to company. It can also vary from project to project. Cancelled project metrics The author s data indicates that about 32% of large systems in the 10,000 function point size range are cancelled. The Standish report, discussed later in this paper, also reports significant project cancellations. It would benefit the industry to perform post-mortems and collect standard benchmarks for all cancelled projects. The data elements would include: 1) nature and type of project; 2) size of application at point of cancellation; 3) methodologies used on the project such as waterfall, agile, etc.; 4) programming languages; 5) time to cancellation in calendar months; 6) costs accrued to point of cancellation; 7) team size and occupation groups; 7) business reason for cancellation such as negative ROI, poor quality, excessive schedule delays, or some other reason. There is some difficulty in collecting data on cancelled projects because most companies are embarrassed by their failures and prefer to keep them secret. The best and most complete data on cancelled projects does not come from ordinary benchmark data, but from the depositions and discovery documents produced during litigation. However some of this data may be covered by non-disclosure agreements. Certification As this paper is written in 2014 there are more than 50 U.S. organizations that provide some form of certification for software workers. Among the kinds of certification that are currently available are certification for counting function points, certification for testers, certification for quality assurance personnel, certification for project managers, and certification offered by specific companies such as Microsoft and Apple for working on their products and software packages. However there is little empirical data that demonstrates certification actually improves performance compared to uncertified personnel doing the same work. There have been studies that show certified function point counters are fairly congruent when counting the same application. However there is a shortage of data as to the performance of certified test personnel and certified project management personnel. There is no reason to doubt that certification does improve performance; what is missing is solid benchmark data that proves this to be the case and quantifies the magnitudes of the benefits. 22
Certification by FDA, FAA, etc. Federal regulatory agencies in the United States such as the Food and Drug Administration and the Federal Aviation Agency require certification of both hardware and software such as medical devices and avionics packages. There are similar agencies in every major country. This kind of certification for software is expensive and requires the production of a variety of special reports and metrics. These add between 5% and 8% to the costs of software applications undergoing certification. They also add time to schedules and certified software packages usually require several months more than the same size and type of software that is not certified. The agile methodology needs to be considerably modified to support certification, because the documents are mandated and not optional. The TSP and RUP methods are easier to use for governmentcertified software packages. Certification of reusable materials As software reuse becomes a mainstream development approach, there is a growing need for formal libraries of certified reusable materials that approach zero-defect status. These libraries would contain much more than source code and also include architectural patterns, design patterns, source code, reusable test cases, and reusable user documents. For that matter reusable plans and estimates could also be included. The certification process would include formal inspections, static analysis, and testing by certified test personnel using mathematical test case design methods such as those based on design of experiments. Custom designs and manual coding are intrinsically expensive and error prone. Construction of software from standard reusable components is the only method that can make permanent improvements to quality and productivity at the same time. Of course development of the reusable components themselves will be slower and more expensive than custom development, but would soon return a positive ROI as reuse went above about five applications using one standard component. The catalog process for the reusable materials would be based on a formal taxonomy of software features, which is a topic needing more study and research. CHAOS report See the Standish report discussed later in this paper. This is an annual report of IT project failures published by the consulting company of the Standish Group. Due to the extensive literature surrounding this report a google search is recommended. Chaos theory Chaos theory is an important subfield of mathematics and physics that deals with system evolution for systems that are strongly influenced by initial or starting conditions. The sensitivity to initial conditions has become popular in science fiction and is known as the butterfly effect based on a 1972 paper by Edward Lorenz that included a statement that a 23
butterfly flapping its wings in Brazil might cause a tornado in Texas. Chaos theory seems to be a factor in the termination of software applications prior to delivery. It may also play part in software breach of contract litigation. Chaos theory deals with abrupt departures from tend lines. By contrast Rayleigh curves, discussed later in this paper, assume smooth and continuous trend lines. Since about 32% of large systems over 10,000 function points in size are cancelled prior to completion, it seems obvious that both Rayleigh curves and chaos theory need to be examined in a software context. A deeper implication of chaos theory is that the outcomes of software sequences or systems are not predictable even if every step is determined by the prior step. From working as an expert witness in a number of lawsuits, it does seem probable that chaos theory is relevant to breach of contract lawsuits. Failing projects and successful projects sometimes have similar initial conditions, but soon diverge into separate paths. Chaos theory needs additional study for its relevance to a number of software phenomena. Cloud measures and metrics As we all know cloud computing is the wave of the future. Some standard metrics such as function points work well for both development of cloud applications, and also for consumption and usage of cloud services. Financial measures such as ROI also work, but need to be adjusted for fixed and variable costs. Consumption of cloud services will be a critical factor, and here function points are among the best metrics. For example a cloud-based software estimation tool will be about 3,000 function points in size. If this cloud tool is used 10 hours per month by 1,000 cloud subscribers that is a monthly consumption rate of 30,000,000 function points. Function points are already used for consumption studies, and this should add value to cloud economic analysis. For consumption studies individual function points may be too small, so a metric of Kilofunction points similar to kilowatts may be needed. CMMI levels One of the interesting byproducts of the capability maturity model integrated (CMMI ) is the placement of software organizations (not specific projects) on a plateau of five levels indicating increasing sophistication. Level 1 = initial; Level 2 = managed; Level 3 = defined; Level 4 = quantitatively managed; Level 5 = optimizing. From a study by the author that was commissioned by the U.S. Air Force ascending the CMMI ladder tends to improve both quality and productivity, with the caveat that best Level 1 groups are actually better than the worst Level 3 groups. The CMMI approach is widely used in the defense community but not used much by the commercial sector. An older ranking scale developed by the author a year before the SEI was incorporated is also relevant and is widely used in the commercial sector. The author s rankings move in the opposite direction from the SEI rankings. It is too bad that the SEI did not check the literature before bringing out their own metric. The author s metric also uses a 5-point scale: Level 1 = expert; Level 2 = above average; Level 3 = average; Level 4 = below average; Level 5 = inexperienced. The author s scale supports two decimal places of precision such as 2.25 rather than the integer values used by the SEI. The author s scale can be converted to the 24
equivalent SEI scale. Both scales are used in the Software Risk Master (SRM) tool in both predictive and measurement modes. Cognitive Dissonance The phrase cognitive dissonance refers to both a theory and a set of experiments by the psychologist Dr. Leon Festinger on opinion formation and entrenched beliefs. Dr. Festinger found that once a belief is strongly held, the human mind rejects evidence that opposes the belief. When the evidence becomes overwhelming, there is then an abrupt change of opinion. Cognitive dissonance is common in scientific fields and explains why theories such as sterile surgical procedures, continental drift, and Darwin s theory of evolution were rejected by many professionals when the theories were first published. Cognitive dissonance is also part of military history and explains the initial rejection of major innovations such as replacing muskets with rifles, the rejection of screw propellers for naval ships; the rejection of naval cannon mounted on adjustable mounts; the rejection of iron-clad ships; and the initial rejection of Samuel Colt s revolver. Cognitive dissonance is also part of business and caused the initial rejection of air conditioning, and the initial rejection of variable-speed windshield wipers (later the wiper idea was accepted and led to patent litigation by the owner.) Cognitive dissonance is also part of software and explains why various metrics are still used even though they have been proven to be inaccurate. Thus cognitive dissonance plays a part in the use of lines of code metrics, cost per defect, and other metrics with mathematically proven flaws. Apparently the weight of evidence is not yet strong enough to cause an abrupt switch to function point metrics. Cohesion The cohesion metric is one of several developed by Larry Constantine. See also coupling later in this paper. The cohesion metric deals with how closely related all parts of a module are. High cohesion implies that all parts of a module are closely related to whatever functionality the module provides, and hence probably easy to read and understand. Complexity The scientific literature encompasses no fewer than 25 discrete forms of complexity. Software engineering has managed to ignore most of these and tends to use primarily cyclomatic and essential complexity, Halstead complexity, and the subjective complexity associated with function point metrics. However many other forms of complexity such as fan complexity, flow complexity, syntactic complexity, semantic complexity, mnemonic complexity, and organizational complexity also have tangible impacts on software projects and software applications. The full suite of 25 different forms of complexity is discussed in the author s book Estimating Software Costs. The topic of complexity needs additional study in a software context because major forms of complexity are not included in either software cost estimates or software benchmarks as of 2014. 25
Consequential damages The topic of consequential damages is a legal term that refers to harm experienced by a customer as the result of a product malfunctioning or failing. As it happens software is extremely like to malfunction or fail and hence probably causes more consequential damages than any other manufactured product in the 21 st century. Examples of consequential damages include having to restate prior year financial results based on bugs in accounting software; deaths or injuries due to malfunctions of medical software; huge financial losses in stock markets due to malfunctions of stock trading software; errors in taxes and withholding due to errors in software used by tax collection agencies; and many more. Consequential damages are not included in either cost of quality (COQ) metrics nor in technical debt metrics. One reason for this is that software developers may not know of consequential damages unless they are actually sued by a disgruntled client. Even then the consequential damages will be for only a single client unless the suit is a class action. In the modern world of 2014 software bugs probably cost more than a trillion dollars per year for consequential damages without any good way of measuring the actual harm or total costs to many industries. Contracts fixed cost The concept of a fixed cost or fixed price contract is that the work will be performed for an agreed to amount even though there may be changes in scope. Fixed cost contracts have a tendency towards litigation in several situations. In one case where the author was an expert witness the client added 82 major changes totaling over 3,000 function points. The client did not want to pay since it was a fixed price contract even though there was a clause for out of scope changes. The court decided in favor of the vendor, who did get paid. In another case, an arbitration, the client agreed to pay for changes in scope but only the amount agreed to in the contract. Some of the late changes cost quite a bit more due to the need for extensive changes to the architecture of the application and then regression testing. Fixed cost contracts need to include clauses for out of scope changes and also a sliding scale of costs for changes made late in the development cycle. Fixed cost contracts also need constant monitoring by clients and by vendor executives. Contracts time and materials The concept of a time and materials contract is that the vendor will charge for actual hours expended and also charge for any tools or materials acquired, such as certified reusable components. Time and materials contracts tend to keep good records of hours expended and hence are useful for historical productivity studies. A caveat is that some vendors have a tendency to either work slowly or put in additional team members that may not be needed. Therefore it is useful to have reliable benchmark data from similar projects. Also useful would be formal estimates prior to starting that are agreed to by both the client and the vendor. Tool such as Software Risk Master (SRM) can predict size, costs, and schedules before projects start. Time and materials contracts need careful planning prior to starting to ensure that they are not 26
extended artificially by vendors, which has been observed with some government time and materials contracts. Contracts using function points The government of Brazil requires function points for all software contracts. The governments of Italy and South Korea are considering the same requirement. Function points are very good contract metrics because of the large volume of benchmark data available. Further, function points lend themselves to using a sliding scale of costs for handling requirements creep and even removing features from software. A number of civilian outsource companies are also using function points for software contracts, and this trend is expanding in 2014. Function points are also valuable for activity-based cost analysis and can be applied to earned value measurements, although the U.S. government and the Department of Defense are behind the civilian sectors in function point usage. Function points are already playing a major role in software litigation for breach of contract, poor quality, and other endemic problems that end up in court. Cost Center The phrase cost center is an accounting term that refers to a corporate organization that does not add to bottom line profit but does expend costs. A profit center is an organization that does produce revenue. For internal software produced by companies for their own use, the majority of these operate under a cost-center model; i.e. they do the software without charges to the users. Because there are no charges, software measurement practices for centers tend to leak and omit major software cost elements such as unpaid overtime and management. Among the author s clients the average completeness of software cost data under the cost center model is only about 37% of true costs. Cost drivers Quite a few software researchers such as Barry Boehm, Ian Sommerville, and the author use the concept of cost drivers. Normal project accounting keeps track of costs by activity. However cost drivers aggregate costs across all activities. For example one of the cost drivers used by both Boehm and Jones is that of software documentation, which spans every phase and almost every activity. In total more than 100 documents can be created for large systems and these often cost more than the code itself. The four major cost drivers cited by the author of this paper for specific projects are: 1) finding and fixing bugs; 2) document creation; 3) meetings and communications; 4) requirements creep. When looking at larger national results across thousands of projects additional cost drivers include: 1) cancelled projects; 2) cyber attacks; 3) cyber attack recovery and reparations; 4) Litigation for breach of contract, intellectual property, and other causes. Cost drivers are useful for software economic analysis because they highlight major areas that need study and improvements. 27
Cost of quality (COQ) The cost of quality metric is much older than software and was first made popular by the 1951 book entitled Juran s QC Handbook by the well-known manufacturing quality guru Joseph Juran. Phil Crosby s later book, Quality is Free also added to the literature. Cost of quality is not well named because it really focuses on the cost of poor quality rather than the cost of high quality. In its general form cost of quality includes prevention, appraisal, failure costs, and total costs. When used for software the author of this paper modifies these terms for a software context: defect prevention, pre-test defect removal, test defect removal, post-release defect removal. The author also includes several topics that are not part of standard cost of quality analysis: cost of projects canceled due to poor quality; cost of consequential damages or harm to customers from poor quality; cost of litigation and damage awards due to poor quality. Cost per defect As discussed earlier in this paper the fixed costs of defect removal such as writing test cases and having maintenance programmers ready and waiting cause cost per defect to rise steadily throughout the development cycle. Fixed costs also cause cost per defect to be cheapest for the buggiest software, which clouds and confuses the economic study of software quality. This metric is not suitable for economic analysis. The alternate metric defect removal cost per function point is a much better indicator of the actual benefits of high quality. See tables 2 and 3 earlier in this report for examples of both cost per defect and cost per function point for defect removal tasks. Cost per defect also ignores the main value points of high quality i.e. shorter schedules, lower costs, and more satisfied customers. Cost per function point If used carefully cost per function point is the top-ranked metric for software economic analysis. However there are some caveats and cautions that need to be understood: 1) Cost vary by size and large systems cost more than small programs; 2) Costs vary by type of software and systems and embedded software costs more than web and IT applications; 3) Costs vary by geographic areas and rural locations such as Nebraska cost less than urban areas such as New York or San Francisco; 4) Costs vary by industry and some industries such as banking have much higher costs than others such as manufacturing; 5) Costs vary by country and some countries such as Switzerland cost a lot more than other countries such as Pakistan; 6) Costs vary by time both while projects are in progress and also after release when new features are added. Continuous growth of software overtime requires that cost data be renormalized from time to time, such as once per year after release. (If a project is 1000 function points at initial release but grows by 100 function points per year for 10 years in a row, then cost per function point needs annual adjustments to include the current year s changes.) 28
Cost per LOC and KLOC Both cost per LOC and cost per KLOC (with K standing for 1000 LOC) have been in use for more than 50 years but suffer from severe errors when comparing projects coded in different programming languages or combinations of languages. Following are cost per LOC side by side with cost per function point to illustrate the errors: Table 9: Cost of Development for 10 Versions of the Same Software Project (A PBX Switching System of 1,500 Function Points in Size) Language Effort Burdened Burdened Burdened Burdened (Months) Salary Costs Cost per Cost per (Months) Funct. Pt. LOC Assembly 781.91 $10,000 $7,819,088 $5,212.73 $20.85 C 460.69 $10,000 $4,606,875 $3,071.25 $24.18 CHILL 392.69 $10,000 $3,926,866 $2,617.91 $24.93 PASCAL 357.53 $10,000 $3,575,310 $2,383.54 $26.19 PL/I 329.91 $10,000 $3,299,088 $2,199.39 $27.49 Ada83 304.13 $10,000 $3,041,251 $2,027.50 $28.56 C++ 293.91 $10,000 $2,939,106 $1,959.40 $35.63 Ada95 269.81 $10,000 $2,698,121 $1,798.75 $36.71 Objective C 216.12 $10,000 $2,161,195 $1,440.80 $49.68 Smalltalk 194.64 $10,000 $1,946,425 $1,297.62 $61.79 Average 360.13 $10,000 $3,601,332 $2,400.89 $27.34 As can be seen cost per LOC reverses true economic productivity and makes the most expensive version coded in assembly language look cheaper than the least expensive version coded in Smalltalk. The errors of LOC metrics are clearly visible in cases such as table 9 where identical applications coded in different languages are shown to highlight LOC metric errors. 29
Cost per story point The cost per story point metric is useful for projects utilizing user stories as a requirements and design method. However this metric cannot be used for large-scale economic studies involving projects that use other kinds of requirements methods than story points. As of 2014 story points have no ISO standards and no certification examinations and have been observed to vary by as much as 400% from company to company. There is very little benchmark data available using story points, and what data that is available needs to be used with caution due to the variability of this metric. Coupling This is another interesting metric developed by Larry Constantine (see also cohesion). The coupling metric refers to how modules exchange or share information. Coupling can range from low coupling to high coupling. Low coupling tends to be associated with well-structured software that is easy to read and comprehend. There are many forms of coupling ranging from no coupling at all through content coupling when a module depends upon the inner workings of another module. Some forms of coupling include data coupling, temporal coupling, stamp coupling, message coupling, control coupling and others as well. Coupling and cohesion are often used as a set of related metrics. Currency exchange rates For international projects software personnel will probably be paid in local currencies. Since currency exchange rates vary every day as well as varying over longer time periods, this is a significant issue for accurate estimates for international projects. Currency exchange rates are economic topics that affect all industries that work globally and not just software. The most common method of dealing with currency exchange rates in software estimates is to use current values and then make adjustments later if significant changes occur. Currency exchange rates also play a part in global outsource contracts and are somewhat related to inflation rates. Both inflation and currency exchange rates can make long-range projects unpredictable. Customer satisfaction metrics Most large software companies devote considerable time and energy to finding out whether customers like their software or are not happy. Usually questionnaires or interviews are created by human factors specialists or even by psychologists. Studies of customer satisfaction at IBM noted a strong correlation between delivered defect rates and overall satisfaction. In fact studies of many kinds of consumer products such as televisions, stereos, etc. found that quality was the number one determinant of high satisfaction. Other topics include speed of defect repairs, ease of reaching support teams, and aesthetic factors. 30
Customer support metrics Two endemic problems of the software industry are that software projects have too many bugs after release and that software support personnel are difficult to reach because there may not be enough of them. Because live support personnel are costly many companies in expensive countries such as Japan, Switzerland, and the United States outsource customer support to countries with lower labor costs such as India. The number of customer support personnel needed to allow clients to reach a live person in less than 5 minutes by phone or email can be estimated and some tools such as Software Risk Master (SRM) include these predictions. Some sophisticated companies such as Apple have calculated customer support needs and do a pretty good job. Others, such as Verizon, have ignored their customers and have inadequate support where it is next to impossible to reach a live support person. Customer support staffing is based in part on expected numbers of post-release defects, in part on expected numbers of clients using the software, and in part on whether support will be available 24 hours per day or only during one or two shifts. Cyber attack metrics Cyber attacks are increasing in variety and frequency. A google search on cyber attacks and cyber attack metrics will show current data, which changes almost daily. Some of the important metrics for cyber attack frequency and origins are kept by government agencies such as the FBI, Homeland Secrity, and the CIA. The Congressional Cyber Security Caucus started by representatives Jim Langevin from Rhode Island (democrat) and Mike McCaul (republican) from Texas publishes excellent weekly summaries on cyber attacks and is highly recommended since it is both free and contains valuable data. (This is a rare instance of cooperation among democrats and republicans.) Cyber attack data includes numbers of attack by type such as denial of service, virus, worms, etc. It should also include the value of any stolen materials, the number of citizens whose data is compromised, and the eventual costs of recovering from cyber attacks. Many companies have been lax and even incompetent in reporting cyber attacks, and a few have lagged in notifying customer of possibly stolen data. These problems are endemic in 2014 and seem to be growing worse. Cyber attacks have moved from individual hackers to organized crime and also to hostile national governments, all of whom have active cyber warfare units that seek out weaknesses in other countries including the United States. This is a huge problem and it will be getting worse. Cyclomatic complexity This metric is one of the most widely used indicators of software structure, along with essential complexity. The cyclomatic complexity metric was developed in 1976 by Tom McCabe. It is based on graph theory and is an expression of the control flow graph of an application. Cyclomatic complexity for software with no branches is 1. As numbers of branches increase cyclomatic complexity increases. Once cyclomatic complexity rises above 20 it is hard to follow the flow and hence some branches may be wrong. The formula for cyclomatic complexity is graph edges minus nodes plus 2. Cyclomatic complexity plays a part in estimating test cases and 31
estimating maintenance effort. High cyclomatic complexity levels are also cited in litigation for poor quality. An interesting theoretical question is whether or not code with low cyclomatic complexity is possible for very complex problems. See also essential complexity and Halstead complexity. Dashboard The term dashboard is much older than software and has been applied to the control panels of various devices such as automobiles where instruments provide useful information to the operator. In a software context the term dashboard refers to a continuous display of information about a project s status, including but not limited to completed tasks versus unfinished tasks, completed test cases versus unfinished test cases, and completed documents versus unfinished documents. A number of commercial and some open-source tools provide automated or semi-automated dashboards for software projects. Some of these support a number of projects at the same time, and are useful for portfolio and analysis and also for data center analysis when many applications are executing simultaneously. Data point metrics Major corporations own more data than they own software. Data is expensive to create and maintain and is known to contain many errors. But as of 2014 there is no effective size metric for data bases and repositories. It is theoretically possible to construct a data point metric that would resemble the structure of function point metrics but size data volumes. A data point metric would be useful in studies of data ownership and data quality. Some of the atomic elements of a data point metric might be logical files, entities, relationships, attributes, inquiries, and interfaces. The fundamental idea is to have function points and data points relatively congruent. If this were the case, then an application such as web site might be sized at 10,000 function points and 15,000 data points. The idea is to allow better estimates and better benchmarks for data-rich applications such as medical records, tax records, retail chain web sites, and many others. Defect (definition) There is a somewhat pedantic academic discussion of the differences between a failure, a fault, a defect, a bug, an error, an incident, anomaly, etc. The term defect is a good general-purpose term that can encompass all of these. A defect is an accidental mistake by a human that causes either total stoppage of software, unacceptably slow performance, or the creation of incorrect data and results. Defects can originate from multiple sources. A requirements defect would be something like the Y2K problem. A design defect would be something like understating a performance goal. An architectural defect would be something like using client server when a distributed network would be better. A code defect is something like branching to the wrong address. A document defect is something like omitting a step in the install procedure for software. A bad-fix defect is a bug in an attempt to fix a prior bug. 32
Defect Detection Efficiency (DDE) This is one of two quality metrics developed by IBM circa 1970. Defect detection refers to the percentage of bugs identified prior to release. See also the next metric on defect removal efficiency (DRE). In earlier eras DDE and DRE were almost identical and only a few bugs found on the actual day of release were not fixed by the time of release. In today s world of 2014 with bigger systems and greater schedule urgency, DDE averages over 10% higher than DRE. That is at least 10% of the known bugs in a software application are not fixed before the software is released. Of course these bugs cause problems and create consequential damages, not that anybody cares about clients anymore. Defect Removal Efficiency (DRE) The phrase defect removal efficiency refers to the percentage of bugs found and fixed prior to release, when compared to customer-reported bugs in the first 90 days after release. If a development teams finds and fixes 990 bugs and clients only find and report 10 bugs in the first three months then DRE would be 99.9%; an excellent results. However as of 2014 average DRE hovers around 90% and for agile around 92%. The best DRE values come from synergistic combinations of pre-test inspections and static analysis combined with formal testing using mathematical test case design and at least nine test stages: 1) subroutine test of code segments; 2) unit test of modules; 3) function test; 4) regression test; 5) component test; 6) performance test; 7) security test; 8) system test; 9 acceptance or Beta test. This combination usually tops 99% in DRE. Projects that omit pre-test defect removal and use only three or four test stages are often below 85% in DRE. DRE is probably the most useful and effective quality metric. It is easy to measure and high levels of DRE correlate with high productivity and high levels of customer satisfaction and also with high levels of team morale. Defect Density For many years defect density has informally been defined as defects per KLOC. This of course omits requirements and design defects, which often outnumber code defects. Worse this definition penalizes high level languages. Assume you have 5,000 lines (5 KLOC) of assembly code with 50 bugs. Now assume the same algorithms are coded in 1,000 lines (1 KLOC) of Java with 10 bugs. Both have exactly 10 bugs per KLOC as apparent defect densities, even though assembly has five times as many code bugs. Assume both versions were 20 function points in size. With this assumption assembly has 2.5 bugs per function point while Java has only 0.5 bugs per function point. As can be seen defects per function point correctly compensates for the reduced bug counts while KLOC metrics don t show any value for reduced defect volumes. For that matter defects per function point can also include bugs in requirements, design, architecture, user documents, and all other categories. 33
Defect discovery factors When software applications are released they still contain defects. These are known as latent defects until they are discovered. This brings up the key topic of what factors leads to the discovery of latent defects? As it turns out numbers of users, amount of usage, and variety of usage are the three critical factors. One user using software for one hour a month and doing one task probably won t find many latent defects. One million users using software 24 hours a day for several thousand different kinds of tasks will probably flush out the majority of latent defects in a month or two. These factors explain the measured differences in defect discovery rates for various kinds of software. Embedded and systems software usually have the fastest defect discovery rate; web projects with thousands of visitors have fast defect discovery rates. Interestingly agile projects, which are often done to support less than 100 users, have fairly slow defect discovery rates which may lead to a premature assertion that agile quality is better than it really might be. Defect origins The phrase defect origins was first defined in IBM circa 1968. The meaning is which specific place caused a software defect. There are six common software defect origins: 1) software requirements; 2) architecture; 3) design; 4) source code; 5) user documents; 6) bad fixes or secondary bugs in defect repairs. There are other sources of defects such as data errors and bugs in test cases, but there are not normally included in software defect measurements. When bugs were reported IBM quality engineers noted the point in time where the bug was found (inspection, testing, deployment, etc.) and also noted the place where the bug was created. They then assigned an origin code to each bug. This allowed IBM to explore quality in a fairly sophisticated way, and led to many important findings such as the fact that requirements and design errors often outnumber code errors. Make no mistake, a requirements defect such as Y2K will eventually end up in source code, but that is not where the Y2K bug started. It started as an explicit user requirement to conserve space by using only two digits for date fields. Every company that builds software should explore their own defect origins, as indeed many do. Defect resolution time This topic refers to the number of hours or days between the point in time when a bug is first reported and the point in time when users get a new version of the software with the bug fixed. Early on during requirements and design when software is still easily changeable, defect resolution time is normally less than 8 hours. As software development proceeds more and more artifacts may require correction. For example a requirements bug found during testing may require changes to the original requirements specification, design documents, probably the source code and perhaps even test cases. The logistics of defect repairs become more convoluted with time. Once defects are fixed and tested they may not be immediately released to customers. For low-severity bugs defect repairs are usually aggregated into the next release. However for high severity bugs, and especially severity 1 bugs when the software does not work at all, patches and emergency repairs are sent out as needed. 34
Defect severity levels All bugs are not equally serious. Way back in the early 1960 s when software was first becoming a business tool IBM recognized that software bugs needed to be classified. The original IBM classification is still working after more than 60 years. Under this classification severity 1 is the highest and indicate the software does not work at all. Severity 2 is second and indicates a major feature is disabled. Severity 3 is next and indicates either a minor issue or one with an available work around. Severity 4 means a cosmetic problem such as a spelling error that does not affect software operation at all. There are some other categories besides severity: invalid defect reports and duplicate defect reports. From analysis of hundreds of projects the normal technical distribution by severity level after release would be: severity 1 = 1%; severity 2 = 15%; severity 3 = 35%; severity 4 = 49%. However because most companies fix highseverity bugs quicker than low severity bugs, clients tend to try and push low severity bugs up into severity level 2 in order to get a quicker repair. Some applications have more than 50% of bugs reported as severity 2 by clients even for trivial issues such as the placement of text on a screen or the color of a display. Also, some defect reports turn out to be suggested improvements that are reported as defects. The actual determination of defect severity is usually assigned to a quality assurance or maintenance team that sometimes has to negotiate with clients. Deferred features For many software projects either clients, executives, or business pressures such as government laws and mandates dictate schedules that are shorter than technically possible. In the case of impossible schedules something has to give, and it is often features that are desirable but not mandatory. Below 100 function points software is usually delivered close to 100% complete. Above 10,000 function points it is not uncommon for the first release to omit more than 35% of planned features in order to make a shorter than possible delivery date. Deferred features are an endemic problem of large software projects. An interesting law by Chris Winter of IBM is 80% of features delivered on time are more valuable than 100% of features delivered late. Delphi methods The term Delphi method is based on a line of famous Greek oracles who lived in the temple in Delphi and were sought out by various leaders to predict future events. In the modern Delphi method panels of experts answer questions in a formal structured way but anonymously. After the first round of questions a second round is prepared using summaries from the first round. There may be additional rounds until a concurrence of opinions is reached. The concept is based on the hypothesis that groups of experts can pool their knowledge and do a better job of prediction than a single expert. Delphi is used more for corporate decisions than for software decisions, but is sometimes used for major applications with high risks. Since Delphi depends on expertise it is important to select participants who have actual knowledge of the issues. 35
Delivered defects Because software defect removal efficiency is almost always below 100% and often below 90%, the great majority of software applications are delivered with latent defects. By using historical data major companies such as IBM are able to predict delivered defects in future projects, and also use effective methods to keep delivered defects at very low levels. Some tools such as Software Risk Master (SRM) predict delivered defects as a standard feature. In fact SRM predicts not only total defects but also delivered defects by origin; i.e. defects caused by requirements, design, code, bad testing, etc. Delivered defects are predicted by using defect potentials and defect removal efficiency (DRE). They are of course measured as they occur. Note that defects are not all found until several years after release. Indeed delivering software with excessive defects slows down defect discovery because clients don t trust the software and avoid using it if possible. Annual reports by clients of delivered defects are only around 30% of actual latent defects for IT applications, but higher for systems and embedded software. Following are the ranges of delivered defects per function point from the author s data: As can be seen the ranges are wide, and also expand as application size increases. Top projects by top companies release few defects at any size. Laggards and even average companies deliver far too many defects for all sizes. See also the discussion of technical debt. Design, Code, and Unit Test (DCUT) The DCUT method has been in use for perhaps 50 years or more. In the 1960 s DCUT comprised about 85% of the total work of software and was a reasonably useful approach for both estimates and benchmarks. Today in 2014 with more than 125 occupation groups involved with large systems DCUT comprises less than 30% of total development effort. For example DCUT excludes quality assurance, technical writers, integration and configuration control, project offices, and project managers. DCUT is not an effective method today and should be replaced by activity-based cost analysis that includes the entire suite of software development activities. Dilution The term dilution refers to the loss of equity that entrepreneurs may experience if they receive venture capital, and especially if they receive more than one round of venture capital. In order to get funding for a software company or major projects from a venture funding source, probably 20% of the ownership will be turned over to the venture capitalists. If the project runs through the initial investment, which many do, and second or third round financing are needed the entrepreneurs occasionally end up with less than 15% ownership. Quite a few venture funded companies fail completely and go bankrupt. As a service to software entrepreneurs Software Risk Master (SRM) includes a venture funding routine that will predict both the number of rounds of funding and the probable dilution of ownership. It can also predict the odds of failure 36
or bankruptcy. Since these predictions can be done early before any money is committed at all, hopefully both the entrepreneurs and the venture capitalist will have a good preview of probable results before committing serious money. Documentation costs Documentation costs are fairly sparse for small projects and especially for agile projects. However for large systems above 10,000 function points and especially for government and military software projects, more than 100 kinds of documents might be created and the total costs of these documents are often greater than the costs of the source code itself. Software Risk Master (SRM) has a standard feature for predicting document numbers, pages, words, and costs. For a small project of 10 function points total document pages will be around 50. For projects of 100 function points total document pages will be around 400. For projects of 1000 function points total document pages will be around 3,500. For projects of 10,000 function points total document pages will be around 32,500. Studies by the author have noted that pages per function point tend to decline with larger applications since full documentation might go past the lifetime reading speed of a single individual. Document costs need additional research in the software engineering field. For large civilian systems document costs are the #2 cost driver and for large defense systems, sometimes the #1 cost driver even going past finding and fixing bugs. Function points are the best metrics for studying document costs. To highlight the huge volume of documents for major systems, following are the numbers and sizes of documents for a systems software application of 25,000 function points, such as a central office switching system: 37
Document Sizes Pages Words Percent Requirements Architecture Initial design Detail design 4,93 6 7 48 6,18 3 12,41 8 Complete 1,974,49 0 61.16% 299,1 10 70.32% 2,473,27 2 55.19% 4,967,18 2 65.18% Test plans 2,762 1,104,937 55.37% Development Plans 1,37 5 550,0 00 68.32% Cost estimates 7 48 299,1 10 71.32% User manuals 4,94 2 1,976,78 3 80.37% HELP text 4,96 5 1,986,15 1 81.37% Courses 3,62 5 1,450,00 0 79.85% Status reports 3,55 3 1,421,01 1 70.32% Change requests 5,33 6 2,134,28 4 66.16% Bug reports 29,80 7 11,922,93 4 76.22% Note that document completeness is also a problem for large systems. Document completeness is inversely proportional to application size measured in function points. For example complete requirements and design documents are only possible for small applications below about 500 function points in size. Above that applications grow during development and the larger they are the more they grow. Documentation costs is the #2 cost driver for applications larger than 10,000 function points. For military and defense projects, which produce about three times the volume of paper as civilian projects, documentation costs are the #1 cost driver. Agile projects have reduced documentation costs so that they are only the #4 cost driver, below coding. However for agile projects meetings and communication costs may be the #2 cost driver, replacing paperwork costs. Duplicate defect reports For commercial and open-source software with hundreds or thousands of users it often happens that bugs are reported by more than one customer. These are called duplicate defects. Actual quality is based on valid unique defects which excludes duplicates. However duplicate defects, if there are many of them, can add considerable expense to maintenance costs. 38
Duplicate defects still need to be logged and examined. It may also be necessary to notify clients of the receipt of the defect reports. Each duplicate defect can take from 5 minutes to 15 minutes of work. Individually this is not very significant, but if an application receives 10,000 duplicate defects the costs can be significant. Earned-value measurements (EVM) The topic of earned value is a formal method accompanied by charts and procedures for combing scope, progress, costs, and remaining work in a formal manner. Earned value analysis (EVA) is frequently used on government and defense software contract projects; not so much in the civilian sector. EVM originated for Federal government project in the 1960 s. The major components of EVM include a development plan, a valuation of planned work, and pre-defined earning rules for completed work; often linked to payments to contractors. Assume a project is scheduled for 1 year and a budget of $1,000,000. At six months supposedly 50% of the work would be done. If it is noted that only 30% of the work is done but 50% of the budget is gone, there is a problem that needs to be addressed. EVM is a complex system with many formal definitions and calculations. One issue is that EVM does not include software quality, which can cause trouble for software projects since schedule slippage is most severe during testing due to having more bugs than anticipated. Enhancement metrics As of 2014 there are more enhancement projects for legacy applications than there are new development projects. Enhancements are more difficult to estimate and measure than new software development. This is because the size, structure, and understanding of the legacy software interact with the enhancement itself. Assume that a new small project of 100 function points is developed. This might require a total of 12 work hours per function point. Now assume that a 100 function point enhancement is being made to a well-structured, welldocumented legacy application of 1,000 function points. Since the architecture and design issues were solved by the legacy application, the enhancement might only require 11work hours per function point. Now assume that a 100 function point enhancement is to be made to a large system of 10,000 function points with high cyclomatic complexity and missing documentation. In this case digging into the legacy code and the need to carry out major regression testing might raise the effort to 14 work hours per function point. As can be seen estimating and measurement needs to include both the new enhancement and the legacy application. For measurements, both the specific enhancement needs to be measured and also the cumulative total cost of ownership (TCO) for the updated legacy application. In other words two sets of measures are needed for enhancements. Several commercial parametric estimation tools such as the author s Software Risk Master (SRM) can predict enhancements, but need input data about the size and decay of the legacy application. See also entropy. 39
Entropy The concept of entropy is not a software concept but a basic fact of physics. All systems and natural objects tend to have an increase in disorder overtime, which is called entropy. This is why we age and why stars turn into supernova. For software entropy is observed in a gradual increase in cyclomatic and essential complexity over time, due to the structural damages caused by hundreds of small changes over long time periods. It is possible to reverse entropy by restructuring or refactoring software, but this is expensive and unreliable if done manually for large systems. Automated restricting tools exist, but only support a few languages and are of uncertain effectiveness. Entropy needs much more study and more direct measurements. Because entropy is associated with all human artifacts and also with all natural systems, it is a fundamental fact of nature. Error-prone modules (EPM) In the early 1970 s IBM undertook an interesting study of the distribution of bug reports in a number of major software projects including operating systems, compilers, data base products, and others. One of the most important findings was that bugs were not randomly distributed through all modules of large systems but tended to clump in a few modules, which were termed error-prone modules. For example 57% of customer reported bugs in the IMS data base application were found in 32 modules out of a total of 425. More than 300 IMS modules had zero defect reports. Other companies replicated these findings and error-prone modules are an established fact of large systems. Two common causes for EPM have been noted: 1) high levels of cyclomatic complexity; 2) bypassing or skimping on inspections, static analysis, and formal testing. In theory EPM can be avoided by proper quality control but even now in 2014 they tend to be far too common in far too many large applications. ERP metrics Enterprise resource planning (ERP) refers to a class of major software applications such as SAP and Oracle that attempt to provide an integrated solution to corporate data needs by replacing older legacy applications with a suite of tools for accounting, marketing, manufacturing, customer resource planning, and other common business activities. ERP packages are large and some top 250,000 function points in size. ERP installation and deployment tend to be troublesome and routinely take longer and cost more than planned. Some of the ERP topics that need to be measured include training of personnel, installation cost, data migration costs to the ERP package from older software; new applications and enhancements to existing applications. Also the ERP packages themselves require extensive customization. The SAP and Oracle ERP packages use a performance metric based on RICE object. The term RICE stands for reports, interfaces, conversions, and enhancements. A RICE object is kind of project that needs cost and schedule estimation. Function points can also be used for ERP planning. 40
Essential complexity This metric was also developed by Tom McCabe in 1976 and is a variation on his more famous cyclomatic complexity metric. Note that Fred Brooks also uses the term in a different context as the minimum set of factors in large complex problems. The McCabe form of essential complexity is derived from cyclomatic complexity but replaces well-structured control sequences with a single statement. If a code section has a cyclomatic complexity of 10 but includes wellstructured sequences then essential complexity might be only 3 or 4. Experience The author s benchmark collection method and the Software Risk Master (SRM) tool use experience for a number of occupations, including client experience, software engineer experience, tester experience, software quality assurance experience, project management experience, customer support experience, and several others. Experience is ranked on a subjective scale of 1 to 5 with 1 = expert; 2 = above average; 3 = average; 4 = below average; 5 = inexperienced. Decimal values are accepted. The impact of experience is shown below: As can be seen results are slightly asymetrical. Top teams are about 30% more productive than average, but novice teams are only about 15% lower than average. The reason for this is that normal corporate training and appraisal programs tend to weed out the really unskilled so that they seldom become actual team members. The same appraisal programs reward the skilled, so that explains the fact that the best results have a longer tail. Software is a team activity. The ranges in performance for specific individuals can top 100%. But there are not very many of these super stars. Only about 5% to 10% of general software populations are at the really high level of the performance spectrum. Individual practitioners can vary in performance by more than 10 to 1 but software is normally a team event. In any case top performers are rare and bottom performers are usually terminated, so average performance is the norm with a weight on the high side. Also, bad management tends to slow down and degrade the performance of top technical personnel, some of whom quit their jobs as a result. Expert estimation The term expert estimation refers to manual software estimates by human beings as opposed to using a parametric estimation tool such as COCOMO II, CostXpert, KnowledgePlan, SEER, Software Risk Master (SRM), SLIM, or TrueCost. A comparison of 50 manual estimates and 50 parametric estimates by the author found that below 250 function points manual estimates and parametric estimates were almost identical. As application size increased manual estimates 41
became progressively optimistic and predicted shorter schedules and lower costs than actually occurred. Above 5,000 function points manual estimates even by experts tended to be hazardous and excessively optimistic by more than 35%. This is not surprising because the validity of historical data is also poor for large systems above 5,000 function points due to leakage of major cost elements such as unpaid overtime, management, specialists, etc. See also parametric estimation later in this report. Failing project (definition) Both the author s books and the Standish report, discussed later in this paper, deal with failing projects. What does failure mean in a software context? The definition used by the author for project failure is: software that is terminated without delivery due to errors, delays, or cost overruns or software whose development company is sued for breach of contract after delivery for excessive errors. See also the definition for successful software later in this report. In between success and failure are thousands of projects that finally get released but are late and over budget and probably have too many bugs after delivery. That is the modus operandi for software circa 2014. Another cut at a definition of failing projects would be projects in the lowest 15% in terms of quality and productivity rates from the benchmark collections of companies such Namcook Analytics, Q/P Management Group, Software Productivity Research, an others. Failure Modes and Effect Analysis (FMEA) This methodology was developed in the 1950 s for examining hardware failures but has also been applied to software. It is not as common for software as root cause analysis which is discussed later in this report. FMEA is an inductive approach that works backwards from specific failures and identifies earlier conditions that led to them. FMEA can work all the way back to development and even design mistakes. It is a common approach for hardware devices; less common for software. FMEA also includes criticality analysis (CA). FMEA can be used in two directions: 1) predictive mode for analyzing the risks of future failures; 2) analytical mode for examining actual failures that have occurred. Due to the complexity of the method a google search is recommended to bring up the relevant literature. Failure rate The term failure is defined by the author as software project that is terminated prior to being completed due to poor quality, negative return on investment, or some other cause that was selfinflicted by the development team. Projects that are terminated for business reasons such as buying a commercial software application rather than finishing an internal application are not failures. The topic of software failures has a lot of publicity due in part to the large number of failures included in the Standish Report, produced by the Standish consulting group. However that report only covers information technology and does not include systems software or commercial software, both of which have lower failure rates. Also, the Standish report does not 42
show failures by application size, which is a serious omission. The author s data on project failure rates by size is as follows: The probability of a software project failing and not being completed is proportional to the cube root of the size of the software application using IFPUG function points with the results expressed as a percentage. For 1,000 function points the odds are about 8%; for 10,000 function points the odds are about 16%; for 100,000 function points the odds are about 32%. These rules are not perfect but are based on observations taken from about 20,000 software projects of all sizes and types including web applications, smart phones, systems software, medical devices, military projects, et. False positive The term false positive refers to misidentifying a code sequence as being incorrect when in fact it is correct. This metric can occur with testing and inspections, but is most widely used with static analysis tools some of which may have more than 10% false positives. False positives are annoying but it is probably safer to have a few false positives than to miss real bugs. Every form of defect removal is below 100% in removal efficiency and produces at least a few false positives. Feature bloat This term is not ordinarily quantified but is a subjective statement that many software packages have features that are in excess of the ones truly needed by the vast majority of users. For example the author of this paper has written 16 books and hundreds of articles with Microsoft Word but probably has used less than 15% of the total feature set available in Microsoft Word. This is not to say that the features in Word are useless, but there are so many of them few authors ever use the majority of available features in either Word or Excel. In theory feature bloat could be measured with function points and the newer SNAP metric for non-functional size. However there is a logical inconsistency. Function points are defined as user benefits, and feature bloat is considered to have benefit. This might be resolved by creating a bloat point metric which would be counted like function points but assume zero user benefits and perhaps zero business value as well. Feature bloat is basically a subject opinion and not a truly measureable attribute. Fixed costs In a manufacturing process the term fixed cost refers to a cost that stays constant no matter how many products are built per month. A prime example of a fixed cost would be the rent paid for a software office building. For software applications many costs are not fixed in the classic sense of being constants, but they are inelastic and stay more or less the same. For example requirements and design are likely to stay more or less the same regardless of what coding language is used. After release of software companies will have maintenance personnel standing by to fix bugs regardless of how many bugs are reported by users. Assume a company has a fulltime maintenance programmer standing by at a cost of $10,000 per month. Now assume that project A has 10 bug reports in the first month of use. The cost per defect for project A will be $1,000. Now assume that next month the same maintenance programmer is standing by for 43
project B, which only has 1 bug. Now the cost per defect will be $10,000 for project B. Fixed and variable costs need to be analyzed for software, and especially for quality work. See also variable costs later in this report. See also burden rate earlier in this report for another look at fixed costs. Function points In the later 1960 s and early 1970 s the number of programming languages used inside IBM expanded from assembly to include COBOL, FORTRAN, PL/I, APL, and others. It was found that lines of code metrics penalized high-level languages and did not encompass requirements and design work. IBM commissioned Al Albrecht and his colleagues in IBM White Plains to develop a metric that could include all software activities and was not based on source code. The results were function point metrics which were developed circa 1975. In 1978 at a joint conference by Share, Guide, and IBM Albrecht presented function points to the outside world. Function points started to be used by IBM customers and in 1984 the International Function Point Users Group (IFPUG) was formed in Montreal, and later moved to the United States. Function point metrics are the weighted combination of inputs, outputs, inquiries, logical files, and interfaces adjusted for complexity. In today s world of 2014 function points are supported by an ISO standard and by certification exams. Function points are probably the most widely used software metric in the world, and have more benchmark data than all other metrics put together. There are a number of alternative methods of counting function points discussed later in this report under the topic of function point variations. Function point churn This term refers to features added after requirements with zero-changes to function point totals. An example would be shifting an input from the top of a screen to the bottom, with no change in content or data. See also function point creep which does add to function point totals. These are hard to measure. See zero-size changes. Function point creep This term refers to post-requirements changes that add to the function point total of an application. Examples might be adding a new report or including a new input question on a questionnaire. Many of these changes are small. See also function point creep and micro function points. The measured rates of function point creep are determined by function point enumeration at the end of requirements and then again at delivery. The rates are usually between 1% and 2% per calendar month. Agile projects are usually between 5% and 10% per calendar month due to incremental development and delivery. Function points per month 44
The metric function points per month is one of two common productivity metrics based on function points. The second common metric is work hours per function point. The two are mathematically equivalent but can produce very different results on a global basis. The number of effective work hours per month varies from country to country. In the United States the average number of work hours per month is 132; in China 186; in Sweden 126; in Iceland 120; in India 190, and so forth. This means that a software project that requires 132 hours of work will take one calendar month in the United States but only about three weeks in India and about five weeks in Iceland. Estimating tools such as Software Risk Master (SRM) that support global estimates and produce both function points per month and work hours per function point need to be sensitive to global work patterns. Function point variations Not long after function points were released by IBM and taken over by the International Function Point Users Group (IFPUG) several researchers claimed that IFPUG function points were not accurate for systems software or for other kinds of software outside traditional information technology. The results were a series of alternate counting rules for function points which produced different results than IFPUG counts. The first of these variations was the Mark II function point in the United Kingdom. It is an interesting sociological phenomenon that all of the variants produce larger counts than IFPUG; none produce smaller counts. The reason is that the inventors of the variants thought that some kinds of software were harder and more complex that information systems. While this may be true, the difficulty could be handled by the fact that these non-it applications required more work hours. It was not necessary to puff up the function point counts, but this is what happened. As a result in 2014 there are many counting rules for function points in addition to IFPUG counting rules. Among the variations are COSMIC function points, engineering function points, feature points, FISMA function points, function points light, Mark II function points, NESMA function points, and unadjusted function points. In fact new variations are occurring quite often so the list of function point variations is growing larger. The author s Software Risk Master (SRM) tool produces size data in a total of 23 of these alternate metrics. However for expressing productivity and quality data SRM uses IFPUG function points as the standard metric. As of 2014 IFPUG function points have more benchmark data than all of the variants put together. It is mildly surprising that so much energy goes into function point variations when the underlying historical data leaks and is often less than 50% complete. The author s opinion is that these function point variations only muddy the waters and make function points less useful and less consistent than they should be. For unknown reasons human beings seem to like creating metric variations, so we have statute and nautical miles, Fahrenheit and Celsius, British Imperial gallons and U.S. gallons, three methods of calculating gasoline octane ratings, and many other examples of multiple metrics for the same topics. Gantt charts The phrase Gantt chart refers to a graphical method for showing overlapped project schedules first developed in 1910 by Henry Gantt. These charts are of course much older than software 45
and used by many industries. However they are also widely used for software projects because they show that true waterfalls are uncommon. Instead software projects normally start a new activity before the prior activity is finished. Thus design starts before requirements are finished; coding starts before design is finished, and so forth. A Gantt chart consists of horizontal bars showing time lines for activities as shown below using a simple example: Requirements ********** Design ********** Coding ********** Testing ********** Documentation ********** Quality assurance ********** Management ********************* A variety of software project management tools and software parametric estimation tools produce Gantt charts as standard outputs and some also support PERT diagrams as well. GANTT is a useful visual aid for understanding schedule duration and schedule overlaps among adjacent activities. Generalists versus specialists In the early days of software programmers or software engineers handled requirements, design, coding, and testing. As applications grew larger specialists appeared. Today in 2014 there are a total of 126 occupation groups. Some organizations prefer the generalist approach with few specialists; other prefer the specialist approach with business analysts, programmers, test specialists, quality assurance specialists, and so forth. Empirical data indicates that the generalist approach tops out below 1,000 function points and becomes hazardous. For example certified test specialist personnel are about 5% more efficient in finding bugs in each of the test stages of function test, regression test, performance test, and system test. To use a non-software analogy, generalists can build small boats and canoes but if you need to build an 80,000 ton cruise ship you will need dozens of special skills. Goal-question metrics (GQM) The phrase goal-question metric refers to a fairly new and general way of measurement developed by Dr. Victor Basili of the University of Maryland, with contributions by Dr. David Weiss and Albert Endres of IBM. The GQM approach is a general-purpose measurement method and not limited to software. It includes a six-step process: 1) set a goal; 2) generate questions based on the goal; 3) specify metrics; 4) develop a data collection method; 5) collect and validate data; 6) do a post mortem on results. It is possible to use the GQM approach with standard metrics such as function points and defect removal efficiency (DRE). Indeed two useful goals for the software industry are: 1) raise average productivity rates to 15 function points per staff month; 2) raise average defect removal efficiency levels to 99%. 46
Good-enough quality fallacy Because software managers are often poor in understanding software economics, it has become commonplace to think that software with significant bugs can be released if it is able to work and perform most tasks. However, more sophisticated software managers know that released bugs lower customer satisfaction and raise support and warranty costs. Further, if software is developed properly using a combination of defect prevention, pre-test defect removal such as inspections and static analysis, and formal testing by certified test personnel using mathematical test case design it can achieve > 99% DRE and still be quicker than sloppy development that only achieved 90% DRE or less. The good-enough fallacy is symptomatic of inept management who need better training in software economics and software quality control. Make no mistake: the shortest software schedules correlate with the highest DRE levels and the lowest defect potentials. Software schedules slip because there are too many bugs in software when testing starts. See also technical debt discussed later in this paper. Governance The financial collapse of Enron and other major financial problems partly blamed on software led to the passage of the Draconian Sarbanes-Oxley law in the United States. This law is aimed at corporate executives, and can bring criminal charges against corporate executives for poor governance or lack of due diligence. The term governance means constant oversight and due diligence by executives of software and operations that might have financial consequences if mistakes are made. A number of the measures discussed here in this report are relevant to governance including but not limited to: cyclomatic complexity, defect origins, defect severity, defect potentials, defect discovery efficiency (DDE), defect removal efficiency (DRE), delivered defects, function point size metrics, and reliability. Halstead complexity The metrics discussed in this topic were developed by Dr. Maurice Halstead in 1977 and deal primarily with code complexity, although they have more general uses. Halstead set up a suite of metrics that included operators (verbs or commands) and operands (nouns or data). By enumerating distinct operators and operands various metrics such as program length, volume, and difficulty are produced. Halstead metrics and cyclomatic complexity metrics are different bug somewhat congruent. Today in 2014 Halstead complexity is less widely used than cyclomatic complexity. Historical data leakage Leakage from historical data is an endemic problem of the software industry. Leakage has the effect of making both quality and productivity look better than they really are. Leakage was first noted by the author in the 1970 s. The most common omissions from historical productivity data 47
include unpaid overtime, project management, user costs, and the work of part-time specialists such as quality assurance, technical writers, business analysts, agile coaches, project office staff, and many more. Leakage is worse for projects created via cost centers than via profit centers. Quality data leakage is also severe and includes omitting bugs in requirements and design, omitting bugs found by unit test, omitting bugs found by static analysis, and omitting bugs found by developers themselves. At IBM there were volunteers who reported unit test and selfdiscovered bugs in order to provide some kind of statistical knowledge of these topics. Among the author s clients overall cost data for cost-center projects average about 37% complete. Quality data averaged only about 24% complete. Projects developed under time and material contracts are more accurate than fixed-price contracts. Projects developed by profit centers are more accurate than projects developed by cost centers. Industry Comparisons Software is produced by essentially every industry in the world. There is little published data that compares software quality and productivity across industry lines. From the author s data collection of about 20,000 projects the high-technology industries that manufacture complex physical equipment (medical devices, avionics, embedded applications) have the best quality. Banks and insurance companies have the best productivity. One of the virtues of function point metrics is the ability to direct comparisons across all industries. Following are preliminary data points from the United States only: 48
Defect Software Defect Removal Delivered Productivity Potentials Efficiency Defects Industry 2013 2013 2013 2013 1 Government - intelligence 7.20 5.95 99.50% 0.03 2 Manufacturing - medical devices 7.75 5.20 98.50% 0.08 3 Manufacturing - aircraft 7.25 5.75 98.00% 0.12 4 Telecommunications operations 9.75 5.00 97.50% 0.13 5 Manufacturing - electronics 8.25 5.25 97.00% 0.16 6 Manufacturing - telecommunications 9.75 5.50 96.50% 0.19 7 Manufacturing - defense 6.85 6.00 96.25% 0.23 8 Government - military 6.75 6.40 96.00% 0.26 9 Entertainment - films 13.00 4.00 96.00% 0.16 10 Manufacturing - pharmaceuticals 8.90 4.55 95.50% 0.20 11 Smartphone/tablet applications 15.25 3.30 95.00% 0.17 12 Transportation - airlines 8.75 5.00 94.50% 0.28 13 Software (commercial) 15.00 3.50 94.00% 0.21 14 Manufacturing - automotive 7.75 4.90 94.00% 0.29 15 Transportation - bus 8.00 5.10 94.00% 0.31 16 Manufacturing - chemicals 8.00 4.80 94.00% 0.29 17 Banks - investment 11.50 4.60 93.75% 0.29 18 Open source development 13.75 4.40 93.50% 0.29 19 Banks - commercial 11.50 4.50 93.50% 0.29 20 Credit unions 11.20 4.50 93.50% 0.29 21 Professional support - medicine 8.55 4.80 93.50% 0.31 22 Government - police 8.50 5.20 93.50% 0.34 23 Entertainment - television 12.25 4.60 93.00% 0.32 24 Manufacturing - appliances 7.60 4.30 93.00% 0.30 25 Software (outsourcing) 14.00 4.65 92.75% 0.34 26 Manufacturing - nautical 8.00 4.60 92.50% 0.35 27 Process control 9.00 4.90 92.50% 0.37 28 Stock/commodity brokerage 10.00 5.15 92.50% 0.39 29 Professional support - law 8.50 4.75 92.00% 0.38 30 Games - computer 15.75 3.00 91.00% 0.27 31 Social networks 14.90 4.90 91.00% 0.44 32 Insurance - Life 10.00 5.00 91.00% 0.45 33 Insurance - medical 10.50 5.25 91.00% 0.47 34 Public utilities - electricity 7.00 4.80 90.50% 0.46 35 Education - University 8.60 4.50 90.00% 0.45 36 Automotive sales 8.00 4.75 90.00% 0.48 49
37 Hospitals 8.00 4.80 90.00% 0.48 38 Insurance - property and casualty 9.80 5.00 90.00% 0.50 39 Oil extraction 8.75 5.00 90.00% 0.50 40 Consulting 12.70 4.00 89.00% 0.44 41 Public utilities - water 7.25 4.40 89.00% 0.48 42 Publishing (books/journals) 8.60 4.50 89.00% 0.50 43 Transportation - ship 8.00 4.90 88.00% 0.59 44 Natural gas generation 6.75 5.00 87.50% 0.63 45 Education - secondary 7.60 4.35 87.00% 0.57 46 Construction 7.10 4.70 87.00% 0.61 47 Real estate - commercial 7.25 5.00 87.00% 0.65 48 Agriculture 7.75 5.50 87.00% 0.72 49 Entertainment - music 11.00 4.00 86.50% 0.54 50 Education - primary 7.50 4.30 86.50% 0.58 51 Transportation - truck 8.00 5.00 86.50% 0.68 52 Government - state 6.50 5.65 86.50% 0.76 53 Manufacturing - apparel 7.00 3.00 86.00% 0.42 54 Games - traditional 7.50 4.00 86.00% 0.56 55 Manufacturing - general 8.25 5.20 86.00% 0.73 56 Retail 8.00 5.40 85.50% 0.78 57 Hotels 8.75 4.40 85.00% 0.66 58 Real estate - residential 7.25 4.80 85.00% 0.72 59 Mining - metals 7.00 4.90 85.00% 0.74 60 Automotive repairs 7.50 5.00 85.00% 0.75 61 Wholesale 8.25 5.20 85.00% 0.78 62 Government - federal civilian 6.50 6.00 84.75% 0.92 63 Waste management 7.00 4.60 84.50% 0.71 64 Transportation - trains 8.00 4.70 84.50% 0.73 65 Food - restaurants 7.00 4.80 84.50% 0.74 66 Mining-coal 7.00 5.00 84.50% 0.78 67 Government - county 6.50 5.55 84.50% 0.86 68 Government - municipal 7.00 5.50 84.00% 0.88 TOTAL/AVERAGES 8.95 4.82 90.39% 0.46 The U.S. Department of Commerce and the Census Bureau have developed an encoding method that is used to identify industries for statistical purposes called the North American Industry Classification (NAIC). Refer to the NAIC code discussion later in this document for a description. Inflation metrics 50
Over long periods of time wages, taxes, and other costs tend to increase steadily. This is called inflation and is normally measured in terms of a percentage increase. For software, inflation rates play a part in large systems that take many years to develop. They also play a part in longrange legacy application maintenance. Inflation also plays a part in selection of outsource countries. For example in 2014 the inflation rates in China and India are higher than in the United States, which will eventually erode the current cost advantages of these two countries for outsource contracts. International comparisons Software is developed in every known country in the world. This brings up the question of what methods are effective in comparing productivity and quality across national boundaries? Some of the factors that have international impacts include: 1) average compensation levels for software personnel by country; 2) national inflation rates; 3) work hours per month by country; 4) vacation and public holidays by country; 5) unionization of software personnel and local union regulations; 6) probabilities of strikes or civil unrest; 7) stability of electric power supplies by country; 7) logistics such as air travel; 8) time zones which make communication difficult between countries with more than a 4 hour time difference; 9) knowledge of spoken and written English, which are the dominant languages for software; 10) intellectual property laws and protection of patents and source code. Function point metrics allow interesting global comparisons of quality and productivity that are not possible using other metrics: Approximate Approximate Approximate Approximate Software Defect Defect Delivered Productivity Potentials Removal Defects (FP per Month) in 2013 Efficiency in 2013 (Defects per FP) (Defects per FP) 1 Japan 9.15 4.50 93.50% 0.29 2 India 11.30 4.90 93.00% 0.34 3 Denmark 9.45 4.80 92.00% 0.38 4 Canada 8.85 4.75 91.75% 0.39 5 South Korea 8.75 4.90 92.00% 0.39 6 Switzerland 9.35 5.00 92.00% 0.40 7 United Kingdom 8.85 4.75 91.50% 0.40 8 Israel 9.10 5.10 92.00% 0.41 9 Sweden 9.25 4.75 91.00% 0.43 10 Norway 9.15 4.75 91.00% 0.43 11 Netherlands 9.30 4.80 91.00% 0.43 12 Hungary 9.00 4.60 90.50% 0.44 13 Ireland 9.20 4.85 90.50% 0.46 14 United States 8.95 4.82 90.15% 0.47 15 Brazil 9.40 4.75 90.00% 0.48 16 France 8.60 4.85 90.00% 0.49 17 Australia 8.88 4.85 90.00% 0.49 18 Austria 8.95 4.75 89.50% 0.50 51
19 Belgium 9.10 4.70 89.15% 0.51 20 Finland 9.00 4.70 89.00% 0.52 21 Hong Kong 9.50 4.75 89.00% 0.52 22 Mexico 8.65 4.85 88.00% 0.58 23 Germany 8.85 4.95 88.00% 0.59 24 Philippines 10.75 5.00 88.00% 0.60 25 New Zealand. 9.05 4.85 87.50% 0.61 26 Taiwan 9.00 4.90 87.50% 0.61 27 Italy 8.60 4.95 87.50% 0.62 28 Jordan 7.85 5.00 87.50% 0.63 29 Malaysia 8.40 4.65 86.25% 0.64 30 Thailand 7.90 4.95 87.00% 0.64 31 Spain 8.50 4.90 86.50% 0.66 32 Portugal 8.45 4.85 86.20% 0.67 33 Singapore 9.40 4.80 86.00% 0.67 34 Russia 8.65 5.15 86.50% 0.70 35 Argentina 8.30 4.80 85.50% 0.70 36 China 9.15 5.20 86.50% 0.70 37 South Africa 8.35 4.90 85.50% 0.71 38 Iceland 8.70 4.75 85.00% 0.71 39 Poland 8.45 4.80 85.00% 0.72 40 Costa Rica 8.00 4.70 84.50% 0.73 41 Bahrain 7.85 4.75 84.50% 0.74 42 Ukraine 9.10 4.95 85.00% 0.74 43 Turkey 8.60 4.90 84.50% 0.76 44 Viet Nam 8.65 4.90 84.50% 0.76 45 Kuwait 8.80 4.80 84.00% 0.77 46 Colombia 8.00 4.75 83.50% 0.78 47 Peru 8.75 4.90 84.00% 0.78 48 Greece 7.85 4.80 83.50% 0.79 49 Syria 7.60 4.95 84.00% 0.79 50 Tunisia 8.20 4.75 83.00% 0.81 51 Saudi Arabia 8.85 5.05 84.00% 0.81 52 Cuba 7.85 4.75 82.50% 0.83 53 Panama 7.95 4.75 82.50% 0.83 54 Egypt 8.55 4.90 82.75% 0.85 55 Libya 7.80 4.85 82.50% 0.85 56 Lebanon 7.75 4.75 82.00% 0.86 57 Iran 7.25 5.25 83.50% 0.87 58 Venezuela 7.50 4.70 81.50% 0.87 59 Iraq 7.95 5.05 82.50% 0.88 60 Pakistan 7.40 5.05 82.00% 0.91 61 Algeria 8.10 4.85 81.00% 0.92 62 Indonesia 8.90 4.90 80.50% 0.96 63 North Korea 7.65 5.10 81.00% 0.97 64 Nigeria 7.00 4.75 78.00% 1.05 65 Bangladesh 7.50 4.75 77.00% 1.09 66 Burma 7.40 4.80 77.00% 1.10 AVERAGE/TOTAL 8.59 4.85 86.27% 0.67 52
This data is based on small samples for most countries and has a high margin of error. Hopefully seeing international comparisons will lead to additional studies. Every country should have a national benchmark repository. The author has collected data in 24 countries, but additional data is taken from secondary sources and converted into function point formats. These tables are published in the hope of increasing the volume of quantified data from every country. When comparing data across a number of countries the author s method of identifying each country for statistical purposes is to use the country s telephone dialing code; i.e. the United States is 1, Brazil is 55, China is 86, Germany is 49, and so forth. A google search on international telephone dialing codes will lead to lists for all countries and major cities as well. Inspection metrics One of the virtues of formal inspections of requirements, design, code, and other deliverables is the suite of standard metrics that are part of the inspection process. Inspection data routinely includes preparation effort, inspection session team size and effort; defects detected before and during inspections; defect repair effort after inspections; and calendar time for the inspections for specific projects. This data is useful in comparing the effectiveness of inspections against other methods of defect removal such as pair programming, static analysis, and various forms of testing. To date inspections have the highest levels of defect removal efficiency (> 85%) of any known form of software defect removal. Invalid defects The term invalid defect refers to a bug report against software applications that, upon examination, turns out not to be true defects. Some of the common reasons for invalid defects include: user errors, hardware errors, and operating system errors mistaken for application errors. As an example of an invalid defect, a bug report against a competitive estimation tool was sent to the author s company by mistake. Even though it was not our bug it took about an hour to forward the bug to the actual company and to notify the client of the error. Invalid defects are not true defects but they do accumulate costs. Overall about 15% of reported bugs against many software applications are invalid defects. ISO/IEC standards This phrase is an amalgamation of the international organization for standards, commonly abbreviated to ISO, and the international electrotechnical commission, commonly abbreviated to IEC. These groups have hundreds of standards covering essentially every industry. Some of the standards that are relevant to software include ISO/IEC 2096:2009 for function points; the ISO/IEC 9126 quality standard; and the new ISO 3101:2009 risk standard. An issue for all ISO/IEC standards is lack of empirical data that proves the benefits of the standards. There is no reason to doubt that international standards are beneficial, but it would be useful to have 53
empirical data that shows specific benefits. For example do the ISO quality and risk standards actually improve quality or reduce risks? As of 2014 nobody knows. The standards community should probably take lessons from the medical community and include proof of efficacy and avoidance of harm as part of the standards creation process. As medicine has learned from the many harmful side-effects of prescription drugs, releasing a medicine without thorough testing can cause immense harm to patients, including death. Releasing standards without proof of efficacy and avoidance of harmful side-effects should be a standard practice itself. Kanban Kanban is a Japanese method of streamlining manufacturing first developed by Toyota. It has become famous under the phrase just in time. The Kanban approach uses interesting methods for marking progress and showing when a deliverable is ready for the next step in production. Kanban is used with software, but not consistently. The agile approach has adopted some Kanban ideas, as have other methodologies. Quite a number of methods for quality control were first used in Japan, whose national interest in quality is thousands of years old. Other Japanese methods include quality circles, Kaizen, and Poke Yoke. Empirical data gathered from Japanese companies indicate very high software quality levels, so the combinations of Japanese methods have proven to be useful and successful in a software context. Kelvin s law of 1883 If you cannot measure it you cannot improve it. William Thomson was to become the first Baron of Kelvin, and is commonly known as Lord Kelvin. He was a mathematician and physicist with many accomplishments, including measuring absolute zero temperature. His famous quotation is widely cited in the software literature and is a primary incentive for striving for effective software metrics. Key Performance Indicators (KPI) This term is applied to dozens of industries and technical fields, including software. The general meaning is progress towards a specific goal. This definition is congruent with goal-question metrics and with rates of improvement discussed later in this report. KPI can include both quantitative and qualitative information. KPI can also be used in predictive and measurement modes. Due to the large scope of topics and the large literature available, a google search is recommended to bring up recent documents on KPI. Software Engineering Institute assessments also include KPI. KLOC This term uses K to express 1000 and LOC for lines of code. This is a metric that dates back to the 1960 s as a way of measuring both software size and also software costs and defect 54
densities. However both KLOC and LOC metrics share common problems in that they penalize high-level languages and make requirements and design effort and defects invisible. Language levels In the late 1960 s and early 1970 s programming languages began their rapid increase in numbers of languages and also powers of languages. By the mid 1970 s over 50 languages were in use. The phrases low level and high level were subjective and had no mathematical rigor. IBM wanted to be able to evaluate the power of various languages and so developed a mathematical form for quantifying levels. This method made basic assembly language the primary unit and it was assigned level 1. Other languages were evaluated based on how many statements in basic assembly language it would take to duplicate one statement in the higher-level language. Using this method both COBOL and FORTRAN were level 3 languages because it took an average of three assembly statements to provide the features of one statement in COBOL or FORTRAN. Later when function points were invented in 1975 the level concept was extended to support function points and was used for backfiring or mathematical conversion between code volumes and function points. Here too basic assembly was the starting point, and it took about 320 assembly statements to be equivalent to one function point. Today in 2014 tables of language levels are commercially available and include about 1000 different languages. For example Java is level 6; objective C is level 12; PL/I is level 4; C is level 2.5, and so forth. This topic is popular and widely used, but needs additional study and more empirical data to prove the validity of the assigned levels for each language. Combinations of languages can also be assigned levels, such as Java and HTML or COBOL and SQL. Lean development The term lean is a relative term that implies less body fat and a lower weight than average. When applied to software the term sort of means building software with a smaller staff than normal, while hopefully not slowing down development or causing harmful side effects. Lean manufacturing originated at Toyota in Japan but the concepts spread to software and especially to the agile approach. Some of the lean concepts include eliminate waste, amplify learning, and build as fast as possible. A lean method called value stream mapping includes useful metrics. As with many other software concepts, lean suffers from a lack of solid empirical data that demonstrates effectiveness and lack of harmful side-effects. The author s clients that use lean methods have done so on small projects below 1,000 function points, and their productivity and quality levels have been good but not outstanding. As of 2014 it is uncertain how lean concepts will scale up to large systems in the 10,000 function point size range. However TSP and RUP have proof of success for large systems so lean should be compared against them. Learning curves The concept of learning curves is that when human beings need to master a new skill, their initial performance will be suboptimal until the skill is truly mastered. This means that when 55
companies adopt a new methodology, such as agile, the first project may lag in terms of productivity or quality or both. Learning curves have empirical data from hundreds of technical fields in dozens of industries. However for software learning curves are often ignored when estimating initial projects based on agile, TSP, RUP, or whatever. In general expect suboptimal performance for a period of three to six months followed by rapid improvement in performance after the learning period. Assume for example that average productivity rates using waterfall development is 6.00 function points per staff month and a company wants to adopt lean and agile techniques. What might occur at three month intervals could be: First quarter = 4.00 function points per staff month; second quarter = 5.00 function points per staff month; third quarter = 6.00 function points per staff month; fourth quarter 8.00 function points per staff month; next calendar year 10.00 function points per staff month. In other words performance is low for the first six months due to the learning curve; after that it improves and hits new highs after 12 months. Lines of code (LOC) metrics Lines of code are probably the oldest metric for software and date back to the 1950 s. LOC metrics come in two distinct flavors: 1) physical lines; 2) logical code statements. Of these two physical lines is the easiest to count but the least accurate in terms of how developers think about programs. Physical lines can include blanks between paragraphs and also comments, neither of which have any bearing on code. Logical statements deal with executable commands and data definitions, which are the things programmers consider when writing code. However both physical and logical code still penalize high-level languages and make requirements and design invisible. A study by the author of software journals such as IEEE Software, the IBM Systems Journal, Crosstalk, Cutter, etc. found that about one third of published articles used physical LOC; one third used logical code statements; and the remaining third just used LOC without specifying either physical or logical. There can be as much as a 500% difference between counts of physical and logical code. The inconsistent use of logical and physical LOC in the software literature is symptomatic of the sloppy measurement practices of the software community. Maintenance metrics The term maintenance is highly ambiguous. No fewer than 23 different kinds of work are subsumed under the single term maintenance. Some of these forms of maintenance include defect repairs, refactoring, restructuring, reverse engineering, reengineering of legacy applications, and even enhancements or adding new features. For legal reasons IBM made a rigorous distinction between maintenance in the sense of defect repairs and enhancements or adding new features. A court order required IBM to provide maintenance information to competitors, but the order did not define what the word maintenance meant. A very useful metric for maintenance is to use function point metrics for the quantity of software one maintenance programmer can keep up and running for one year. The current average is about 1,500 function points. For very well structured software the maintenance assignment scope can top 5,000 function points. For very bad software with high complexity and convoluted paths, the maintenance assignment scope can drop below 500 function points. Other metrics in the 56
maintenance field include the number of clients one telephone support person can handle during a typical day (about 10) and the number of bugs that a maintenance programmer can fix per month (from 8 to 12). In spite of the complexity of maintenance, the tasks of maintenance, customer support, and enhancement can be measured and predicted fairly well. This is important because in 2014 the world population of software maintenance personnel is larger than the world population of software development personnel. Measurement speed and cost A topic of some importance is how easy or difficult it is to use specific metrics and measure useful facts about software. Manual methods are known to be slow and costly. For example manual function point counts only proceed at a rate of about 500 function points per day. At a consulting cost of $3,000 per day that would mean it costs $6.00 for every function point counted. (The author has filed a U.S. patent application on a high-speed early sizing method that can predict function points and other metrics in an average time of 1.8 minutes per project regardless of the size of the application. This is a standard feature of the author s Software Risk Master tool.) Collecting manual benchmark data by interviewing a development team takes about three hours per project. Assuming four development personnel and a manager are interviewed, the effort would 15 staff hours for the development group and 3 consulting hours: 18 hours in total. Assuming average costs benchmark data collection would cost about $2,250 per project. By contrast self-reported data can be gathered for about half that. Automated tools for high-speed function point analysis, for cyclomatic complexity, and for code counting are all available but to date none have published speed and cost data. However the topics of measurement speed and measurement costs are under reported in the software literature and need more work. Meetings and communications One of the major cost drivers for software projects is that of meetings and communications. For example agile projects have cut down paperwork but increased meeting and communication costs. Between 12% and about 20% of software development costs are in the form of meetings with customers, team meetings, or meetings between managers and other managers. If travel is included for international projects, the percentages can be even higher. Agile projects have cut down on document costs compared to ordinary projects, but increased meetings and communication costs. Unless both documents and meetings and communications are measured, which is usually not the case, it is hard to see which is best. A typical pattern of meetings for a software project of 2,500 function points is shown using the Software Risk Master (SRM) tool: 57
SRM Estimates for Meetings and Communications Number of Total Meeting $ per Meeting Events Meetings Attendees Hours Costs Funct. Pt. Conference calls 25 7 178 $13,509 $5.40 Client meetings 6 8 186 $14,066 $5.63 Arch./design meetings 5 7 158 $11,949 $4.78 Team technical meetings 59 8 1,976 $149,665 $59.87 Team status meetings 78 14 2,750 $208,308 $83.32 Executive status meetings 6 7 191 $14,461 $5.78 Problem analysis meetings 8 10 530 $40,120 $16.05 Phase reviews 3 15 362 $27,435 $10.97 Meeting FP per staff month 52.17 Meeting work hours per function point 2.53 Percent of Development Costs 12.33% It is easy to see why meetings and communications are an important software cost driver. However they are seldom measured or included in benchmark reports even though they may rank as high in total costs. 58
Methodology Comparison Metrics A basic purpose of software metrics should be to compare the results of various methodologies such as agile, extreme programming, pair programming, waterfall, RUP, TSP, Prince2, Merise, Iterative, and in fact all 35 named methodologies. The only current metric that is useful for sideby-side comparisons of methodologies is the function point metric. LOC does not measure requirements and design and penalizes high-level languages. Story points don t measure projects without user stories. Use-case points don t measure projects without use cases. Function points measure everything. For development productivity a study on the author s blog (http://namcookanalytics.com) shows the following results for 10 similar projects of 1000 function points in size 10 different methodologies and all compared via function points per staff month: 1) CMMI 5 with spiral development = 12.05 2) Extreme programming = 11.89 3) Agile/scrum = 11.85 4) TSP = 11.64 5) RUP = 9.58 6) CMMI 3 with iterative development = 9.37 7) Object-oriented development = 9.31 8) Pair programming = 9.21 9) CMMI 1 with waterfall = 6.51 10) Correctness proofs with waterfall = 6.21. These 10 samples illustrate the versatility of the function point metric in being able to measure any known methodology coded in any known programming language or combination of languages. Methodology validation before release In medicine and some engineering fields before a new therapy can be released to the public it must undergo a series of tests and validation exercises to ensure that it works as advertised and does not have serious problems or cause harm. For software methodologies it would be useful to include a validation phase before releasing the methods to the world. IBM did validate function point metrics and formal inspections and also the Rational Unified Process (RUP). Methods that seem to have been released without much in the way of formal validation include agile development and pair programming. Now that both have been in use for a number of years, agile seems to be effective below 1000 function points for projects with limited numbers of users, some of whom can participate directly. Agile is not yet effective for large systems above 10,000 function points or for projects with millions of users. Agile also has problems with software that needs FDA or FAA certification due in part to the huge volumes of paper documents required by Federal certification. Methodologies, like prescription medicines, should come with warning labels that describe proper use and include cautions about possible harmful consequences if the methodology is used outside its proven range of effectiveness. Current 59
methods that need validation and proof of success and a lack of harmful side-effects include pair programming which is intrinsically expensive and lean development which is useful for hardware but still not validated for software. Metrics conversion With two different forms of lines of code metrics, more than a dozen variations in function point metrics; story points, use-case points, and RICE objects one might think that metrics conversion between various metrics would be sophisticated and supported by both commercial and opensource tools, but this is not the case. In the author s view it is the responsibility of a metric inventor to provide conversion rules between a new metric and older metrics. For example it is NOT the responsibility of the International Function Point Users Group (IFPUG) to waste resources deriving conversion rules for every minor variation or new flavor function point. As a courtesy, the author s Software Risk Master (SRM) tool does provide conversions between 23 metrics, and this seems to be the largest number of conversions as of 2014. There are more narrow published conversions between COSMIC and IFPUG function points. However metrics conversion is a very weak link in the chain of software measurement techniques. Examples of metrics conversion are shown below for an application of a nominal 1000 IFPUG function points. These are standard outputs from the author s Software Risk Master tool: Alternate Metrics Size % of IFPUG 1 IFPUG 4.3 1,0 00 100.00% 2 Automated code based 1,0 70 107.00% 3 Automated UML-based 1,0 30 103.00% 4 Backfired function points 1,0 00 100.00% 5 Cosmic function points 1,1 43 114.29% 6 Fast function points 9 70 97.00% 7 Feature points 1,0 00 100.00% 8 FISMA function points 1,0 20 102.00% 9 Full function points 1,1 70 117.00% 10 Function points light 9 65 96.50% 11 IntegraNova models 1,0 90 109.00% 12 Mark II function points 1,0 60 106.00% 13 NESMA function points 1,0 40 104.00% 14 RICE objects 4,7 14 471.43% 15 SCCQI function points 3,029 302.86% 60
16 Simple function points 17 SNAP non functional metrics 9 75 97.50% 1 82 18.18% 18 SRM pattern matching 1000 100.00% 19 Story points 5 56 55.56% 20 Unadjusted function points 8 90 89.00% 21 Use case points 3 33 33.33% This table is included in Software Risk Master as a standard output. Additional metrics will be added from time to time as they occur and have sufficient data available. Metrics education academic Academic training in software metrics is embarrassingly bad. So far as can be determined from limited samples not a single academic course mentions that LOC metrics penalize high level languages and that cost per defect metrics penalize quality. The majority of academics probably don t even know these basic facts of software metrics. What universities should teach about software metrics include: manufacturing economics and the difference between fixed and variable software costs, activity-based cost analysis, defect potentials and defect removal efficiency, function point analysis, metrics conversion, comparing unlike software methods; comparing international software projects; software growth patterns during development and after release. They should also teach the hazards of metrics with proven mathematical and economic flaws such as lines of code and cost per defect, both of which violate standard economic assumptions. Metrics education professional societies and metrics companies Metrics training from professional societies and companies that use metrics such as benchmark and estimation companies are generally focused on teaching specific skills such as function point analysis. The estimating companies teach the specifics of using their tools, and also provide some more general training on estimation and measurement topics. Academic institutions are so weak in metrics training that probably the societies and metrics companies provide more hours of training than all universities put together, and do a better job overall. Metrics natural The phrase natural metric refers to a metric that measures something visible and tangible that can be seen and counted without ambiguity. Examples of natural metrics for software would include pages of documents, test cases created, test cases executed, and physical lines of code. By contrast synthetic metrics are not visible and not tangible. 61
Metrics synthetic The phrase synthetic metric refers to things that are abstract and based on mathematics rather than on actual physical phenomena. Examples of synthetic metrics for software include function point metrics, cyclomatic complexity metrics, logical code statements, test coverage, and defect density. Both synthetic and natural metrics are important, but synthetic metrics are more difficult to count. However synthetic metrics tend to be very useful for normalization of economic and quality data, which is difficult to do with natural metrics. Micro function points Normal IFPUG function points have adjustment factors that hit lower limits for application sizes of about 10 function points. Quite a lot of software work takes place below that size, and a surprising amount of work in the realm of small enhancements and bug repairs may even be below one function point in size. There is a need for a micro function point that can handle very small sizes including fractional function points. The Software Risk Master (SRM) tool can do this via pattern matching; backfiring is also a possibility. Individually very small micro changes have little effort and low costs. However Fortune 500 companies can have over 25,000 such small changes per year and the cumulative costs can be significant. In any case, knowledge of the total volume and expenses associated with very small changes is useful economic information. Metrics validation Before a metric is released to the outside world and everyday users, it should be validated under controlled conditions and proven to be effective and be without harmful consequences. For metrics such as function points and SNAP metrics they did undergo extensive validation. Other metrics were just developed and published without any validation. Older metrics such as lines of code and cost per defect have been in use for more than 50 years without yet being formally studied or validated for ranges of effectiveness and for harmful consequences. Monte Carlo method This phrase implies a predictive method named after the famous gaming Mecca of Monte Carlo. Applied to business and technology, the Monte Carlo uses numerous samples to derive probabilities and more general rules. For example collecting data on software projects from a sample of 50 commercial banks might provide useful information on ranges of banking software performance. Doing a similar study for 50 manufacturing companies would provide similar data, and comparing the two sets would also be insightful. For predictive modeling ranges of inputs would be defined and then dozens or scores of runs would be made to check the distributions over the ranges. John von Neumann programmed the ENIAC computer to provide Monte Carlo 62
simulations so this method is as old as the computer industry. Monte Carlo simulation is also part of some software estimation tools. Morale metrics A topic needing more study, but difficult to gather data, is the impact of morale on team performance. Many companies such as IBM and Apple perform morale studies, but these are usually kept internally and not published outside. Sometimes interesting correlations do get published. For example when IBM opened the new Santa Teresa programming center, designed specifically for software, the morale studies found that morale was much higher at Santa Teresa than at the nearby San Jose lab where the programmer s had worked before. Productivity and quality were also high. Of course these findings might not prove causation of the new architecture, but they were interesting. In general high morale correlates with high quality and high productivity, and low morale with the opposite case. But more study is needed on this topic because it is an important one for software engineering. Among the factors known to cause poor morale and even voluntary termination among software engineers have been: 1) poor project management; 2) forced use of pair programming without the consent of the software personnel; 3) impossible demands for short schedules by clients or executives; 4) more than 6 hours of unpaid overtime per week for long periods; 5) arbitrary curve fitting for appraisals that limit the number of top personnel to a limited statistical value. NAIC codes (replacements for SIC codes) There are thousands of industries. There is also a need to do cross-industry comparisons for topics such as revenues, employment, quality, etc. The U.S. Census Bureau and the U.S Department of Commerce have long recognized the need for cross-industry comparisons. Some years ago they published a large table of codes for industries called standard industry classification or SIC codes. More recently in 1997 the SIC codes have been replaced and updated by a new encoding method called North American Industry Classification or NAIC codes. The government of Mexico also participated in creating the NAIC codes. The author and his colleagues use NAIC codes when collecting benchmark data. A google search on NAIC code will bring up useful tables and a look-up engine for finding the NAIC codes of thousands of industries. The full NAIC code is six digits, but for many benchmarks the two-digit and threedigit versions are useful since they are more general. Some relevant two-digit NAIC codes for software include manufacturing 31-33; retail 44-45; information 51; finance 52; professional services 54; education 61. For benchmarks and also for software cost estimation, NAIC codes are useful to ensure apples to apples comparisons. NAIC codes are free as are a number of tools for looking up the codes for specific industries. National averages Given the size and economic importance of software one might think that every industrialized nation would have accurate data on software productivity, quality, and demographics. This does 63
not seem to exist. There seem to be no effective national averages for any software topic, and software demographics are suspect too. While basic software personnel are known fairly well, the Bureau of Labor Statics data does not show most of the 126 occupations. For example there is no good data on business analysts, software quality assurance, data base analysts, and scores of other ancillary personnel associated with software development and maintenance. Creating a national repository of quantified software data would benefit the United States. It would probably have to be done either by a major university or by a major non-profit association such as the ACM, IEEE, PMI, SIM or perhaps all of these together. Funding might be provided by major software companies such as Apple, Microsoft, IBM, Oracle and the like, all of whom have quite a bit of money and also large research organizations. Currently the best data on software productivity and quality tends to come from companies that build commercial estimation tools, and companies that provide commercial benchmark services. All of these are fairly small companies. If you look at the combined data from all 2014 software benchmark groups such as Galorath, ISBSG, Namcook Analytics, Price Systems, Q/P Management Group, Quantimetrics, QSM, Reifer Associates, Software Productivity Research the total number of projects is about 80,000. However all of these are competitive companies, and with a few exceptions such as the recent joint study by ISBSG, Namcook, and Reifer the data is not shared or compared. It is not always consistent either. One would think that a major consulting company such as Gartner, Accenture, or KPMG would assemble national data from these smaller sources, but this does not seem to happen. While it is possible in 2014 to get rough employment and salary data for a small set of software occupation groups, there is no true national average that encompasses all industries. Non-disclosure agreements (NDA) When the author and his colleagues from Namcook Analytics LLC collect benchmark data from clients, the data is provided under a non-disclosure agreement or NDA as commonly abbreviated. These agreements prevent the benchmark organization from identifying the client or the specific projects from which data are collected. Of course if the data is merely added to a collection of hundreds of other projects for statistical analysis that does not violate the NDA because it is not possible to identify where the data came from. Academics and many readers of benchmark reports that conceal the sources of the data due to NDA agreements complain that the sources should be identified, and some even assume the data is invalid unless the sources are named. NDA s are a normal part of professional benchmark data collection and serve to protect proprietary client information that should not be shared with competitors or the outside world. In a sense benchmark NDA agreements are similar to the confidentiality between lawyers and clients and the confidentiality of medical information between physicians and patients. NDA s are a common method for protecting information and need to be honored by all benchmark collection personnel. Non-functional requirements Software requirements come in two flavors: functional requirements which is what the customer wants the software to do; non-functional requirements which are needed to make the software 64
work on various platforms, or required by government mandate. Consider home construction before considering software. A home built overlooking the ocean will have windows with a view this is a functional requirement by the owners. But due to zoning and insurance demands, homes near the ocean in many states will need hurricane-proof windows. This is a nonfunctional requirement. See the discussion of the new SNAP metric later in this report. Typical non-functional requirements are changes to software to allow it to operate on multiple hardware platforms or operate under multiple operating systems. Normalization In software the term normalization has different meanings in different contexts, such as data base normalization and software project result normalization. In this paper the form of normalization of interest is converting raw data to a fixed metric so that comparisons of different projects are easy to understand. The function point metric is as good choice for normalization. Both work hours per function point and defects per function point can show the results of differences in application size, differences in application methodology, differences in CMMI levels, and other topics of interest. However there is a problem that is not well covered in the literature, and for that matter not well covered by the function point associations. Application is size is not constant. During development software applications grow due to creeping requirements at more than 1% per calendar month. After release applications continue to grow for as long as they are being used at more than 8% per calendar year. This means that both productivity and quality data need to be renormalized from time to time to match the current size. The author recommends normalization at requirements end and again at delivery for new software. For software in the field and being used, the author recommends renormalization once a year probably at the start of each fiscal or calendar year. Object-oriented metrics Object-oriented languages and methods have become mainstream development approaches. For example all software at Apple uses the Objective C programming language. The terminology and concepts of object-oriented development are somewhat unique and not the same as procedural languages. However some standard metrics such as function points and defect removal efficiency (DRE) work well with object-oriented development. In addition the OO community has developed metrics suites that are tailored to the OO approach. These include methods, classes, inheritance, encapsulation, and some others. Coupling and cohesion are also used with OO development. This is too complex a topic for a short discussion so a google search on object-oriented metrics will bring up interesting topics such as weighted methods per class and depth of inheritance tree. Occupation groups A study of software demographics in large companies was funded by AT&T and carried out by the author and his colleagues. Some of the participants in the study included IBM, the Navy, 65
Texas Instruments, Ford, and other major organizations. The study found 126 occupations in total, but no company employed more than 50 of them. Among the occupations were agile coaches, architects, business analysts, configuration control specialists, designers, estimating specialists, function point specialists, human factors specialists, programmers or software engineers, project office specialists, quality assurance specialists, technical writers, and test specialists. The number of occupation groups increased with both application size and also with company size. Traditional programming can be less than 30% of the team and less than 30% of the effort for large applications. The study also found that no human resource group actually knew how many software occupations were employed or even how many software personnel were employed. It was necessary to interview local managers. The study also found that some software personnel refused to be identified with software due to low status. These were aeronautical or automotive engineers building embedded software. Very likely government statistics on software employment are wrong. If corporate HR organizations don t know how many software people are employed they can t tell the government software employment either. There is a need for continuing study of this topic. Also needed are comparisons of productivity and quality between projects staffed with generalists and similar projects staffed by specialists. Following is a typical distribution of software occupation groups for a generic application of 1000 function points in size: Although programmers and testers dominate, note that neither occupation group even hits 30% of overall staffing levels. Needless to say there are wide variations. Also with a total of 126 known occupation groups, really large systems will have much greater diversity in occupations than shown here. Parametric estimation The term parametric estimation refers to software cost and quality estimates produced by one or more commercial software estimation tools such as COCOMO II, CostXpert, KnowledgePlan, SEER, SLIM, Software Risk Master, or TruePrice. Parametric estimates are derived from the study and analysis of historical data from past projects. As result the commercial estimation companies tend to also provide benchmark services. Some of the parametric estimation companies such as the author s Namcook Analytics have data on more than 20,000 projects. A comparison by the author of 50 parametric estimates and 50 manual estimates by experienced project managers found that both manual and parametric estimates were close for small projects below 250 function points. But as application size increased manual estimates became progressively optimistic while parametric estimates stayed within 10% well past 100,000 function points. For small projects both manual and parametric estimates should be accurate enough to be useful, but for major systems parametric estimates are a better choice. Some companies utilize two or more parametric estimation tools and run them all when dealing with large mission-critical software applications. Convergence of the estimates by separate parametric estimation tools adds value to major projects. 66
Pair programming Pair programming is an example of a methodology that should have been validated before it started being used, but was not. The concept of pair programming is that two programmers take turns coding and navigating using the same computer. Clearly if personnel salaries are $100,000 per year and the burden rate is $50,000 per year than a pair is going to cost twice as much as one programmers; i.e. $300,000 per year instead of $150,000 per years. A set of ten pairs will cost $3,000,000 per year, and return fairly low value. The literature on pair programming is trivial and only compares unaided pairs against unaided individual programmers without any reference to static analysis, inspections, or other proven methods of quality control. Although pair enthusiasts claim knowledge transfer as a virtue, there are better methods of knowledge transfer including inspections and mentoring. While some programmers enjoy pair programming many do not and several reports discuss programmers who quit companies specifically to get away from pair programming. This method should have been evaluated prior to release using a sample of at least 25 pairs compared to 25 individuals, and the experiments should also compared pairs and individuals with and without static analysis. The experiments should also have compared pairs against individuals who used formal inspections. The author s data indicates that pairs always cost more, are usually slower, and are not as effective for quality control as individual programmers who use inspections and static analysis. An unanswered question of the pair programming literature is if pairing programming is good, why not pair testers, quality assurance, project managers, business analysts, and the other 125 occupations associated with software. Pareto analysis The famous Pareto principle states that 80% of various issues will be caused by 20% of the possible causes. The name was created by Joseph Juran and named in honor of an Italian economist named Vilfredo Pareto who noted in 1906 that 20% of the peapods in his garden produced 80% of the peas. Pareto analysis is much more than the 80/20 rule and includes sophisticated methods for analyzing complex problems with many variables. Pareto distributions are frequently noted in software such as the discovery of error-prone modules and a Microsoft study that fixing 20% of bugs would eliminate 80% of system crashes. Some of the areas where Pareto analysis seem to show up include: 1) a minority of personnel seem to produce the majority of effective work; 2) in any industry a minority of companies are ranked best to work for by annual surveys. Pattern matching Patterns have become an important topic in software engineering and will become even more important as reuse enters the mainstream. Today in 2014 design patterns and code patterns are both fairly well known and widely used. Patterns are also useful in measurement and estimation. For example the author s patent-pending early sizing method is based on patterns of historical projects that match the taxonomy of the new application that is being sized. Patterns need to be 67
organized using standard taxonomies of application nature, scope, class, and type. Patterns are also used by hundreds of other industries. For example the Zillow data base of real estates and the Kelley Blue Book of used cars are both based on pattern matching. Performance metrics For the most part this paper deals with metrics for software development and maintenance. But software operating speed is also important, as is hardware operating speed. There are dozens of performance metrics and performance evaluation methods. A google search on the phrase software performance metrics is recommended. Among these metrics are load, stress, data throughput, capacity, and many others. PERT (Program Evaluation and Review Technique) The famous PERT method was developed by the U.S. Navy in the 1950 s for handling the logistics of naval ship construction. It is closely aligned to the critical path approach. In practice PERT diagrams show a network of activities and timelines, with pessimistic, optimistic, and expected durations. Part of the PERT analysis is to identify the critical path where time cannot easily be compressed. PERT graphs are often used in conjunction with GANTT charts, discussed earlier. PERT is a large and complex topic so a google search on PERT diagrams or PERT methodology will bring up extensive sets of papers and reports. In today s world there are commercial and open-source tools that can facilitate PERT analysis and create PERT diagrams and GANTT charts for software projects. Phase metrics The term phase refers to a discrete set of tasks and activities that center on producing a major deliverable such as requirements. For software projects there is some ambiguity in phase terms and concepts, but a typical pattern of software phases would include: 1) requirements; 2) design; 3) coding or construction; 4) testing; 5) deployment. Several commercial estimation tools predict software costs and schedules by phase. However there are major weaknesses with the phase concept. Among these weaknesses is the fact that many activities, such as technical documentation, quality assurance, and project management span multiple phases. Another weakness is the implicit assumption of a waterfall development method, so that phases are not a good choice for agile projects. Activity-based cost analysis is a better and more accurate alternative to phases for planning and estimating software. Portfolio metrics The term portfolio in a software context refers to the total collection of software owned and operated by a corporation or a government unit. The portfolio would include custom developed software, commercial software packages, and open-source software packages. In today s world 68
of 2014 it will also include cloud applications that companies use but do not have installed on their own computers, such as Google documents. Function point metrics are a good choice for portfolios. LOC metrics might be used but with thousands of applications coded in hundreds of languages, LOC is not an optimal choice. In today s world of 2014 a Fortune 500 company can easily own more than 5,000 software applications with an aggregate size approaching 10,000,000 function points. Productivity The standard economic definition for productivity is goods or services produced per unit of labor or expense. The software industry has not yet found a standard topic that can be used for the goods or services part of this definition. Among the topics used for goods or services can be found function points, lines of code, story points, RICE objects, and use case points. Of these only function points can be applied to every activity and every kind of software developed by all known methodologies. As of 2014 function point metrics are the best choice for software goods and services and therefore for measuring economic productivity. However the software literature includes over a dozen others such as several flavors of LOC metrics, story point, use-case points, velocity, etc. So far as can be determined no other industry besides software has such a plethora of bad choices for measuring economic productivity. Production rate This metric is often paired with assignment scope to create software cost and schedule estimates for specific activities. The production rate is the amount of work a person can complete in a fixed time period such as an hour, a week, or a month. Using the simple natural metric of pages in a user s guide assigned to a technical writer, the assignment scope might be 50 pages and the production rate might be 25 pages per month. This combination would lead to an estimate of one writer and two calendar months. Production rates can be calculated using any metric for a deliverable item, such as pages, source code, function points, story points, etc. Professional malpractice Because software is not a licensed profession it cannot actually have professional malpractice in 2014. Yet several metrics in this report are cited as being professional malpractice in specific contexts. The definition of professional malpractice is an instance of incompetence or negligence on the part of a professional. A corollary to this definition is that academic training in the profession should have provided all professionals with sufficient information to avoid most malpractice situations. As of 2014 software academic training is inadequate to warn software engineers and software managers of the hazards of bad metrics. The metric lines of code is viewed as professional malpractice in the specific context of attempting: 1) economic analysis across multiple programming languages; 2) economic analysis that includes requirements, design, and non-code work. LOC metrics would not be malpractice for studying pure coding speed or for studying code defects in specific languages. The metric cost per 69
defect is viewed as professional malpractice in the context of: 1) exploring the economic value of quality; 2) comparing a sequence of defect removal operations for the same project. Cost per defect would not be malpractice if fixed costs were backed out or for comparing identical defect removal activities such as unit test across several projects. LOC metrics make requirements and design invisible and penalize modern high-level languages. Cost per defect makes the buggiest software look cheapest and ignores the true value of quality in shortening schedules and lowering costs. Profit center A profit center is a corporate group or organization whose work contributes to the income and profits of the company. The opposite case would be a cost center where money is consumed but the work does not bring in revenues. For internal software which companies build for their own use some companies use the cost center approach and some use the profit center approach. Cost center software is provided to internal clients for free, and funding comes from some kind of corporate account. Profit center software would charge internal users for the labor and material needed to construct custom software. In general measures and metrics are better under the profit center model because without good data there is no way to bill the clients. As a general rule for 2014 about 60% of internal software groups are run using the cost center model and 40% are run using the profit center model. For commercial software development is clearly a profit center model. For embedded software in medical devices or automotive engines the software is part of a hardware product and usually not sold separately. However it still might be developed under a profit center model, not always. Overall profit centers tend to be somewhat more efficient and cost effective than cost centers. This topic should be included in standard benchmark reports, but is actually somewhat difficult to find in the software literature. Progress improvements measured rates for quality and productivity For an industry notorious for poor quality and low productivity it is obvious that metrics and measurements should be able to measure improvements over time. While this is technically possible and not even very difficult, it seldom happens. The reason is that when companies collect data for benchmarks they tend to regard them as one-shot measurement exercises, and not as a continuing activity with metrics collected once or twice a year for long periods. Some leading companies such as IBM do measure rates of progress and so do some consulting groups such as the author s Namcook Analytics LLC. From long-range measures over a 10 year period quality can be improved at annual rates of 25% or more for at least five years in a row. Productivity is harder to accomplish and annual improvements are usually below10%. Quality is measured using defect potentials and defect removal efficiency (DRE). Productivity is measured using work hours per function point for applications of nominally the same size and type. Switching to agile from waterfall is beneficial, but the agile learning curve is so steep that initial results will be disappointing. 70
Project end date The start and end dates of software projects are surprisingly ambiguous. The definition for project end date used by the author is the date the software is delivered to its intended users. This assumes that the end date is for development clearly maintenance and enhancement work could continue for years. An alternate end date would be the freeze point for software projects after which no further changes can be made to the current release. This is normally several weeks prior to delivery to clients. There are no fixed rules for end dates. Project-level metrics Probably the most common form of benchmark in the world is an overall result for a software project without any granularity or internal information about activities and tasks. For example a typical project-level benchmark for an application of 1000 function point might be that it required 15 work hours per function point, had a schedule of 15 calendar months, and a cost of $1,200 per function point. The problem with this high-level view is that there is no way to validate it. Did the project include project management? Did the project include unpaid overtime? Did the project include part-time workers such as quality assurance and technical writers? There is no way of being sure of what really happens with project-level metrics. See the discussion of activity-based costs earlier in this report. Project office or project management office (PMO) For large companies that build large systems above 10,000 function point in size it is very common to have a dedicated team of planning and estimating specialists who work together in an organization called either a project office or a project management office (PMO). These organizations are found in most major corporations such as IBM, AT&T, Motorola, and hundreds of others. PMO staffing runs from a low of two up to more than a dozen for massive software projects in the 100,000 function point size range. Because ordinary project managers are not trained in either software estimation or measurement, the PMO groups employ specialists who are trained. Further, the PMO offices are usually well stocked with a variety of project management tools including parametric estimation (SEER, KnowledgePlan, Software Risk Master, etc.), project planning tools (Microsoft Project, Timeline, etc.), and more recently newer tools such as the Automated Project Office (APO) by Computer Aid Inc. As a general rule large software projects supported by formal PMO groups have better track records for on-time delivery and cost accuracy than projects of the same size that do not have PMO organizations. Project start date The start date of a software projects is one of the most uncertain and ambiguous topics in the entire metrics literature. Long before requirements begin someone had to decide that a specific software application was needed. This need had to be expressed to higher managers who would 71
be asked to approve funds. The need would have to be explained to software development management and some technical personnel. Then formal requirements gathering and analysis would occur. What is the actual start date? For practical purposes the pre-requirements discussions and funding discussions are seldom tracked, and even if they were tracked there would be no easy way to assign them to a project until it is defined. About the only date that is crisp is the day requirements gathering starts. However for projects created by inventors for their own purposes, there are no formal requirements other than concepts in the mind of the inventor. When collecting benchmark data the author asks the local project manager for the start date and also asks what work took place on that date. Not everybody answers these questions the same way, and there are no agreed-to rules or standards for defining a software project s start date. Quality There are many competing definitions for software quality, including some like conformance to requirements that clearly don t work well. Others such as maintainability and reliability are somewhat ambiguous and only partial definitions. The definition used by the author is the absence of defects that would cause a software application to either stop completely or to produce incorrect results. This definition has the virtue of being able to be used with requirements and design defects as well as code defects. Since requirements are often buggy and filled with errors, these defects need to be included in a working definition for software quality. Defects also correlate with customer satisfaction, in that as bugs go up satisfaction comes down. Quality-Function Deployment (QFD) QFD was originally developed in Japan for hardware products by Dr. Yoji Akao in 1966. More recently it has been applied software. QFD is included in a glossary of software metrics and measurement because of the interesting fish-bone diagrams that are part of the QFD process. These are also called house of quality because the top of the diagram resembles a peaked roof. The QFD topic is a complex subject and a google search will bring up the literature on QFD. QFD is effective in improving the delivered quality of a number of kinds of products, including software. The kinds of software using QFD tend to be engineering and medical devices where there are significant liabilities and very high operational reliability are needed. Ranges of software development productivity Considering that software is more than 60 years old in 2014, one might think that both average productivity rates and ranges of productivity would be well known and widely published. This is not the case. There are books such as the author s Applied Software Measurement that have ranges and averages, and there are benchmark sources such as the International Software Benchmark Standards Group (ISBSG) that publish ranges and averages for subsets, but there is no source of national data that is continuously updated to show U.S. national averages for software productivity or the ranges of productivity. This would be somewhat equivalent to published data on U.S. life expectancy levels. Among the author s clients the range of software 72
productivity is from a low of just over 1 function point per staff month for large defense applications to a high of just under 100 function points per staff month for small civilian projects with more than 75% reusable materials. From the author s collection of about 20,000 projects the ranges by size are, expressed in terms of function points per month, are: 1 function point 33.84 10 function points 21.56 100 function points 16.31 1,000 function points 12.67 10,000 function points 3.75 100,000 function points 2.62 Average (not weighted) 13.27 Web projects 12.32 Domestic outsource 11.07 IT projects 10.04 Commercial 9.12 Systems/embedded 7.11 Civilian government 6.21 Military/defense 5.12 Average (not weighted) 8.72 Note that there are large variations by application size and also large variations by application type. There are also large variations by country, although international data is not shown here. Japan and India, for example, would be better than the U.S. Note also that other benchmark providers might have data with different results from the data shown here. This could be due to the fact that normally benchmark companies have unique sets of clients so the samples are almost always different. Also, there is little coordination of cooperation among various benchmark groups, although the author, ISBSG, and Don Reifer did produce a report on project size with data from all three organizations. Ranges of software development quality Because poor quality and excessive volumes of delivered defects are endemic problems for the software industry, it would be useful to have a national repository of software quality data. This does not exist. In fact quality data is much harder to collect than productivity data due to leakage which leaves out defects found in requirements and design, defects found by static analysis, and defects found by desk checking and unit test. Even delivered defects leak because if too many bugs are released usage will drop and hence latent bugs will remain latent and not be discovered. From the author s collection of about 20,000 projects following are approximate average values for software quality. Here too other benchmark sources will vary. 73
Size Defect Removal Defects Deliver Potential Efficiency ed 1 1.50 96.93% 0.05 10 2.50 97.50% 0.06 100 3.00 96.65% 0.10 1000 4.30 91.00% 0.39 10000 5.25 87.00% 0.68 100000 6.75 85.70% 0.97 Average 3.88 92.46% 0.37 Type Defect Removal Defects Deliver Potential Efficiency ed Domestic outsource 4.32 94.50% 0.24 IT projects 4.62 92.25% 0.36 Web projects 4.64 91.30% 0.40 Systems/embedded 4.79 98.30% 0.08 Commercial 4.95 93.50% 0.32 Government 5.21 88.70% 0.59 Military 5.45 98.65% 0.07 Average 4.94 93.78% 0.30 As can be seen there are variations by application size and also variations by application type. For national average purposes, the value shown by type is more meaningful than size, since there are very few applications larger than 10,000 function points so these large sizes distort average values. In other words defect potentials average about 4.94 while defect removal averages about 93.78% and delivered defects average about 0.30 circa 2014 if the view is cross industry. Overall ranges of defect potentials run from about 1.25 per function point to about 7.50 per function point. Ranges of defect removal run from 99.65% to a low of below 77.00%. Ranges of software schedules Probably the best way to handle ranges of software schedules is to use a graph that shows best, average, and worst case schedules for a range of application sizes: As can easily be seen the differences between the worst and the best expand for large systems, as might be expected. 74
Rayleigh curve Lord Rayleigh was an English physicist who won a Nobel Prize in 1904 for the discovery of Argon gas. He also developed a family of curves that the distribution of results for several variables. This curve was adopted by Larry Putnam and Peter Norden adopted this curve as a method of describing software staffing, effort, and schedules. The curves for software are known as Putnam-Norden-Rayleigh curves or PNR. A google search for this term will show many different articles. In general the curves are a good approximation for software staffing over time. The PNR curves, and other forms of Rayleigh curves, assume smooth progress. For software this is not always the case. There are often severe discontinuities in the real world caused by creeping requirements, canceled projects, deferred features or other abrupt changes. For example about 32% of large systems above 10,000 function points are canceled without being completed, which truncates PNR curves. For smaller project with better odds of success the curves are more successful. Larry Putnam was the original developer of the SLIM estimation tool, which supports the family of curves, as do other tools as well. See also Chaos theory earlier in this paper for a discussion of discontinuities and random events. Reliability metrics Software reliability refers, in general, to how long software can operate successfully without encountering a bug or crashing. Reliability is often expressed using mean-time to failure (MTTF) and mean-time between failures (MTBF). Studies at IBM found that reliability correlated strongly with numbers of released defects and defect removal efficiency (DRE). High reliability, or encountering a bug or failure less than once per years, normally demand DRE levels of > 99% and delivered defect densities of < 0.001 per function point. Repair and rework costs This term overlaps the term technical debt which is discussed later in the report. The phrase deals with the sum total effort and costs for logging bugs, routing them to repair teams, fixing the bugs, testing the fixes for regressions, integrating the new code, and then releasing the new version of a software application. Since bug repairs are the #1 cost driver, repair and rework costs often tops 25% of total development budgets. Requirement There is some ambiguity in exactly what constitutes a requirement. In general a requirement is a description of a specific feature which clients want software to perform. A requirement for a word-processing software package would be automatic spell checking. Requirements can be expressed in terms of use-cases, user stories, text, mathematical formulae, or combinations of methods. No matter how requirements are expressed, they are known to have several attributes that cause problems for software projects. These attributes include: 1) errors in requirements; 75
2) toxic requirements that are harmful to software (such as Y2k); 3) incompleteness which leads to continuous requirements growth. There are also deeper and more subtle problems such as the lack of an effective taxonomy that can put requirements into a well-formed hierarchy. All software needs to accept inputs, perform various calculations, and produce results. But this general statement needs to be expanded into a formal taxonomy that would encompass error checking, user error avoidance, and many others. From comparisons of explicit requirements and function points, an average requirement takes about 3.0 function points to implement, with a range between 0.5 and 20. Requirements creep Requirements have been known to change during development for more than 50 years. It was only after function points were released in 1978 that requirements creep could be measured explicitly. A sample of software projects were sized using function points at the end of the requirements phase. Later the same projects were resized at the point of delivery to customers. Since both starting and ending sizes were known and the calendar month schedules were known, this allowed researchers at IBM and elsewhere to measure requirements creep exactly. Assume that an application was measured at the end of requirements at 1,000 function points. Assume that the same application was measured 12 months later at release and was found to be 1,200 function points in size. This is an average monthly growth rate of 1.67%. The total growth or creep was 200 function points or about 16.66 function points per month. The additional 200 function points show a total growth of 20%. Growth does not stop with delivery but continues forever so long as the software is in active use. Post-release growth is slower at about 8% per calendar year. Requirements metrics It is a bad assumption to believe that user requirements are error-free. User requirements contain many errors and some requirements may be toxic and should not be in the application at all; Y2K is an example of a toxic requirement. The essential metrics for software requirements include but are not limited to: 1) requirements size in pages, words, and diagrams; requirements 2) errors found by inspection; 3) possibly toxic requirements pointed out to users by domain experts; 4) rates of requirements growth or change during development; 5) requirements deferred to future releases in order to achieve arbitrary schedule targets. Since the author s estimating tool Software Risk Master (SRM) has a patent-pending early sizing feature that allows it to be used prior to requirements, it predicts all five of these essential requirements metrics. An example for an application of 10,000 function points at delivery using the Rational Unified Process (RUP) would be: starting size = 8,023 function points; creep = 1,977 function points; monthly rate of creep = 1.82%; total creep = 19.77%; requirements defects = 1,146; toxic requirements = 27; requirements completeness = 73.68%; explicit requirements = 2,512; function points per requirement = 3.37. All of these predictions can be made before requirements analysis starts by using pattern matching from similar completed projects. Requirements are volatile and also error prone. In the future no doubt formal patterns of reusable requirements will smooth out current problems and provide better overall requirements than are common in 2014. 76
Return on investment (ROI) A google search on the phrase return on investment will bring up hundreds of articles and over a dozen flavors or ROI including return on assets, financial rate of return, economic rate of return, and a number of others. For software projects ROI is often not done at all and is seldom done well. What is needed for software ROI are accurate predictions of project schedules and costs prior to starting, and accurate measures of schedules and costs after completion. Quality also needs to be predicted and measured because poor quality will puff up maintenance costs and warranty repair costs to alarming levels, and may also trigger expensive litigation due to consequential damages. It is technically possible to predict schedules and costs with good accuracy using any of the available parametric estimation tools, with the caveat that they can t be used until requirements and known. The Software Risk Master (SRM) tool includes a patentpending early sizing feature that allows it to predict costs and schedules prior to requirements. The SRM tool also includes ROI as a standard output, assuming that the client who commissioned the estimate can provide value data for tangible and intangible value. SRM compares costs to value to calculate ROI, but it does not predict value. That must be usersupplied information because value can range from almost nothing to creating an entirely new business that will earn billions of dollars, as shown by Microsoft, Facebook, and Twitter. The essential problems with ROI calculations circa 2014 include: 1) Optimistic estimates for costs and schedules; 2) optimistic quality estimates; 3) leaky historical data; 4) failure to include requirements creep in schedule and cost estimates; 5) very poor tracking of progress; 6) poor quality control which leads to delays and cost overruns; 7) optimistic revenue or value predictions. Reusable materials Because custom design and manual coding of software is intrinsically expensive and error-prone, there is a need to move away from custom development and move towards construction from certified standard reusable components. However reuse covers much more than just source code. The full suite of reusable artifacts include but are not limited to 1) reusable architecture, 2) reusable design, 3) reusable requirements, 4) reusable plans, 5) reusable estimates, 6) reusable data structures, 7) reusable source code, 8) reusable test plans, 9) reusable test cases, 11) reusable test scripts, and 12) reusable user documentation. Currently software is built rather like an America s Cup yacht or a Formula 1 race car, using custom designs and extensive manual labor. In the future software might be constructed like regular automobiles such as Fords or Toyotas, using assembly lines of reusable materials and perhaps even robots. The overall impact of software reuse is the largest known variable in all of software engineering for a generic application of 1000 function points in size: Neither team experience, methodologies, nor programming languages have as much impact on software productivity rates as does reuse of certified components. 77
RICE objects ERP companies such as SAP and Oracle use the phrase RICE objects as a work metric. The acronym RICE stands for reports, interfaces, conversions. These are some of the activities associated with deploying ERP and building and customizing applications to work with ERP packages. 78
Risk metrics Software projects have a total of about 210 possible risk factors. Among these are outright cancellation, schedule delays, cost overruns, breach of contract litigation, patent litigation, cyber attacks, and many more. The risk analysis engine of the author s Software Risk Master (SRM) tool predicts 20 of the 210 risks and assigns each risk a probability percent based on historical data derived from similar projects of the same size and type. For example risk of breach of contract litigation ranges from 0% for in-house projects to about 15% for large contract waterfall projects with inexperienced personnel. Risk severities are also predicted using a scale from 1 to 10 with the lower numbers being less serious. Risk avoidance probabilities are also calculated based on weighted combinations of CMMI levels, methodologies, and team experience levels. The worst case would be cowboy development at CMMI 1 by a team of novices. The best case would be TSP or RUP at CMMI five by a team of experienced personnel. These risk predictions and metrics are standard features of the SRM tool. The normal way of presenting risks resembles the following chart: Normal Risk Risk Analysis from similar projects Odds Severity Optimistic cost estimates 35.00% 9.50 Inadequate quality control using only testing 30.00% 10.00 Excessive schedule pressure from clients, executives 29.50% 6.50 Technical problems hidden from clients, executives 28.60% 10.00 Executive dissatisfaction with progress 28.50% 8.50 Client dissatisfaction with progress 28.50% 9.00 Poor quality and defect measures (omits > 10% of bugs) 28.00% 7.00 Poor status tracking 27.80% 7.50 Significant requirements creep (> 10%) 26.30% 8.00 Poor cost accounting (omits > 10% of actual costs) 24.91% 6.50 Schedule slip (> 10% later than plan) 22.44% 8.00 Feature bloat and useless features (>10% not used) 22.00% 5.00 Unhappy customers (> 10% dissatisfied) 20.00% 9.25 Cost Overrun (>10% of planned budget) 18.52% 8.50 High warranty and maintenance costs 15.80% 7.75 Cancellation of project due to poor performance 14.50% 10.00 Low reliability after deployment 12.50% 7.50 Negative ROI due to poor performance 11.00% 9.00 Litigation (patents) 9.63% 9.50 Security vulnerabilites in software 9.60% 10.00 Theft of intellectual property 8.45% 9.50 Litigation (breach of contract) 7.41% 9.50 Toxic requirements that should be avoided 5.60% 9.00 79
Low team morale 4.65% 5.50 Average Risks for this size and type of project 18.44% 8.27 Financial Risk: (cancel; cost overrun; negative ROI) 44.02% Risks vary by size, complexity, experience, CMMI levels, and other factors that are specific to specific projects. Root-cause analysis The phrase root cause analysis refers to a variable set of methods and statistical approaches that attempt to find out why specific problems occurred. Root cause analysis is usually aimed at serious problems that can cause harm or large costs if not abated. Root cause analysis or RCA is not used only for software but is widely used by many hi-technology industries and also by medicine and military researchers. As an example of software root cause analysis a specific high-severity bug in a software application might have slipped through testing because no test case looked for the symptoms of the bug. A first-level issue might have been that project managers arbitrarily shortened test case design periods. Another cause might be that test personnel did not use formal test design methods based on mathematics such as design of experiments. Further, testing might have been performed by untrained developers rather than by certified test personnel. The idea of root cause analysis is to work backwards from a specific problem and identify as many layers of causes as can be proven to exist. Root cause analysis is expensive, but some tools are available from commercial and open-source vendors. See also failure mode and effects analysis (FMEA) discussed earlier in this report. Sample sizes An interesting question is what kinds of sample sizes are needed to judge software productivity and quality levels? Probably the minimum sample would be 20 projects of the same size, class and type. Since the permutations of size, class, and type total to more than 2,000,000 instances, a lot of data is needed to understand the key variables that impact software project results. To judge national productivity and quality levels about 10,000 projects per country would be useful. Since software is a major industry in more than 100 countries, the global sample size for the overall software industry should include about 1,000,000 projects. As of 2014 the sum total of all known software benchmarks is only about 80,000 software projects. See the discussion of taxonomies later in this report. Schedule compression Software schedules routinely run later than planned. Analysis by the author of over 500 projects found that average schedule demands by clients or senior managers approximated raising application size in function points to the 0.3 power. Actual delivery dates for the same projects had exponents ranging from the 0.37 to 0.41 power. For a generic application of 1000 function 80
points clients wanted the software in 8 calendar months and it took between 12 and 17 months to actually deliver it. This brings up two endemic problems for the software industry: 1) software clients and executives consistently demand schedules shorter than it is possible to build software; 2) software construction methods need to switch from custom development to using larger volumes of standard reusable components in order to shorten schedules by 50% or more. Normal attempts to compress software project schedules include adding personnel, which usually backfires, truncating quality control and test periods, which always backfires, and increasing overlap between activities. None of these are usually successful and indeed may make schedules worse. Another common method of schedule compression is to defer planned features to a later release. Use of formal risk analysis and estimation before starting projects can minimize the odds of irrational schedule demands. Benchmarks from projects of the same size and type can also minimize the odds of irrational schedule demands. Overall impossible schedule demands are the #1 cause and poor development and construction methods are the #2 cause of delivering software later than desired. Schedule overlap The term schedule overlap defines the normal practice of starting and activity before a prior activity is completed. See the discussion of Gantt chart for a visual representation of schedule overlap. Normally for projects design starts when requirements are about 75% complete; coding starts when design is about 50% complete; testing starts when coding is about 25% complete. This means that the net schedule of a software project from beginning to end is shorter than the sum of the activity schedules. Parametric estimation tools and also project management tools that support PERT and GANTT charts all handle schedule overlaps, which are normal for software projects. Schedule overlap is best handled using activity-based cost analysis or taskbased cost analysis. Agile projects with a dozen or more sprints are a special case for schedule overlap calculations. Schedule slip As discussed earlier in the section on schedule compression users routinely demand delivery dates for software projects that are quicker than technically possible. However schedule slip is not quite the same. Assume a project is initially scheduled for 18 calendar months. At about month 16 the project manager reports that more time is needed and the schedule will be 20 months. At about month 19 the project manager reports that more time is needed and the schedule will be 22 months. At about month 21 the manager reports that more time is needed and the schedule will be 24 months. In other words schedule slip is the cumulative sequence of small schedule delays usually reported only a short time before the nominal schedule is due. This is an endemic problem for large software projects. The root causes are inept schedule before the project starts, requirements creep during development, and poor quality control which stretches out testing schedules. It should be noted that most software projects seem to be on time and even early until testing, at which point they are found to have so many bugs that the planned test schedules double or triple. 81
Scope The word scope in a software context is synonymous with size and is measured using function points, story points, lines of code, or other common metrics. Scope creep is another common term which is synonymous with requirements creep. Security metrics In today s world of hacking, denial of service, and cyber attacks companies are beginning to records attempts at penetrating firewalls and other defenses. Also measured are the strength of various encryption schemes for data and confidential information. Also measured are password strength. This is a complex topic that changes rapidly so a google search on security metrics to stay current is recommended. After software is released and actually experiences attack, data should be kept on the specifics of each attack and also on the staffing, costs, and schedules for recovery and also financial losses to both companies and individuals. Six-Sigma for Software The concept of six-sigma was developed in Motorola circa 1986. It became famous when Jack Welch adopted six-sigma for General Electric. The concept originated for hardware manufacturing. More recently six-sigma has been applied to software, with mixed but generally good results. The term six-sigma is a mathematical way of expressing reliability or the odds of defects occurring. To achieve six-sigma results defect removal efficiency (DRE) would need to be 99.99966%. The current U.S average for DRE is below 90% and very few projects achieve 99%. The six-sigma approach has an extensive literature and training, as does six-sigma for software. A google search on the phrase six-sigma for software will bring up hundreds of documents and books. Size adjustment Many of the tables and graphs in this report, and others by the same author, show data expressed in even powers of 10; i.e 100 function points, 1,000 function points; 10,000 function points and so on. This is not because the projects were all even values. The author has a proprietary tool that converts application size to even values. For example if several PBX switches ranges from a low of 1,250 function points to a high of 1,750 function points they could all be expressed at a median value of 1,500 function points. The reason for this is to highlight the impact of specific factors such as methodologies, experience levels, CMMI levels, etc. Size adjustment is a subtle issue and includes adjusting defect potentials and requirement creep. In other words, size adjustments are not just adding or subtracting function points and keeping all other data at the same ratios as the original. For example if software size in function points is doubled, defect potentials will go up by more than 100% and defect removal efficiency (DRE) will decline. 82
SNAP metrics Function point metrics were developed to measure the size of software features that benefit users of the software. But there are many features in software that do not benefit users but are still required due to technical or legal constraints. This new metric, distinct from function points, is termed software non-functional assessment process or SNAP. As an example of a nonfunctional requirement, consider home construction. A home owner with an ocean view will want windows facing the ocean, which is a functional requirement. However local zoning codes and insurance regulations mandate that windows close to the ocean must be hurricane proof, which is very expensive. This is a non-functional requirement. Function points and SNAP metrics are calculated separately. However from the author s clients who have tried SNAP they seem to approximate 15% to 20% of the volume of function points. Because SNAP is new and only slowly being deployed, there may be future changes in the counting method and additional data in the future. Some examples of software SNAP might include security features and special features so the software can operate on multiple hardware platforms or multiple operating systems. As this report is being drafted an announcement in March of 2014 indicates that IFPUG and Galorath Associates are going to perform joint studies on SNAP metrics. Software employment statistics Most of us depend on the Bureau of Labor Statistics, the Census Bureau, and the Department of Commerce for statistics about software employment. However from a study of software occupation groups commissioned by AT&T, some sources of error were noted. The Bureau of Labor Statistics showed about 1,018,000 software programmers in 2012 with an impressive 22% growth rate. Our study showed that not a single human resource group kept good records or even knew how many software personnel were employed. Further, some software producers building embedded software at companies such as Ford and several medical and avionics companies refused to be identified as software engineers due to low status. They preferred their original academic job titles of automotive engineer, aeronautical engineer, or anything but software engineer. Many hi-tech companies used a generic title of member of the technical staff that included both software and hardware engineers of various kinds, without specifically identifying the software personnel. Very likely government statistics are on the low side. If the HR groups of Fortune 500 companies don t know how many software people work there, probably the government does not know either. Software Quality Assurance (SQA) As a young software engineer working for IBM in California, the author worked in one of IBM s SQA organizations. Because SQA groups evaluate the quality status of software projects, they need an independent organization separate from the development organization. This is to ensure that SQA opinions are objective, and not watered down based on threats of reprisals by project managers in case of a negative opinion. The SQA groups in major companies collect quality data and also provide quality training. The SQA personnel also participate in formal inspections, 83
often as moderators. In terms of staffing SQA organizations are typically about 3% of the development team size, although there are ranges. The IBM SQA organizations also had a true research and development function over and above normal project status reporting. For example while working in an IBM quality assurance group, the author performed research on software metrics pros and cons, and also designed IBM s first parametric software estimation tool in 1973. Formal SQA organizations are not testing groups, although some companies call testing by this name. Testing groups usually report to development management, while SQA groups report through a separate organization up to a VP of Quality. One of the more famous VP s of Quality was Phil Crosby of ITT, whole book Quality is Free remains a best-seller even in 2014. The author also worked for ITT and was the software representative to the ITT corporate quality council. Software usage and consumption metrics To complete the economic models of software projects, usage and consumption need to be measured as well as production. As it happens function point metrics can be used for consumptions studies as well as for production and maintenance studies. For examples physicians have access to more than 3,000,000 function points in MRI devices and other diagnostic tools; attorneys have access to more than 325,000 function points of legal tools; project managers have access to about 35,000 function points if they use tools such as parametric estimation, Microsoft Projects, cost accounting, etc. The overall set of metrics and measures needed is shown below: A full economic model of commercial, systems, and embedded applications as well as some IT applications would combine production and usage data. Following are the ranges in project management tool usage for leading, average, and lagging projects: 84
Numbers and Size Ranges of Software Project Management Tools (Tool sizes are expressed in terms of IFPUG function points, version 4.2) Project Management Tools Lagging Average Leading 1 Project planning 1,000 1,250 3,000 2 Project cost estimating 3,000 3 Statistical analysis 3,000 4 Methodology management 750 3,000 5 Reusable feature analysis 2,000 6 Quality estimation 2,000 7 Assessment support 500 2,000 8 Project office support 500 2,000 9 Project measurement 1,750 10 Portfolio analysis 1,500 11 Risk analysis 1,500 12 Resource tracking 300 750 1,500 13 Governance tools 1,500 14 Value analysis 350 1,250 15 Cost variance reporting 500 500 1,000 16 Personnel support 500 500 750 17 Milestone tracking 250 750 18 Budget support 250 750 19 Function point analysis 250 750 20 Backfiring: LOC to FP 300 21 Earned value analysis 250 300 22 Benchmark data collection 300 Subtotal 1,800 4,600 30,000 Tools 4 12 22 Similar data is also known for software development, software quality assurance, software maintenance, and software testing. It is interesting and significant the largest differences in tool use between laggards and leaders are for project management and quality assurance. Laggards and leaders use similar tool suites for development, but the leaders uses more than twice as many tools for management and quality assurance tasks than do the laggards. Sprint The term sprint is an interesting agile concept and term. In some agile projects overall features are divided into sets that can be built and delivered separately, often in a short time period of six weeks to two month. These subsets of overall application functionality are called sprints. The use of this term is derived from racing and implies a short distance rather than a marathon. The sprint concept works well for projects below 1,000 function points, but begins to encounter logistical problems at about 5,000 function points. For really large systems > 10,000 function points there would be hundreds of sprints and there are no current technologies for decomposing really large applications into small sets of independent features that fit the sprint concept. 85
Staffing level In the early days of software the term staffing level meant the number of programmers it might take to build an application, with ranges from 1 to perhaps 5. In today s world of 2014 with a total of 126 occupation groups this term has become much more complicated. Parametric estimation tool such as Software Risk Master (SRM) and also project management tools such as Microsoft Project can predict the number of people needed to build software. SRM predicts a standard set of 20 occupation groups including business analysts, architects, programmers, test personnel, quality assurance, technical writers, managers, etc. Staffing is not constant for most occupations but rises and shrinks as work is finished. Staffing levels by occupation include average numbers of personnel and peak numbers of personnel. See also the Rayleigh curve discussion earlier in this report. The staffing profile for a major system of 25,000 function points is shown below and is a standard output from the author s Software Risk Master tool: Occupation Groups and Part-Time Specialists Normal Staff Peak Staff 1 Programmers 9 4 14 1 2 Testers 8 3 12 5 3 Designers 3 7 6 1 4 Business analysts 3 7 5 7 5 Technical writers 1 6 2 3 6 Quality assurance 1 4 2 2 7 1st line managers 1 5 2 0 8 Data base administration 11 8 9 Project Office staff 7 1 0 10 Administrative support 8 1 0 11 Configuration control 5 6 12 Project librarians 4 5 13 2nd line managers 3 4 14 Estimating specialists 86
15 Architects 16 Security specialists 17 Performance specialists 18 Function point counters 19 Human factors specialists 20 3rd line managers 3 4 2 3 1 2 1 2 1 2 1 2 1 1 As can be seen software is a multi-discipline team activity with many different occupation groups and special skills. Standish Report (CHAOS report) The consulting company of the Standish Group publishes an annual report on IT failures. This is called the CHAOS report but also cited as the Standish report. The report is widely cited but also widely challenged. Even so it contains interesting data and information about project failures and failure modes. Note that the Standish report is limited to IT projects and does not deal with systems or embedded software, which have lower failure rates than IT projects. Nor does it deal with government and military projects, which have higher failure rates than IT projects. Story points Story points are a somewhat subjective metric based on analysis of designs expressed in terms of user stories. Story points are not standardized and vary by as much as 400% from company to company. They are used primarily with agile projects and can be used to predict velocity. A google search will bring up an extensive literature, including several papers that challenge the validity of story points. Successful projects (definition) The terms software failure and software success are ambiguous in the literature. The author s definition of success attempts to quantify the major issues troubling software: success means < 3.00 defects per function points; > 97% defect removal efficiency; > 97% of valid requirements implemented; < 10% requirements creep; 0 toxic requirements forced into application by unwise clients; > 95% of requirements defects removed; development schedule achieved within + or 3% of a formal plan; costs achieved within + or 3% of a formal parametric cost estimate. See also the definition of failing projects earlier in this report. Another cut at a definition of a successful project would be one that is in the top 15% in terms of 87
software productivity and quality from all of the projects collected by benchmark organizations such as Namcook Analytics, Q/P Management Group, Software Productivity Research, and others. Taxonomy of software projects Taxonomies are the underpinning of science and extremely valuable to all sciences. Software does not yet have an agreed-to taxonomy of software application sizes and types used by all companies and all types of software. However as part of the author s benchmark services, we have developed a useful taxonomy that allows apple to apple comparisons of any and all kinds of projects. The taxonomy consists of eight primary topics: 1. Application nature (new, enhancement, COTS modification, etc.) 2. Application scope (algorithms, module, program, system, etc.) 3. Application class (internal, commercial, defense, outsource, etc.) 4. Application type (web, IT, embedded, telecom, etc.) 5. Platform complexity (single platform, multiple platforms, etc.) 6. Problem complexity (low, average, high) 7. Code complexity (low, average, high) 8. Data complexity (low, average, high) In addition to the taxonomy itself, the author s benchmark recording method also captures data on sixteen supplemental topics that are significant to software project results. These include: 1. Development methodology (agile, RUP, TSP, waterfall, spiral, etc. 2. Quality methodologies (inspections, static analysis, test stages, etc.) 3. Activity-based cost analysis of development steps 4. Defect potentials and defect removal efficiency (DRE) 5. Special attributes (CMMI, SEMAT, FDA or FAA certification, etc.) 6. Programming language (assembly, Java, Objective C, HTML, mixed, etc.) 7. Requirements growth during development and after release 8. Experience levels (clients, developers, testers, managers, etc.) 9. Development country or countries (U.S., India, Japan, multiple, etc.) 10. Development state or region (Florida, New York, California, etc.) 11. Hardware platforms (smartphone, tablet, embedded, mainframe, etc.) 12. Software platforms (Windows, Linux, IBM, Apple, etc.) 13. Tool suites (for design, coding, testing, project management, etc.). 14. Volumes of reusable materials available (0% to 100%) 15. Work hours, holidays, compensation levels and burden rates (project cost structures) 16. North American Industry Classification (NAIC code) for all industries. The eight taxonomy factors and the sixteen supplemental factors make comparisons of projects accurate and easy to understand by clients. The taxonomy and supplemental factors are also used for pattern matching or converting historical data into useful estimating algorithms. As it happens applications that have the same taxonomy are also about the same in terms of schedules, 88
costs, productivity and quality. That being said, there are millions of permutations from the factors used in the author s taxonomy. However the vast majority of software applications can be encompassed by fewer than 100 discreet patterns. Technical debt The concept of technical debt was put forth by Ward Cunningham. It is a brilliant metaphor, but not a very good metric as currently defined. The idea of technical debt is that shortcuts or poor architecture, design, or code made to shorten development schedules will lead to downstream post-release work. This is certainly true. But the use of the term debt brings up the analogy of financial debt, and here there are problems. Financial debt is normally a two-party transaction between a borrower and a lender; technical debt is self-inflicted by one party. A subtle issue with technical is that it makes a tacit assumption that short cuts are needed to achieve early delivery. They are not. A combination of defect prevention, pre-test defect removal, and formal testing can deliver software with close to zero technical debt faster and cheaper than the same project with short cuts, which usually skimp on quality control. A more serious problem is that too many post-release costs are not included in technical debt. If an outsource contractor is sued for poor performance or poor quality, then litigation and damages should be included in technical debt. Consequential damages to users of software caused by bugs or failures should be included in technical debt. Also, about 32% of large systems are cancelled without being completed. These have huge quality costs but zero technical debt. Overall technical debt seems to encompass only about 17% of the full costs of poor quality and careless development. Another omission with technical debt is the lack of a normalization method. Absolute technical debt like absolute financial debt is important, but it would also help to know technical debt per function point. This would allow comparisons of various project sizes and also various development methods. Technical debt can be improved over time if there is an interest in doing so. Technical debt is currently a hot topic in the software literature so it will be interesting to see if there are changes in structure and topics over time. Test metrics This is a complex topic and also somewhat ambiguous and subjective. Among the suite of common test metrics circa 2014 are test cases created, work hours per test case; test work hours per function point; reused regression test cases, test cases per function point, test cases executed successfully, test cases executed and failing; test coverage for branches, paths, code statements, and risks; defects detected; test intervals or schedule; test iterations, Test coverage (branches and paths), and test defect removal efficiency levels. Test coverage The phrase test coverage is somewhat ambiguous and can be used to describe the percent of code statements executed during testing, the percent of branches or paths; and the percent of 89
possible risks for which test cases exist. All definitions tend to be inversely related to cyclomatic and essential complexity. For highly complex applications the number of test cases to approach 100% coverage can approximate infinity. As of 2014 the only software for which 100% test coverage can be achieved would be straight-line software with a cyclomatic complexity number of 1. Test coverage is important but surprisingly ambiguous given how poor software quality is and how important testing is. There should be published data on test coverage by size, cyclomatic complexity, essential complexity, and also by specific test stage such as unit test, function test, regression test, and so forth. Currently today the literature on test coverage tends to be vague and ambiguous as to what kind of coverage actually occurs. Total Cost of Ownership (TCO) This is a very important metric but one that is difficult to collect and to study. The term total cost of ownership includes development and at least three years of maintenance, enhancement, and customer support. Some projects have been in continuous use for more than 25 years. To be really inclusive TCO should also include user costs. Further, it would be helpful if TCO included cost drivers such as finding and fixing bugs, paperwork production, coding, testing, project management, etc. The author s tool Software Risk Master (SRM) can predict and measure TCO for a minimum of three calendar or fiscal years after release. The actual mathematical algorithms could be extended past three years, but general business uncertainly lowers the odds of predictions more than three years out. For example corporate acquisition or merger could make dramatic changes as could sale of a business unit. A typical pattern of TCO for a moderate size application of 2,500 function points in size for three years might be as follows: 3-Year TCO Staffing Effort % of TCO months Development 7.48 260.95 46.17% Enhancement 2.22 79.75 10.58% Maintenance 2.36 85.13 10.35% Support 0.34 12.29 0.68% User costs 4.20 196.69 32.12% Total TCO 16.60 634.81 100.0% In order to collect TCO data it is necessary to have a continuous measurement program that collects effort data, enhancement data, support data, and defect repair data at least twice a year for all major projects. Unpaid overtime The majority of U.S. software personnel are termed exempt which means they are not required to be paid overtime even if they work much more than 40 hours per week. Unpaid overtime is an 90
important factor for both software costs and software schedules. Unfortunately unpaid overtime is the most common form of data that leaks or does not get reported via normal project tracking. If you are comparing benchmarks between identical projects and one of them had 10 hours per week of unpaid overtime while the other had 0 hours of unpaid overtime, no doubt the project with overtime will be cheaper and have a shorter schedule. But if the unpaid overtime is invisible and not included in project tracking data, there is no good way to validate the results of the benchmarks. Among the author s clients unpaid overtime of about 4 hours per week is common; but omitting this unpaid overtime from formal cost tracking is also common. As can be seen, the impact of unpaid overtime on costs and schedules is significant. The following chart shows an application of 1,000 function points with compensation at $10,000 per staff month: Use-case points Use cases are part of the design approach featured by the unified modeling language (UML) and included in the Rational Unified Process (RUP). Use cases are fairly common among IBM customers as are use-case points. This is a metric based on use cases and used for estimation. It was developed in 1993 by Gustav Korner prior to IBM acquiring the method. This is a fairly complex metric and a google search is recommended to bring up definitions and additional literature. Use-case points and function points can be used for the same software. Unfortunately use case points only apply to projects with use case designs, whereas function points can be used for all software and are therefore much better for benchmarks. IBM should have published conversion rules between use-case points and function points, since both metrics were developed by IBM. In the absence of IBM data, the author s Software Risk Master (SRM) tool predicts and converts data between use-case points and IFPUG function points. A total of 1000 IFPUG function points is roughly equal to 333 use-case points. However this value will change because use-cases also vary in depth and complexity. It is the responsibility of newer metrics such as use-case points to provide conversion rules to older metrics such as function points; but this responsibility is seldom acknowledged by metrics developers and did not occur for use-case points. User costs For internal IT projects users provide requirements, review documents, participate in phase reviews, and may even do some actual testing. However user costs are seldom reported. Also, user costs are normally not included in the budgets for software applications. The author s Software Risk Mater (SRM) tool predicts user costs for IT projects. Total user costs range from below 50% of software development costs to more than 70% of software development costs. This topic is under reported in the software literature and needs additional research. A sample of typical user costs for a medium IT project of 2,500 function points is shown below: User Activities Staffing Schedule Effort Costs $ per FP User requirements team: 3.85 8.06 44.77 $604,407 $241.76 91
User architecture team 2.78 1.93 5.37 $75,529 $29.01 User planning/estimating team 1.89 1.69 5.13 $69,232 $27.69 User prototype team: 3.13 4.03 12.59 $169,989 $68.00 User design review team 5.00 3.22 16.12 $217,587 $87.03 User change control team: 3.33 16.12 53.73 $725,288 $290.12 User governance team 2.56 10.48 26.86 $362,644 $145.06 User document review team 6.67 2.01 13.43 $181,322 $72.53 User acceptance test team: 5.56 2.42 13.43 $181,322 $72.53 User installation team: 4.35 1.21 5.26 $70,952 $28.38 Subtotal 4.20 5.12 196.69 $2,755,733 $1,062.11 User costs are not always measured even though they can top 65% of development costs. They are also difficult to measure because they are not usually included in software project budgets, and are scattered among a variety of different organizations each of which may have their own budgets. Value (intangible) The topic of intangible value is ambiguous and varies from application to application. Some of the many forms of intangible value include: medical value; value to human life and safety; military value for improving military operations; customer satisfaction value; team morale value; corporate prestige value. It would be theoretically possible to create a value point metric similar to function point metrics to provide a scale or range of intangible values. Value (tangible) Value comes in two flavors, tangible and intangible. The forms of software tangible value also comes in several flavors: 1) direct revenues such as software sales; 2) indirect revenues such as training and maintenance contracts; 3) operating cost reductions and work efficiency improvements. Tangible value can be expressed in terms of currencies such as dollars and are included in a variety of accounting formulae such as accounting rate of return and internal rate of return. Variable costs As the name implies, variable costs are the opposite of fixed costs and tend to be directly proportional to the number of units produced. An example of a variable cost would be the number of units produced per month in a factory. An example of a fixed cost would be the monthly rent for the factory itself. For software an important variable cost is the amount and cost of code produced for a specific requirement, which varies by language. Another variable cost would be the number and costs of bug repairs on a monthly basis. The software industry tends to blur together fixed and variable costs, and this explains endemic errors in the lines of code metric and the cost per defect metric. 92
Velocity The term velocity is a metric widely used by agile projects. It can be used in both forward predictive modes and historical data collection modes. Velocity can also be used with tangible deliverable such as document pages and also with synthetic metrics such as story points. Since velocity is not precisely defined, users could do a google search to bring the additional literature on the velocity metric. Venn diagram In 1880 the mathematician John Venn developed a simple graphing technique to teach set theory. Each set was represented by a circle. The relationship between two sets could be shown by the overlap between the circles. The use of Venn diagrams is much older than software and is used by dozens of kinds of engineers and mathematicians due to the simplicity and elegance of the approach. A simple Venn diagram with two sets is shown below: Venn diagrams can be used with more than two circles, of course, but become complex and lose visual appeal with more than four circles. Visual status and progress tracking It is not uncommon for large software projects to utilize both project offices and also war rooms. One of the purposes of both is to have current data on project status and progress, often using visual methods. The most common visual method is to use an entire wall and put up a general process flow chart that indicates completed tasks, work in progress, and future tasks. One company, Shoulder s Corporation, has gone beyond wall charts and used a threedimensional tracking system using colored balls suspended from strings. Project offices are more sedate, but usually include status tracking tools such as Microsoft Project and Computer Aid s Automated Project Office. It is obvious that the industry needs animated 3D graphical status tracking tightly coupled with planning and estimation. The tracking tool also should display continuous requirements growth and continuous visualization of defect removal progress. The tool also needs to continue beyond release and show annual growth plus customer use and customer bug reports for ten years or more. 93
War room In a software context a war room is a room set aside for project planning and status tracking. Usually they have tables with planning documents and often one or more walls are covered with project flow diagrams that indicate current status. Usually war rooms are found for large systems in the 10,000 function point size range. Warranty costs Most software projects don t have warranties. Look at the fine print on almost any box of commercial software and you will see phrases such as no warranty expressed or implied. In the rare cases where some form of warranty is provided, it can range from replacement of a disk with a new version to actually fixing problems. There is no general rule and each application by each company will probably have a unique warranty policy. This is professionally embarrassing for the software industry, which should offer standard warranties for all software. Some outsource contracts include warranties, but here too there are variations from contract to contract. Work hours There is a major difference between the nominal number of hours worked and the actual number of hours worked. In the United States the nominal work week is 40 hours. But due to lunch breaks, coffee breaks, and other non-work time the effective work week is around 33 hours per week or 132 hours per month. There are major differences in work hours from country to country and these differences are important for both software measurement and software estimation. The effective work month for the U.S. is 132 hours; for China 186 hours; for Sweden 126 hours; and so forth. These variances mean that a project that might require one calendar month in the U.S. would require only three weeks in China but more than one month in Sweden. Following is a graph of work hours per month in selected countries: Variations in work hours per month does not translate one-for-one into higher or lower productivity. Other topics such as experience and methodologies are also important. Even so the results are interesting and thought-provoking. Work hours per function point The two most common methods for expressing productivity with function point metrics are work hours per function point and function points per staff month. The two are mathematically related, but not identical due to variations in the number of hours worked per month. Assume a software project of 100 function points can be completed at a rate of 10 hours 94
per function point. In the U.S. at 132 hours per month this project would take 1000 hours and 7.58 calendar months, with a productivity rate of 13.2 function points per month. In China the project would also take 1000 hours but only 5.38 calendar months, with a productivity rate of 18.6 function points per month. Zero defects The ultimate goal of software engineering is to produce software applications with zero defects after release. Since software applications are known to be error-prone, this is a challenging goal indeed. Custom designs and hand coding of applications are both error-prone and the average defect removal efficiency as of 2014 is below 90% and defects average more than 3.0 per function point when requirements defects, design defects, code defects, and bad-fix defects are all enumerated. The best approach to achieving zero-defects for software would be to construct applications from libraries of reusable components that are certified to near zero-defect levels. The reusable components would have to through a variety of defect prevention, pre-test removal and test stages. It is clear that cost per defect cannot be used for zero defect quality costs, but function point work very well. Zero-size software changes One of the most difficult activities to predict or measure is that of changes to software that have zero function points. Two examples might be: 1) shifting an input question from the bottom of the screen to the top; 2) Reversing the sequence of printing a software cost estimate and showing outputs before inputs, instead of the normal way of showing inputs first. Both examples require work but have zero function points because the features are already present in the application. Some of the ways of measuring these changes would be to backfire from source code that is changed. Another method would be to record the hours needed for the work and then apply local costs and calculate function points from costs; i.e. if pas local projects cost $1000 per function point and the zero-size change took $1,000 to accomplish; it is probably 1 function point in size. Zero-size changes are also called function point churn as opposed to function point creep which do add to function point totals. SUMMARY AND CONCLUSIONS Over the past 50 years the software industry has become one of the largest industries in human history, and software applications have changed every aspect of business, government, and military operations. No other industry has had such a profound change on human communication and human knowledge transfer. But in spite of the many successes of the software industry, software applications are characterized by poor quality when released, frequent cost and schedule overruns, and many 95
cancelled projects. Further, software is one of the most labor-intensive industries in history and approaches cotton cultivation in total work hours to deliver a product. In order to solve the problems of software and convert a manual and expensive craft into a modern engineering profession with a high degree of manufacturing automation, the software industry needs much better metrics and measurement disciplines, and much more in the way of standard reusable components. Better measures and better metrics are the stepping stones to software engineering excellence. It is hoped that this report will highlight both measurement problems and also increase the usage of effective metrics such as function points and defect removal efficiency (DRE). 96
References and Readings on Software Metrics and Measurements and Topics Cited in This Paper Boehm, Barry Dr.; Software Engineering Economics; Prentice Hall, Englewood Cliffs, NJ; 1981; 900 pages. Brooks, Fred: The Mythical Man-Month, Addison-Wesley, Reading, Mass., 1974, rev. 1995. Beck, Kent; Test-Driven Development; Addison Wesley, Boston, MA; 2002; ISBN 10: 0321146530; 240 pages. Black, Rex; Managing the Testing Process: Practical Tools and Techniques for Managing Hardware and Software Testing; Wiley; 2009; ISBN-10 0470404159; 672 pages. Bundschuh, Manfred and Dekkers, Carol; The IT Metrics Compendium; Springer, 2005. Charette, Bob; Software Engineering Risk Analysis and Management; McGraw Hill, New York, NY; 1989. Charette, Bob; Application Strategies for Risk Management; McGraw Hill, New York, NY; 1990. Cohen, Lou; Quality Function Deployment How to Make QFD Work for You; Prentice Hall, Upper Saddle River, NJ; 1995; ISBN 10: 0201633302; 368 pages. Constantine, Larry L; Beyond Chaos: The Expert Edge in Managing Software Development; ACM Press, 2001. Crosby, Philip B.; Quality is Free; New American Library, Mentor Books, New York, NY; 1979; 270 pages. DeMarco, Tom; Peopleware: Productive Projects and Teams; Dorset House, New York, NY; 1999; ISBN 10: 0932633439; 245 pages. Ebert, Christof; Dumke, Reinder, and Bundschuh; Manfred; Best Practices in Software Measurement; Springer; 2004. Everett, Gerald D. And McLeod, Raymond; Software Testing; John Wiley & Sons, Hoboken, NJ; 2007; ISBN 978-0-471-79371-7; 261 pages. Gack, Gary; Managing the Black Hole: The Executives Guide to Software Project Risk; Business Expert Publishing, Thomson, GA; 2010; ISBN10: 1-935602-01-9. Gack, Gary; Applying Six Sigma to Software Implementation Projects; http://software.isixsigma.com/library/content/c040915b.asp. 97
Garmus, David and Herron, David; Function Point Analysis Measurement Practices for Successful Software Projects; Addison Wesley Longman, Boston, MA; 2001; ISBN 0-201- 69944-3;363 pages. Garmus, David; Russac Janet, and Edwards, Royce; Certified Function Point Counters Examination Guide; CRC Press; 2010. Gilb, Tom and Graham, Dorothy; Software Inspections; Addison Wesley, Reading, MA; 1993; ISBN 10: 0201631814. Hallowell, David L.; Six Sigma Software Metrics, Part 1.; http://software.isixsigma.com/library/content/03910a.asp. Harris, Michael; Herron David, and Iwanici, Stasia; The Business Value of IT; CRC Press, 2008. IFPUG (52 authors); The IFPUG Guide to IT and Software Measurement; Auerbach publishers; 2012. International Organization for Standards; ISO 9000 / ISO 14000; http://www.iso.org/iso/en/iso9000-14000/index.html. Jacobsen, Ivar; Ng Pan-Wei; McMahon, Paul; Spence, Ian; Lidman, Svente; The Essence of Software Engineering: Applying the SEMAT Kernel; Addison Wesley, 2013. Jones, Capers; A Short History of Lines of Code Metrics ; Namcook Analytics LLC, Narragansett, RI 2014. Jones, Capers; A Short History of the Cost per Defect Metric ; Namcook Analytics LLC, Narragansett RI 2014. Jones, Capers; The Technical and Social History of Software Engineering; Addison Wesley Longman, Boston, Boston, MA; 2014. Jones, Capers and Bonsignour, Olivier; The Economics of Software Quality; Addison Wesley, Boston, MA; 2011; ISBN 978-0-13-258220-9; 587 pages. Jones, Capers; Software Engineering Best Practices; McGraw Hill, New York; 2010; ISBN 978-0-07-162161-8;660 pages. Jones, Capers; Measuring Programming Quality and Productivity ; IBM Systems Journal; Vol. 17, No. 1; 1978; pp. 39-63. Jones, Capers; Programming Productivity - Issues for the Eighties; IEEE Computer Society Press, Los Alamitos, CA; First edition 1981; Second edition 1986; ISBN 0-8186 0681-9; IEEE Computer Society Catalog 681; 489 pages. 98
Jones, Capers; A Ten-Year Retrospective of the ITT Programming Technology Center ; Software Productivity Research, Burlington, MA; 1988. Jones, Capers; Applied Software Measurement; McGraw Hill, 3rd edition 2008; ISBN 978=0-07-150244-3; 662 pages. Jones, Capers; Critical Problems in Software Measurement; Information Systems Management Group, 1993; ISBN 1-56909-000-9; 195 pages. Jones, Capers; Software Productivity and Quality Today -- The Worldwide Perspective; Information Systems Management Group, 1993; ISBN -156909-001-7; 200 pages. Jones, Capers; Assessment and Control of Software Risks; Prentice Hall, 1994; ISBN 0-13- 741406-4; 711 pages. Jones, Capers; New Directions in Software Management; Information Systems Management Group; ISBN 1-56909-009-2; 150 pages. Jones, Capers; Patterns of Software System Failure and Success; International Thomson Computer Press, Boston, MA; December 1995; 250 pages; ISBN 1-850-32804-8; 292 pages. Jones, Capers; Software Quality Analysis and Guidelines for Success; International Thomson Computer Press, Boston, MA; ISBN 1-85032-876-6; 1997; 492 pages. Jones, Capers; Estimating Software Costs; 2 nd edition; McGraw Hill, New York; 2007; 700 pages.. Jones, Capers; The Economics of Object-Oriented Software ; Namcook Analytics; Narragansett, RI; 2014. Jones, Capers; Software Project Management Practices: Failure Versus Success ; Crosstalk, October 2004. Jones, Capers; Software Estimating Methods for Large Projects ; Crosstalk, April 2005. Kan, Stephen H.; Metrics and Models in Software Quality Engineering, 2 nd edition; Addison Wesley Longman, Boston, MA; ISBN 0-201-72915-6; 2003; 528 pages. Land, Susan K; Smith, Douglas B; Walz, John Z; Practical Support for Lean Six Sigma Software Process Definition: Using IEEE Software Engineering Standards; WileyBlackwell; 2008; ISBN 10: 0470170808; 312 pages. Nandyal; Raghav; Making Sense of Software Quality Assurance; Tata McGraw Hill Publishing, New Delhi, India; 2007; ISBN 0-07-063378-9; 350 pages. 99
Popp, Karl: Advances in Software Economics; Books on Demand; 2011 Radice, Ronald A.; High Quality Low Cost Software Inspections; Paradoxicon Publishingl Andover, MA; ISBN 0-9645913-1-6; 2002; 479 pages. Royce, Walker E.; Software Project Management: A Unified Framework; Addison Wesley Longman, Reading, MA; 1998; ISBN 0-201-30958-0. Strassmann, Paul; The Business Value of Computers; an Executive s Guide, International Thomson Computer Press, 1994. Wiegers, Karl E.; Peer Reviews in Software A Practical Guide; Addison Wesley Longman, Boston, MA; ISBN 0-201-73485-0; 2002; 232 pages. 100