VALUE-BASED, DEPENDENCY-AWARE INSPECTION AND TEST PRIORITIZATION. Qi Li

Transcription

1 VALUE-BASED, DEPENDENCY-AWARE INSPECTION AND TEST PRIORITIZATION by Qi Li A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2012 Copyright 2012 Qi Li

2 Dedication To my parents ii

3 Acknowledgements My Ph.D dissertation could not be completed without the support of many hearts and minds. I am deeply indebted to my Ph.D advisor Dr. Barry Boehm, for his great and generous support for all my Ph.D research. I am deeply honored to be one of his students and get direct and close advice from him all the time. My sincere thanks are also extended to other committee members Dr. Stan Settles, Dr. Nenad Medvidovic, Dr. Richard Selby, Dr. William Halfond and Dr. Sunita Chulani, for the invaluable guidance on focusing my research and efforts on reviewing drafts of my dissertation. Special thanks to my ISCAS advisors, Professor Mingshu Li, Professor Qing Wang, and Professor Ye Yang. They led me into the academic world and continuously encourage, support my research, and promote the in-depth collaborative research in our joint lab of USC-CSSE & ISCAS. The realization of this research effort also exists because of the tremendous support from Dr. Jo Ann Lane and Dr. Ricardo Valerdi. In addition, this research could not have been conducted without support from the University of Southern California Center for Systems and Software Engineering courses, corporate, and academic affiliates, especial thanks to Galorath Incorporated, NFS-China for giving me the chance to apply this research into the real industrial projects, to USC-CSSE graduate-level software engineering courses 577ab Year students for their collaborative effort on the Value-based Inspection and Testing experiments, to all my USC and ISCAS colleagues and friends, life could not be more colorful without you. Lastly, from the bottom of my heart, I would like to thank my family for their unconditional love and support during my study. iii

4 Table of Contents Dedication... ii Acknowledgements... iii Chapter 1: Introduction Motivation Research Contributions Organization of Dissertation... 5 Chapter 2: A Survey of Related Work Value-Based Software Engineering Software Review Techniques Software Testing Techniques Software Test Case Prioritization Techniques Defect Removal Techniques Comparison Chapter 3: Framework of Value-Based, Dependency-Aware Inspection and Test Prioritization Value-Based Prioritization Prioritization Drivers Stakeholder Prioritization Business /mission value Defect Criticality Defect Proneness Testing or Inspection Cost iv

5 Time- to-market Value-Based Prioritization Strategy Dependency-Aware Prioritization Loose Dependencies Tight Dependencies The Process of Value-Based. Dependency-Aware Inspection and Testing Key Performance Evaluation Measures Value and Business Importance Risk Reduction Leverage Average Percentage of Business Importance Earned (APBIE): Hypotheses, Methods to test Chapter 4: Case Study I-Prioritize Artifacts to be Reviewed Background Case Study Design Results Chapter 5: Case Study II-Prioritize Testing Scenarios to be Applied Background Case Study Design Maximize Testing Coverage The step to determine Business Value The step to determine Risk Probability The step to determine Cost v

6 The step to determine Testing Priority Results Lessons Learned Chapter 6: Case Study III-Prioritize Software Features to be functionally Tested Background Case Study Design The step to determine Business Value The step to determine Risk Probability The step to determine Testing Cost The step to determine Testing Priority Results Chapter 7: Case Study IV-Prioritize Test Cases to be Executed Background Case Study Design The step to do Dependency Analysis The step to determine Business Importance The step to determine Criticality The step to determine Failure Probability The step to determine Test Cost The step for Value-Based Test Case Prioritization Results One Example Project Results vi

7 All Team Results: A Tool for Faciliating Test Case Prioritization: Statistical Results for All Teams via this Tool Lessons learned Chapter 8: Threats to Validity Chapter 9: Next Steps Chapter 10: Conclusions Bibliography vii

8 List of Tables Table 1. Comparsion Results of Value-based Group A and Value-neutral Group B Table 2. Test Suite and List of Faults Exposed Table 3 Business Importance Distribution (Two Situations) Table 4. Comparison for TCP techniques Table 5. An Example of Quantifying Dependency Ratings Table 6. Case Studies Overview Table 7.V&V assignments for Fall2009/ Table 8. Acronyms Table 9. Documents and sections to be reviewed Table 10. Value-neutral Formal V&V process Table 11. Value-based V&V process Table 12. An example of value-based artifact prioritization Table 13. An example of Top 10 Issues Table 14. Issue Severity & Priority rate mapping Table 15. Resolution options in Bugzilla Table 16. Review effectiveness measures Table 17. Number of Concerns Table 18. Number of Concerns per reviewing hour Table 19. Review Effort Table 20. Review Effectiveness of total Concerns viii

9 Table 21. Average of Impact per Concern Table 22. Cost Effectiveness of Concerns Table 23. Data Summaries based on all Metrics Table 24. Statistics Comparative Results between Years Table 25 Macro-feature coverage Table 26. FU Ratings Table 27. Product Importance Ratings Table 28. RP Ratings Table 29. Installation Type Table 30. Average Time for Testing Macro Table 31. Testing Cost Ratings Table 32. Testing Priorities for 10 Local Installation Working Environments Table 33. Testing Priorities for 3 Server Installation Working Environments Table 30. Value-based Scenario Testing Order and Metrics Table 35. Testing Results Table 36. Testing Results (continued) Table 37. APBIE Comparison Table 38. Relative Business Importance Calculation Table 39. Risk Factors Weights Calculation-AHP Table 40. Quality Risk Probability Calculation (Before System Testing) Table 41. Correlation among Initial Risk Factors: Table 42. Relative Testing Cost Estimation Table 43 Correlation between Business Importance and Testing Cost ix

10 Table 44. Value Priority Calculation Table 45. Guideline for rating BI for test cases Table 46. Guideline for rating Criticality for test cases Table 47. Self-check questions used for rating Failure Probability Table 48. Mapping Test Case BI &Criticality to Defect Severity& Priority Table 49. Relations between Reported Defects and Test Cases Table 50. APBIE Comparison (all teams) Table 51. Delivered Value Comparison when Cost is fixed (all teams) Table 52. Cost Comparison when Delivered Value is fixed (all teams) Table 53. APBIE Comparison (11 teams) Table 54. Delivered Value Comparison when Cost is fixed (11 teams) Table 55. Cost Comparison when Delivered Value is fixed (11 teams) x

11 List of Figures Figure 1. Pareto Curves... 2 Figure 2. Value Flow vs. Software Development Lifecycle... 3 Figure 3. The 4+1 Theory of VBSE: overall structure... 8 Figure 4. Software Testing Process-Oriented Expansion of VBSE 4+1 Theory and Key Practices... 8 Figure 5. Value-based Review (VBR) Process Figure 6. Coverage-based Test Case Prioritization Figure 7. Comparison under Situation Figure 8. Comparison under Situation Figure 9. Overview of Value-based Software Testing Prioritization Strategy Figure 10. An Example of Loose Dependencies Figure 11. An Example of Tight Dependencies Figure 12. Benefits Chain for Value-based Testing Process Implementation Figure 13. Software Testing Process-Oriented Expansion of 4+1 VBSE Framework Figure 14. ICSM framework tailored for csci Figure 15. Scenarios to be tested Figure 16. Comparison among 3 Situations Figure 17. Business Importance Distribution Figure 18. Testing Cost Estimation Distribution Figure 19. Comparison between Value-Based and Inverse order Figure 20. Initial Estimating Testing Cost and Actual Testing Cost Comparison.. 95 xi

12 Figure 21. BI, Cost and ROI between Testing Rounds Figure 22. Accumulated BI Earned During Testing Rounds Figure 23. BI Loss (Pressure Rate=1%) Figure 24. BI Loss (Pressure Rate=4%) Figure 25. BI Loss (Pressure Rate=16%) Figure 26. Value Functions for Business Importance and Testing Cost Figure 27. Dependency Graph with Risk Analysis Figure 28. Typical production function for software product features Figure 29. Test Case BI Distribution of Team01 Project Figure 30. Failure Probability Distribution of Team01 Project Figure 31. In-Process Value-Based TCP Algorithm Figure 32. PBIE curve according to Value-Based TCP (APBIE=81.9%) Figure 33. PBIE Comparison without risk analysis between Value-Based and Value- Neutral TCP (APBIE_value_based=52%, APBIE_value_neutral=46%) Figure 34. An Example of Customized Test Case in TestLink Figure 35. A Tool for facilitating Value-based Test Case Prioritization in TestLink Figure 36. APBIE Comparison Figure 37. Delivered-Value Comparison when Cost is fixed Figure 38. Cost Comparison when Delivered Value is fixed xii

13 Abbreviations ICSM Phases: ICSM: Incremental Commitment Spiral Model VC: Valuation Commitment FC: Foundation Commitment DC: Development Commitment TRR: Transition Readiness Review RDC: Rebaselined Development Commitment IOC: Initial Operational Capability TS: Transition & Support Artifacts developed and reviewed for USC CSCI577 OCD: Operational Concept Description SSRD: System and Software Requirements Description SSAD: System and Software Architecture Description LCP: Life Cycle Plan FED: Feasibility Evidence Description SID: Supporting Information Document QMP :Quality Management Plan IP: Iteration Plan IAR: Iteration Assessment Report TP: Transition Plan xiii

14 TPC: Test Plan and Cases TPR: Test Procedures and Result UM: User Manual SP: Support Plan TM: Training Materials Value-Based, Dependency-Aware inspection and test prioritization related RRL: Risk Reduction Level ROI: Return On Investment BI: Business Importance ABI: Accumulated Business Importance PBIE: Percentage of Business Importance Earned APBIE: Average Percentage of Business Importance Earned AC: Accumulated Cost FU: Frequency of Use RP: Risk Probability TC: Testing Cost TP: Test Priority PI: Product Importance Others: FV&V: Formal Verification & Validation xiv

15 VbV&V: Value-based Verification & Validation Eval: Evaluation ARB: Architecture Review Board xv

16 Abstract As two of the most popular defect removal activities, Inspection and Testing are of the most labor-intensive activities in software development life cycle and consumes between 30% and 50% of total development costs according to many studies. However, most of the current defect removal strategies treat all instances of software artifacts as equally important in a value-neutral way; this becomes more risky for high-value software under limited funding and competitive pressures. In order to save software inspection and testing effort to further improve affordability and timeliness while achieving acceptable software quality, this research introduces a value-based, dependency-aware inspection and test prioritization strategy for improving the lifecycle cost-effectiveness of software defect removal options. This allows various defect removal types, activities, and artifacts to be ranked by how well they reduce risk exposure. Combining this with their relative costs enables them to be prioritized in terms of Return On Investment (ROI) or Risk Reduction Leverage (RRL). Furthermore, this strategy enables organizations to deal with two types of common dependencies among items to be prioritized. This strategy will help project managers determine how much software inspection/testing is enough? under time and budget constraints. Besides, a new metric Average Percentage of Business Importance Earned (APBIE) is proposed to measure how quickly testing can reduce the quality uncertainty and earn the relative business importance of the System Under Test (SUT). This Value-Based, Dependency-Aware Inspection and Testing strategy has been empirically studied and successfully applied on a series of case studies within different prioritization granularity levels: (1). Prioritizing artifacts to be reviewed in 21 graduate- xvi

17 level, real-client software engineering course projects; (2). Prioritizing testing scenarios to be applied in an industrial project at the acceptance testing phase in Galorath, Inc.; (3). Prioritizing software features to be functionally tested in an industrial project in the China-NFS company; (4). Prioritizing test cases to be executed in 18 course projects. All the comparative statistics analysis from the four case studies show positive results from applying the Value-Based, Dependency-Aware strategy. xvii

18 Chapter 1: Introduction 1.1.Motivation Traditional verification & validation, and testing methodologies such as: path, branch, instruction, mutation, scenario, or requirements testing usually treat all aspects of software as equally important [Boehm and Basili, 2001], [Boehm, 2003]. This leads to a purely technical issue leaving the close relationship between testing and business decisions unlinked and the potential value contribution of testing unexploited [Ramler et al., 2005]. However, commercial experience is often that 80% of the business value is covered by 20% of the tests or defects, and that prioritizing by value produces significant payoffs [Bullock, 2000], [Gerrard and Thompson, 2002], [Persson and Yilmazturk, 2004]. Also, current Earned Value systems fundamentally track the project progress against the plan, and cannot track changes in the business value of the system being developed. Furthermore, system value-domain problems are the chief sources of software project failures, such as unrealistic expectations, unclear objectives, unrealistic time frames, lack of user input, incomplete requirement, or changing requirements [Johnson, 2006]. All of these plus the increasing criticality of software within systems, make valueneutral software engineering methods increasingly risky. Boehm and Basili s Software Defect Reduction Top 10 List [Boehm and Basili, 2001] shows that Finding and fixing a software problem after delivery is often 100 times more expensive than finding and fixing it during the requirements and design phase. Current software projects spend about 40 to 50 percent of their effort on avoidable rework. About 80 percent of avoidable rework comes from 20 percent of the defects. 1

19 About 80 percent of the defects come from 20 percent of the modules, and about half the modules are defect free. About 90 percent of the downtime comes from, at most, 10 percent of the defects. Peer reviews catch 60 percent of the defects. Perspective-based reviews catch 35 percent more defects than non-directed reviews. Disciplined personal practices can reduce defect introduction rates by up to 75 percent [Boehm and Basili, 2001]. Figure 1. Pareto Curves [Bullock, 2000] The upper Pareto curve in Figure 1 comes from an experience report [Bullock, 2000] for which 20% of the features provide 80% of the business value. It shows that among the 15 customer types, the first one nearly consists of 50% of the billing revenues and that 80% of the test cases generate only 20% of the business value. So, focusing the effort on the high-payoff test cases will generate the highest ROI. The linear curve is representative of most automated test generation tools. It is equally likely to test the high and low value types, so in general, it shows a linear payoff. Value-neutral method can do even worse than this. For example, many projects focus on reducing the number of 2

20 outstanding problem reports as quickly as possible, leading to first fixing the easiest problems such as typos, or grammar mistakes. This generates a value curve much worse than the linear one. From the perspective of VBSE, the full range of the software development lifecycle (SDLC) is a value flow that begins with value objective assessment and capture by value-based requirement acquisition, business case analysis, early design and architecting, followed by value implementation by detailed architecting, and development; and value realization by testing to ensure the value objectives are satisfied before transitioned and delivered to customers by means of value-prioritized test cases being executed and passed, as shown in Figure 2. Monitoring and controlling actual value being earned by project s results in terms of multiple value objectives can enable organizations to pro-actively monitor and control not only fast-breaking risks to project success in delivering expected value, but also fast-breaking opportunities to switch to even higher-value emerging capabilities to avoid highly efficient waste of an organization s scarce resources. Acquisition, Requirement Value Objective Capture Design, Architect Value Implementation Development Value Realization Test & Transition Figure 2. Value Flow vs. Software Development Lifecycle 3

21 Each of the system s value objectives is corresponding to at least one test item, e.g. an operational scenario, a software feature, or a test case that is used to measure whether this value objective is achieved or not in order to earn the relevant value. The whole testing process could be seen as a Value Earned process by executing and successfully passing one test case, and earning one piece of value etc. In the Value-Based Software Engineering community, value is not only limited to purely financial terms, but extended to as relative worth, utility or importance to provide help address software engineering decisions [Boehm, 2003]. Business Importance in terms of Return On Investment (ROI) is often used to measure the relative value of functions, components, features or even systems for business domain software systems. So the testing process under this business domain context could also be accordingly defined as a Business Importance Earned process. To measure how quickly a testing strategy can earn the business importance, especially under time and budget constraints, a new metric Average Percentage of Business Importance Earned (APBIE) is proposed and will be introduced in detail in Chapter Research Contributions The research is intended to provide the following contributions: Current software inspection and testing process investigation and analysis; Propose a real Earned Value system to track business value of testing and measure testing efficiency in terms of Average Percentage of Business Importance Earned (APBIE); 4

22 Propose a systematic strategy for Value-Based, Dependency Aware Inspection & Testing Processes; Apply this strategy to a series of empirical studies with different granularities of prioritization; Elaborate decision criteria of testing/inspection priorities per project contexts, which are helpful and insightful for real industry practices; Implement an automatic tool for facilitating Value-Based, Dependency-Aware prioritization. 1.3.Organization of Dissertation The organization of this dissertation is as follows: Chapter 2 presents a survey of results Value-Based Software Engineering, software inspection techniques, software testing process strategies, software test case prioritization techniques and defect removal techniques. Chapter 3 introduces the methodology of Value-Based, Dependency Aware inspection and testing prioritization strategy and process, proposes key performance evaluation measures, research hypotheses, and methods to test the hypotheses. Chapter 4-7 introduces the detailed steps and practices to apply the Value-Based, Dependency Aware prioritization strategy onto four typical inspection and testing case studies. For each case study, project backgrounds, case study designs, implementation steps are introduced, comparative analysis is conducted, both qualitative and quantitative results and lessons learned are summarized: 5

23 Chapter 4 introduces the prioritization of artifacts to be reviewed on USC-CSSE graduate-level, real-client course projects for its formal inspection; Chapter 5 conducts the prioritization of operational scenarios to be applied in Galorath, Inc. for its performance testing; Chapter 6 illustrates the prioritization of features to be tested on a Chinese software company for its functionality testing; Chapter 7 presents the prioritization of test cases to be executed on USC-CSSE graduate level course projects at its acceptance testing phase. Chapter 8 explains some threats to validity; Chapter 9 and 10 propose some future research work and conclude the contributions of this research dissertation. 6

24 Chapter 2: A Survey of Related Work 2.1. Value-Based Software Engineering Value-Based Software Engineering (VBSE) is a discipline that addresses and integrates economic aspects and value considerations into the full range of existing and emerging software engineering principles and practices, processes, activities and tasks, technology, management and tools decisions in the software development context [Boehm, 2003]. The engine in the center is the Success-Critical Stakeholder (SCS) Win-Win Theory W [Boehm, 1988], [Boehm et al., 2007], which addresses what values are important and how success is assured for a given software engineering organization. The four supporting theories that it draws upon are utility theory, decision theory, dependency theory, and control theory, respectively dealing with how important are the values, how do stakeholders values determine decisions, how do dependencies affect value realization, and how to adapt to change and control value realization. VBSE key practices includes: benefits realization analysis; stakeholder Win-Win negotiation; business case analysis; continuous risk and opportunity management; concurrent system and software engineering; value-based monitoring and control and change as opportunity. This process has been integrated with the spiral model of system and software development and evolution [Boehm et al., 2007] and its next generation system and software engineering successor, the Incremental Commitment Spiral Model [Boehm and Lane, 2007]. 7

25 Figure 3. The 4+1 Theory of VBSE: overall structure [Boehm and Jain, 2005] The Value-based Software Engineering theory is the fundamental theory for the proposed Value-based Inspection and Test Prioritization strategy. Our strategy is VBSE theory s application on Software Testing and Inspection process. Our strategy s mapping to the VBSE s 4+1 theory and key practices is shown in Figure 4. Figure 4. Software Testing Process-Oriented Expansion of VBSE 4+1 Theory and Key Practices 8

26 2.2. Software Review Techniques Up to date, many focused review or reading methods and techniques have been proposed, practiced and proved to be superior to unfocused reviews. The most common one in practice is checklist-based reviewing (CBR) [Fagan, 1976], others include perspective-based reviewing (PBR) [Basili et al., 1996], [Li et al., 2008], defect-based reading (DBR) [Porter et al., 1995], functionality-based reading (FBR) [Abdelrabi et al., 2004] and usage-based reading (UBR) [Conradi and Wang, 2003], [Thelin et al., 2003]. However, Most of them are value-neutral (except UBR) and focused on one single aspect, e.g. DBR focuses defect classification to find defects in artifacts and a scenario is a key factor in DBR. UBR focuses on prioritizing use cases in order of importance from a user perspective. FBR is proposed to trace framework requirements to produce wellconstructed framework and review the code. As an initial value-based set of peer review guidelines [Lee and Boehm, 2005], its process consists of: first, a win-win negotiation among stakeholders defines the priority of each system capability; Based on the checklists for each artifact, domain expert will determine the criticality of issue; next, the system capabilities with high priorities were reviewed first; third, at each priority level, the high-criticality sources of risk were reviewed first, as shown in Figure 5. The experiment uses Group A: 15 IV&V personnel using VBR procedures and checklists, Group B 13 IV&V personnel using previous valueneutral checklists. The result of the initial experiment found a factor-of-2 improvement in value added per hour of peer review time as shown in Table 1. 9

27 Figure 5. Value-based Review (VBR) Process [Lee and Boehm, 2005] Table 1. Comparsion Results of Value-based Group A and Value-neutral Group B [Lee and Boehm, 2005] By Number P- value % Gr A higher By Impact P- value % Gr A higher Average of Concerns Average Impact of Concerns Average of Problems Average Impact of Problems Average of Concerns per hour Average Cost Effectiveness of Concerns Average of Problems per hour Average Cost Effectiveness of Problems As a new contribution to value-based V&V process development, the Value- Based, Dependency-Aware prioritization strategy was then customized to develop a systematic and multi-criteria process to quantitatively determine the priorities of artifacts to be reviewed. This process adds Quality Risk Probability, Cost and 10

28 Dependency considerations into the prioritization and has been successfully applied on USC-CSSE graduate level, real client course projects with statistically significant improvement of review cost effectiveness, which will be introduced in Chapter Software Testing Techniques Rudolf Ramler outlines a framework for value-based test management [Ramler et al., 2005], it is a synthesis of current most relevant processes and a high-level guideline without detail implementation specifications and empirical validation. Stale Amland introduces a risk-based testing approach [Amland, 1999]. It states that resources should be focused on those areas representing the highest risk exposure. However, this method doesn t consider the testing cost which is also an essential factor during testing process. Boehm and Huang propose a quantitative risk analysis [Boehm et al., 2004] that helps determine when to stop testing software and release the product under different organizational contexts and different desired quality levels. However, it is a macroscopic empirical data analysis without process guidance in detail. Other relevant work includes usage-based testing, and statistical-based testing [Cobb and Mills, 1990], [Hao and Mendes, 2006], [Kouchakdjian and Fietkiewicz, 2000], [Musa, 1992], [Walton et al., 1995], [Whittaker and Thomason, 1994], [Williams and Paradkar, 1999]. Usage model characterizes operational use of a software system, then generate random test cases from the usage model, perform statistical testing of the software, record any observed failure, and analyze the test results using a reliability model to provide a basis for statistical inference of reliability of the software during operational use. Statistical testing based on a software usage model ensures that the failures that will 11

29 occur most frequently in operational use will be found early in the testing cycle. However, it doesn t differentiate failure s impact and operational usages business importance Software Test Case Prioritization Techniques Most of current test case prioritization (TCP) techniques [Elbaum et al., 2000], [Elbaum et al., 2002], [Elbaum et al., 2004], [Rothermel et al., 1999], [Rothermel et al., 2001], are coverage-based, and aim to improve a test suite s rate of fault detection, a measure of how quickly faults are detected within the testing process, in order to get earlier feedback on the System Under Test (SUT). The metric Average Percentage of Faults Detected (APFD) is used to measure how quickly the faults are identified for a given test suite. These TCP techniques are all based on coverage of statements or branches in the programs, assuming that all the statements or branches are equally important, all faults have equal severity and all test cases have equal costs. An example of coverage-based test case prioritization is shown in Figure 6. Figure 6. Coverage-based Test Case Prioritization [Rothermel et al., 1999] S.Elbaum proposed a new cost-cognizant metric, APFDc, for assessing the rate of fault of detection of prioritized test cases that incorporates varying test case and fault costs [Elbaum et al., 2001], [Malishevsky et al., 2006], which should reward test cases orders proportionally to their rate of unit-of-fault-severity-detected-per-unit-test-cost. 12

30 By incorporating context and lifetime factors, improved cost-benefit models are provided for use in assessing regression testing methodology and effects of time constraints on the costs and benefits of prioritization techniques [Do and Rothermel, 2006], [Do et al., 2008], [Do and Rothermel, 2008]. However, he didn t incorporate the failure probability in the prioritization. H.Srikanth presented a requirement-based system level test case prioritization called the Prioritization of Requirements for Test (PORT) based on requirements volatility, customer priority, implementation complexity, and fault proneness of the requirement to improve the rate of detection of severe faults, measured by Average Severity of Faults Detected (ASFD), however, she didn t consider the cost of testing in the prioritization. More recently, there has been a group of related work on fault-proneness test prioritization based on failure prediction, the most representative one is CRANE [Czerwonka et al., 2011], a failure prediction, change risk analysis and test prioritization system at Microsoft Corporation that leverages existing research [Bird et al., 2009], [Eaddy et al., 2008], [Nagappan et al., 2006], [Pinzger et al., 2008], [Srivastava and Thiagarajan, 2002], [Zimmermann and Nagappan, 2008], for the development and maintenance of Windows Vista. It prioritized the selected tests by changed blocks covered per test cost unit ratio [Czerwonka et al., 2011]. Their test prioritization is mainly based on the program change analysis in order to estimate the more fault-prone parts, however, program change is only one factor that would influence the failure probability, other factors, e.g. personnel qualification, module complexity etc. should influence the prediction of failure probability as well. Besides it didn t consider the business value from customers and the different importance levels of modules, and defects. 13

31 Some other fault/failure prediction work to identify the fault-prone components in a system [58-60] is also relevant to our work. Other related work of test case prioritization can be found at some recent systematic review work [Roongruangsuwan and Daengdej, 2010], [Yoo and Harman, 2011], [Zhang et al., 2009]. In our research, a new metric: Average Percentage of Business Importance Earned (APBIE) to measure how quickly the SUT s value is realized for a given test suite or how quickly the business importance can be earned by testing under the VBSE environment. The definition of APBIE will be introduced in detail in Chapter 3. Comparison among TCP techniques Most of the current Test Case Prioritization techniques [Elbaum et al., 2000, , 2004], [Malishevsky et al., 2006], [Do and Rothermel, 2006], [Do and Rothermel, 2008], [Do et al., 2008], [Rothermel et al., 1999], [Rothermel et al., 2001], [Srikanth et al., 2005] are under the prerequisite that: which test cases will expose which faults is known, and aims to improve the rate of fault detection. In order to predict the defect proneness to support more practical test case prioritization, current research in this field trends to develop various defect prediction techniques that serve as the basis for test prioritization [Bird et al., 2009], [Czerwonka et al., 2011], [Eaddy et al., 2008], [Emam et al., 2001], [Nagappan et al., 2006], [Ostrand et al., 2005, 2007], [Pinzger et al., 2008], [Srivastava and Thiagarajan, 2002], [Zimmermann and Nagappan, 2008]. In order to call for more attention to the value considerations into the current test case prioritization techniques, we used a simple example as shown in Table 2 from Rothermel s paper [Rothermel et al., 1999] (which could also be representative of other 14

32 similar coverage-based TCP techniques) and constructed two situations as displayed in Table 3 for this example. Although these two situations are emulated, they can represent most of the real situations. Table 2. Test Suite and List of Faults Exposed [Rothermel et al., 1999] Fault A X X B X X X X C X X X X X X X D X E X X X Rothermel s test case prioritization technique is under the perquisite that: which test cases will expose which faults is known. Based on Rothermel s method, the testing order should be C-E-B-A-D, however, his prioritization doesn t differentiate the business importance of each test suite, let s make some assumptions to show what his prioritization can result in if the business importance of each test suite is know. Let s assume that test suite s business importance is independent of faults seeded as shown in Table 2. The business importance is from the customer s value perspectives on the relevant features that those test suites can represent. 15

33 PBIE Table 3 Business Importance Distribution (Two Situations) Situation 1 (Best Case) Situation 2 (Worst Case) Business Accumulated Business Accumulated Importance BI Importance BI C 50% 50% 5% 5% E 20% 70% 10% 15% B 15% 85% 15% 30% A 10% 95% 20% 50% D 5% 100% 50% 100% APBIE 80% 40% Situation 1: If it is lucky enough (the possibility should be very low in reality) that the business importance percentage distribution of the five test suites is shown as in the Situation 1 in Table 3, C-E-B-A-D is also the testing order if we apply Value-based TCP. So the PBIE curves for both our method and Rothermel s overlap as shown in Figure 7. This testing order is the optimal for both rates of business importance earned and faults detected. Ours Rothermel Start Test Case Order Figure 7. Comparison under Situation 1 16

34 PBIE Situation 2: If the business importance percentage distribution of the five test suites is shown as in the Situation 2 in Table 3 C-E-B-A-D is the Rothemel s TCP order with the APBIE=40%, however, our value-based method s TCP order is D-A-B-E-C with the APBIE=80% as shown in Figure 8. So our method can improve the testing efficiency by a factor of 2 in terms of APBIE in this situation when compared with Rothermel s method. Rothermel's Ours Start Test Case Order Figure 8. Comparison under Situation 2 The comparison results shows that it is possible, but the possibility is extremely low, that Rothermel s testing order can overlap the value-based order, and most often time the APBIE is lower than our value-based TCP technique. Because the two techniques have different optimized goals: our method aims to improve APBIE, while his method aims to improve the rate of fault detection. Besides, a comprehensive comparison among the state-of-art TCP techniques is shown in Table 4. The prioritization algorithm is the same, and all use the greedy algorithm or its variants to first pick the best candidate, making the local optimal choice at each step in order to achieve the global optimum. However, the selecting goals are different, for Rothermel s method, the goal is to pick the one that can expose the most faults; while for our method, the goal is to pick the one that represents the highest testing 17

35 value. Rothermel s test case prioritization aims to improve the rate of fault detection, measured by Average Percentage of Fault Detection (APFD), but our method s goal aims to improve the rate of business importance earned, measured by Average Percentage of Business Importance Earned (APBIE). Prioritization algorithm Goal Measure Assumption? Table 4. Comparison for TCP techniques Rothermel Elbaum et Srikanth et Czerwonka Our et al., 1999 al., 2001 al., 2005 et al., 2011 method Greedy Greedy Greedy NA Greedy Coverage-based Maximize the rate of fault detected APFD: Average Percentage of Faults Detected Maximize the rate of severity of faults Detected ASFD: Average Severity Faults Detected testing Cost under the prerequisite that which test case will expose which faults is known, and those faults are seeded deliberately of Defect- Proneness based Maximize the chances of finding defects in the changed code FRP: Fix Regression Proneness No Valuebased Practical? Infrequently, because of the assumption above Yes Yes Factors for Prioritization Risk Size? (business importance + defect impact) Risk Probability? Maximize the rate of unit-of-faultseveritydetected-perunit-test-cost APFDc: Average Percentage of Faults Detected, incorporating No Partial: Partial: consider the consider the defect severity customerassigned priority No No Partial: consider requirement change, complexity, fault prone. No Partial: mainly consider code change impact by version control Maximize the rate of business importance earned APBIE: Average Percentage of Business Importance Earned systems Cost? No Yes No No Yes Dependency? No No No No Yes No Yes Yes 18

36 As an additional case of the application of the Value-Based, Dependency-Aware strategy, we recently experimented a more systematic value-based test case prioritization of a set of test cases to be executed for acceptance and regression testing on the USC- CSSE graduate-level, real-client course projects, with improved testing efficiency and effectiveness, which will be introduced in Chapter 7. Our prioritization is more systematic, because we synthetically consider the business importance from customers perspective, the failure probability, the execution cost and dependency among them into the prioritization Defect Removal Techniques Comparison The efficiency of review and testing are compared in Constructive QUALity Model (COQUALMO) [Boehm et al., 2000]. To determine the Defect Removal Fraction (DRFs) associated with each of the six levels (i.e., Very Low, Low, Nominal, High, Very High, Extra High) of the three profiles (i.e., automated analysis, people reviews, execution testing and tools) for each of three types of defect artifacts (i.e., requirement defects, design defects, and code defects), it conducted a two-round Delphi. This study found that people review is the most efficient on removing requirement and design defects, and testing is the most efficient on removing code defects. Madachy and Boehm extended their previous work on COQUALMO and assessed software quality process with the Orthogonal Defect Classification COnstructive QUALity MOdel (ODC COQUALMO) that predicts defects introduced and removed, classified by ODC types [Chillarege et al., 1992], [Madachy and Boehm, 2008]. A comprehensive Delphi survey was used to capture more detailed efficiencies of the techniques (automated 19

37 analysis, execution testing, and tools, and peer reviews) against ODC defect categories as an extension on the previous work [Boehm et al., 2000]. In [Jones, 2008], Capers Jones lists Defect Removal Efficiency of 16 combinations of 4 defect removal methods: design inspections, code inspections, quality assurance, and testing. These results show that, on one side, no single defect removal method is adequate, on the other side, implies that removal efficiency from better to worse would be design inspections, code inspections, testing and quality assurance. However, all the above defect removal technique comparison work is based on Delphi surveys, and still lack quantitative data evidence from industry. Based on the experience from the manufacturing area that has been brought to the software domain and software reliability models to predict the future failure behavior, S. Wagner presents a model for quality economics of defect-detection techniques [Wagner and Seifert, 2005]. This model is proposed to estimate the effects of a combination and remove such influences when evaluating a single technique. However, this model is a theoretic model without real industry data validation. More recently, Frank Elberzhager presented an integrated two-stage inspection and testing process on the code level [Elberzhager et al., 2011]. In particular, defect results from an inspection are used in two-stage manner: first, prioritize parts of the system that are defect-prone and then prioritizes defect types that appear often. However, the combined prioritization is mainly using defects detected from inspection to estimate failure probability in order to prioritize testing activities, without considerations on defect removal technique efficiency comparison by defect type among inspection, testing or other defect removal techniques. 20

38 We plan to collect real industry project data to compare the defect removal techniques efficiency based on RRL to further calibrate ODC COQUALMO. And then select or combine defect removal techniques by defect type to optimize the scarce inspection and testing resources which will be discussed in Chapter 9 as our next-step work. 21

39 Chapter 3: Framework of Value-Based, Dependency-Aware Inspection and Test Prioritization This chapter will introduce the methodology of the Value-Based, Dependency Aware inspection and testing prioritization strategy and process, proposes key performance evaluation measures, research hypotheses and the methods to test those hypotheses Value-Based Prioritization The systematic and comprehensive value-based, risk-driven inspection and testing Figure 9. Overview of Value-based Software Testing Prioritization Strategy 22

40 prioritization strategy, proposed to improve their cost-effectiveness, is shown in Figure 9. It illustrates the value-based inspection and testing prioritization s methodology, composed of four main consecutive parts: prioritization drivers, which deals with what are the project success-critical factors are and how they influence the software inspection and testing; prioritization strategy, which deals with how to make optimal trade-offs among those drivers; prioritization case studies, which deals with how to apply the value-based prioritization strategy into practices, especially under industry contexts and this part will be introduced in detail from Chapter 4 to Chapter 7; and prioritization evaluation which deals with how to track the business value of inspection and testing and measure their cost-effectiveness. These fours questions from each part will be answered and explained Prioritization Drivers Most of the current testing prioritization strategies focus on optimizing one single goal, i.e. coverage-based testing prioritization aims to maximum the testing coverage per unit testing time, risk-driven testing aims to detect the most fault-prone parts at the earliest time etc. Besides, seldom research work incorporates the business or mission value into the prioritization. In order to build a systematic and comprehensive prioritization mechanism, the prioritization should take all project success-critical factors into consideration, i.e., business or mission value, testing cost, defect criticality, and defectprone probability, for some business critical projects, the time to market should also be added into prioritization. The value-based prioritization drivers should include: Stakeholder Prioritization The first step of value-based inspection and testing is to identify Success-Critical Stakeholders (SCSs) and understand their roles played during the inspection and testing 23

41 process and their respective win conditions. Direct stakeholders of testing are testing team, especially testing manager, developers and project managers, who directly interact with the testing team. In the spirit of value-based software engineering important parties for testing are key customers as the source of value objectives, which set the context and scope of testing. Marketing and product managers assist in testing for planning releases, pricing, promotion, and distribution. We will look at the following factors that must be considered when prioritizing the testing order of new features, and they represent SCSs s win conditions: Business /mission value Business or mission value is captured by business case analysis with the prioritization of success-critical stakeholder value propositions; Business Importance of having the features gives information as to what extent mutually agreed requirements are satisfied and to what extent the software meets key customers value propositions. CRACK (Collaborative, Representative, Authorized, Committed and Knowledgeable) [Boehm and Turner, 2003] customer representatives are the source of features relative business importance. Only if their most valuable propositions or requirements have been understood clearly, developed correctly, tested thoroughly and delivered timely, the project could be seen as a successful one. So under this situation, CRACK customer representatives are most likely to be collaborative and knowledgeable to provide the relative business importance information Defect Criticality Defect criticality is captured by measuring the impact of absence of an expected feature, not achieving a performance requirement, or the failure of a test case, Combining 24

42 with the business or mission value, it serves as the other factor to determine the Size of Loss as shown in Figure Defect Proneness Defect-proneness is captured by expert estimation based on historical data or past experiences, design or implementation complexity, qualification of the responsible personnel, code change impact analysis etc. Quality of the software product is another success-critical factor that needs to be considered for the testing process. The focus of quality risk analysis is on identifying and eliminating risks that are potential value breakers and inhibit value achievements. The information of quality risk could help testing manager with risk management, progress estimation, and quality management. Testing managers are interested in the identification of problems particularly the problem trends that helps to estimate and control testing process. By risk identification and analysis, it will also provide the developing manager some potential process improvement opportunities to mitigate project risks in the future. So both of the testing manager and developing team are willing to be collaborative with each other to do the quality risk analysis Testing or Inspection Cost Testing or inspection cost is captured by expert estimation based on historical data or past experiences, or by some state-of-art testing cost estimation techniques or tools; Testing cost is considered as an investment in software development and should also be seriously considered during the testing process. This would become more crucial as the time-critical deliverables are required, e.g., when time-to-market greatly influences the market share. If most of the testing effort is put into testing features or test cases, or 25

43 scenarios with relatively less business importance, that will lose more market share and lead to decreasing customer s profits, even negative profits in the worst case. Testing managers are interested in making testing process more efficient, by putting more effort on the features with higher business importance Time- to-market Time-to-market can greatly influence the effort distribution of software developing and project planning. Because the testing phase serves as the adjacent phase before software product transition and delivery, it will be influenced even more by market pressure [Yang et al., 2008]. Sometimes, in the intense market competition situation, sacrificing some software quality to avoid more market share erosion might be a good organizational strategy. Huang and Boehm [Huang and Boehm, 2006] propose a value-based software quality model that helps to answer the question How much testing is enough? in three types of organizational contexts: early start-up, commercial, and high finance. For example, an early start-up will have a much higher risk impact due to market share erosion than the other two. Thus better strategy for an early start-up is to deliver a lower quality product than invest in quality beyond the threshold of negative returns due to market share erosion. Marketing and product managers help to provide the market information and assist in testing for planning releases, pricing, promotion, and distribution Value-Based Prioritization Strategy The value-based inspection and testing prioritization strategy synthetically considers business importance from the client s value perspective combined with the criticality of failure occurrence as a measure of the size of loss at risk. For each test item 26

44 (e.g. artifacts, testing feature, testing scenario, or test case), the probability of loss is the probability that a given test item would catch the defect, estimated from an experience base that would indicate defect-prone components or performers. Since Size (Loss) * Probability (Loss) = Risk Exposure. This enables the testing items to be ranked by how well they reduce risk exposure. Combining their risk exposures with their relative testing costs enables the test items to be prioritized in terms of Return On Investment (ROI) or Risk Reduction Leverage (RRL), where the quantity of Risk Reduction Leverage (RRL) is defined as follows [Selby, 2007]: Where RE before is the RE before initiating the risk reduction effort and RE after is the RE afterwards. Thus, RRL serves as the engine for the testing prioritization and is a measure of the relative cost-benefit ratio of performing various candidate risk reduction activities, e.g. testing in this case study Dependency-Aware Prioritization In our case studies, two types of dependencies are dealt with, they are Loose Dependencies and Tight Dependencies, their definitions, typical examples, and our solutions to them are introduced as below: Loose Dependencies Loose Dependencies is defined as: it would be ok to continue task without awareness of dependencies, but would be better with awareness. The typical case is those dependencies among artifacts to be reviewed in the inspection process. 27

45 For example, Figure 10 illustrates the dependencies among four artifacts to be reviewed for CSCI577ab course projects: System and Software Requirement Description (SSRD), System and Software Architecture Description (SSAD), Acceptance Testing Plan and Cases (ATPC), Supporting Information Description (SID). Although they are course artifacts, they also represent typical requirement, design, test and other supporting documents in real industrial projects. As shown in Figure 10, SSRD is the requirement document and usually can be reviewed directly; in order to review use cases, UML diagrams in SSAD, or test cases in ATPC, it is better to review requirements first in SSRD at least to check whether those use cases, UML diagrams in SSAD or test cases in ATPC cover all the requirements in SSRD, so SSAD and ATPC depend on SSRD as the arrows illustrate in Figure 10. SID maintains the traceability matrices among requirements in SSRD, use cases in SSAD and test cases in ATPC, so it is better to have all the requirements, uses cases and test cases in hand when reviewing the traceability, so SID depends on all the other three artifacts. But it won t bother or block to go ahead to review SSAD or ATPC without reviewing SSRD, or review SID without refereeing all other artifacts. So we call this type of dependencies loose dependencies. Figure 10. An Example of Loose Dependencies 28

46 Basically, the more artifacts this document depends on, the higher the Dependency rating is, and the lower the reviewing priority will be, which can be represented by the formula as below: In order to quantify the loose dependency and add it to the review priority calculation, Table 5 displays a simple example. The number of artifacts this document depends on is counted, qualitative ratings Low, Moderate and High are mapped, and numeric values (1, 2, 3) are added in to calculating the priority. Other numeric values e.g. (1, 5, 10) or (1, 2, 4) can also be used if necessary. The case study in Chapter 4 will introduce more about how to deal with this type of the loose dependency into the Value- Based prioritization. Table 5. An Example of Quantifying Dependency Ratings # of dependable artifacts Dependency Ratings Numeric Values SSRD 0 Low 1 SSAD, ATPC 1 Moderate 2 SID 3 High Tight Dependencies Tight Dependencies is defined as: the successor task has to wait until all its precursor tasks finish, the failure of the precursor will block the successor. The typical case is the dependencies among the test cases to be executed during the testing process. 29

47 Figure 11. An Example of Tight Dependencies Figure 11 illustrates a simple dependency tree among 7 test cases (T1-T7), each node represents a test case, the numeric value in each node represents the RRL of the test case. If T1 fails to pass, it will block all other test cases that depend on it, e.g. T3, T4, T5, T6 and T7, and we call this type of dependencies Tight Dependencies. A prioritization algorithm is proposed to deal with this type of dependencies, and it is a variant of the greedy algorithm: it first selects the one with the highest RRL, and check whether it depends on other test cases; if it has dependencies, and in its dependency set, recursively selects the one with the highest RRL until selecting the one with no dependencies. The detailed algorithm and prioritization logics will be introduced in Chapter 7. For the 7 test cases in Figure 11, according to the algorithm, T2, T5 and T6 have the highest RRL with the value of 9. However, T6 depends on T3 and T1, T5 depends on T1, while T2 has no dependencies and can be directly executed. So T2 is the first test case to be executed. Since both T5 and T6 depend on T1, T1 is tested in order to test those high payoff T5 and T6. After T1 is passed, T5 with the highest RRL is unblocked and ready for testing. Recursively running the algorithm results in the order T2->T1- >T5->T3->T6->T4->T7. More test cases prioritization for real projects will be introduced and illustrated in Chapter 7. 30

48 3.3. The Process of Value-Based. Dependency-Aware Inspection and Testing Figure 12 displays the benefits chain for value-based testing process implementation including all these SCSs roles and their win conditions if we consider software testing as an investment during the whole software life cycle. Figure 12. Benefits Chain for Value-based Testing Process Implementation Figure 13 illustrates the whole process of this value-based software testing method. This method helps test manager consider all the win-conditions from SCSs, enact the testing plan and adjust it during testing execution. The main steps are as follows: 31

49 Figure 13. Software Testing Process-Oriented Expansion of 4+1 VBSE Framework Step 1: Define Utility Function of Business Importance, Quality Risk Probability and Cost. After identifying SCSs and their win conditions, the next step is to understand and create the single utility function for each win-condition and how they influence the SCSs value propositions. With the assistance of the key CRACK customer, the testing manager uses a method first proposed by Karl Wiegers [Wiegers, 1999] to get the relative Business Importance for each feature. The developing manager and the test manager accompanied with some experienced developers, calculate the quality risk probability of each feature. The test manager with the developing team estimate the testing cost for each feature This step brings the stakeholders together to consolidate their value models and to negotiate testing objectives. This step is in line with the Dependency and Utility Theory in VBSE that helps to identify all of the SCSs and understand how the SCSs want to win. 32

50 Step 2: Testing Prioritization Decision for Testing Plan. Then business importance, quality risk and testing cost are put together to calculate a value priority number in terms of RRL for each item to be prioritized, e.g. artifact, scenario, feature, or test case. This is like a multi-objective decision and negotiation process which follows the Decision Theory in VBSE. Features value priority helps test manager enact the testing plan, and resources should be focused on those areas representing the most important business value, the lowest testing cost and highest quality risk. Step 3: Control Testing Process according to Feedback. During the testing process, each item s value priority in terms of RRL is adjusted according to the feedback of quality risk indicators and updated testing cost estimation. This step assists to control progress toward SCS win-win realization which is according to the Control Theory of VBSE. Step 4: Determine How Much Testing is Enough under Different Market Patterns. One of the strengths of 4+1 VBSE Dependency Theory is to uncover factors that are external to the system but can impact the project s outcome. It serves to align the stakeholder values with the organizational context. Market factors would influence organizations to different extent by different organizational contexts. A comparative analysis is done in Chapter 6 for different market patterns and the result shows that the value-based software testing method is especially effective when the market pressure is very high. 33

51 3.4. Key Performance Evaluation Measures Value and Business Importance Some of the dictionary definitions of value (Webster 2002) are in purely financial terms, such as the monetary worth of something: marketable price. However, in the value-based software engineering community, it broader dictionary definition of value as relative worth, utility or importance to provide help address software engineering decisions. In our research, we usually use relative Business Importance to capture the client s business value Risk Reduction Leverage The quantity of Risk Exposure (RE) is defined by: Where Size (Loss) is the risk impact size of loss if the outcome is unsatisfactory, Prob (Loss) is the probability of an unsatisfactory outcome. The quantity of Risk Reduction Leverage (RRL) is defined as follows: Where RE before is the RE before initiating the risk reduction effort and RE after is the RE afterwards. Thus, RRL is a measure of the relative cost-benefit ratio of performing various candidate risk reduction or defect removal activities. RRL serves as the engine for the prioritization strategy for different applications to improve the cost-effectiveness of defect removal activities. Its quantity acquisition can be different per its applications, project context and scenarios. For example, to quantify the effectiveness of a review, Review Cost Effectiveness defined as below is a variant of RRL 34

52 under the condition that the defects detected are 100% resolved and removed, which drops the Prob (Loss) is from 100% to 0%: Average Percentage of Business Importance Earned (APBIE): This metric is defined to measure how quickly the SUT s value is realized by testing. Let T be the whole test case suite for the SUT containing m test items, T be a selected and prioritized test suite subset containing n test items that will be executed and i is the ith test items is in the test order T. It is obvious that T T, and n m; The Total Business Importance (TBI) for T is After business importance for the m test items are all rated, TBI is a constant. Initial Business Importance Earned (IBIE) is the sum of the business importance for those test items in the set of T-T.. It could be 0 when T=T. The Percentage of Business Importance Earned (PBIE i ) when the ith test item in the test order T is passed is 35

53 Average Percentage of Business Importance Earned (APBIE) is defined as: Average Percentage of Business Importance Earned (APBIE) is used to measure how quickly the SUT s value is realized, the higher it is, and the more efficient the test is and it serves as another important metric to measure the cost-effectiveness of testing Hypotheses, Methods to test A series of hypotheses are defined to be tested. For value-based review process for prioritizing artifacts, the core hypothesis is: H-r1:the review cost effectiveness of concerns/problems on the same artifact package does not differ between value-based group (2010, 2011teams) & value-neutral one (2009 teams); Others auxiliary hypotheses include: H-r2:the number of concerns/problems reviewers found does not differ between groups; H-r3:the Impact of concerns/problems reviewers found does not differ between groups; and etc. Basically, concerns/problems data based on the defined metrics are collected from the tailored Bugzilla system and consolidated. Then their Mean, Standard Deviation will be compared, T-test and F-test are used to test whether those hypotheses can be accepted or rejected. For value-based scenarios/features/test cases prioritization, the core hypothesis is: H-t1: the value-based prioritization does not increase APBIE; 36

54 Others auxiliary hypotheses include: H-t2: the value-based prioritization does not lead high-impact defects to be detected earlier in the acceptance testing phase; H-t3: the value-based prioritization does not increase Delivered-Value when Cost is Fixed or does not save Cost when Delivered-Value is fixed under time constraints; To test H-t1 and H-t3, we will compare the experimented value-based testing case study with value-neutral ones. Then their Mean, Standard Deviation will be compared, T- test and F-test are used to test whether those hypotheses can be accepted or rejected. To test H-t2, we will observe the issues reported in the Bugzilla system to check whether issues with high priority and high severity are reported at the early stage of acceptance phase. Besides, its application from USC real-client course projects to other real industry projects can further test these hypotheses. Furthermore, qualitative methods, such as surveys or interviews will also be used in our case studies to complement the quantitative results. The Value-Based, Dependency-Aware prioritization strategy has been empirically studied and applied on defect removal activities within different prioritization granularity levels as summarized in Table 6. prioritization of artifacts to be reviewed on USC-CSSE graduate level real-client course projects for its formal inspection; prioritization of operational scenarios to be applied in Galorath, Inc. for its performance testing; 37

55 prioritization of features to be tested on a Chinese software company for its functionality testing; prioritization of test cases to be executed on USC-CSSE graduate level course projects at its acceptance testing phase. Table 6. Case Studies Overview Case Studies Defect Removal Activities Items to be Prioritized Granularity for Prioritization Prioritization Drivers Business Risk Value Probability Testing Cost Dependency I: Inspection Artifacts to High-level Impacts to Rating Rating Yes USC be reviewed Project course projects II: Performance Operational High-level Frequency Rating Rating No Galorath, Testing Scenarios to of Use Inc. be applied III: Functionality Features to Medium-level Benefit + Rating Rating No ISCAS Testing be tested Penalty project IV: Acceptance Test Cases Low-level Feature BI Rating Assume Yes USC Testing to be + Testing equal course executed Aspect projects These four typical case studies cover the most commonly used defect removal activities during the software development life cycle. Although the prioritization strategies for them are all triggered by RRL, the ways to get the priorities and dependencies for the items to be prioritized are different per the defect removal activity type and the project context. 38

56 For example, the business case analysis can be implemented with various methods, considering their ease of use and adaption under experiments environment. For example, in the case study of value-based testing scenario prioritization in Chapter 5, we use frequency of use (FU) combined with product importance as a variant of business importance for operational scenarios; in the case study of value-based feature prioritization for software testing in Chapter 6, Karl Wiegers requirement prioritization approach [Wiegers, 1999] is adopted, which considers both the positive benefit of the presence of a feature and the negative impact of its absence. In the case study of valuebased test case prioritization in Chapter 7, classic S-curve production function with segments of investment, high-payoff, and diminishing returns [Boehm, 1981] are used to train students for their project features business case analysis with the Kano model [Kano] as a reference to complement their analysis for feature business importance ratings. Test cases business importance is then determined by its corresponding functions/components/features importance, and whether testing the core function of this feature or not. As for the case study of determining the priority of artifacts (system capabilities) in Chapter 4, the business importance is tailored to ratings of their influences/impacts to the project s success. The similarity for these different business case analyses is that all using well-defined, context-based relative business importance ratings. These four case studies have practical meanings in real industry and practitioners can have 3 learner outcomes for each case study as below: What are the value-based inspection and testing prioritization drivers and their tradeoffs? 39

57 What are the detailed practices and steps for the value-based inspections/ testing process under project contexts? How to track business value of testing and measure testing efficiency using a proposed real earned value system, with real industrial evidences? 40

58 Chapter 4: Case Study I-Prioritize Artifacts to be Reviewed 4.1. Background This case study for prioritizing artifacts to be reviewed was implemented in the real-client projects verification and validation activities at USC graduate-level software engineering course. The increasing growth of software artifact package motivates us to prioritize the artifacts be reviewed with the goal to improve the review cost-effectiveness. At USC, best practices from software engineering industries are introduced to students through a 2-semester graduate software engineering course (Csci577a, b) with real-client projects. From Fall 2008, the Incremental Commitment Spiral Model (ICSM) [Boehm and Lane, 2007], a value-based, risk-driven software life cycle process model was introduced and tailored as a guideline [ICSM-Sw] for this course as shown in Figure 14. It teaches and trains students skills such as understanding and negotiating stakeholder needs, priorities and shared visions; rapid prototyping; evaluating COTS, services options; business and feasibility evidence analysis; and concurrent plans, requirements and solutions development. 41

59 Figure 14. ICSM framework tailored for csci577 [ICSM-Sw] In this course, students work in teams and are required to understand and apply the Incremental Commitment Spiral Model for software engineering to real-world projects. In CSCI 577b, student teams develop Initial Operational Capability (IOC) products based on the best results from CSCI 577a. As the guideline for this course, ICSM covers the full system development life cycle based on Exploration, Valuation, Foundations, Development, and Operations phases as shown in Figure 14. The key to synchronizing and stabilizing all of the concurrent product and process definition activities is a set of risk-driven anchor point milestones: the Exploration Commitment Review (ECR), Valuation Commitment Review (VCR), Foundation Commitment Review (FCR), Development Commitment Review (DCR), Rebaselined Development Commitment Review (RDCR), Core Capability Drivethrough (CCD), Transition Readiness Review (TRR), and Operation Commitment Review (OCR). At these milestones, the business, technical, and operational feasibility of the growing package of specifications and plans is evaluated by independent experts. For the course, clients, 42

60 professors and teaching assistants perform Architecture Review Board (ARB) activities based on to evaluate the package of specifications and plans. Most off-campus students come from real IT industry with rich experiences. They often take on the roles of Quality Focal Point and Integrated Independent Verification and Validation (IIV&V) to review set of artifacts to find any issues related to completeness, consistency, feasibility, ambiguity, conformance, and risk in order to minimize the issues found at ARB. A series of package review assignments are consecutively given to them after development teams submit their packages during the whole semester. The instructions for each assignment, together with artifact templates in the ICSM Electronic Process Guide (EPG) [ICSM-Sw] provide reviewing entry and exit criteria for each package review. Table 7 summarizes the content of the V&V reviews as performed in Fall 2009 and Fall 2010 and 2011, and Table 8 gives the definitions of the ICSM and all other acronyms used in this case study. 43

61 Table 7.V&V assignments for Fall2009/2010 V&Ver Assignment Review Package 2009 V&V Method 2010/2011 V&V Method Learn to Use Bugzilla System for Your Project Team Eval of VC Package Eval of Initial Prototype Eval of Core FC Package Eval of Draft FC Package OCD,FED, LCP FV&V FV&V PRO FV&V FV&V OCD,PRO,SSRD**,SSAD,LCP,FED, SID FV&V VbV&V OCD,PRO,SSRD**,SSAD,LCP,FED, SID FV&V VbV&V Eval of FC/DC Package OCD,PRO,SSRD**,SSAD,LCP,FED, SID, QMP, ATPC^, IP^ FV&V VbV&V Eval of Draft DC/TRR Package OCD,PRO,SSRD**,SSAD,LCP,FED, SID, QMP, ATPC^, IP^, TP^ VbV&V VbV&V Eval of DC/TRR Package OCD,PRO,SSRD**,SSAD,LCP,FED, SID, QMP, ATPC, IP, TP, IAR^,UM^,TM^,TPR^ VbV&V VbV&V **: not required by NDI/NCS team; ^: only required by one-semester team; Table 8. Acronyms ICSM phases: VC: Valuation Commitment, FC: Foundation Commitment, DC: Development Commitment, TRR: Transition Readiness Review, RDC: Rebaselined Development Commitment, IOC: Initial Operational Capability, TS: Transition & Support Artifacts developed and reviewed for this course: OCD: Operational Concept Description, SSRD: System and Software Requirements Description, SSAD: System and Software Architecture Description, LCP: Life Cycle Plan, FED: Feasibility Evidence Description, SID: Supporting Information Document, QMP :Quality Management Plan, IP: Iteration Plan, IAR: Iteration Assessment Report, TP: Transition Plan, TPC: Test Plan and Cases, TPR: Test Procedures and Result, UM: User Manual, SP: Support Plan, TM: Training Materials Others: FV&V: Formal Verification & Validation, VbV&V: Value-based Verification & Validation, Eval: Evaluation, ARB: Architecture Review Board 44

62 4.2. Case Study Design The comparison analysis is conducted between teams and teams that adopted the value-based prioritization strategy and teams adopting a valueneutral method without prioritizing before reviewing. All the three years teams reviewed the same content of three artifact packages as shown in Table 9. Table 9. Documents and sections to be reviewed Doc/Sec CoreFCP DraftFCP FC/DCP 1&2 sem 1&2 sem 2 sem 1 sem OCD 100% 100% 100% 100% FED AA(Section 1,5) NDI(Section1,3,4.1,4.2.1,4.2.2) Section 1-5 Section % LCP Section 1, % 100% 100% SSRD AA(100%) NDI(N/A) AA(100%) NDI(N/A) AA(100%) NDI(N/A) AA(100%) NDI(N/A) SSAD Section 1, Section 1, 2 Section 1, 2 100% PRO Most critical/important use cases 100% 100% 100% SID 100% 100% 100% 100% QMP N/A N/A Section 1,2 100% ATPC N/A N/A N/A 100% IP N/A N/A N/A 100% Year 2009 teams used a value-neutral formal V&V process (FV&V) to reviewing the three artifact packages, a variant of Fagan inspection [Fagan, 1976] practice. The steps they followed are: 45

63 Table 10. Value-neutral Formal V&V process Step 1: Create Exit Criteria: From the original team assignment s description and the related ICSM EPG completion criteria, generate a set of exit criteria that identify what needs to be present and the standard for acceptance of each document. Step 2: Review and Report Concerns: Based upon the exit criteria, read (review) the documents and report concerns and issues into the Bugzilla [USC_CSSE_Bugzilla] system. Step 3: Generate Evaluation Report Management Overview - List any features of the solution described in this artifact that are particularly good, of which a non technical client should be aware of. Technical Details - List any features of the solution described in this artifact that you feel are particularly good, and which a technical reviewer should be aware of. Major Errors & Omissions - List top 3 errors or omissions in the solution described in this artifact that a non technical client would care about. The description of an error (or omission) should be understandable to a non technical client, and should explain why the error is worth the client s attention. Critical Concerns - List top 3 concerns with the solution described in this artifact that a non technical client would care about. The description of the concern should be understandable to a non technical client, and should explain why the client should be aware of it. You should also suggest step(s) to take that would reduce or eliminate your concern. Year 2010 and 2011 teams applied the value-based, dependency-aware prioritization strategy to the review process with the guidelines for inspection as summarized as in Table

64 Table 11. Value-based V&V process Step 1: Value-based V&V Artifacts Prioritization Priority Factor Importance 5: most important 3: normal 1: least important Quality Risk 5: highly risky 3: normal 1: least risky Dependency 5: highly dependent 3:normal 1: not dependent Review Cost 5: need intensive effort 3: need moderate effort 1: need little effort Determine Weights Rating Guideline Without this document, the project can t move forward or could even fail; it should be rated with high importance Some documents serve a supporting function. Without them, the project still could move on; this kind of document should be rated with lower importance Based on previous reviews, the documents with intensive defects might be still fault-prone, so this indicates a high quality risk Personnel factors, e.g. the author of this documents is not proficient or motivated enough; this indicates a high quality risk A more complex document might have a high quality risk A new document or an old document with a large portion of newly added sections might have a high quality risk Sometimes some lower-priority artifacts are required to be reviewed at least for reference before reviewing a higher-priority one. For example, in order to review SSAD or TPC, SSRD is required for reference. Basically, the more documents this document depends on, the higher the Dependency rating is, and the lower the reviewing priority will be A new document or an old document with a large portion of newly added sections usually takes more time to review and vice versa A more complex document usually takes more time to review and vice versa Weights for each factor (Importance, Quality Risk, Review Cost, and Dependency) could be set according to the project context. Default values are 1.0 for each factor Priority Calculation E.g: for a document, Importance=5, Quality Risk=3, Review Cost=2, Dependency = 1, default weights are used=> Priority= (5*3)/(2*1)=7.5 A spreadsheet [USC_577a_VBV&VPS, 2010] helps to calculate the priority automatically, 5-level ratings for each factor are VH, H, M, L VL with values from 5 to 1, intermediate values 2, 4 are also allowed. Step 2: Review artifacts based on prioritization and report defects/issues The one with higher priority value should be reviewed first For each document s review, review the core part of the document first. Report issues into the Bugzilla [USC_CSSE_Bugzilla] Step 3: List top 10 defects/ issues List top 10 highest-risk defects or issues based on issues priority and severity 47

65 A real example of artifacts prioritization in one package review by a 2010-team [USC_577a_VBV&VAPE, 2010] is displayed in Table 12. The default weight of 1.0 for each factor is used. Based on the priority calculated, reviewing order follows SSRD, OCD, PRO, SSAD, LCP, FED, SID. SSRD has the highest reviewing priority with the rationales provided: SSRD contains the requirements of the system, without this document, the project can't move forward or could even fail (Very-High Importance). This is a complex document, and needs to be consistent with win conditions negotiation, which might not be complete at this point, also, a lot of rework was required based on comments from TA (Very-High Quality Risk). SSRD depends on few other artifacts (Low Dependency). This is an old document, but it is complex with a lot of rework (Very-High Review Cost). Table 12. An example of value-based artifact prioritization Weights: Importance Quality Risk Dependency Review Cost Priority M L L M LCP This document describes the life cycle plan of the project. This document serves as supporting function, without this, the project still could move on. With his document, the project could move more smoothly. Based on previous reviews, the author of this document has a strong sense of responsibility. A lot of new sections added, but this document is not very complex H VH M H OCD This document gives the overall operational concept of the system. This document is important, but it is not critical for this success of the system. This is a complex document and a lot of the sections in this document needed to be redone based on the comments received from the TA. SSRD Old document, but a lot of rework done

66 H H H H FED This document should be rated high because it provides feasibility evidence for the project. Without this document, we don't know whether the project is feasible. The author of this document does not have appropriate time to complete this document with quality work. SSRD, SSAD A lot of new section added to this version of the document VH VH L VH SSRD This document contains the requirements of the system. Without this document, the project can't move forward or even fail. This is a complex document. This document needs to be consistent with win conditions negotiation, which might not be complete at this point. Also, a lot of rework was required based on comments from TA. This is an old document, but it is complex with a lot of rework VH VH H VH SSAD This document contains the architecture of the system. Without this document, the project can't move forward or even fail. This is a complex document and it is a new document. The author of this document did not know that this document was due until the morning of the due date. SSRD, OCD This is an old document, but it is complex with a lot of rework done for this version VL L VH VL SID This document serves as supporting function, without this document, the project still could move on, but the project could move on more smoothly with this document. This is an old document. Only additions made to existing sections. OCD, SSRD, FED, LCP, SSAD, PRO This is an old document and this document has no technical contents H L M L PRO Without this document, the project can probably move forward, but the system might not be what the customer is expecting. This document allows the customer to have a glimpse of the system. This is an old document with little new contents. The author has a high sense of responsibility and he fixed bugs from the last review in reasonable time. FED This is an old document with little content added since last version and not much rework required An example of Top 10 issues made by this team for CoreFCP evaluation is displayed in Table 13. These Top 10 issues are communicated in a timely manner with artifact authors to attract enough emphasis. The interesting finding is the relations between 49

67 the artifact priority sequence and the top 10 issues sequence: the issues with higher impact usually exist in the artifacts with high priority, showing that the artifact prioritization enables reviewers to focus on issues with high impact at least in this context. However, it also helps avoid the potential problem of neglecting high-impact issues in lower-priority artifacts, as in Issues 8 and 10. Table 13. An example of Top 10 Issues Summary 1 SSRD Missing important requirements. 2 SSRD Requirement supporting information too generic. 3 SSAD Wrong cardinality in the system context diagram. 4 OCD The client and client advisor stakeholders should be concentrating on the deployment benefits. 5 OCD The system boundary and environment missing support infrastructure. 6 FED Missing use case references in the FED. 7 FED Incorrect mitigation plan. Rationale A lot of important requirements are missing. Without these requirements, the system will not succeed. The output, destination, precondition, and post condition should be defined better. These description will allows the development team and the client better understand the requirements. This is important for system success. The cardinality of this diagram needs to be accurate since this describes the top level of the system context. This is important for system success. It is important for that this benefits chain diagram accurately shows the benefits of the system during deployment in order for the client to show to potential investor to gather fund to support the continuation of system development. It is important for the System boundary and environment diagram to capture all necessary support infrastructure in order for the team to consider all risks and requirements related the system support infrastructure. Capability feasibility table proves the feasibility of all system capabilities to date. Reference to the use case is important for the important stakeholders to understand the capabilities and their feasibility. Mitigation plans for project risks are important to overcome the risks. This is important for system success. 8 LCP Missing skills and roles The LCP did not identify the skill required and roles for next semester. This information is important for the success of the project because the team next semester can use these information and recruit new team members meeting the identified needed skills. 50

68 9 FED CR# in FED doesn't match with CR# in SSRD 10 LCP COCOMO drivers rework The CR numbers need to match in both FED and SSRD for correct requirement references. COCOMO driver values need to be accurate to have a better estimate for the client. The three-year experiment issue data for the evaluation of CoreFCP, DraftFCP and FC/DCP from total 35 teams is collected and extracted from the Bugzilla database. The generic term Issue covers both Concerns and Problems. If the IV&Vers find any issue, they report it as a Concern in Bugzilla and assign it to the relevant artifact author. The author determines whether the concern is a problem or not. As transformed in Table 14, Severity is rated from High (corresponding to ratings of Blocker, Critical, Major in Bugzilla ), Medium (corresponding the rating of Normal in Bugzilla), Low ( the ratings of Minor, Trivial, Enhancement in Bugzilla) with the value from 3 to 1. Priority is rated from High (Resolve Immediately), Medium (Normal Queue), Low (Not Urgent, Low Priority, Resolved Later) with the value from 3 to 1. The Impact of an issue is the product of its Severity and Priority. The impact of an issue with high severity and high priority is 9. Obviously, the impact of an issue is an element in the set {1, 2, 3, 4, 6, and 9}. 51

69 Table 14. Issue Severity & Priority rate mapping Rating for Measurement Rating in Bugzilla Value High Blocker, Critical, Major 3 Severity Medium Normal 2 Low Minor, Trivial, Enhancement 1 High Resolve Immediately 3 Priority Medium Normal Queue 2 Low Not Urgent, Low Priority, Resolved Later 1 The generic term Issue covers both Concerns and Problems. If the IV&Vers find any issue, they report it as a Concern in Bugzilla and assign it to the relevant artifact author. The author determines whether it needs fixing by choosing an option for Resolution as displayed in Table 15. Whether an issue is a problem or not is easy to be determined by querying the Resolution of the issue. Fixed and Won t Fix mean the issue is a problem and the other two options mean that it is not. Table 15. Resolution options in Bugzilla Resolution Options Fixed Won t Fix Invalid WorksForMe Instructions in Bugzilla If the issue is a problem, after you fix the problem in the artifact, then choose Fixed If the issue is a problem, but won t be fixed for this time, then choose Won t Fix and must provide the clear reason in Additional Comments why it can t be fixed for this time If the issue is not a problem then choose Invalid and must provide a clear reason in Additional Comments If the issue really works fine, then choose WorksForMe and let the IVVer review this again 52

70 4.3. Results Various measures in Table 16 are used to compare the performance of 2011, 2010 years value-based and 2009 value-neutral review process. The main goal of the Valuebased review or inspection is to increase the review cost effectiveness as defined in Chapter 3. Table 16. Review effectiveness measures Measures Number of Concerns Number of Problems Number of Concerns per reviewing hour Number of Problems per reviewing hour Review Effort Review Effectiveness of total Concerns Review Effectiveness of total Problems Average of Impact per Concern Average of Impact per Problem Review Cost Effectiveness of Concerns Review Cost Effectiveness of Problems Details The number of concerns found by reviewers The number of problems found by reviewers The number of concerns found by reviewers per reviewing hour The number of problems found by reviewers per reviewing hour Effort spent on all activities in the package review As defined in Chapter 3 but for concerns As defined in Chapter 3 but for problems Review Effectiveness of total Concerns/ Number of Concerns Review Effectiveness of total Problems/ Number of Problems As defined in Chapter 3 but for concerns As defined in Chapter 3 but for problems Table 17 to Table 22 list the three years 35 teams performances on different measures for concerns, and problems data is similar and is not listed here due to page limitation. Mean and Standard Deviation values are calculated at the bottom of each measure. 53

71 Table 17. Number of Concerns 2011 Teams 2010 Teams 2009 Teams T T T-1 58 T-3 82 T T-2 45 T T-3 53 T T T-4 33 T-4 87 T-6 38 T-5 60 T-5 32 T-7 78 T T-6 58 T T-7 98 T T T-8 94 T T T T T T T T T T T T Mean Mean Mean Stdev Stdev Stdev

72 Table 18. Number of Concerns per reviewing hour 2011 Teams 2010 Teams 2009 Teams T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Mean 4.17 Mean 2.08 Mean 1.36 Stdev 2.12 Stdev 0.93 Stdev

73 Table 19. Review Effort 2011 Teams 2010 Teams 2009 Teams T T T T T T T T T T T T-4 61 T T T T T T T T T T T T T T-9 72 T T T T T T T T T Mean Mean Mean Stdev 7.30 Stdev Stdev

74 Table 20. Review Effectiveness of total Concerns 2011 Teams 2010 Teams 2009 Teams T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Mean Mean Mean 292 Stdev Stdev Stdev

75 Table 21. Average of Impact per Concern 2011 Teams 2010 Teams 2009 Teams T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Mean 4.36 Mean 4.42 Mean 3.97 Stdev 0.54 Stdev 0.52 Stdev

76 Table 22. Cost Effectiveness of Concerns 2011 Teams 2010 Teams 2009 Teams T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Mean Mean 9.30 Mean 5.31 Stdev Stdev 4.53 Stdev 1.86 Table 23 compares the Mean and Standard Deviation values for all the measures between the three-year teams. To determine whether the differences between years based on a measure is statistically significant or not, Table 24 compares every two years data using the F-test and T-test. The F-test determines whether two samples have different variances. If the significance (p-value) for F-test is 0.05 or below, the two samples have different variances. This will determine which type of T-test will be used to determine whether the two samples have the same mean. Two types of T-test are: Two-sample equal variance (homoscedastic), and Two-sample unequal variance (heteroscedastic). If the 59

77 significance (p-value) for T-test is 0.05 or below, the two samples have different means. For example, Table 24 shows that 2010 s value-based review teams had a 75.04% higher Review Cost Effectiveness of Concerns than 2009 s value-neutral teams. The p-value for F-test leads to choose Two-sample unequal variance type T-test. The p-value for T-test is strong evidence (well below 0.05) that the 75.04% improvement has statistical significance, the similar for its comparison between 2011 and 2009 (with F-test , and T-test ), which rejects the hypothesis H-r1. Table 23. Data Summaries based on all Metrics 2011 Team 2010 Team 2009 Team Mean Stdev Mean Stdev Mean Stdev Number of Concerns Number of Problems Number of Concerns per reviewing hour Number of Problems per reviewing hour Review Effort Review Effectiveness of total Concerns Review Effectiveness of total Problems Average of Impact per Concern Average of Impact per Problem Review Cost Effectiveness of Concerns Review Cost Effectiveness of Problems

78 Table 24. Statistics Comparative Results between Years 2011 Vs Vs Vs 2010 % 2011 Team higher F-test T-test % 2010 F-test Team (p-value) (p-value) higher (p-value) T-test % 2011 Team higher F-test T-test (pvalue) (pvalue) (pvalue) Number of Concerns 53.96% % % Number of Problems 57.90% % % Number of Concerns per reviewing hour % % % Number of Problems per reviewing hour % % % Review Effort % % % Review Effectiveness of total Concerns 76.24% % % Review Effectiveness of total Problems 80.78% % % Average of Impact per Concern 9.74% % % Average of Impact per Problem 9.46% % % Review Cost Effectiveness of Concerns % % % Review Cost Effectiveness of Problems % % % In Table 24 the shadowed sections represent that those comparisons are statistically significant, we can see that 2010 teams performance improves from 2009 teams on most of the measures, except the number of concerns/problems, and review effort teams performance even improves from 2009 teams on all the measures. Since Year 2010 and 2011 teams all adopted the same value-based inspection process, their differences on the measures between the two years are expected to be insignificant. However, we find that the review effort in 2011 is dramatically decreased, which directly causes significant differences on other measures relevant to review effort between 2010 and 2011, such as review effort, number of concerns/problems per reviewing hour, review cost effectiveness of concerns/problems. The decreased review effort in 2011 is due to 2011 year s team size change: 2011 teams have an average size of 6.5 (6 or 7) developers with 1 reviewer each team, while 2010 teams have an average size of 7.5 (7 or 8) developers with an average of 1.5 (1 or 2) reviewers each team, decreased 61

79 number of reviewers each team leads to the decreased review effort. This uncontrolled factor might partially contribute to an overall factor of 2.5 s improvement from 2009 to 2011, or an overall 100% from 2010 to 2011 on review cost effectiveness of concerns/problems, which might be a potential threat of validity to our positive results, however, we also find that all other review effort irrelevant measures comparison between 2010 and 2011 shows these two years performances are similar, such as average of impact per concern/problem, number of concerns/problems. Two reviewers in each team in 2010 usually overlapped reviewed all documents, they tend not to report duplicated concerns if there was already a similar one in the concern list, so for 2010 and 2011, it makes sense that both years have nearly the same number of concerns (no statistically significant), but review effort nearly doubled in 2010 since the reviewer size is nearly twice as This might also give us some hints that one reviewer per team might be enough for 577ab projects. This indicates that similar as the year 2010, reviewers tend to report issues with higher severity and priority by using value-based inspection process. This also minimizes the change of reviewer size s threat to our results. To sum up, these comparative analysis results show that the value-based review method to prioritize artifacts can improve the cost effectiveness of reviewing activities, and can enable reviewers to be more focused on artifacts with high importance and risks, and capture concerns/problems with high impact. Besides, to complement the quantitative analysis, a survey was distributed to reviewers after introducing the Value-based prioritization strategy. In their feedback, almost all 14 Year 2009 teams, 8 Year 2010 teams and 13 Year 2011 teams chose the Value-based reviewing process. Various advantages are identified by reviewers, such as: 62

80 more streamlined, efficient, not a waste of time, more focused on most important documents with high quality risks, more focused on non-trivial defects and issues, an organized and systematic way to review documents in an integrated way, not treating documents independently. Some example responses are as below: The value-based V&V approach holds a great appeal a more intensive and focused V&V process. Since items are prioritized and rated as to importance and likelihood of having errors. This is meant for you to allocate your time according to how likely errors (and how much damage could be done) will occur in an artifact. By choosing to review those areas that have changed or are directly impacted by changes in the other documents I believe I can give spend more quality time in reviewing the changes and give greater emphasis on the changes and impacts. Top 10 issue list gives a centralized location for showing the issues as opposed to spread across several documents. Additionally, by prioritizing the significance of each issue, it gives document authors a better picture of which issues they should spend more time on resolving and let them know which ones are more important to resolve. Previously, they would have just tackled the issues in any particular order, and may not have spent the necessary time or detail to ensure proper resolution. Focusing on a top 10 list helps me to look at the bigger picture instead of worrying about as many minor problems, which will result in documents that will have fewer big problems. For the review of the Draft FC Package, the Value-based IIV&V Process will be used. This review process was selected because of the time constraint of this review. There is only one weekend to review all seven Draft FC Package documents. The Valuebased review will allow me to prioritize the documents based on importance, quality risk, 63

81 dependencies, and reviewing cost. The documents will be reviewed based on its identified priority. This allows documents more critical to the success of the project to be reviewed first and given more time to. These responses and the unanimous choice of using the Value-based process show that the performers considered the Value-based V&V process to be superior to the formal V&V process for achieving their project objectives. The combination of both qualitative and quantitative evidence produced viable conclusions. 64

82 Chapter 5: Case Study II-Prioritize Testing Scenarios to be Applied 5.1. Background This case study to prioritize testing scenarios was implemented at the acceptance testing phase of one project in Galorath, Inc. [Galorath]. The project is designed to develop automated testing macros/scripts for the company s three main products (SEER- SEM, SEER-H, and SEER-MFG) to automate their installation/un-installation/upgrade processes. The three macros below automate the work-flow for installation test, uninstallation test and upgrade test respectively: Macro1: New Install Test integrates the steps of: Install the current product version-> Check correctness of the installed files and generate a report-> Export registry\odbc\shortcut files-> Check correctness of those exported files and a generate report Macro2: Uninstall Test integrates the steps of: Uninstall the current product version-> Check whether all installed files are deleted after un-installation & generate a report-> Export registry\odbc\shortcut files-> Check whether registry\odbc\shortcut files are deleted after un-installation and generate a report 65

83 Macro 3: Upgrade Test integrates the steps of: Install one of previous product versions-> Upgrade to the current version-> Check correctness of installed files & generate a report-> Export registry\odbc\shortcut files-> Check correctness of those exported files & generate a report-> Uninstall the current product version-> Return to the beginning (finish until all previous product versions are all tested) Secondly, these macros are going to be finally released to their testers, consultants, developers for internal testing purpose at the end. They are supposed to run these macros on their own machines or virtual machines on their host machines to do the installation testing (not like a dedicated testing server) and they need to deal with various variables: Different products (SEER-SEM, SEER-H, and SEER-MFG) installing, uninstalling and upgrading processes are different and should be recorded and replayed respectively; The paths of registry files vary due to different OS bit (32 bit or 64 bit); The paths of shortcuts are different due to different operating systems (WinXP, Vista, Win7, Server 2003, and Server 2008) and OS bit; Different installation types (Local, Client, and Server) will result in different installation which will be displayed in registry files; In sum, the automation is supposed to work well for three types of installation type (Local, Client, Server) on different various operating systems (i.e. Win7, Vista, WinXp ) 66

84 with 32bit or 64bit, and on various virtual machines as well. The combination of these variables increases the operational scenarios to be tested at the phase of acceptance testing before the fixed release time. In our case study, we define one scenario as testing one product (SEER-MFG, SEER-H or SEER-SEM) can be installed, uninstalled, upgraded from its previous versions correctly without any performance issue on one operating system environment with one type of installation. For example, for Server type test, three types of servers need to be tested, i.e. WinServer 2003x32, 2008x64, 2008x32, for each of the three SEER products, this results in 3*3=9 scenarios; For Local or Client type test, the 10 operating systems to be workable are listed in Table 32and Table 33, and for each of the three SEER products as well, so this results in 10*3=30 scenarios as well. As show Figure 15, the number of leaf nodes is 3*3+10*3+10*3=69, which means there are 69 paths from the root to the leaf nodes, which represents 69 scenarios to be tested before final release. The time required to test one scenario is roughly ( )/3=267mins=4.4 hours (Table 31). So the time required to run all 69 scenarios testing is 69*4.4=306 hours=39 working days. This effort even doesn t count the time for fixing and re-testing effort. Even several computers can be paralleled to run the test at the same time, this is still impossible to be finished before the fixed release time. Figure 15. Scenarios to be tested 67

85 5.2. Case Study Design In order to improve the cost-effectiveness of testing under the time constraint, both coverage-based and value-based testing strategies are combined to serve this purpose Maximize Testing Coverage As displayed in Table 25, Macro 3 covers all the functionalities and is supposed to catch all defects that Macro 1 and Macro 2 have. So the coverage-based strategy is: First test Macro3 according to the coverage-based testing principle. If defects are found in Macro 3, check whether this defect also exists in the shared features for Macro 1 and Macro 2, if so, adapt this change to Macro 1 and 2 and test them as well. So under the most optimistic situation that macro 3 passes without any performance issues, the time of running macros only requires the time of running macro 3. This could save some effort to test Macro 1 and Macro 2 individually. Table 25 Macro-feature coverage Features Macro 1 Macro 2 Macro 3 Install process X X Uninstall process X X Upgrade process X Export installed files X X Compare files size, date and generate report1 X X Export ODBC registry files X X X Export Registry files X X X Export shortcuts X X X Combine files X X X Compare file's content and generate report2 X X X 68

86 Besides, the value-based testing prioritization strategy was applied to further improve testing cost-effectiveness by focusing the scarce testing resources on the most valuable and risky parts of those macros. The project manager and the product manager helped to provide the business value for scenarios based on their frequencies of use (FU), combined with product importance (PI) as a variant for business value. Besides, from the previous testing experiences and observances, we know that which environments are tending to have more performance issues, which parts of the macros are tending to be the bottleneck, all of this information can help with the estimation of scenarios Risk Probability (RP). By this value-based prioritization, the testing effort is going to be put on those scenarios with higher frequency of use, and higher risk probability ones, and avoid testing some scenarios that are seldom/never used. The following sections will introduce in detail how the testing priorities are determined step by step. Basically, Table 26 to Table 28 displays the ratings guideline for FU and RP, Table 30 and Table 31 shows the ratings guideline for TC, and illustrates all the rating results for these scenarios. In this part, several acronyms are used as below: FU: Frequency of Use RP: Risk Probability TC: Testing Cost TP: Test Priority BI: Business Importance PI: Product Importance 69

87 The step to determine Business Value In order to quantify the Frequency of Use (FU), a survey with a rating guideline in Table 26 was sent to the project manager and the product manager for rating various scenarios relative FU. Table 26. FU Ratings FU Ratings Rating Guideline 1 (+) Least frequently used, if we have enough time, it is ok to test; 3 (+++) Normally used, so need to test in a normal queue & and make sure work well; 5 (+++++) Most frequently used, so must be tested first & thoroughly and make sure the macros work well; Based on the ratings they provided, for the host machine, WinXP and Win 7 (x64) have the highest frequency of use in Galorath, Inc. For server installation test, people in Galorath, Inc. usually use virtual machines of WinServer 2003(x32) and WinServer 2008(x64) to represent server installation test and rated the highest. For Win 7(x32), although its host machines are used not as many as Win XP and Win 7 (x64), but people frequently use its virtual machine to do the test, so rated as the highest. For Vista (x64), it is seldom used before, and they even don t have a virtual copy, so it was rated as the lowest as shown in Table 32 and Table 33, Besides, they also provided the product relative importance ratings as shown in Table 27, which will be combined to determine the business value of a scenario as well. 70

88 Table 27. Product Importance Ratings Product Product Importance SEER-MFG 2 SEER-H 2 SEER-SEM The step to determine Risk Probability In order to quantify the probability of a performance issue s occurrence, Table 28 gives the rules of thumb for rating the probability. The subjective ratings will be based on past experiences and observances. Table 28. RP Ratings RP Ratings Rating Guideline 0 Have been passed testing 0.3 Low 0.5 Normal 0.7 High 0.9 Very High From previous random testing experiences on different operating systems, the general performance order from low to high is Vista < WinXp(x32) < Win7(x64), however, WinXP(x32) host machine has passed the test when these macros were developed, so its RP rating is 0, even Win7(x64) is supposed to work better than WinXP (x32), but it has never been thoroughly tested before, so we rated its RP as Low; Vista (either x32 or x64) is supposed to have a lower performance, so we rated its RP as High. 71

89 Win7(x32) is supposed to work well as WinXP (x32) but not better than Win7 (x64), so we rated its RP as Normal. Besides, from previous random testing, we learned that virtual machine s performance is usually lower than the host machine, and our experiences were proved and validated as they are in consistency with many discussions on some professional forums or technical papers, so we rated virtual machine s RP not lower than its host. These ratings are also shown in Table 32 and Table 33. Furthermore, during our brainstorm of these macros quality risks, the project manager provided the information that few defects were found before for client type installation before and no recent modifications for the recent release. So we only need to test Local and Server installation as shown in Table 29. This information greatly reduced the testing scope and avoided testing the defect-free parts. Table 29. Installation Type Installation Type Need Test? Local 1 Server 1 Client The step to determine Cost Table 30 shows the roughly estimated average time to run each macro. And the total time of running all the three macros for one scenario is their sum 125mins. Table 30. Average Time for Testing Macro 1-3 Macros Macro 1 Macro 2 Macro 3 Running Time 25mins 25mins 75mins 72

90 In fact, the time to run one scenario not only consists of the time running macros, the testing preparation time is un-ignorable as well: Setup testing environments, which includes: configuring all installation prerequisites, setup expected results, install/configure COTS required for macro execution. If the operating system which the macros will be tested on is not available, installing a proper one for testing requires even longer time. So basically, we defined the three-level cost ratings as shown in Table 31, and the cost relative rating is roughly 1:2:5. Table 31. Testing Cost Ratings Install OS (3hours) Setup Testing Environments (60mins) Run Macros (125mins) Time (mins) Cost Ratings X X X X X X As shown in Table 32 and Table 33, for WinXP and Win7 (x64) host machines, because we developed the macros on them, they both have been set up with testing environments, the testing cost only consists of the time for running macros, so the cost ratings is as low as 1. For Vista(64) and Win 7(x32), no one in Galorath, Inc. has their host machines. It requires installing OS additionally, so they are rated as high as 5. For all virtual machines, Galorath Inc. has their movable copies, we don t need to install OS, but has to setup testing environments on them, and so they are rated as 2. 73

91 The step to determine Testing Priority After passing the testing for each scenario, the probability of failure would be reduced to 0, so the testing priority (TP) triggered by RRL is calculated as: Testing Priorities for all scenarios are calculated by FU*RP/TC as shown in in Table 32 and Table 33. Table 32. Testing Priorities for 10 Local Installation Working Environments Local Installation Host Machine Virtual Machine working on the host on the same row FU RP TC TP (RRL) FU RP TC TP (RRL) WinXP (x32) Vista (x32) Win7 (x64) WinXP(x32) Win7 (x32) Vista(x32) Vista (x32) Vista (x64) Win7 (x32) WinXP (x32)

92 Table 33. Testing Priorities for 3 Server Installation Working Environments Server Installation VM FU RP TC TP (RRL) Win 7 (x64) WinServer 2003x32 WinServer 2008x64 WinServer 2008x Combined with the product importance ratings in Table 27, the value-based scenario testing prioritization algorithm is: First test the scenario whose working environment has the highest TP (RRL); For each selected operating system environment, first test SEER-SEM, which has higher importance, and then test SEER-H or SEER-MFG, which have lower importance Results Table 34 shows the value-based testing prioritization order and the relevant metrics based on this order. Several acronyms used are as below: RRL: Risk Reduction Level BI: Business Importance ABI: Accumulated Business Importance PBIE: Percentage of Business Importance Earned APBIE: Average Percentage of Business Importance Earned AC: Accumulated Cost 75

93 The first row TP (RRL) in Table 34 shows the testing order we followed to do this testing by first testing the scenario with higher RRL. This order enabled us to focus the limited effort on testing more frequently used scenarios with higher risk probability to fail, and supposed to improve the testing efficiency especially when the testing time and resource is limited. The testing results by using the value-based testing prioritization strategy are shown in Table 35 and Table 36. Due to the schedule constraint, and according to the TP order, we didn t do thorough test on WinXP (x32) Virtual Machine working on host of Vista (x32) and Vista (x64) host machine, since they both has the lowest frequency of use, they can be ignorable for testing if the time runs out. For Win7 (x32), although it is never tested, it is supposed to pass since its Virtual Machine copy, which is supposed to have even lower performance, has passed the testing. Besides, if we installed a Win 7 (x32) on a host machine to test, this will cause more time, and we couldn t finish other scenario testing which has higher TP and won t require installing a new OS before testing. Therefore, the testing strategy combines the considerations of all critical factors and makes the testing results optimal under scarce testing resources. Table 34. Value-based Scenario Testing Order and Metrics TP(RRL) Passed FU(BI) PBIE 48.15% 54.32% 58.02% 61.73% 67.90% 74.07% 77.78% 83.95% 90.12% 93.83% 95.06% 98.77% % ABI TC AC APC 3.33% 6.67% 13.33% 20.00% 26.67% 33.33% 40.00% 46.67% 53.33% 60.00% 66.67% 83.33% % ABI/AC

94 Table 35. Testing Results Local Installation Host Machine Virtual Machine working on the host on the same row WinXP pass Vista (x32) pass (x32) WinXP pass (x32) Win7 (x64) pass Win7 (x32) pass Vista (x32) pass WinXP Never test, we are running out of time, FU is the lowest, Vista (x32) pass (x32) no need to test when the testing time is limited Never test, we even don t have VM for this, besides, we are running out of time, Vista (x64) FU is the lowest, no need to test when the testing time is limited Never test, we don t have a host machine for this, but supposed to pass, since its Win7 (x32) VM has passed Table 36. Testing Results (continued) Win 7 (64) Server Installation WinServer 2003x32 WinServer 2008x64 WinServer 2008x32 pass pass pass Figure 16 shows the results of value-based testing prioritization compared with two other situations which might be common in testing planning as well. The three situations for comparison are: Situation 1: value-based testing prioritization strategy: this situation is exactly what we did for the macro testing in Galorath, Inc., using the value-based scenario testing strategy. We followed the Testing Priority (TP) to do the testing. Since our testing time is limited, we had to stop testing when the Accumulated Cost (AC) reached 18 units as shown in Figure 16. At this point, Percentage of Business Importance Earned (PBIE) is as high as 93.83%; Situation 2: Reverse of value-based, risk-driven testing strategy: this situation s testing order is reversed from Situation 1; when the AC reaches 18 units, PBIE is only 77

95 22.22%; this is the worst case, but this might be a common value-neutral situation in reality as well. Situation 3: The prioritization in Situation 1 considers all variables into the valuebased testing prioritization: not only prioritizes various operating systems, but also prioritizes different products and different installation types. However, in the situation 3, we do a partial value-based prioritization: we still prioritize products and operating systems, but we assume that the installation type is equally important, so the client installation type which has been proved to be defect-free should also be tested. The results show a significant difference: when AC reaches 18 units, PBIE is only 58.02%; much of the testing effort is wasted on testing the defect-free type. In fact, this partial valuebased prioritization is common in practice: testing managers often do prioritize tests in practice, but the way they prioritize is often intuitive, and tends to ignore some factors into prioritization, so this situation can represent most common situations in practice as well. Since this situation still treats all installation types equally important, we still consider it as a value-neutral one to differentiate the complete, systematic, comprehensive and integrated value-based prioritization in Situation 1. 78

96 100.00% 80.00% 77.78% 83.95% 90.12% 93.83% 95.06% % 87.65% 60.00% 74.07% 51.85% 58.02% 61.73% 58.02% 40.00% 39.51% 45.68% 20.00% 0.00% 35.80% 25.93% 22.22% 16.05% PBIE-1 PBIE-2 PBIE % 6.17% 9.88% Stop Figure 16. Comparison among 3 Situations Table 37 compares APBIE of the three situations, and it is obvious that valuebased testing prioritization is the best in terms of APBIE. The case study in Galorath, Inc. validates that the added value-based prioritization can improve the scenario testing s costeffectiveness in terms of APBIE. Table 37. APBIE Comparison Comparison APBIE Situation 1 (Value-based) 70.99% Situation 2 (Inverse Order) 10.08% Situation 3 (Value-neutral) 32.10% Other value-neutral (or partial value-based) situations PBIE curves are supposed to lie between the Situation 1 and Situation 2 in Figure 16, and are representative of the most common situations in reality. From the comparative analysis, we can reject the 79

97 hypothesis H-t1 which means that value-based prioritization can improve the testing costeffectiveness Lessons Learned Integrate and leverage the merits of state-of-art effective test prioritization techniques: in this paper, we synthetically incorporated the merits of various test prioritization techniques to maximize the testing cost effectiveness, i.e. coverage-based, defect proneness-driven and most important incorporated the business value into the testing prioritization. Value-based testing strategy introduced in this paper is not independent of other prioritization techniques; on the contrary, it is the synthesis of all the merits from other techniques with a focus on bridging the gap between business or mission value from customers and the testing process. Think more on trade-offs for automated testing at the same time: form our experiences in this case study to establish automated testing at Galorath, Inc., we can also see that establishing automated testing is a high risk as well as a high investment project [Bullock, 2000]. The test automation is also software development, which might be also expensive, fault-prone, and facing evolving and maintenance problems. Furthermore, automated testing usually treats every scenario equally important. However, the combination of value-based test prioritization and automated testing might be a promising strategy and can even further improve the testing cost-effectiveness. For example, adopting the value-based test case prioritization strategy can shrink the testing scope by 60%, the remaining tedious manual testing effort can be further replaced by an initial little investment to write some automated scripts to allow testing run by computer programs overnight and save human effort by 90%, so by the strategy of 80

98 combining value-based test case prioritization and automated testing, the cost is reduced to (1-60%)*(1-90%)=4% with a factor of 25 s RRL improvement. Anyway, this is also a trade-off question among how much automated testing is enough based on its saving and investment to establish. In fact, any testing strategy has its own advantages; the most important for testing practitioners is having a strong sense of combining the merits of these testing strategies to continuously improve the testing process. Team work is recommended to determine ratings. Prioritization factors ratings, i.e. ratings of business importance, risk probability, testing cost, should not only determined by a single person, this might introduce subjective bias which might cause the prioritization misleading. Ratings should be discussed and brainstormed at team meetings when more stakeholders involved to acquire more comprehensive information, resolve disagreements and negotiate to consensus. For example, if we didn t send out the questionnaire to get the frequency of use of each scenario, we would treat all scenarios equally important and couldn t finish all the testing in a limited time. The worst situation is that we installed some operating system sceneries that were seldom used and tested the macros on them and finally found that it was no need to test them. The same for risk probability: if we didn t know that Client installation would not needed to test because it seldom failed before and supposed to be defect-free, amount of testing effort would be put on this unnecessary testing. So team work to discuss and understand the project under test is very important to determine the testing scope and testing order. Business case analysis is based on project contexts: from these empirical studies so far, the most difficult, yet flexible part is how to determine the business importance for 81

99 the testing items via business case analysis: The business case analysis can be implemented with various methods, considering their ease of use and adaption under experiments environment. For example, in this case study of value-based testing scenario prioritization, we use frequency of use (FU) combined with product importance as a variant of business importance for operational scenarios; in the case study of value-based feature prioritization for software testing in Chapter 5, Karl Wiegers requirement prioritization approach [Wiegers, 1999] is adopted, which considers both the positive benefit of the presence of a feature and the negative impact of its absence. In the case study of value-based test case prioritization in Chapter 7, classic S-curve production function with segments of investment, high-payoff, and diminishing returns [Boehm, 1981] are used to train students for their project features business case analysis with the Kano model [Kano] as a reference to complement their analysis for feature business importance ratings. Test cases business importance is then determined by its corresponding functions, components or features importance, and test cases usage, whether testing the core function of this feature or not As for the case study of determining the priority of artifacts (system capabilities) in Chapter 3, the business importance is tailored to ratings of their influences/impacts to the project s success. The similarity for these different business case analyses is that all using well-defined, context-based relative business importance ratings. Additional prioritization effort is a trade-off as well: Prioritization can be as easy as in this case study or can be more deliberate. Too much effort on prioritization might bring diminishing testing cost-effectiveness. How much is enough depends on the project context and how easily we can get that information required for prioritization. It should be kept in mind all the time that value-based testing prioritization aims at saving 82

100 effort, rather than increasing effort. In this case study, the information required for this prioritization is from expert estimation (project managers, product manager and project developers) with little cost, yet generate high pay-offs for the limited testing effort. However, for this method s application on large-scale projects which might have thousands of test items to be prioritized, there has to be a consensus mechanism to collect all the data. We started to implement an automatic way to support this method s application on large-scale industrial projects. This automation is designed to support establishing the traceability among requirements, code, test cases and defects, so business importance ratings for requirements can be reused for test items, the code change and defect data can be used for predicting risk probability. The automation will also experiment the sensitivity analysis on judging the correctness of ratings and how the rating s change can impact the testing order. The automation is supposed to generate recommend ratings in order to save effort and provide reasonable ratings as well to facilitate value-based testing prioritization. 83

101 Chapter 6: Case Study III-Prioritize Software Features to be functionally Tested 6.1. Background This case study to prioritize features for testing was implemented at the system and acceptance testing phase of one of an industry product s (named Qone [Qone]) main releases in a Chinese Software Organization. The release under test added nine features with total Java codes of 32.6 KLOC in this release. The features are mostly independent amendments or patches of some existing modules. The value-based prioritization strategy was also applied to prioritize the 9 features to be tested based on their ratings of business importance, Quality Risk Probability, and Testing Cost. Features testing value priorities provide the decision support for the testing manager to enact the testing plan and adjust it according to the feedback of quality risk indicators, such as defects numbers and defects density and updated testing cost estimation. Defects data was collected automatically and displayed real-time by this organization s defect reporting and tracking system with immediate feedback to adjust the testing priorities for the next testing round Case Study Design The step to determine Business Value To determine business importance of each feature, Karl Wiegers approach [Wiegers, 1999] is applied in this case study. This approach considers both the positive benefit of the presence of a feature and the negative impact of its absence. Each feature is assessed in terms of the benefits it will bring if implemented, as well as the penalty that will be incurred if it is not implemented. The estimates of benefits and penalties are relative. A scale of 1 to 9 is used. For each feature, the relative benefit and penalty are 84

102 summed up and entered in the Total BI (Business Importance) column in Table 38 using the following formula. The sum of the Total BI column represents the total BI of delivering all features. To calculate the relative contribution of each feature, divide its total BI by the sum of the Total BI column. As we can see, there is an approximate Pareto distribution in which F1 and F2 contribute 22.2% of the features and 59.3% of the total BI. Table 38. Relative Business Importance Calculation Benefit Penalty Total BI BI % Weights 2 1 F % F % F % F % F % F % F % F % F % SUM Figure 17 shows the BI distribution of the 9 features. As we can see, there is an approximate Pareto distribution in which F1 and F2 contribute 22.2% of the features and 59.2% of the total BI. 85

103 35.0% 30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% 30.9% 28.4% 6.2% 6.2% 3.7% 6.2% 9.9% 4.9% 3.7% F1 F2 F3 F4 F5 F6 F7 F8 F9 Business Importance Figure 17. Business Importance Distribution The step to determine Risk Probability The risk analysis was performed prior to system testing start, but was continuously updated during testing execution. It aims to calculate the risk probability for each feature. We follow the four steps: Step 1: List all risk factors based on past projects and experiences: set up the n risks in the rows and columns of an n*n matrix. In our case study, according to this Chinese organization s past similar projects risk data. Four top quality risk factors with the highest Risk Exposure are: Personnel Proficiency, Size, Complexity, and Design Quality. Defects Proportion and Defects Density are usually used as hand-on metrics for quality risk identification during the testing process and they together with the top four quality risk factors to serve as the risk factors that would determine the feature quality risk in this case study. Step 2: Determine risk weights according to their impact degree to software quality: different risk factor has different impact degrees to influence software quality under different organizational contexts, and it is more reasonable to assign them different 86

104 weights before combining them to get one risk probability number for each feature. AHP (The Analytic Hierarchy Process) Method [89], a powerful and flexible multi-criteria decision-making method that has been applied to solve unstructured problems in a variety of decision-making situations, ranging from the simple personal decisions to the complex capital intensive decisions, is used to determine the weight for each risk factor. Based on the understanding of risk factors and their knowledge and experience of their specific relative impact degree to software quality in this organization s context, the testing manager collaborated with the developing manager to determine the weights of each quality risk using AHP method. In this case study, the calculation of quality risks weights is illustrated in Table 39. The number in each cell represents the value pair-wise relative importance: number of 1, 3, 5, 7, or 9 in row i and column j stands for that the stakeholder value in row i is equally, moderately, strongly, very strongly, and extremely strongly more important than the stakeholder value in column j, respectively. In order to calculate weight, each cell is divided by the sum of its column, and then averaged by each row. The results of the final averaged weight are listed in the bolded Weights column in Table 39. The sum of weights equals 1. If we are able to determine precisely the relative value of all risks, the values would be perfectly consistent. For instance, if we determine that Risk1 is much more important than Risk2, Risk2 is somewhat more important than Risk3, and Risk3 is slightly more important than Risk1, an inconsistency has occurred and the result s accuracy is decreased. The redundancy of the pairwise comparisons makes the AHP much less sensitive to judgment errors; it also lets you measure judgment errors by calculating the 87

105 consistency index (CI) of the comparison matrix, and then calculating the consistency ratio (CR). As a general rule, CR of 0.10 or less is considered acceptable [Saaty, 1980]. In the case study, we calculated CR according to the steps in [Saaty, 1980], and the CR is 0.01, which means that our result is acceptable. Table 39. Risk Factors Weights Calculation-AHP Personnel Proficiency Size Complexity Design Quality Defects Proportion Defects Density Weights Personnel Proficiency 1 1/ /3 1/ Size Complexity 1/3 1/ /7 1/ Design Quality Defects Proportion Defects Density 1/3 1/ /7 1/ Step 3: Score each risk factor for each feature: the testing manager s in collaboration with the developing manager scores each risk factor for each feature. The estimate is of the degree to which the risk factor is present for each feature. 1 means the factor is not present and 9 means the factor is very strong. A distinction must be made between factor strength and action to be taken. 9 indicates factor strength, but does not indicate what should be done about it. Initial Risks are risk factors we use to calculate the risk probability before the system testing and Feedback Risks such as Defects Proportion and Defects Density are risk indicators used during the testing process and serve to monitor and control the testing process. 88

106 Risks such as Personnel Proficiency, Complexity, and Design Quality etc. are scored by the developing manager based on their understanding of each feature and predefined scoring criteria. The organization also has its own defined scoring cr iteria for each risk rating. For example, for Personnel Proficiency, Years of experience in application, platform, language and tool serves as a surrogate for simply measuring it, the scoring criteria the organization adopts are as follows: 1-More than 6 years, 3-More than 3years, 5-More than 1 year, 7-More than 6 months, 9-<2 months Use of intermediate scores (2, 4, 6, 8) was allowed More comprehensive measures for Personnel Proficiency could be a combination of COCOMO II [Boehm et al., 2000] personnel factors, e.g. ACAP (Analyst Capability), PCAP (Programmer Capability), PLEX (Platform Experience), LTEX( Language and Tool Experience) with other outside factors that might influence Personnel Proficiency, e.g. reasonable workload, and work spirit and passion from psychological view. Risks such as Size, Defects Proportion, Defects Density are scored based on collected data, for example, if a feature s size is 6KLOC and the largest feature s size is 10KLOC, so the feature s size risk is scored as 9*(6/10) 5. Step 4: Calculate the risk probability for each feature: for each feature Fi, after each risk factor score is obtained, following formula is used to combine all the risk factors to get the risk probability P i of Fi 89

107 R i, j j is Fi s risk value of jth risk factor, W denotes the weight of jth risk factor. Table 40 will calculate the Probability of the total initial risks that comes from each feature before system test. Table 40. Quality Risk Probability Calculation (Before System Testing) Initial Risks Feedback Risks Personnel Proficiency Size Complexity Design Quality Defects Proportion Defects Density Probability Weights F F F F F F F F F Lessons Learned and Process Implication: From the data of initial risks collected, some potential problems are found for this organization: Potential problem in tasks break down and allocation: the Feature F9 has the least risks of both Personnel Proficiency and Complexity and it implies that one of the most experience developers is responsible for the least complex feature. But for the most complex feature F4, it is developed by the least experienced developer. This implies a potential task allocation problem in this organization. Generally, it is highly risky to let 90

108 the least experienced staff to do the most complex task and also a resource waste to let the most experienced developer to do the least complex task. In the future, the organization should consider a more reasonable and efficient task allocation strategy to mitigate risk. Potential insufficient design capability: basically, the risk factors should be independent when they are combined to generate a risk probability, which means that the risk factors should not have strong interrelation among them. Based on the data from Table 40, we do a correlation analysis among the risk factors, almost all risk factors don t have strong correlations (correlation coefficient>0.8). But it should be noted that the correlation coefficient 0.76 between Complexity and Design Quality is high, which means as the Complexity becomes an issue, the Design Quality also becomes a risky problem. This could imply that the current designers or analysts are inadequate for their work. To mitigate this risk, the project manager should consider recruiting analysts with more requirements, high-level design and detailed design experiences in the future. Table 41. Correlation among Initial Risk Factors: Personnel Proficiency Size Complexity Design Quality Personnel Proficiency 1 Size Complexity Design Quality From Table 39, we could see that feedback risk factors: Defect Proportion and Defect Density have the largest weights when they use AHP to determine the risk items weights. This is reasonable, because initial risk factors are mainly used to estimate 91

109 the risk probability before system testing starts. As long as system testing starts, the testing manager should be more concerned with each feature s real and undergoing quality situation to find which features are the most fault-prone. Defect Proportion and Defect Density could serve to provide the real quality information and feedback during the process of system testing. This is also the reason that probabilities in Table 40 are low, since the initial risks are assigned smaller weights and there are no feedback risk factors before system testing starts The step to determine Testing Cost The test manager estimates the relative cost of testing each feature, again on a scale ranging from a low of 1 to a high of 9. The test manager estimates the cost ratings based on factors such as the developing effort of the feature, the feature complexity, and the quality risks as shown in Table 42. Table 42. Relative Testing Cost Estimation Cost Cost% F % F % F % F % F % F % F % F % F % sum

110 25.00% 20.00% 21.4% 15.00% 10.00% 5.00% 4.8% 11.9% 11.9% 14.3% 9.5% 11.9% 7.1% 7.1% 0.00% F1 F2 F3 F4 F5 F6 F7 F8 F9 Cost Figure 18. Testing Cost Estimation Distribution A correlation analysis is done between the 9 features business importance and estimated testing cost as shown in Table 43. The negative correlation denotes that the most testing costly features might have less business importance to key customers. Testing the features first with more business importance but less cost will definitely improve the testing efficiency and maximize its ROI at the early stage of testing phase. Table 43 Correlation between Business Importance and Testing Cost BI Cost BI 1 Cost The step to determine Testing Priority Similar as the scenario prioritization, after passing the testing for each feature, the probability of failure would be reduced to 0, so the testing priority (TP) triggered by RRL is calculated as: 93

111 And the Testing Priorities for the 9 features are shown in Table 44, the testing order is F1, F2, F7, F6, F3, F4, F8, F5, and F9. Table 44. Value Priority Calculation BI % Probability Cost% Priority F F F F F F F F F Results After adapting the value-based prioritization strategy to determine the testing order of the 9 features, the PBIE comparison between value-based order and its inverse order (the most inefficient one) is shown in Figure 19, and the difference of APBIE between the two is 76.9%-34.1%=42.8% which means value-based testing order can improve the costeffectiveness by 42.8% than the worst case, other value-neutral (or partial value-based) situations PBIE curves are supposed to lie between the these two PBIE curves, and are representative of the most common situations in reality, and this further rejects hypothesis H-t1. 94

112 Cost(Percent) PBIE 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 96.2% 92.5% 87.6% 81.4% 75.3% 69.1% 69.1% 59.2% 40.7% 30.8% 30.9% 24.7% 18.5% 12.3% 7.4% 3.7% % Value-Based Inverse Features Figure 19. Comparison between Value-Based and Inverse order In our case study, the test manger plans to execute 4 rounds of testing. During each round, test groups focus on 2-3 features with the highest current priority, and the other features are tested by automated tools. The testing result is when the first round is over, F1 and F2 satisfy the stop-test criteria, when the second round is over, F3, F6, F7 satisfied the stop-criteria, when the third round is over, F4, F8 satisfied the stop-test criteria, and the last round is F5 and F9. And initial estimating testing cost and actual testing cost comparison can be shown in Figure Testing Rounds Estimate Actual Figure 20. Initial Estimating Testing Cost and Actual Testing Cost Comparison 95

113 If we regard the testing activity as an investment, its value is realized when features satisfy the stop-test criteria. The accumulated BI earned curve in Figure 22 is like a production function, with higher pay-off at the earlier stage but diminishing return later. Also from Figure 21 and Figure 22, we can see that when we finished the Round 1 testing, we earned 59.2% BI of all features, at a cost of only 19.8% of the all testing process, the ROI is as high as During the Round 2, we earned 22.2% BI, cost 25.3% effort, and the ROI became negative as We also can see, from Round 1 to Round 4, both the BI earned line and the ROI line is descending. Round 3 and Round 4 earn only 18.5% BI but cost 54.9% effort. This shows that the Round 1 testing is the most cost effective. Testing the features with higher value priority first is especially useful when the market pressure is very high. In such cases, one could stop testing after finding a negative ROI in Round 1. However, in some cases, continuing to test may be worthwhile in terms of customer-perceived quality Start Round 1 Round 2 Round 3 Round 4 BI Earned Cost Test_ROI Figure 21. BI, Cost and ROI between Testing Rounds 96

114 Start Round 1 Round 2 Round 3 Round 4 BI Earned Figure 22. Accumulated BI Earned During Testing Rounds Consideration of Market Factors Time to market can strongly influence the effort distribution of software developing and project planning. As testing phase serves as the adjacent phase before software product transition and delivery, it will be influenced even more by market pressure [Huang and Boehm, 2006]. Sometimes, under the intense market competition situation, sacrificing some software quality to avoid more market share erosion might be a good organizational strategy. In our case study, we use a simple function as follows to display the market pressure s influence to Business Importance: Time represents the number of unit time cycle. A unit time cycle might be a year, a month, a week even a day. For simplicity, in our case study, the unit time cycle is a testing round. Pressure Rate is estimated and provided by market or product managers, with the help of customers. It represents during a unit time cycle, what is the percentage 97

115 initial value of the software will depreciated. The more furious the market competition is, the larger the Pressure Rate is. As we can see from the formula above, the longer the time is, the larger the Pressure Rate is, the smaller is the present BI, and the larger the loss BI caused by market erosion. In our case study, Due to we calculate the relative business importance, the initial total BI is 100(%). When the Round n testing is over, the loss BI caused by market share erosion is. On the other hand, the earlier the product enters the market, the larger the loss caused by poor quality. Finally, we can find a sweet spot (the minimum) from the combined risk exposure due to both unacceptable software quality and market erosion. We assume three Pressure Rates 1%, 4% and 16% standing for low, medium and high market pressure respectively in Figure 23 to Figure 25, and this could be also seen as three types of organizational contexts: high finance, commercial and early start-up [Huang and Boehm, 2006]. When market pressure is as low as 1% in Figure 23, the total loss caused by quality and market erosion reaches the lowest point (sweet spot) at the end of the Round 4.When the Pressure Rate is 4%, the lowest point of total loss is at the end of Round 3 in Figure 24, which means we should stop testing and release this product even F5 and F9 haven t reached the stop-test criteria at the end of Round 3; this would ensure the minimum loss. When the market pressure rate is as high as 16% in Figure 25, we should stop testing at the end of Round 1. 98

116 Figure 23. BI Loss (Pressure Rate=1%) Figure 24. BI Loss (Pressure Rate=4%) Figure 25. BI Loss (Pressure Rate=16%) 99

117 Extension of Testing Priority Value Function: In this case study, we use multi-objective multiplicative value function to determine the testing priority. There is also another additive value function that can be used to determine the testing priority as follows: V(X BI ), V(X C ) and V(X RP ) are single value functions for Business Importance, Cost and Risk Probability. W BI, W C and W RP are relative weights for them respectively. V(X BI +X C +X RP ) is the multi-objective additive value function for testing priority. For the single value functions of Business Importance and Risk Probability, they are increasing preference, the larger the Business Importance or Risk Probability, the higher the testing priority as shown in the left part of Figure 26. For the single value function of Testing Cost, it is decreasing preference, the larger the Cost, the lower the testing priority value as shown in the right part of Figure 26. Figure 26. Value Functions for Business Importance and Testing Cost Extension from the multiplicative value function to additive one also shows the similar feature testing priorities result [Li, 2009]. No matter the value function is multiplicative or additive, as long as they reasonably reflect the similar SCSs win 100

118 condition preferences, they are supposed to generate the similar priority results. From our extension experiment, both dynamic prioritizations could make the ROI of testing investment reach the peak at the early stage of testing, which is especially effective when the time to market is limited. This extension of value function is also supported by Value- Based Utility Theory. 101

119 Chapter 7: Case Study IV-Prioritize Test Cases to be Executed 7.1. Background This case study for prioritizing test cases to be executed by using the Value-Based, Dependency-Aware prioritization strategy was experimented on USC 2011 spring and fall semester software engineering course s a number of 18 projects. As an extension to previous work for prioritizing testing features, this work prioritized test cases in a finegrained granularity with added considerations on test cases inner-dependency. Besides, it tailored the Probability of Loss from the Risk Reduction Leverage (RRL) definition to test case Failure Probability and used this as a trigger to shrink the regression test case suite by excluding the stable features for the scarce testing resource. A project named Project Paper Less [USC_577b_Team01, 2011] with 28 test cases is used as an example to investigate the improved testing efficiency. Through Fall 2010 CSCI 577a, Team01 students have already developed good results of Operation Concept Description (OCD), System and Software Requirement Description (SSRD), System, System and Software Architecture Description (SSAD) and Initial Prototype together with various planning documents, such as Lifecycle Plan (LCP), Quality Management Plan (QMP). From Spring 2011 CSCI 577b, they develop Initial Operational Capability with concurrently generating Test Plan and Cases (TPC), students are trained to write test cases according to the requirements in SSRD with Equivalence Partitioning and Boundary Value Testing techniques [Ilene, 2003] to elaborate test cases. Their test cases in the TPC cover 100% requirements in the SSRD and they have already done some informal unit testing, integration testing before the acceptance testing. They 102

120 follow the Value-based Testing Guideline [USC_577b_VBATG, 2011] to do Value-based test case prioritization (TCP), execute their acceptance testing according to the testing order from the prioritization, record their testing results in the Value-based Testing Procedure and Results (VbTPR) and report defects discovered to Bugzilla system [USC_CSSE_Bugzilla] to report and track those defects until closure. From the next section, the Value-based TCP steps will be introduced within Team01 s project s context Case Study Design The step to do Dependency Analysis Most features in the SUT are not independent of each other and they typically have precedence or coupling constraints between them that requires some features must be implemented before others, or some must be implemented together [Maurice et al., 2005]. Similar for test cases, some test cases are required to be executed and passed before others can be executed. The failure of some test cases can also block others to be executed. Understanding the dependencies among test cases would benefit test case prioritization and test planning; also they are useful information for rating business importance, failure probability, criticality and even testing cost that will be introduced within the following sections. Based on the test cases in TPC [USC_577b_Team01, 2011], testers were asked to generate dependency graphs for their test suites. They could be as simple as Team01 s test case dependency tree in Figure 27, or could be much more complex, for example, one test case node has more than one parental node. In Figure 27, for each test case, the bracket associated with have two space holders for later filling in, one is for Testing Value (=Business Importance*Failure Probability/Testing cost) and the other is Criticality. The 103

121 following sections will introduce in detail how to rate those factors, and use them for prioritization. Figure 27. Dependency Graph with Risk Analysis The step to determine Business Importance As for testing, the business importance of a test case is mainly determined by its corresponding functions, components or features importance or value to clients. Besides, due to the test case elaboration strategies, such as Equivalence Partitioning and Boundary Value Testing, various test cases for the same feature are designed to test different aspects of the feature with different importance as well. The first step to determine the Business Importance of a test case is to determine the BI of its relevant function/feature. From CSCI577a, students are educated and trained on how to do business cases analysis for software project, and rate relative Business Importance for function/feature in a software system from the client s view, such as the importance of software, product, component, or feature to his/her organization in terms of its Return on Investment [Boehm, 1981] as shown in Figure 28. A general mapping instruction between function/feature BI rating range as given in the box in Figure 28. And the range in production function (investment, high-payoff, diminishing returns) are given to students for their references. 104

122 BI: VL-N BI: H-VH BI: VL-N Figure 28. Typical production function for software product features [Boehm, 1981] Basically the slope of the curve represents the ROI of the function, the higher the slope, the higher the ROI, so the higher the BI of the function. The BI of the function in the Investment segment is usually in the range from Very Low to Normal, since the early Investment segment involves development of infrastructure and architecture which does not directly generate benefits but which is necessary for realization of the benefits in the High-payoff and Diminishing returns segments. For Project Paper Less, the Access Control and User Management features should belong to the Investment segment. The main application functions for this project such as Case Management, Document Management features are the core capabilities for this system that the client most wants to have and they are within High-payoff segment, so the BI of those functions are in the range from High to Very High. Because of the scope and schedule constraints of the course projects, these projects are usually small-scale and only require students developing the core capabilities and seldom have some features that belong to Diminishing Return segment. 105

123 The business importance of a test case is determined by the business importance of its corresponding feature, function or module on one side, it is also determined by the criticality magnitude of the failure occurrence on the other side. A guideline for rating a test case s Business Importance is shown in Table 45 by considering both two sides. The ratings for Business Importance are from VL to VH, with corresponding values from 1 to 5. For example, for the Login function in the Access Control module, the tester used Equivalence Partitioning test case generation strategy to generate two test cases: one is to test whether a valid user can login, and the other is to test whether an invalid user cannot login. Since the Access Control feature belongs to Investment segment and the tester rated it as Normal benefit to the client. If the first test case to test whether a valid user can login fails, the Login function won t run and this will block other functions, such as Case Management, Document Management, to be tested, so this test case should be rated Normal according to the guideline in Table 45. On the other side, for the other test case to test whether an invalid user cannot login should be rated Low, because if it fails, the login can still run (the valid user can still login to test other functionalities without blocking them). So its criticality magnitude is relatively smaller than the first test case and deserves a relative lower rating Low. This is just an example for differentiating the Business Importance of test cases elaborated by Equivalence Partitioning yet within the same feature. There are other various cases applicable to differentiate the relative importance by considering the criticality magnitude of failure occurrence as well. 106

124 Table 45. Guideline for rating BI for test cases VH:5 H:4 N:3 L:2 VL:1 This test case is used to test the functionality that will bring the Very High benefit for the client, without passing it, the functionality won t run This test case is used to test the functionality that will bring the Very High benefit for the client, without passing it, the functionality can still run This test case is used to test the functionality that will bring the High benefit for the client, without passing, the functionality won t run This test case is used to test the functionality that will bring the High benefit for the client, without passing it, the functionality can still run This test case is used to test the functionality that will bring the Normal benefit for the client, without passing it, the functionality won t run This test case is used to test the functionality that will bring the Normal benefit for the client, without passing it, the functionality can still run This test case is used to test the functionality that will bring the Low benefit for the client, without passing it, the functionality won t run This test case is used to test the functionality that will bring the Low benefit for the client, without passing it, the functionality can still run This test case is used to test the functionality that will bring the Very Low benefit for the client, without passing it, the functionality won t run As a result of rating the total 28 test cases Business Importance for Project Paper Less, the ratings distribution is shown in Figure 29, High, and Very High business importance test cases consist more than half. This makes sense because most features implemented are core capabilities, but still needs some investment capabilities that are necessary for those core ones. 107

125 VH 4% VL 11% VL L L 21% N H H 50% VH N 14% Figure 29. Test Case BI Distribution of Team01 Project The step to determine Criticality Criticality, as mentioned the above step, represents impact magnitude of failure occurrence and what influences it will bring to the ongoing test. Combined with the Business Importance from the client s value perspective, they contribute to determine the size of loss at risk. The empirical guideline for rating it is in Table 46. The ratings are from VL to VH with values from 1 to 5. The common reason for this is that test cases which with high Criticality should be passed as early as possible, otherwise, it would block other test cases to be executed and might delay the whole testing process if defects are not resolved soon enough. Students are educated to refer the dependency tree/graph for rating this. For Project Paper Less test case dependency tree as shown in Figure 27, for the ones TC-01-01, TC and TC-04-01, they are all rated Very High, because they are on the critical path for executing all other test cases, if they fail, it would block most of the other test cases to be executed and most of those blocked test cases have high Business Importance. 108

126 Most of the other test cases are tree leaves, if they fail, they won t block other test cases to be executed and their Criticality are rated Very Low. Table 46. Guideline for rating Criticality for test cases VH:5 H:4 N:3 L:2 VL:1 Block most (70%-100%) of the test cases, AND most of those blocked test cases have High Business Importance or above Block most (70%-100%) of the test cases, OR most of those blocked test cases have High Business Importance or above Block some (40%-70%) of the test cases, AND most of those blocked test cases have Normal Business Importance Block a few (0%-40%) of the test cases, OR most of those blocked test cases have Normal Business Importance or below Won t block any other test cases The step to determine Failure Probability The primary goal of testing is to reduce the uncertainty of the software product quality before it is finally delivered to the client. Testing without risk analysis is a waste of resources, and uncertainty and risk analysis are triggers for selecting the subset of test suite, in order to focus the testing resources on the most risky, fault-prone features. A set of self-check questions from different aspects or factors that might cause test case failure are provided in Table 47 for students reference to rate the test case s failure probability. Students rated each test case s Failure Probability based on those recommended factors or others they might think of by themselves. The rating levels with numeric values are: Never Fail (0), Least Likely to Fail (0.3), Have no idea (0.5), Most Likely to Fail (0.7), Fail for sure (1). 109

127 Table 47. Self-check questions used for rating Failure Probability Experience Did the test case fail before? --People tend to repeat previous mistakes, so does software. From pervious observations, e.g. unit test, performance at CCD, or informal random testing, the test case failed before tends to fail again Is the test case new? --The test case that hasn t not been tested before has a higher probability to fail Change Impact Personnel Complexity Dependencies Does any recent code change (delete/modify/add) have impact on some features? --if so, the test cases for these features have a higher probability to fail Are the people responsible for this feature qualified? -- If not, the test case for this feature tends to fail Does the feature have some complex algorithm/ IO functions? --If so, the test case for this feature have a higher probability to fail Does this test cases have a lot of connections (either depend on or to be depended on) with other test case? --If so, this test case have a higher probability to fail For Project Paper Less, before the acceptance testing, testers have already done Core Capability Drive-through (CCD) for core capabilities developed in the first increment, design-code review, unit test, informal random testing, testers have already gained information and experiences about the health status of the software system they developed. Based on this, they rated the Failure Probability for the whole 28 test cases. The distribution of the rating levels are shown in Figure 30. Never Fail test cases consist of more than half based on previous experiences and observations. So for those Never Fail ones, they should be delayed to be executed at the end of each testing round if resources are still available, or even not to be executed if time and testing resources are limited. So in this way, quality risk analysis drives to shrink the test case suite and only choose to execute those test case subsets with quality risks. 110

128 Fail for sure, 0, 0% Have no idea, 1, 4% Most Likely to Fail, 6, 21% Least Likely to Fail, 6, 21% Never Fail Least Likely to Fail Have no idea Never Fail, 15, 54% Most Likely to Fail Fail for sure Figure 30. Failure Probability Distribution of Team01 Project The step to determine Test Cost Value-Based Software Engineering considers every activity as an investment. For test activities, the cost/effort for executing each test case should also be considered for TCP. However, estimating the effort to execute each test case is challenging [Deonandan et al., 2010], [Ferreira et al., 2010]. Some practices simply suggest count the numbers of steps to execute the test case. To simplify our experiment, students are also asked to write test case on the same granularity level to make sure that every case has the nearly the same number of steps to be executed as much as they can do, and assume that the cost for executing each test case is the same The step for Value-Based Test Case Prioritization As far as testers rated those factors above for each test case, Testing Value triggered by RRL is defined as below: 111

129 It is obvious from the definition of Testing Value that the Testing Value is in proportion to Business Importance and Failure Probability and inversely proportional to Testing cost. This allows test cases to be prioritized in terms of return on investment (ROI). Students were asked to fill in each test case node with the number of Testing Value and Criticality ratings as shown in Figure 27. Executing the ones with the highest Testing Value and highest Criticality first is our basic prioritization strategy. However, due the dependencies among test cases, a common situation is that testers cannot usually jump and reach to the test case with the highest Testing Value directly without executing and passing some others with lower Testing Value on the critical path to obtain the highest one. For example, in Figure 27, TC has the highest Testing Value (3.5) together with highest Criticality rating (VH), but testers can t directly execute it until TC and TC on the critical path are executed and passed. So the factor of dependency should also be added into the value-based TCP algorithm. Some key concepts below are introduced to help understand the value-based TCP algorithm. Passed: All steps in the test case generates the expected outputs that can make this feature work accordingly Failed: As long as one of the steps in the test case generates an unexpected outputs to make this function can t work or this failure would for sure block other test cases to be executed if possible (some minor improvement suggestion doesn t belong to this category ) NA: The test case is not able to be executed, there are some candidate reasons: This test case depends on another test case which fails; External factors, such as the testing environment e.g. the pre-condition could not be satisfied, or there is no required testing data, etc. 112

130 Dependencies Set: A test case s Dependencies Set is the set of the test cases that this test case depends on. The Dependencies Set should include all dependent test cases, either directly or indirectly. Ready-to-Test: it is a status of test cases, and its definition is: A test case is Ready-to- Test only if the test case has no dependency or all the test cases in its Dependencies Set have been Passed. Not-Tested-Yet: it is another status of test cases, and its definition is: A test case is Not- Tested-Yet means this test case has not been tested yet so far. The algorithm of value-based, dependency-aware Test Case Prioritization is shown below with brief description in Figure 10. It is basically a variant of greedy algorithm with the optimal goal of first selecting the Ready-to-Test one with the highest Testing Value and Criticality to test. Value First: Test the one with the highest Testing Value. If several test cases Testing Values are the same, test the one with the highest Criticality. Dependency Second: If the test case selected from the first step is not Ready-to-Test, which means at least one of the test cases in its Dependencies Set is Not-Tested-Yet. In such situation, prioritize the Not-Tested-Yet test cases according to Value First in this Dependencies Set and start to test until all test cases in the Dependencies Set are Passed. Then the test case with the highest value is Ready-to-Test. Update the prioritization: After one round, update the Failure Probability based on updated observation from previous testing rounds. 113

131 Pick the one with the highest Test Value (if the same, choose the one with higher Criticality) Exclude the Passed one for prioritization N <<- -In the Whole Test Case Set- ->> Y <<In the Dependencies Set>> N Have dependencies? N Start to test <<Ready-to-Test>> Y Failed? Y <<Report for Resolution>> All dependencies passed? Y <<Ready-to-Test>> Resovled? N Exclude the Failed one and the others NA that depends on it for prioritization Figure 31. In-Process Value-Based TCP Algorithm For Project Paper Less, 15 Never Fail test cases are excluded in the subset selected to test, as shadowed in the dependency tree in Figure 27. For those test cases, it is not necessary to test them deliberately if the testing effort or resources are limited; yet it is ok to test them at the end of this round if time is still available. According to the Value- Based TCP algorithm, the testing order for the remaining test cases is: TC-04-01, TC-04-02, TC-04-03, TC-05-10, TC-18-01, TC-12-01, TC-11-01, TC-13-01, TC-02-01, TC-14-01, TC-03-04, TC-02-02, TC However, the testers still need to walk through TC and TC to reach TC-04-01, but walking-through costs much less than deliberately testing and the effort for it could be neglected Results One Example Project Results Average Percentage of Business Importance Earned (APBIE) is used to measure how quickly the SUT s value is realized, the higher it is, and the more efficient the test is. 114

132 PBIE For the above test case prioritization for Project Paper Less, the BI, FP, Criticality ratings could be found at [USC_577b_Team01, 2011]. For the whole T set of 28 test cases, we get TBI=88; At the initial point of the testing round, 15 test cases were rated Never Fail with no need to test in this testing round, they consist of the set T-T. In total, they have 45 business importance, which means IBIE=45, and PBIE 0 =45/88=51.1%; For the remaining 13 prioritized test cases to be executed in order in the set of T, PBIE 1 =(45+5)/88=56.8% when TC is passed, PBIE 2 =(45+5+4)/88=61.4% when TC is passed, PBIE 13 =( )/88=100% when TC is passed and all 88 business importance is earned at this moment. The business importance earned fast at the beginning and becomes slower to the end as shown in Figure 32; The APBIE=(56.8%+61.4% +100%)/13=81.9%; 100.0% 90.0% 80.0% 70.0% 60.0% 61.4% 65.9% 70.5% 75.0% 79.5% 84.1% 88.6% 92.0% 95.5% 96.6% 98.9% 100.0% 50.0% 56.8% 40.0% 30.0% 20.0% 10.0% 0.0% Figure 32. PBIE curve according to Value-Based TCP (APBIE=81.9%) 115

133 As the obvious evidence above, risk analysis for Failure Probability for test cases can help to select subset test case suite to focus effort on most risky test cases in order to save testing cost and effort. However, the risk analysis should be based on previous handson experiences and observations about the quality of the SUT. If testers have no idea about the SUT health status before test, in practice, for example, the third party testing, outsourcing testing etc. the Test Value should depend only on Business Importance before their first test, assuming test cost is the same for each test case as an example dependency tree shown in Figure 27. So in this case, all test cases should be prioritized, according to the Value-Based TCP algorithm, the test order for the whole test suite without risk analysis is: TC-01-01, TC-03-01, TC-04-01, TC-05-01, TC-04-02, TC-04-03, TC-05-02, TC-05-03, TC-05-05, TC-05-07, TC-05-08, TC-05-10, TC-12-01, TC-18-01, TC-11-01, TC-13-01, TC-19-01, TC-02-01, TC-14-01, TC , TC-02-02, TC-15-01, TC-16-01, TC-16-02, TC-16-03, TC-03-02, TC-03-03, TC This testing order s PBIE is displayed in square curve in Figure 33, with a comparison with a commonly used value-neutral test order in diamond curve, which follows the test case ID number or Breadth-First-Search (BFS) the dependency tree. It is obvious that Value-Based TCP can earn business importance quicker than value-neutral one. APBIE for Value-Based TCP is 52%, higher than value-neutral one 46%, which rejects the hypothesis H-t1. This improvement would be more significant if the business importance numeric values are not in a linear range from 1 to 5, but an exponential range from 2 1 to

134 PBIE 100.0% 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% Value-Neutral Value-Based Test Case Order Figure 33. PBIE Comparison without risk analysis between Value-Based and Value-Neutral TCP (APBIE_value_based=52%, APBIE_value_neutral=46%) It is also should be noted that the 21.9% difference (81.9%-60%) with/without Failure Probability analysis is contributed by risk analysis to select sub test case suite to further improve the test efficiency. So the Value-Based TCP can improve testing costeffectiveness by selecting and prioritizing test cases in order to earn Business Importance as early as possible, and this is especially useful when the testing schedule is tight and testing resources are limited. Value-Based TCP enables early execution for test cases with high business importance and criticality, the failure of test cases would lead to defects reported to responsible developers, and developers would arrange time to prioritize and fix defects according to the degrees of severity and priority of those defects in an efficient way. In fact, test cases business importance and criticality determine the severity and priority for defects on the failure occurrence, as the mapping in Table 48. Basically, if test cases with Very High business importance fail, the corresponding features/functions which brings highest benefit to customers can t work, it will cause large size of customer s benefit loss, 117

135 and due to this reason, the relevant defect s severity should be rated Critical ; if test cases with Very High criticality fail, it blocks most of other test cases with high business importance to be executed, so the relevant defect should be Resolve Immediately in order not to delay the whole testing process. Table 48. Mapping Test Case BI &Criticality to Defect Severity& Priority Value-Based TCP BI ratings BI <-> Severity Defect Severity in Bugzilla Criticality<->Priority Value-Based TCP Criticality ratings Defect Priority in Bugzilla VH Critical VH Resolve H Major H Immediately N Normal N Normal Queue L Minor L Not Urgent, VL Trivial, Enhancement VL Low Priority, Resolve Later, So if testers follow the Value-based TCP to select and prioritize test cases, it will directly lead to early detection of high severity and priority defects for the above reasons if potential defects exist. For Project Paper Less, after the first round of acceptance testing, 4 defects are reported to Bugzilla, their severity, priority and corresponding test cases with business importance and criticality were shown in Table 49. From the ascending ordinal defect sequence (the earlier defect report results in a foremost defect ID) and their relevant Test Case ID, it is obvious that the value-based prioritization enable testers to detect highseverity defects as early as possible, although there were some mismatches between test case Criticality ratings and defect Priority ratings. This is mainly because we didn t instruct students to report defects according to the mapping in Table 48 and Bugzilla has 118

136 Priority default value as Normal Queue and students might feel it is of no need to change, or think that high-severity defects should be Resolved Immediately in common sense. Yet, this in turns provides evidence that Value-Based TCP enables testers to detect the highseverity faults at the early time if those potential faults exist. So from the observations of defect reporting in Bugzilla for this project, defects with higher Priority and Severity are reported earlier and resolved earlier. This can reject the hypothesis H-t2. Defect ID in Bugzilla Table 49. Relations between Reported Defects and Test Cases #4444 Critical Severity Priority Test Case ID BI FP Resolve Immediately TC V H Criticalit y 0.7 VH #4445 Major Normal Queue TC H 0.7 VL #4460 Major Normal Queue TC H 0.7 VL #4461 Major Resolve Immediately TC H 0.7 VL All Team Results: After all teams executed the acceptance testing with several follow-on regression testing using the Value-Based TCP technique, a survey with several open questions are sent and answered by the primary testers. Questions are mainly around their feelings and feedback on applying the Value-Based TCP for the acceptance testing, problems they encountered, and improvement suggestion. Some representative responses are shown below: Before doing the prioritization, I had a vague idea of which test cases are important to clients. But after going through the Value-Based testing, I had a better picture as to which ones are of critical importance to the client. 119

137 I prioritized test cases mainly based on the sequence of the system work flow, which is performing test cases with lower dependencies at first before using value-based testing. I like the value-based process because it can save time by letting me focus on more valuable test cases or risky ones. Therefore, it improves testing efficiency A Tool for Faciliating Test Case Prioritization: In the upper example case study, a semi-automatic spreadsheet was developed to support its application on USC graduate software engineering course projects in 2011 spring semester. In order to further facilitate and automate its prioritization to save effort and minimize human errors, and support its application on large-scale projects which might have thousands of test cases to be prioritized, there indeed has to be a consensus mechanism to collect all the required rating data. We implemented an automated and integrated tool to support this method based on an open source, built on PHP+MySQL+Apache platform, widely-used test case management toolkit TestLink. We customized this system to incorporate the value-based dependency-aware test case prioritization technique and is available at [USC_CSSE_TestLink], and used for USC graduate software engineering course projects. Figure 34 illustrates an example of the test case in the customized TestLink. 120

138 Figure 34. An Example of Customized Test Case in TestLink Basically, it supports to: Rate Business Importance, Failure Probability, and Test Cost by selecting the ratings from the dropdown lists as shown in Figure 34, currently it supports for 5- level ratings for each factor :Very Low, Low, Normal, High and Very High with default numeric values from 1 to 5, and the Testing Value in terms of RRL for each test case can be calculated automatically. Manage test case dependencies by inputting other test cases that this test case directly depends on as shown in the text field Dependent Test Case in Figure 34, and dependencies are stored in the database for later prioritization. 121

139 Prioritize test cases according the value-based, dependency-aware prioritization algorithm in Chapter 7 to generate a planned value-based testing order as illustrated in Figure 35, in order to help testers to plan their testing more costefficient. A value-neutral testing order which only deals with the dependencies among test cases without considering the RRL of each test cases are also generated for comparison. Display the PBIE curves for both value-based and value-neutral testing orders visually, and shows the APBIE for both orders at the bottom of the chart in Figure 35. Figure 35. A Tool for facilitating Value-based Test Case Prioritization in TestLink 122