AUTOMATED UNIT TEST GENERATION DURING SOFTWARE DEVELOPMENT A Controlled Experiment and Think-aloud Observations ISSTA 2015 José Miguel Rojas j.rojas@sheffield.ac.uk Joint work with Gordon Fraser and Andrea Arcuri
Testing is a widespread validation approach in industry, but it is still largely ad hoc, expensive, and unpredictably effective. Software Testing Research: Achievements, Challenges, Dreams, A. Bertolino. Future of Software Engineering. IEEE. 2007.
Testing is a widespread validation approach in industry, but it is still largely ad hoc, expensive, and unpredictably effective. Software Testing Research: Achievements, Challenges, Dreams, A. Bertolino. Future of Software Engineering. IEEE. 2007. Test case generation has a strong impact on the effectiveness and efficiency of testing. one of the most active research topics in software testing for several decades, resulting in many different approaches and tools. An orchestrated survey of methodologies for automated software test case generation, S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M.B. Cohen, W. Grieskamp, M. Harman, M.J. Harrold, P. McMinn. J. Systems and Software. Elsevier. 2013.
BACK IN ISSTA 2013 Does automated white-box test generation really help software testers?, G. Fraser, M. Staats, P. McMinn, A. Arcuri and F. Padberg
BACK IN ISSTA 2013 ARE UNIT TEST GENERATION TOOLS HELPFUL TO DEVELOPERS WHILE THEY ARE CODING? Does automated white-box test generation really help software testers?, G. Fraser, M. Staats, P. McMinn, A. Arcuri and F. Padberg
CODE COVERAGE
CODE COVERAGE TIME SPENT ON TESTING
CODE COVERAGE TIME SPENT ON TESTING IMPLEMENTATION QUALITY
CONTROLLED EXPERIMENT
CONTROLLED EXPERIMENT Golden Implementation
CONTROLLED EXPERIMENT Golden Implementation Class Template
CONTROLLED EXPERIMENT Golden Implementation Class Template Implementation
CONTROLLED EXPERIMENT Golden Implementation Class Template Implementation
CONTROLLED EXPERIMENT Manual Golden Implementation Class Template Implementation
CONTROLLED EXPERIMENT Manual Golden Implementation Class Template 1 hour Implementation
CONTROLLED EXPERIMENT Manual Golden Implementation Class Template 1 hour Implementation 41
CONTROLLED EXPERIMENT Manual Golden Implementation Class Template 1 hour Implementation 41
CONTROLLED EXPERIMENT Manual Golden Implementation Class Template 1 hour Implementation 41 2
CONTROLLED EXPERIMENT Manual Golden Implementation Class Template 1 hour Implementation 41 2 4
DOES USING EVOSUITE DURING SOFTWARE DEVELOPMENT LEAD TO TEST SUITES WITH HIGHER CODE COVERAGE? RQ 1
CODE COVERAGE participants test suites run on their own implementations 100% Assisted Manual 80% 83% Branch Coverage 60% 40% 63% 57% 39% 41% 38% 50% 20% 26% 0% FilterIterator FixedOrderComparator ListPopulation PredicatedMap
CODE COVERAGE Times coverage was checked Times Coverage was checked 10 8 6 4 2 0 Assisted Manual 9.6 9 6 5.9 6.4 5.3 4 1.9 FilterIterator FixedOrderComparator ListPopulation PredicatedMap Category Axis
CODE COVERAGE participant s test suites run on their own implementations Branch Coverage (%) 100% 75% 50% 25% 0% ListPopulation 0 10 20 30 40 50 60 Manual EvoSuite Assisted Time (min)
CODE COVERAGE participant s test suites run on their own implementations Branch Coverage (%) 100% 75% 50% 25% 0% ListPopulation 0 10 20 30 40 50 60 Manual EvoSuite Assisted Time (min)
CODE COVERAGE participant s test suites run on their own implementations Branch Coverage (%) 100% 75% 50% 25% 0% ListPopulation 0 10 20 30 40 50 60 Manual EvoSuite Assisted Time (min)
CODE COVERAGE participant s test suites run on their own implementations Branch Coverage (%) 100% 75% 50% 25% 0% ListPopulation 0 10 20 30 40 50 60 Manual EvoSuite Assisted Time (min)
CODE COVERAGE participants test suites run on golden implementations 100% Assisted Manual 80% Branch Coverage 60% 40% 20% 41% 30% 35% 21% 37% 28% 42% 50% 0% FilterIterator FixedOrderComparator ListPopulation PredicatedMap
CODE COVERAGE participant s test suites run on golden implementations, over time Branch Coverage Branch Coverage 100% 75% 50% 25% 0% 100% 75% 50% 25% 0% FilterIterator FixedOrderComparator 0 10 20 30 40 50 60 Time (min) Assisted Manual EvoSuite-generated ListPopulation PredicatedMap 0 10 20 30 40 50 60 Time (min) Branch Coverage Branch Coverage 100% 75% 50% 25% 0% 0 10 20 30 40 50 60 Time (min) 100% 75% 50% 25% 0% 0 10 20 30 40 50 60 Time (min)
CODE COVERAGE participant s test suites run on golden implementations, over time Branch Coverage Branch Coverage 100% 75% 50% 25% Assisted Manual EvoSuite-generated FilterIterator FixedOrderComparator 0% depending on how the generated 0% tests are used. 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Time (min) Time (min) ListPopulation PredicatedMap 100% 75% 50% 25% 0% 0 10 20 30 40 50 60 Time (min) Branch Coverage Branch Coverage 100% 75% Coverage can be higher when using EvoSuite, 50% 25% 100% 75% 50% 25% 0% 0 10 20 30 40 50 60 Time (min)
DOES USING EVOSUITE DURING SOFTWARE DEVELOPMENT LEAD TO DEVELOPERS SPENDING MORE OR LESS TIME ON TESTING? RQ 2
TESTING EFFORT Number of test runs 14 11 13.7 12.9 11 Assisted Manual 8 6 3 7.8 7.2 8.2 4.4 5.6 0 FilterIterator FixedOrderComparator ListPopulation PredicatedMap
TESTING EFFORT Minutes spent on testing 26 25 Assisted Manual 21 16 10 5 18.5 20 9.3 15.8 12.6 7.7 14.3 0 FilterIterator FixedOrderComparator ListPopulation PredicatedMap
TESTING EFFORT Minutes spent on testing 26 21 16 10 5 18.5 20 9.3 15.8 25 Assisted Manual Using EvoSuite reduces the time spent on testing. 14.3 12.6 7.7 0 FilterIterator FixedOrderComparator ListPopulation PredicatedMap
DOES USING EVOSUITE DURING SOFTWARE DEVELOPMENT LEAD TO SOFTWARE WITH FEWER BUGS? RQ 3
IMPLEMENTATION QUALITY Golden test suites run on participants implementations Number of Failures+Errors 16 13 10 6 3 0 Assisted Manual 15.6 14.3 6.1 6.3 6.4 5.3 4.2 4.3 FilterIterator FixedOrderComparator ListPopulation PredicatedMap
IMPLEMENTATION QUALITY Golden test suites run on participants implementations Number of Failures+Errors 16 13 10 6 3 0 6.1 6.3 6.4 5.3 4.2 4.3 Assisted Manual 15.6 14.3 Using EvoSuite during development did not lead to to better implementations. FilterIterator FixedOrderComparator ListPopulation PredicatedMap
DOES SPENDING MORE TIME WITH EVOSUITE AND ITS TESTS LEAD TO BETTER IMPLEMENTATIONS? RQ 4
PRODUCTIVITY Time spent with EvoSuite Correlation with number of failures plus errors 0.40 0.30 0.20 0.10 0.00-0.10-0.20-0.30-0.40-0.50 0.35-0.29-0.03-0.22 0.32 0.02 Number of runs Time spent on tests 0.35-0.49 FilterIterator FixedOrderComparator ListPopulation PredicatedMap
PRODUCTIVITY Time spent with EvoSuite Correlation with number of failures plus errors 0.40 0.30 0.20 0.10 0.00-0.10-0.20-0.30-0.40-0.50 0.35-0.29-0.03-0.22 0.32 0.02 Number of runs Time spent on tests 0.35 Implementation quality improves the more time developers spend with EvoSuite-generated tests. -0.49 FilterIterator FixedOrderComparator ListPopulation PredicatedMap
Using automated unit test generation does impact developers productivity,
Using automated unit test generation does impact developers productivity, but
Using automated unit test generation does impact developers productivity, but how to make the most out of unit test generation tools?
THINK ALOUD OBSERVATIONS K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data (revised edition). MIT Press, 1993. J. Hughes and S. Parkes, Trends in the use of verbal protocol analysis in software engineering research, Behaviour and Information Technology, vol. 22, no. 2, pp. 127 140, 2003.
THINK ALOUD OBSERVATIONS Subject K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data (revised edition). MIT Press, 1993. J. Hughes and S. Parkes, Trends in the use of verbal protocol analysis in software engineering research, Behaviour and Information Technology, vol. 22, no. 2, pp. 127 140, 2003.
THINK ALOUD OBSERVATIONS Observer Subject K. A. Ericsson and H. A. Simon, Protocol Analysis: Verbal Reports as Data (revised edition). MIT Press, 1993. J. Hughes and S. Parkes, Trends in the use of verbal protocol analysis in software engineering research, Behaviour and Information Technology, vol. 22, no. 2, pp. 127 140, 2003.
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
THINK ALOUD OBSERVATIONS Golden Implementation Class Template 2 hours Implementation 5 1 4
RESULTS FilterIterator FixedOrderComparator ListPopulation PredicatedMap PredicatedMap 3 6 2 2 5 15 20 23 7 10 94% 92% 97% 72% 94% 95% 100% 94% 100% 83% Images: http://jozef89.deviantart.com/
RESULTS FilterIterator FixedOrderComparator ListPopulation PredicatedMap PredicatedMap 3 6 2 2 5 15 20 23 7 10 94% 97% 94% 95% 100% 100% Images: http://jozef89.deviantart.com/
RESULTS FilterIterator FixedOrderComparator ListPopulation PredicatedMap PredicatedMap 3 6 2 2 5 15 20 23 7 10 94% 97% 72% 94% 100% 100% Images: http://jozef89.deviantart.com/
LESSONS LEARNED
LESSONS LEARNED There are different approaches to testing and test generation tools should be adaptable to them
LESSONS LEARNED There are different approaches to testing and test generation tools should be adaptable to them Developers behaviour is often not driven by code coverage
LESSONS LEARNED There are different approaches to testing and test generation tools should be adaptable to them Developers behaviour is often not driven by code coverage Readability of generated unit tests is paramount
LESSONS LEARNED There are different approaches to testing and test generation tools should be adaptable to them Developers behaviour is often not driven by code coverage Readability of generated unit tests is paramount Integration into development environments must be improved
LESSONS LEARNED There are different approaches to testing and test generation tools should be adaptable to them Developers behaviour is often not driven by code coverage Readability of generated unit tests is paramount Integration into development environments must be improved Education/Best practices: Developers do not know how to best use automated test generation tools!
Coverage is easy to assess because it is a number, while readability is a very nontangible property
Coverage is easy to assess because it is a number, while readability is a very nontangible property What is readable to me may not be readable to you. It is readable to me just because I spent the last hour and a half doing this.
Coverage is easy to assess because it is a number, while readability is a very nontangible property What is readable to me may not be readable to you. It is readable to me just because I spent the last hour and a half doing this. Participant 5
j.rojas@sheffield.ac.uk
www.evosuite.org/study-2014/ j.rojas@sheffield.ac.uk