Confirmation Bias as a Human Aspect in Software Engineering

Confirmation Bias as a Human Aspect in Software Engineering Gul Calikli, PhD Data Science Laboratory, Department of Mechanical and Industrial Engineering, Ryerson University

Why Human Aspects in Software Engineering? Enhance decision making under uncertainty, so that managers can take decisions about efficient allocation of resources during any phase of the SDLC. Which parts of software should be prioritized for testing? Who should test/develop the most critical parts of software? Who should fix the bugs in the most problematic parts of the software? Who should/should not develop/ maintain the same source files? Who should we hire as a developer/ tester/analyst/designer?

Why Human Aspects in Software Engineering? People s thought processes have a significant impact on software quality as software is analyzed, designed, tested, developed and managed by people. While solving problems in daily life people use heuristics to solve problems. When heuristics fail to produce a correct judgment, it results in a cognitive bias. Heuristics employed in daily software engineering activities may also result in cognitive biases, leading to defects. Some common cognitive bias types: confirmation bias anchoring and adjustment availability representativeness. We focus on confirmation bias!

Confirmation Bias in Software Engineering Confirmation bias is defined as the tendency of people to seek evidence to verify a hypothesis rather than seeking evidence to refute that hypothesis.

Confirmation Bias in Software Engineering Due to confirmation bias, developers tend to perform unit tests to make their program work rather than to break their code. During all levels of software testing, we must employ a testing strategy, which includes adequate attempts to fail the code to reduce software defect density.

Methodology to Quantify Confirmation Bias Research Question 1: How can we identify the measures of confirmation bias in relation to software development process?

Methodology to Quantify Confirmation Bias Challenge: Quantifying confirmation bias to perform empirical analyses. Proposed Solution: Our methodology is an iterative process and it mainly consists of the following steps: 1) Preparation of the confirmation bias test 2) Formation of the confirmation bias metrics set

Confirmation Bias Test Confirmation bias test consists of the following: Interactive Test based on Wason s Rule Discovery Task Written Test based on Wason s Selection Task Question Type Written Test Content No. of Questions Abstract Questions 8 Thematic Questions 6 SW development/ testing questions TOTAL 22 8

Confirmation Bias Test Wason s Rule Discovery Task Goal: Discover the correct rule Initially, subject is given three numbers, which conform to a simple rule Experiment Protocol: repeat until correct rule is announced write down tree numbers & reasons for choice; receive feedback from tester; if you are sure about the rule announce the rule; end if you want to terminate end end break; % terminated

Confirmation Bias Test Wason s Selection Task: Goal: To find out which of the four cards should be turned over to test the validity of the statement given below: p If there is a D on one side of the card, then it has a 3 on its other side. p q q p not- p q not- q

Example: Wason s Rule Discovery Task in Relation to Unit Testing Wason s Rule Discovery Task: Subjects have a tendency to select many triples (i.e., test cases) that are consistent with their hypotheses and few tests that are inconsistent with them. T: Triples conforming to the correct rule H: Set of triples conforming to the hypotheses in subject s mind. Observed Similarity with Functional (Black-box) Testing 3 : Program testers may select many test cases consistent with the program specifications (positive tests) and a few that are inconsistent with them (negative tests).

Example: Wason s Rule Selection Task in Relation to Unit Testing Example 1 : Suppose you want to make sure that a program avoids dereferencing a null pointer by always checking before dereferencing. Someone If a pointer tells you is there are only four sections dereferenced, of code then to it be is tested, and they checked have determined for nullity. the following things about those sections: Section A checks whether the pointer is null. The pointer may or may not be dereferenced there. Section B does not check whether the pointer is null. The pointer may or may not be dereferenced there. Section C dereferences the pointer. The pointer may or may not have been checked for nullity. Section D does not dereference the pointer. The pointer may or may not have been checked for nullity. Which sections need to be investigated further? Stacy, W., & MacMillan, J. (1995). Cognitive bias in software engineering. Communication of the ACM, 38(6), 57 63.

Confirmation Bias Metrics Set Interactive Test Metrics Written Test Metrics. Next step in definition of the metrics suite

Test Severity Confirmation Bias Metrics Set: Some Practical Results Interactive Test Outcome: Hypothesis Testing Strategy Written Test Outcome: Reich and Ruth s Falsifier/Verifier/Matcher Classification Bins of Problem Solving Steps Falsifier Verifier Matcher None Group1*: Developers of a GSM/Telecommunications company (29 subjects) Group 8*: Computer Engineering PhD candidates with minimum 2 years of development experience (36 subjects)

Influence of Developers Confirmation Bias on Software Quality Part 1 Research Question 2: How do confirmation biases of developers affect software quality?

Influence of Developers Confirmation Bias on Software Quality Part 1 Dataset: Steps of the Analysis: Formation of developer groups Estimation of developer groups confirmation bias metric values from individual values: Measurement of defect rate for each developer group Analysis of the Pearson correlation between developer groups confirmation bias metrics and defect rates

Influence of Developers Confirmation Bias on Software Quality- Part 1 Estimation of the correlation between developer groups confirmation bias metrics (interactive test) and defect rates. Results: (Group1*) (Group8*) Conventional effect sizes as offered by Cohen

Influence of Developers Confirmation Bias on Software Quality- Part 1 Estimation of the correlation between developer groups confirmation bias metrics (written test) and defect rates. Results: (Group1*) (Group8*) Conventional effect sizes as offered by Cohen

Influence of Developers Confirmation Bias on Software Quality Part 2 Research Question 3: How do measures of confirmation bias perform in predicting defect prone parts of software?

Defect Prediction Models Software quality is often measured by the number of defects in the software. Testing takes ~50% of overall time in Software Development Lifecycle (SDLC). Oracles/Predictors can be used to supplement testing activities for effective allocation of testing resources. NASA Metrics Data Direct Usage of Metrics Equal Weighting Metrics Decision Tree Naïve Bayes Classification Company Metrics Data InfoGain/PCA Weighted Metrics

Defect Prediction Models At the intersection of AI and SWE Product/ Process- Related People-related Organizational metrics # of developers Developer experience Social Interaction nws Design metrics File Dependency Graphs Churn Metrics Static Code Metrics CGBR Data Content How can we enhance the performance of defect prediction models? Data Size under-sampling outperformed over-sampling. micro-sampling Algorithms k-nn Naïve Bayes Bayesian Networks Neural Networks SVM Logistic Regression..

Influence of Developers Confirmation Bias on Software Quality Part 2 Construction of the Prediction Model (also used in missing data problem) Algorithm: Naive Bayes Input data: static code, churn, confirmation bias metrics (models are constructed for each combination of these metrics) Preprocessing: undersampling 10x10 cross validation Performance measures:

Influence of Developers Confirmation Bias on Software Quality Part 2 Dataset: ERP Results Dataset: Telecom1 Dataset: Telecom2 Dataset: Telecom4 Dataset: Telecom3

Influence of Developers Confirmation Bias on Software Quality Part 2 Results Summary: Confirmation Bias is a single human aspect. Yet, using confirmation bias metrics led to comparable performance results in predicting defect prone parts of software. The performance of defect prediction models built by using only confirmation bias metrics is comparable with the performance of the defect prediction models that use static code metrics and churn metrics. Therefore, we should further investigate other human aspects

Residuals Current Work: The Impact of Confirmation Bias on the Release-based Defect Prediction of Developer Groups: Problem: Predicting defect rates of developer groups for next releases of a software product. Motivation: Towards task assignment Solution: Use Partial Least Regression (PLSR) and PCR (Principle Component Regression). Defect Rate? ( current & past releases) Methodology: Train the model with the releases 1, 2, i-1 and test it for the i th release. unknown high low Results (Dataset Telecom1) REQUIRED!!! Predict defect rates of developer groups use confirmation bias metrics Avoid that group The group is ok Results (Dataset ERP) Residual range: [-0.08-0.04] Results (Dataset Telecom2) Developer Group Indices

Current Work: Dealing with Missing Data Problem: Collecting data (e.g. confirmation bias metrics) though interviews/tests might be challenging: Tight schedule of developers Evaluation apprehension Lack of motivation Staff turnover Solution: Use Expectation Maximization (EM) Algorithm to impute missing data. Methodology (Experimental Setup): Form 2 N -2 different missing data configurations (N: Total number of developer groups) Use EM to impute missing data All these result in missing data problem Build defect prediction models using imputed data Compare obtained performance results with the performance of prediction models built using complete data.

Current Work: Dealing with Missing Data Refer to previous work on defect prediction for the dataset, construction of the model and estimation of performance criteria. Experimental Results: Dataset: ERP Dataset: Telecom1 Dataset: Telecom2 Dataset: Telecom3

Current Work: Confirmation Bias Metrics: A new metrics suite proposal to measure the thought processes of developers We initially identified a confirmation metrics set. Our Current Goal: To complete the following To-Do list Form theoretical basis Done! Refine existing metrics set We are here! To empirically demonstrate the feasibility of our metrics. Formulate a single derived metric using the refined metrics set To analytically evaluate our metrics suite and single derived metric against the principles of measurement theory. Empirically validate the feasibility of the single derived metric.

Current Work: Refine Existing Metrics Set: Criteria for the final metrics suite: Metrics should not be highly correlated with each other Check for the correlation between defect rates and the values of each metric Metrics should be able to differentiate problematic software product from the rest. Metric Name: positivecompatible χ2 : 84.9, df: 4 Telecom1 is currently experiencing serious post-release defects. Telecom2 and ERP are mission critical Software as they include billing and charging modules. Telecom1 Telecom2 ERP

Current Work: Refine Existing Metrics Set (cont d): Criteria for the final metrics suite: Metrics should be able to differentiate problematic software product from the rest. Metric Name: Ind ElimEnum (Wason s Eliminative/Enumerative Index) χ2 : 100, df: 6 Telecom1 is currently experiencing serious post-release defects. Telecom2 and ERP are mission critical Software as they include billing and charging modules. Telecom1 Telecom2 ERP

Current Work: Formulate a single derived metric In order to make the interpretation of the results much easier, we formulated a a single derived metric to quantify confirmation bias level. Confirmation Bias Level: Deviation of confirmation bias metrics values from the corresponding ideal metrics values.

Current Work: To analytically evaluate our metrics suite and single derived metric against the principles of measurement theory : According to measurement theory : We begin with a set of objects, each of which has one or more common attributes, each of which in turn can be divided into exclusive and exhaustive equivalence classes. The objects and the relationship between them constitute an Empirical Relational System (ERS). In parallel, we construct a Numerical Relational System (NRS) comprising numbers and the relationships between them. Example: Let M(x) be the value of the variable length for rod x, we assign numbers such that M(x) M(y) if and only if x y, where represents not shorter than x. Establish a homomorphism from the ERS denoted by [A, ], where A represents the Êset of rows Ê to the NRS denoted by [R, ] Ê

Current Work: To analytically evaluate our metrics suite and single derived metric against the principles of measurement theory. Question: Which concepts should be inherited from the measurement theory so that the following are prevented? Lacking in desirable measurement properties Being insufficiently generalized. Some formulations of measurement fail for disciplines such as psychology (example: concatenation operation x o y = z ) such formulations should be identified. Appropriate ones should be inherited. Goals: Avoiding the criticism regarding the lack of theoretical base in the formation of a metrics set. To form a formal methodology to define metrics sets for other cognitive aspects of people.

Related Publications G. Calikli and A. Bener, The Impact of Confirmation Bias on the Release-based Defect Prediction of Developer Groups, the 25 th Conference on Software Engineering and Knowledge Engineering (SEKE 2013), Boston, USA, 2013. (submitted) G. Calikli and A. Bener, Influence of Confirmation Biases of Developers on Software Quality: An Empirical Study, Software Quality Journal, 2012 G. Calikli, B. Caglayan, A. Tosun and A. Bener, Modeling Human Aspects to Enhance Software Quality Management, 2012 International Conference on Information Systems (ICIS 2012), Orlando Florida, USA, December, 2012. B. Caglayan, A. Tosun, G. Calikli, T. Aytac, A. Bener, and B. Turhan, Dione: An Integrated Measurement and Defect Prediction Solution, 20th International Symposium on Foundations of Software Engineering, Cary, North Carolina, USA, September, 2012. G. Calikli, and A. Bener, Empirical Analyses of the Factors Affecting Confirmation Bias and the Effects of Confirmation Bias on Software Developer/ Tester Performance, Promise 2010, Tmişoara, Romania, September, 12-13, 2010. G. Calikli, and A. Bener, Preliminary Analysis of the Effects of Confirmation Bias on Software Defect Density, ESEM 2010, Bozen, Italy, September, 16-17, 2010. G. Calikli, B. Arslan and A. Bener, Confirmation Bias in Software Development and Testing: An Analysis of the Effects of Company Size, Experience and Reasoning Skills, 22nd Annual Psychology of Programming Interest Group Workshop, 19-21 September 2010. G. Calikli, A. Bener, and B. Arslan, An Analysis of the Effects of Company Culture, Education and Experience on Confirmation Bias Levels of Software Developers and Testers, ICSE 2010, May 2-8, Cape Town. G. Calikli, A. Tosun, A. Bener, and M. Celik, The Effect of Granularity Level on Software Defect Prediction", Proceedings of the 24th International Symposium on Computer and Information Sciences (ISCIS 2009), pp. 531-536.

THANK YOU ANY QUESTIONS? Gül Çalıklı: gcalikli@ryerson.ca