Comparing and Combining Evolutionary Couplings from Interactions and Commits

Transcription

1 Comparing and Combining Evolutionary Couplings from Interactions and Commits Fasil Bantelay, Motahareh Bahrami Zanjani Department of Electrical Eng. and Computer Science Wichita State University Wichita, Kansas 67260, USA, {ftbantelay, Huzefa Kagdi Department of Electrical Eng. and Computer Science Wichita State University Wichita, Kansas 67260, USA Abstract The paper presents an approach to mine evolutionary couplings from a combination of interaction (e.g., Mylyn) and commit (e.g., CVS) histories. These evolutionary couplings are expressed at the file and method levels of granularity, and are applied to support the tasks of commit and interaction predictions. Although the topic of mining evolutionary couplings has been investigated previously, the empirical comparison and combination of the two types from interaction and commit histories have not been attempted. An empirical study on 3272 interactions and 5093 commits from Mylyn, an open source task management tool, was conducted. These interactions and commits were divided into training and testing sets to evaluate the combined, and individual, models. Precision and recall metrics were used to measure the performance of these models. The results show that combined models offer statistically significant increases in recall over the individual models for change predictions. At the file level, the combined models achieved a maximum recall improvement of 13% for commit prediction with a 2% maximum precision drop. Index Terms Mining Software Repositories; Evolutionary Couplings; Mylyn; Interaction History; Commit History I. INTRODUCTION Change Impact Analysis (IA) or change prediction in source code has been investigated in the software maintenance community. The main goal of this task is to estimate the complete extent of a proposed change in source code (e.g., due to a new feature or bug report). That is should a source code entity be changed, what other entities also need to be changed? Numerous solutions to this task, ranging from traditional static and dynamic techniques to contemporary methods from information retrieval and mining software repositories, have been reported in the literature [1-4]. It shows a definitive progress in supporting this task; however, there remains much work to be done in improving its effectiveness (accuracy). Furthermore, developers may interact, e.g., navigate, view, and modify, software entities within an Integrated Development Environment (IDE) that may not be eventually committed to the code repository. These interactions could have contributed in locating and/or verifying the entities that were changed due to a change request. In this paper, we investigate complementary ways to improve support for change prediction and associated developer interactions. Evolutionary couplings mined from commits in source code repositories have been used to support the task of change prediction [5-9]. Similarly, interactions recorded with task management tools, such as Mylyn, have shown promise in helping developers [10-12]. We compare the efficacy of evolutionary couplings mined from commits and interactions for the change prediction task. On the face value, it could be conjectured that commits (i.e., changed entities) are a subset of interactions (i.e., viewed and changed entities). Consequently, comparing and combining the two should lead to obvious postulates and/or predictable outcomes. Our investigation on the Mylyn dataset found that this subset relationship does not always hold (see Table 1). This fact suggests the potential orthogonality of these two sources and inspires our work. Monitoring developer interactions and recording them, however promising, is a relatively recent phenomenon. Its use is arguably not yet prevalent to the scale of source code repositories, i.e., the number of open source projects with interaction histories are far fewer than those with source code repositories. Having an automatic support for interaction prediction, similar to change/commit prediction, could potentially benefit developers. Our quest is to examine the viability of evolutionary couplings mined from commits in assisting developers with (future) interactions. We also present combination models of commits and interactions to mine evolutionary couplings with the goal of improving effectiveness of commit and interaction predictions. Combining the two different, yet somewhat related, histories could lead to redundancy, and subsequently create a fallacy of strong (otherwise non-existing) couplings. Thus, we explore and assess different ways of combining the two histories in a systematic and synergetic way. These couplings are demonstrated on commit and interaction prediction tasks at source code file and method levels of granularity. To the best of our knowledge, these combined approaches were neither attempted nor empirically assessed previously. We conducted an empirical study on 3272 interaction traces and 5093 commits from Mylyn, an open source task management tool. These interactions and commits were divided into training and testing sets to evaluate the combined, and individual, models. Precision and recall metrics were used to measure the performance of these models. The results show that combined models offer statistically significant increases in recall over the individual models for change predictions. The results also show that a model trained from commit histories can predict interactions with a higher precision than those from interaction histories /13/$31.00 c 2013 IEEE 311 WCRE 2013, Koblenz, Germany

2 Figure 1. A snippet of 4 interaction events (labeled 1-4) recorded by Mylyn for bug issue with trace id In the 1 st interaction, createeditortab method is selected. In the 2 nd, 3 rd & 4 th interactions, contextactivated method is indirectly manipulated, then directly selected and finally edited. A) Method name: createeditortab; B) Class name: ContextEditManager; C) File name: ContextEditorManager.java; D) Parameter types for createeditortab method: Lorg.eclipse.ui.internal.EditorReference & Ljava.lang.String. In summary, our paper makes the following contributions: A combined approach for mining evolutionary couplings from commit and interaction histories. An empirical comparison of the combined evolutionary couplings with the two types of individual couplings for commit and interaction prediction tasks. An empirical comparison of two types of evolutionary couplings mined from commits and interactions for commit and interaction prediction tasks. The remainder of this paper is organized as follows: Section II presents our approach for mining evolutionary couplings from interactions and commits. Section III describes our empirical evaluation and results on the Mylyn dataset. In Sections IV and VI, threats to validity and conclusion are presented. In Section V, we discuss previous work. II. COMBINED APPROACH FOR MINING EVOLUTIONARY COUPLINGS Interaction history has been used to detect both interaction and change couplings [13-15]. Change history from source configuration management systems, such as CVS and SVN, is used to detect change couplings [5, 9, 16-18]. We combine commits from SCM with interactions from the Mylyn model in an attempt to find a better prediction model than using interaction or change history alone. Our rationale for combining the two histories includes the following: A combined model could assist developers by recommending potential entities to be interacted with the training data from commits. A given project may not possess both interaction and change histories. Combined models may detect additional couplings compared to individual models. Elements that are both interacted and committed together could signify stronger couplings than those only interacted or committed. Therefore, a combined model has a potential to uncover such couplings. We use the Mylyn interaction data to mine interaction couplings: programmers who interacted with the method X also interacted with the method Y. We use commits from the SCM to mine change couplings: programmers who changed the method X also changed the method Y. We applied the couplings for supporting interaction and commit prediction tasks. An Interaction Prediction (IP) refers to recommending software entities that may need to be interacted with for a task. A Commit Prediction (CP) refers to recommending software entities that may need to be committed for a task completion. To mine Interaction Couplings (IC), we first extract Mylyn interaction files from the bug tracking system, then process them into transactions, and finally employ the association rule mining technique. Sections II.A, II.B, and II.D elaborate on these steps. To mine Interaction Couplings (IC), we first extract commits from the source code repository, then process them into transactions with additional parsing for the method level granularity, and finally employ the association rule mining technique. Sections II.C and II.D elaborate on these steps. The commonality and orthogonality of interactions and commits is the guiding force in devising the combined models. The linchpin is the bug numbers. Two levels of information are considered: 1) the relationship of interactions and commits to a bug # in the issue tracking system and 2) the commonality and uniqueness of entities involved in interactions and commits. Note that a bug report is used to reference any type of issue (e.g., defect, enhancement, or feature) in the bug tracking system. Sections II.E and II.F elaborate on these steps. A. Interaction Data and Hosting Repositories Interaction is the activity of programmers in an IDE during a development session, e.g., editing a file, or referencing an API documentation, or browsing the web from within an IDE. Different tools, such as Mylyn, have been developed to model programmers actions in IDEs [10, 11, 19-21]. Mylyn monitors programmers activities inside the Eclipse IDE and uses the data to create an Eclipse user interface focused around a task. The Mylyn interaction consists of traces of interaction histories. Each historical record encapsulates a set of interaction events needed to complete a task. Once a task is defined and activated, the Mylyn monitor records all the interaction events, the smallest unit of interaction within an IDE, for the active task. For each interaction, the monitor captures about eight different types of data attributes [12]. The structure handle attribute contains a unique identifier for the target element affected by the interaction. E.g., the identifier of a Java class contains the names of the package, the file to which the class belongs to, and the class. Similarly, the identifier of a Java method contains the names of the package, the file and the class the method belongs to, and the method name, and the parameter 312

3 type(s) of the method. Figure 1 shows an example of 4 consecutive Mylyn interaction events. For each active task, Mylyn creates an XML trace file called mylyn-context.zip that contains all the interaction events A trace file contains the interaction history of a task. This file is typically attached to the project s issue tracking system, such as Bugzilla, Trac or JIRA 1. The trace files for the Mylyn project are archived in the Eclipse bug tracking system as attachments to a bug report 2. A bug issue may contain multiple interaction traces. For example, issue# contains 12 trace attachments. Each trace has a unique attachment id and contains the mylyn/context/zip tag to distinguish it from others. B. Extracting and Processing Interactions into Transactions We first need to identify bug reports that contain mylyncontext.zip attachment(s) because all bug issues may not contain interaction trace(s). To do so, we searched the Eclipse bug-tracking system for bugs containing at least one mylyncontext.zip attachment. Another factor to consider is, all the interactions to a system may not result in committed changes to a source control system. If a bug issue is not fixed with a resolution, it is unlikely for a corresponding commit history to exist. Thus, we only searched for bug issues with Resolved status and a Fixed resolution. We developed a tool to process the search result, which performs the major following tasks: Downloading trace files the tool takes the search result from the Eclipse bug-tracking site as input and automatically downloads all the trace files to a user specified directory. The trace files have the same name, mylyn-context.zip. The tool renames each file by using the bug id and attachment id (separated by an underscore) giving them a unique identifier in the directory they reside in. Internally, the tool identifies the trace file id(s) for each bug issue. If options are specified to output this result, the tool can save the bug ids with the corresponding trace ids in a Java properties file format, the key being the bug id and the values being a comma separated list of trace ids. It uses the URL pattern to download each trace file in the history by replacing X with the trace id. Processing trace files the tool takes the directory that contains the trace files as input and parses each trace file to identify the list of Java files and methods manipulated by each interaction history. We consider each trace file as an interaction transaction. For each transaction, the tool outputs the issue number together with a tab-separated list of Java files and methods. We need the issue number to create a link between interaction and commit transactions. The targeted files and methods are identified from the structure handle of the interaction event. Figure 1(A and C) shows a method name and file name from event 1 and 3 respectively. Two types of patterns are used to identify file and method interaction targets from the structure handle. Pattern <P>/<S><<K>{<F>.java is used to identify file-level targets and pattern <P>/<S><<K>{<F>.java[<C>~<M> is used to identify method-level targets. P is the name of the project. S is the director structure containing the target. K is the package name. F is a Java file name with.java extension. C is a class name. M is a method name Removing noise from transactions After the parsing of trace files is completed, the tool eliminates two types of noises from interaction transactions. Multiple interactions to the same target Mylyn can create different types of interaction events on the same target in a single interaction history [12]. In Figure 1, the contextactivated method is manipulated by three different kinds of interactions. For the purpose of evolutionary couplings, we only need the first interaction to a software entity in a single interaction history. The tool considers only the first interaction to an element and ignores subsequent events on it. This produces a list of unique files (methods). Unintended interactions Mylyn does not have a mechanism to avoid unintended interactions to the system. As a result, some processes, such as automated processes and accidental interactions, may lead to unusually very large interaction transactions. In order to avoid detection of interaction couplings from such interactions, we removed large transactions, i.e., those containing the number of elements larger than the 3 rd quartile, w.r.t the frequency distribution of number of elements in each interaction. C. Extracting and Processing Commits into Transactions Our approach also requires commit data from version archives, such as SVN and CVS. For detecting evolutionary couplings, we need files that have been changed together in a single commit operation. SVN preserves atomicity of commit operations; however, older versions of CVS did not [22]. For a project hosted in an older CVS repository, we convert the CVS repository into an SVN repository using the CVS2SVN tool, which has been used in popular projects such as gcc 3. The tool mines file-level commit transactions from SVN repository. For mining method-level transactions, we used a previously developed tool with some modification to identify the issue number associated with each commit [16]. For each transaction, we extract the bug id from the commit message. Unless Mylyn is configured to generate an automatic commit message, the bug id may not be found in it 4. Unlike interaction transactions, commit transactions may not be always associated with a bug id. Similar to interaction transactions, we discarded large commit transactions, which could be due to a branch or merge operation in CVS. Our tool discards commit transactions containing the number of elements larger than the 3 rd quartile, w.r.t the frequency distribution of the number of elements in each commit. D. Detecting Interaction and Change Couplings Interaction Couplings (IC) intend to identify software entities that were frequently navigated (viewed, changed, or both) together during a single session. Change couplings (CC) intend to identify software entities that were frequently committed together to a source code repository. In Mylyn the unit of a session is a task to fix a defect or implement an enhancement request. For detecting both types of evolutionary couplings, we employed the association rule mining technique with different minimum support values, similar to Ying et al m.doc.isv%2fguide%2fteam.htm 313

4 [9] and Zimmermann et al.[17], specifically the Apriori method [23]. Unlike Zimmermann et al. approach, we mined one-tomany association rules, so that the models can start predicting couplings with a single antecedent. Association rule is a data mining technique for discovering interesting relationships between different items, in this case software entities, from historical transactions. Let P = {E1, E2,, En} be a set of n software entities: fields, methods, classes, or files of a program. Let C = {C1, C2,,Cm} be a set of m change transactions and let I = {I1, I2,, In} be a set of n interaction transactions. Each element in C and I is a set in P. An IC is defined as an association rule:! 𝑋! 𝑋! (1) of two pairs of disjoint program elements X1 and X2 in I. Similarly, CC is defined as an association rule:! 𝑌! 𝑌! (2) of two pairs of disjoint program elements Y1 and Y2 in C. X1 and Y1 are called antecedents, X2 and Y2 are called consequents, and 𝑠 is the minimum support. E. Combining Commit and Interaction Transactions To combine the two histories from interactions and commits in a systematic and synergetic way, we used the bug id as a common attribute. As pointed out in Section II.C, some commit transactions may not contain an associated bug id. Thus, it is not possible to combine those commit transactions with their corresponding interaction transactions. The second point is that a bug id may be associated with one or more interaction and commit transactions, which results different kinds of relationships between them. Table 1 highlights the six possible types of relationships that could exist between interaction and commit transactions at a file-level granularity. the Mylyn dataset (see Table 1 with bug ids and ). Additionally, all the entities changed in a commit may not be tracked to the corresponding interaction history. Considering the type of relationships that could exist between interactions and commits, we devised four different ways of combining interaction and commit histories using the bug id. Let I be the interaction dataset containing Ni number of interaction transactions and C be the commit dataset containing Nc number of commit transactions. Both I and C are multisets (mset multiple-membership set) because duplicate transactions may exist [24]. Let f be a function that returns the multiplicity, the number of occurrences, of a transaction in I or C. The cardinality, the sum of all the multiplicities of each element, of the mset is the number of transactions constituting the dataset. Let B be a set containing the list of bug ids. Formally, Interaction History - 1 txn containing 2 files txn containing 1 file Commit History - 1 txn containing 2 files - 2 txns each containing 1 unique file - 3 txns containing 11-1 txn containing 2 files unique files - 4 txns containing 10-3 txns each containing 2 unique files unique files - 1 txn containing 1 file - No txn - No txn - 1 txn containing 4 files txn stands for a transaction, and file is a Java file (3) 𝐶 = {((𝑐, 𝐵𝑐), 𝑓(𝑐)): 𝑐 𝑖𝑠 𝑎 𝐶𝑇𝑎𝑛𝑑 𝐵𝑐 = 𝑜𝑟 𝐵𝑐 𝐵} (4) IT and CT stand for the considered interaction and commit transactions respectively. In the first combination dataset, denoted by P, we simply concatenate I and C one after the other without any regard to the redundant information. The result of this operation is an mset with the cardinality Ni + Nc. P is defined as the additive union of I and C. 𝑃 =𝐼 𝐶 = 𝑄= 𝑐, 𝐵𝑐, 𝑓(𝑐) 𝑒𝑙𝑠𝑒 ((𝑖 𝑐! 𝐵𝑖),1) 𝑖𝑓 𝐵𝑖 = 𝐵𝑐 𝑎𝑛𝑑 𝑓(𝑖) = 𝑓(𝑐) = 1 𝑖, 𝐵𝑖, 𝑓 𝑖, 𝑐, 𝐵𝑐, 𝑓 𝑐 𝑒𝑙𝑠𝑒 (5) (6) In the third combination dataset, denoted by R, we also attempted to eliminate redundant elements whenever a relationship exists between an IT and a CT. A relationship between an IT and a CT exists if and only if transactions from I and C are associated with the same bug id. The relation could be 1-to-1, 1-to-*, *-to-1, or *-to-*. Bug ids , , and from Table 1 satisfy this condition. 𝑅= For bug# , the two interacted files were finally committed. For bug# the interacted file was finally committed in one of the corresponding commit transactions. For bug# , the committed files were among the interacted files in all the 3 interaction transactions. For bug# , 2 of the interaction transactions were the same, the 3rd one was a subset of the 1st 2 transactions and the 4th one contains 2 unique Java files. 2 of the files in the 1st 2 interaction transactions were in the 1st commit transaction. The files in one of the commits were not part of any of the 4 interaction transactions. The 2 files in the 4thinteraction transaction were found in the 3rdcommit. Ideally, we would expect an interaction history to exist for each commit transaction or vice versa; however, it is not always the case for 𝑖, 𝐵𝑖, 𝑓 𝑖 + 𝑓(𝑐) 𝑖𝑓 𝐵𝑖 = 𝐵𝑐 𝑎𝑛𝑑 𝑓 𝑖 = 𝑓 𝑐 𝑖, 𝐵𝑖, 𝑓(𝑖), For the second combination dataset, denoted by Q, we attempted to eliminate redundant elements whenever a one-toone correspondence is detected between an IT and a CT. A one-to-one relationship between an IT and a CT exists if and only if a bug id is associated with a single IT and a single CT. Bug id from Table 1 satisfies this condition. TABLE 1. SIX TYPES OF RELATIONSHIPS BETWEEN INTERACTION AND COMMIT TRANSACTIONS. Bug Id 𝐼 = {((𝑖, 𝐵𝑖), 𝑓(𝑖)): 𝑖 is an 𝐼𝑇 and 𝐵𝑖 𝐵} 𝑖 𝑐! 𝐵𝑖, 1 𝑖𝑓 𝐵𝑖 = 𝐵𝑐 𝑖, 𝐵𝑖, 𝑓 𝑖, 𝑐, 𝐵𝑐, 𝑓 𝑐 𝑒𝑙𝑠𝑒 (7) For the fourth combination dataset, denoted by S, we only consider related ITs and CTs, and exclude unrelated transactions from the datasets. 𝑆= 𝑖 𝑐! 𝐵𝑖, 1 : 𝐵𝑖 = 𝐵𝑐 (8) Next, we illustrate our point using the dataset presented in Table 1. The msets and the corresponding number of transactions constituting the six datasets are: I = {((i1, ), 1), ((i2, ), 1), ((i3, ), 1), ((i4, ), 1), ((i5, ), 1), ((i6, ), 2), ((i7, ), 1), ((i8, ), 1), ((i9, ), 1)} I = =

5 C = {((c1, ), 1), ((c2, ), 1), ((c3, ), 1), ((c4, ), 1), ((c5, ), 1), ((c6, ), 1), ((c7, ), 1), ((c8, ), 1)} C = = 8 P = {((p1, ), 2), ((p2, ), 2), ((p3, ), 1), ((p4, ), 1), ((p5, ), 1), ((p6, ), 1), ((p7, ), 1),((p8, ), 2), ((p9, ), 2), ((p10, ), 1), ((p11, ), 1), ((p12, ), 1), ((p13, ), 1), ((p14, ), 1)} P = = 18 Q = {((q1, ), 1), ((q2, ), 2), ((q3, ), 1), ((q4, ), 1), ((q5, ), 1), ((q6, ), 1), ((q7, ), 1), ((q8, ), 2), ((q9, ), 2), ((q10, ), 1), ((q11, ), 1), ((q12, ), 1), ((q13, ), 1), ((q14, ), 1)} Q = = 17 R = {((r1, ), 1), ((r2, ), 1), ((r3, ), 1), ((r4, ), 1), ((r5, ), 1), ((r6, ), 1)} R = = 6 S = {((s1, ), 1), ((s2, ), 1), ((s3, ), 1), ((s4, ), 1)} S = = 4 I, C, P, Q, R and S are the numbers of transactions, constituting each dataset in I, C, P, Q, R and S. It is computed by adding all the multiplicities of each element in the given dataset. The above examples show that P always results in the most number of transactions and S always results in the least number of transactions among the datasets. TABLE 2. MYLYN PROJECT INTERACTION AND COMMMIT HISTORIES FOR THE PERIOD OF JUNE 18, 2007 TO JULY 01, Parameters Interaction Commit 3272 traces 5093 revisions File Methods Files Methods Transactions Max. elements/transaction Min. elements/transaction Avg. elements per transaction Associated with bug Id All All F. Prediction Models We detected evolutionary couplings from the six groups of datasets in Section II.E using association rule. The association rules forms our coupling-based prediction models. We shall refer to two individual models, corresponding to datasets I (1) and C (2), as Interaction Model (IM) and Commit Model (CM) respectively. And the four combined models are referred as CpM, CqM, CrM and CsM corresponding to datasets from P (3), Q (4), R (5) and S (6) respectively (see Section II.E). III. EMPIRICAL EVALUATION In the empirical study, we investigated how well our combined approaches for evolutionary couplings performed in predicting future interactions and commits, IP and CP. We are simulating the prospective of a developer who is interacting within an IDE to implement (and commits code related to) a change request. The performance is assessed using two quantitative metrics from information retrieval. A. Research Questions We addressed the following research questions (RQ) in our case study: RQ1. How do interaction and change based evolutionary couplings trained from datasets I and C respectively perform for IP and CP, and compare with each other? RQ2. How the prediction models trained by the combined datasets perform for IP and CP, and compare with individual models from interactions and commits? RQ3. How much is the performance difference of the different combined models for IP and CP? B. Subject Software System The empirical evaluation requires adequate amount of both interaction and commit history. We focused our evaluation on the Mylyn project, which contains about 4 years of interaction data. It is an Eclipse Foundation project with the most number of interaction history attachments. It is mandatory for Mylyn project committers to use the Mylyn plug-in 5. This fact explains in a way that there is more interaction data for the Mylyn project than other Eclipse Foundation projects. Mylyn does not have interaction data for its entire lifetime. Commit history started 2 years prior to that of interaction, and commits to the Mylyn CVS repository terminated on July 01, To get both interaction and commit histories within the same period, we considered the history between June 18, 2007 (the first day of interaction history attachment) and July 01, 2011 (the last day of commit to the Mylyn CVS repository). 1) Interaction Dataset: The Mylyn project consists of 2275 bug issues containing 3272 interaction trace files. About 1721 (76% of) bug issues are associated with only one trace file. After preprocessing the traces and filtering out noises, 2357 file-level and 2174 method-level transactions were identified. Table 2 provides information about the file and method levels of interaction transactions for the Mylyn project. There are more file-level interaction transactions than methodlevel interaction transactions. This difference may be due to the fact that Mylyn propagates lower-level interaction events in to their parents. An event that took place on a method, for instance, also affects the encompassing class, which in turn affects the encompassing file, package, and so on. The average number of files per transaction is greater than the average number of methods per transaction, which could be the result of interaction to more than one method in a single file. 2) Commit Dataset: The Mylyn project contains 5093 revision histories. Out of 5093 change sets, 3727 revisions contain at least a change to one Java file and 2058 revisions contain at least one Java method. About 3572 (96% of) filelevel changes and 1947 (95% of) method-level changes are associated with bug issues. According to Table 2, there are more numbers of commit transactions than interaction transactions. This difference could happen if programmers use a single Mylyn task for more than one commit, or they might forget to create Mylyn tasks for every change they made, as required by the Mylyn committers guideline. Interaction transactions are larger in size than commit transactions. That is, programmers typically interact with more entities than they change

6 C. Training and Testing Sets Both interaction and commit datasets are split into two groups: training and testing sets. The training sets are used to mine association rules, i.e., evolutionary couplings, and the testing sets are used to measure the effectiveness of the rules for IP and CP. We used the first 75% of the transactions for the training set and the next 25% of the transactions for the testing set. We have two individual models based on interactions and commits alone. Also, there are four combined models, which are based on different ways of combining interactions and commits from the same period of history (see Section II), and at two levels of granularity. Therefore, we have a total of 12 models: 6 each at the file and method levels. Corresponding to these models, we have a total of 12 training sets. Figure 2 shows the number of file and method levels of transactions for the six different groups of training sets. We have two tasks on which our models are evaluated: interaction and commit predictions (IP and CP). Each task is evaluated at two levels of granularity (file and method). Therefore, four testing datasets were produced: two interaction-testing sets (one each for file and method levels) for IP and two commit-testing sets (one each for file and method levels) for CP. For IP, the number of file-level transactions was 589 and the number of method-level transactions was 543. For CP, the number of file-level transactions was 932 and the number of method-level transactions was 514. The 6 models trained at the file level were compared with each of IP and CP file-level testing sets. The 6 models trained at the method level were compared with each of IP and CP method-level testing sets. D. Performance Metrics To evaluate the accuracy of the six prediction models, for all the transactions in the testing set, we used two popular measures from information retrieval: precision and recall [25]. Precision is the proportion of predicted files/methods that are correct, which is formulated as follows: TP Precision(p) = TP + FP (9) Recall is the proportion of actual files/methods predicted correctly, which is formulated as follows: TP Recall(r) = TP + FN (10) Where, TP (true positives) predicted files/methods that are relevant, FP (false positives) predicted files/methods that are not relevant, and FN (false negatives) relevant files/methods that are not predicted. For each transaction in the testing sets, we determined the first file/method to be interacted with (for IP) or file/method to be changed (for CP). Mylyn records the time stamp of each interaction event, so we used this value to determine the first file/method to be interacted with in each transaction in the interaction testing dataset. A commit transaction; however, does not identify the file/method that was changed first. Consequently, we make predictions assuming each file/method has an equal chance of being changed first. In this case, the precision and recall values for a CP become the average of all the predictions considering each element in a commit transaction as a starting point for the change. Figure 2. Number of transactions in the training sets of the models. If a model does not make predictions, both precision and recall metrics become undefined. To overcome this scenario, we imposed one rule on the formula. If a model did not make any predictions, we did not compute precision and recall. To include this exception scenario in our metric, we report the probability of a model in predicting IP and CP (regardless of whether the predictions are correct or not). This probability is termed likelihood and it is given by the formula [17]: Total Prediction Likelihood(l) = No. of Test Cases (11) Unlike precision and recall, a testing set will only have a single value for likelihood. E. Hypotheses Testing We derived testable hypotheses to evaluate our research questions. We only list the null hypotheses because one can easily derive the alternative hypotheses from them. H 0-1 : There is no difference among the precision values of the six models for file-level interaction predictions. H 0-2 : There is no difference among the recall values of the six models for file-level interaction predictions. H 0-3 : There is no difference among the precision values of the six models for method-level interaction predictions. H 0-4 : There is no difference among the recall values of the six models for method-level interaction predictions. H 0-5 : There is no difference among the precision values of the six models for file-level commit predictions. H 0-6 : There is no difference among the recall values of the six models for file-level commit predictions. H 0-7 : There is no difference among the precision values of the six models for method-level commit predictions. H 0-8 : There is no difference among the recall values of the six models for method-level commit predictions. Each pair of the above precision and recall hypotheses corresponds to one of the testing sets. For example, the interaction-testing set at the file level is used for hypotheses H 0-1 and H 0-2 and the one at the method level is used for hypotheses H 0-3 and H 0-4. Likewise, the commit-testing set at the file level is used for hypotheses H 0-5 and H 0-6 and the one at the method level is used for hypotheses H 0-7 and H 0-8. For each hypothesis, we compared the 6 models on one testing set at a 316

7 time and did not compare hypotheses and results on different testing sets against each other. Note that, we mean statistically significant difference in the stated null hypotheses. To analyze the differences between the values reported by each model, we computed the average values of precision and recall for each support threshold. The precision and recall values are compared using a precisionrecall curve. We performed the analysis of variance (ANOVA) test with α=0.05 to validate whether there is a statistically significant difference between the models. F. Evaluation Results Figure 3 and Figure 4 show the precision and recall curves for IP and CP for the 3 support thresholds (1, 2 and 3) at file(a) and method(b) levels of granularity. Each data point represents the average precision and recall for all the transactions in the testing set. The lines connecting each precision-recall pair at each threshold show the trade-off between precision and recall. Figure 5 shows the outcome of the ANOVA test. Note that the metric values in the charts are reported in fractions. 1) Interaction Prediction (IP): File-level- From Figure 3(a), at the file-level IP, we can see that IM and the combination models resulted in a similar performance with the exception of CM. CqM and CrM achieved the highest recall value with a gain of 3% as compared to IM, with a loss of 1% in precision. CM exhibited the highest precision value with a 4% increase compared to IM. However, CM returned the lowest recall value with a 15% decrease as compared to IM. All the combination models, except CsM, achieved a higher likelihood value than IM. CpM achieved the maximum gain of 4% likelihood. Method-level- Figure 3(b) shows the precision-recall curve for method-level IP. Both CM and CrM gained precision over IM. CM showed a 13% increase while losing 5% in recall, whereas CrM showed a 4% increase without any loss in recall. CrM showed a 2% recall increase without any loss in precision. Both CpM and CqM exhibited a 2% increase in likelihood. Usually, an increase in the minimum support should increase precision and decrease recall; however, CM exhibited a decrease in precision at the support of 3 and recall was the same across the support values for the method-level IP. This exception shows that the coupling is perhaps stronger in committed elements than interacted ones, i.e., the consequences of not applying the required co-changes are more severe than co-interactions. For the method-level IP, CpM and CqM resulted in almost identical performance for precision and recall. From equation (5) and equation (6), we can see that the number of transactions in the training datasets for CpM and CqM are practically very close. In the example in Section II.E, the cardinalities of P and Q are 18 and 17. This similarity in the training datasets of CpM and CqM resulted in the same performance across the three support values for the methodlevel IP and two out of the three support values for the filelevel IP. From Figure 5, we can see that there are statistically significant differences between CM and the other models in terms of precision at both file and method levels. Therefore, we reject H 0-1 and H 0-3. There are also statistically significant differences between CM and the other models in terms of recall at the file-level granularity; however, there is none at the method level. Therefore, we reject H 0-2 and accept H ) Commit Prediction (CP): File-level- From Figure 4(a), for the file-level CP, the combined models did not show any improvements in precision as compared to CM. However, CqM gained 13% in recall at the loss of 2% in precision. CrM also showed a 10% increase in recall with a 1% loss in precision. All the combination models achieved a higher likelihood value than CM. CpM achieved a maximum gain of 15% likelihood. Method-level- Figure 4(b) shows the precision-recall curve for the method-level CP. In terms of precision, the other models did not perform as well as CM. However, CrM displayed a 4% increase in recall with a 13% decrease in precision. All the combination models achieved a higher likelihood value than CM. CpM achieved a maximum gain of 14% likelihood. CM exhibited an increase in recall at the support of 3 for the method-level CP. From Figure 2, the numbers of file and method level transactions for CM are much larger than the respective numbers of transactions for CsM. In the example provided in Section II.E, the cardinalities of C and S are 8 and 4. Despite this big difference in the numbers of training transactions, CsM resulted in a higher recall value for both file and method level IP and CP. This fact indicates that coupling is determined not only by the number of transactions in the training history but also the number of elements per transaction. From equation (8), the training dataset for CsM includes some transactions from the interaction dataset, and interaction transactions contain larger number of elements per transaction than commit transactions, which resulted in a higher recall for CsM than CM. From Figure 5, we can see that there are statistically significant differences between CM and the other models in terms of precision at both file and method levels. There are statistically significant differences in precision between IM and each of CpM, CqM and CrM at the method-level CP. Also, statistically significant differences were observed between CpM and each of CsM, CqM and CsM in terms of precision at the method-level CP. Therefore, we reject H 0-5 and H 0-7. There are also statistically significant differences between CM and each of CpM, CqM and CrM in recall at the file level. Therefore, we reject H 0-6 and accept H 0-8. G. Answering Research Questions (RQs) To answer the research questions, the performances of the different prediction models for predicting each of the four testing sets were examined. In answering RQ1, the average performances of IM and CM for IP and CP across the three support thresholds were compared at the file and method levels of granularity. For the file-level IP, IM performed better than CM with 16% and 18% improvements in recall and likelihood respectively. However, IM exhibited a 3% decrease in precision. For the method-level IP, CM performed better than IM with an 11% increase in precision; however, at a loss of 4% in recall. For the file-level CP, IM outperformed CM with a gain of 8% in recall and 9% in likelihood; however, at a loss of 2% in precision. For the method-level CP, CM outperformed IM with a gain of 24% in precision and 8% in recall; however, at a loss of 9% in likelihood. In both cases, CM is better in precision and IM is better in recall except method-level CP. 317

8 Figure 3. Precision vs. recall curves of the different prediction models for detecting IP for Mylyn project at minimum support of 1, 2 and 3. The combination models did not register a promising result for IP. The maximum performance gains by the combined models are a 4% gain in both precision and likelihood for file and method level IP respectively. The combination models; however, showed a promising result for file-level CP displaying an 11% and 15% increase in recall and likelihood with a 1% trade off in precision. For the method-level CP, CM performed better than combination models because it exhibited 19% and 1% gains in precision and recall respectively with a loss of 14% in likelihood. The maximum performance gain of combination models was 4% in both precision and likelihood for file and method IPs. The combination models showed promising results for the file-level CP. In answering RQ3, the average performances of the combined models across the three thresholds were compared for the file and method levels of IP and CP. For the file-level IP, the difference in precision of the four combined models is 1%, and the differences in recall are between 1% and 2%. For the method-level IP, the differences in precision of the four combined models are between 1% and 2%, and the difference in recall is at most 3%. For the file-level CP, the difference in precision of the four combined models is 1%, and the differences in recall are between 1% and 5%. For the methodlevel CP, the differences in precision of the four combined models are between 1% and 2%, and the difference in recall is at most 2%. Significant differences are observed between CpM and CsM, and between CqM and CsM, for the precision of the method-level CP. Overall, CpM and CqM are better in recall and likelihood, and CrM and CsM are better in precision. Figure 5. A heat-map summarizing hypotheses test results across all the prediction models for the minimum support of 1. Signficant difference: Cells colored black exists for both method and file levels; dark-gray exists for only the file level; light-gray exists for only the method level; white none for both method and file levels. Figure 4. Precision vs. recall curves of the individual and combined models for CP on themylyn dataset at minimum supports of 1, 2 and 3. Parts A and B for files and methods. In answering RQ2, the average performances of the combined models across the three thresholds were compared with the average performances of IM for IP and with the average performances of CM for CP. For the file-level IP, the combination models outperformed IM with 1%, 1%, and 4% gain in precision, recall, and likelihood respectively. For the method-level IP, combination models outperformed IM with a gain of 4% in precision, 1% in recall, and 2% in likelihood. IV. THREATS TO VALIDITY We discuss internal, construct, and external threats to validity of the results of our empirical study. Incomplete or Missing Interaction History: Although, a common period was considered for extracting the interaction and commit datasets in the Mylyn dataset, the number of commit transactions is significantly higher than the number of interaction transactions. This difference may not be the result of a single task getting defined for multiple commits because there are many cases in which committed files have never been part of one of the corresponding interaction transactions. Data Extraction Errors: We used two adequately vetted tools to extract method-level interaction and commit transactions; however, it is possible that the unforeseen error rates between the two tools might have been different. 318

9 CVS to SVN Conversion: We do not know the error rate of CVS2SVN when grouping individual CVS files into change sets. It may erroneously split a commit into multiple, or group multiple commits into one. There are 1366 more commits than the number of interaction traces for the same period. This difference could be due to the errors introduced by CVS2SVN. Explicit Bug Id Linkage: We considered interactions and commits to be related if there was an explicit bug id mentioned in them. Other implicit relationships were not considered. Training and Testing Set Split: We considered only a 75%:25% split between training and testing sets. It is possible that a different split point could produce different results. Single Period of History: We considered only the history between June 18, 2007 and July 01, It is possible that this history is not reflective of the optimum results for all the models. A different history period might produce different results in terms of their relative performance. Performance Metrics: We considered precision and recall metrics for the evaluation. One could also use other derived metrics such as F-measure; however, we wanted to analyze the performance differences with multiple orthogonal metrics. Only One System Considered: Due to the lack of adequate Mylyn interaction histories for open source projects, our validation study was performed only on a single system written in Java. It was the one with the largest available dataset within Eclipse Foundation. It had over 2600 fixed bug reports that contained at least one interaction trace attachment. The second and third largest projects (the Eclipse Platform and Modeling) had about 700 and 450 such bugs. Nonetheless, this fact may limit the generality of our results. V. RELATED WORK We discuss related evolutionary coupling mining approaches. Our goal is not to exhaustively detail this large body of literature, but to briefly discuss a few representatives. A. Evolutionary Couplings from Programmers Interactions There are a number of research efforts that used interaction information to mine evolutionary couplings. Researchers have been developing IDE plug-ins to capture programmers interactions during programming activities [10, 11, 21]. NavTracks, a complementary tool to the Eclipse package Explorer, keeps track of the navigation history of software developers. The tool provides information concerning the recent actions of a programmer on a local copy of a development project. The information was used to mine IC at the file-level granularity [11]. Team Track [10] also records programmers interactions to projects, files, classes, and members by continuously tracking the position of the mouse cursor at every second. The information was then used to provide navigation support to programmers unfamiliar with the code base. In HeatMaps [21], the interestingness of a programming element is determined by computing a Degreeof-Interest (DOI) value based on the historical selection and modification of it. If an artifact is found interesting, it is decorated with colors to indicate its importance to the task. Zou et al. [13] used the interaction history to identify evolutionary information about a development process, such as restructuring is more costly than other maintenance activities. Robbes et al. [7] developed an incremental change based repository by retrieving the program information from an IDE, which includes more information about the evolution of a system than traditional SCM, to identify a refactoring event. Parnin and Gorg [26] identified relevant methods for the current task by using programmers interactions with an IDE. Kobayashi et al. [15] presented a Change Guide Graph (CGC) based on interaction information to guide programmers to the location of the next change. Each node in the graph presents a changed artifact and each edge presents a relation between consecutive changes. Following the CGC graph, the next target in the change sequence can be identified. Logical couplings have also been detected by combining interaction history with other sources of information about a program. Schneider et al. [19] presented a visual tool for mining local interaction histories to help address some of the awareness problems experienced in distributed software development projects. Both interaction history and static dependencies were used to provide a set of potentially interesting elements to a programming task. Change histories from SCM, such as CVS and SVN, do not track sequence of edits in a change set. Robbes et al. [4] proposed an alternative approach to predict sequential change couplings by recorded programmers activities in the IDE. They used the data to evaluate existing change prediction approaches. B. Evolutionary Couplings from Commits Ying et al. [9] used an association rule mining algorithms to mine evolutionary couplings from commits. They also provide the interestingness value for each recommendation, which indicates the surprise factor, i.e., entities that are not apparent to a developer by their primitive knowledge of source code. Canfora et al. [27] used both CVS and Bugzila data to perform impact analysis. The method exploits information retrieval algorithms to link the change request description and the set of historical source files in repositories. They use textual similarities to retrieve past change requests (CR) similar to a new CR. Fluri et al. [5] focused on adding the structural change information to release history data. They discarded changes related to textual modifications, such as updates in license terms, because they could indicate false coupling between files. Kagdi et al. [16, 28] provide a model that combines evolutionary couplings with estimated changes identified by traditional impact analysis techniques. Zimmermann et al. [17] presented a tool, namely ROSE, to mine evolutionary couplings from CVS commits. They used a sliding window technique to identify commits and used association rule mining. Similar to Ying et al. [9] and Zimmermann et al.[17], our approach uses association rule mining for evolutionary couplings from commits. Other approaches that use static analysis for impact analysis are discussed in [1, 29, 30] and those that use dynamic analysis are discussed in [2, 31, 32]. Their discussion is out of scope here. C. Comparison of Our Approach with Existing Approaches From the above discussion, it can be seen that none of the approaches used combinations of interaction and commit histories for IP and CP. We presented four combined models of commit and interaction histories at file and method levels. Also, we mined CC and IC from each of these combined data sets. We performed two different empirical comparisons: one 319