Predicting More from Less: Synergies of Learning

Transcription

1 Predicting More from Less: Synergies of Learning Ekrem Kocaguneli, Bojan Cukic, Huihua Lu Lane Department of Computer Science and Electrical Engineering West Virginia University Morgantown, WV, USA Abstract Thanks to the ever increasing importance of project data, its collection has been one of the primary focuses of software organizations. Data collection activities have resulted in the availability of massive amounts of data through software data repositories. This is great news for the predictive modeling research in software engineering. However, widely used supervised methods for predictive modeling require labeled data that is relevant to the local context of a project. This requirement cannot be met by many of the available data sets, introducing new challenges for software engineering research. How to transfer data between different contexts? How to handle insufficient number of labeled instances? In this position paper, we investigate synergies between different learning methods (transfer, semi-supervised and active learning) which may overcome these challenges. I. INTRODUCTION Predictive modeling and estimation methods are important research directions in software engineering (SE). The targets of predictive models can be different [1]: software quality [10], software effort estimation [13], software defects [28], release scheduling [36] and so on. Thanks to recent emphasis placed on data collection activities, we have access to massive publicly available sources of software engineering data. In fact, we have access to more open source projet data than at any other time in the history. For example, SourceForge currently hosts 300K projects with a user base of 2M 1, GoogleCode hosts 250K open source projects 2, similarly there is an abundant amount of SE repositories, e.g. ISBSG [23], PROMISE [27], Eclipse Bug Data [42] and TukuTuku 3. When an organization has no local data or the local data is outdated, i.e. it is no longer representative of the current practices, transferring the available data of other organizations or adapting the outdated data to current contexts may be helpful. The transfer of data between organizations or time frames is addressed by the transfer learning research [25]. The publicly available data sets offer an opportunity for the creation of better software engineering processes even for organizations without any local data of their own. Therefore, effective methods to transfer data from the domain of one organization to the other appears to be an important, yet challenging research direction. The performance of transfer learning (a.k.a. cross company estimation in SE) methods remain questionable with both supporting and opposing evidence [15], [41]. Recently, the relevancy filtering is reported to be a good alternative to improve the performance of transfered data to the extent that its performance is statistically significantly the same as that of local data [17], [38]. However, relevancy filtering approaches transfer only a limited number of training instances relevant to the local domain. When only a limited amount of labeled training instances are available, we typically try to supplement learning by predicting labels for the unlabeled training data. The predicted labels (so called estimated labels ) can be evaluated according to a confidence function and only the ones with high confidence are kept. This process is repeated until all the instances are labeled. This issue is handled by the semisupervised learning research [24]. The lack of labels in a given data set requires unsupervised learning methods that can provide a labeling order. This ordering is expected to start from the most informative instances and continue towards the least informative ones. This issue is handled by the active learning research [9], [16], [40]. The common property of most estimation techniques is that they rely on supervised algorithms, i.e., they require training data with labels. All the available datasets may not contain complete dependent variable information, a.k.a. labels [18]. The label information may be either non-existent or only available for a limited number of instances. This is due to the fact that often it is infeasible to collect the labels for all the instances in a large software dataset. 4 In spite of public availability of software engineering data, the aforementioned issues bring out the following challenges: How to transfer data data between domains and projects? How to accommodate prediction problems for which a limited amount of labeled instances are available? How to handle prediction problems in which no instances have labels? Transfer, semi-supervised and active learning, applied to software engineering data, are research areas that can help us answer specific problems related to contemporary prediction problems. We overview the state-of-the-art of the applications of these techniques in SE ( II). Based on the recent work, we 4 Note that the terms dependent variable information, label information and label refer to the value of the feature(s), which are believed to be related to independent variables and suitable for prediction by supervised algorithms. Hence, these terms will be used interchangeably in the text /13 c 2013 IEEE 42 RAISE 2013, San Francisco, CA, USA Accepted for publication by IEEE. c 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

2 evaluate their strengths and weaknesses ( III). One limitation in the current picture is that these techniques are typically used in separation. However, we will show that these prediction techniques complement each other s weak points ( IV). Their synergy will likely present opportunities for improved software engineering research. We will conclude our discussion in V. II. CURRENT STATE-OF-THE-ART In this section we cover the current state-of-the-art in SE concerning transfer, semi-supervised as well as active learning. The summary of this section in terms of the strengths and weaknesses of the learning methods (as identified by the existing literature) can be found in Figure 2. A. Transfer Learning A supervised learning problem is often composed of two separate sets: Training and test sets. Such a learning problem is composed of: 1) a specific domain D, which consists of a feature space and a marginal distribution that defines this space; 2) a task T, which is the combination of a label space and an objective estimation function. Transfer learning is a set of learning methods that allow the training and test sets to have different domains and/or tasks [25]. The formal definition, as given by Jialin et al. is as follows: Assuming we have a source domain D S, a source task T S, a target domain D T and a target task T T ; transfer learning tries to improve an estimation method in D T using the knowledge of D S and T S. Note that the assumption in the above definition is that D S 6= D T and T S 6= T T. There are various subgroups of transfer learning, which define the relationship between traditional machine learning methods and various transfer settings, e.g. see Table 1 of [30]. SE transfer learning studies (more popularly known as cross-company learning in SE) have the same task (estimating a dependent variable, e.g. estimating the fault proneness of modules, estimating the software development effort etc.) yet in different domains (data coming from different organizations or different time frames). Such transfer learning problems are classified as transductive transfer learning [2]. The current transfer learning results reported in SE have one thing in common: instability and significant variability of their results. A literature review of software effort estimation (SEE) studies conducted by Kitchenham et al. report equal evidence for and against the success of prior transfer learning studies [15]. A similar result is also reported in the defect prediction domain. Zimmermann et al. found that estimation methods trained on cross-application data were inferior than those from within-application data [41]. From a total of 622 transfer and within data comparisons, they report that within data performed better in 618 cases. An interesting research direction related to transfer learning results was proposed by Turhan et al. in their cross-company defect prediction study [38]. Supporting the evidence provided by Zimmermann et al. [41], Turhan et al. also reported that transferring all the available cross company data yields poor estimation performance (very high false alarm rates). However, after the instance selection pruned away irrelevant data during transfer learning, they found that the estimators built on transferred data were nearly equivalent to the estimators learned from within data [38]. Following the irrelevancy filtering idea, Kocaguneli et al. [16] used a variance-based instance selection for a study of transfer learning in SEE. In a limited study with three data sets, they found that through instance selection, the performance differences in the estimation methods trained on transferred or within data were not statistically significant. This study was then extended with 8 different data sets [17]. The results were identical: performance differences between models built from the within and transferred data are not significant. Quality of the data is a critical aspect of empirical SE [20]. Various different filtering strategies appear to be strong candidates for improving the data quality of SE data sets. The success of filtering cross data may also be due to the improved cross data quality from the perspective of a local data problem. However, that is merely our speculation and there needs to be empirical experimentation to see relation of data quality and filtering of cross data. The filtering of the SE data sets is recently attracting more attention with promising results in terms of improving the data set quality. For example, Liebchen et al. experiment on a proprietary data set with 3 different filtering-based data quality techniques: filtering; robust filtering; filtering followed by polishing [20]. They report that all the three techniques provide performance improvement over the use of the data set in an as-is manner. Another point to note from Liebchen et al. s study is the fact that the filteringbased techniques may reduce the number of instances down to 1/3 or even 1/4 of the original data set size. The issue of the data quality in SE data sets can also be observed in open source projects and data sets. Bird et al. investigate the fact that the collected bug fix data may be a biased sample of the entire population, hence they evaluate the impact of bug feature bias and commit feature bias on the quality of open source bug fix data sets [6]. They report that the data quality in bug fix data sets is significantly influenced by the systematic bias. The bias inherent in the quality of data also has a negative impact on the processes, which utilize the results of prediction models that are built on the biased data sets [6]. In another study Bachmann et al. investigate the extent of the sampling bias in another open source data set (APACHE data set) [3]. The presented results are threatening regarding the core assumptions about data quality, e.g. bugs often go unnoticed as they are shared through mailing lists instead of being recorded in a bug tracker. In other words the data set is a heavily biased sample of the entire population. Unstable results, as seen in transfer learning studies [15] [17], [38], [41], are nothing new to SE. Keung et al. investigated 90 software engineering estimation (SEE) methods using 20 data sets [14]. Their results behave in accordance with the prediction of Shepperd et al. s earlier work [34]: Changing conditions affect the performance of estimation methods. The issue of unstable results is also the main topic of the recent special issue in EMSE, where an in depth 43

3 discussion is provided [29]. Menzies et al. report that the instability can possibly be explained by the heterogeneity of the SE data sets [26]. They experiment with software effort and defect data sets using a clusterer called WHERE (to find local regions) and a contrast set learner called WHICH (to compare local and global treatments). Menzies et al. show that effort and defect data contain multiple small and local regions with different properties. Their recommendation is to focus on context specific principles instead of global ones to come up with stable conclusions. Bettenburg et al. also evaluate the performance of local and global models on effort and defect data sets [5]. They evaluate 3 different treatments: 1) Global models on the whole data; 2) Local models on the subsets of the global data; 3) Augmenting global models with lessons learned from the local model. They report that the local models tend to be better fitting the data in comparison to global data. However, they warn that local models are likely to point to too many different local lessons that may be far too context-specific. Hence, they recommend using global models that consider and combine the trends of local models. Recently, Turhan offered a formal treatment of the unstable results in SE, the so called data set shift [37]. The same problem is also addressed under the names of concept shift, concept drift, changing environments, and contrast mining. Turhan evaluates the types of data shift issues as proposed by Storkey [35] from an SE perspective. Storkey points out that dataset shift is a specific case of transfer learning. Transfer learning deals with the cases where there are multiple training scenarios that are partially related and are used to predict in one specific scenario [35]. A very interesting transfer learning direction is the transfer of data between different time frames. This direction has been investigated by Lokan et al. [21], [22]. In [22] Lokan and Mendes found out by using chronological sets of instances that time frame divisions of instances did not affect prediction accuracy. In [21], they found out that it is possible to suggest a window size of a time frame of past instances, which can yield performance increase in estimation. They also note that the size of the window frame is data set dependent. Successful application of data transfer between different time frames poses a very important finding. Given that relevant software projects from earlier time frames can be successfully identified and transferred to current time frames, older data sets are no longer obsolete and the effort invested in the collection of these datasets is not wasted. On the contrary, they can still provide training instances that are relevant to current contexts. Rahman et al. investigate the resource-constrained transfer learning in the context of defect prediction [31], which is another promising direction for the transfer learning studies. Rahman et al. note that since the resources available for testing are limited, it is infeasible to test all the entities that a transfer learning-based prediction model identified as being defective. They test the performance of cross-defect prediction performance under different practical resource constraints. Rahman et al. report that cross-defect prediction models are as good as within-defect models. Furthermore, they note that the cross-defect model performance is highly stable under different constraints. B. Semi-supervised Learning Accurate dependent variable information that is critical for supervised methods may not always be available for all or most of the training instances due to cost and difficulty inherent in their collection. For example, accurate detection of fault prone modules enables organizations to produce high quality software [24], yet such a quality modeling activity requires the fault content knowledge of the previously developed software modules. Assessing the correctness of the fault content information is time consuming and costly. Effort estimation shares common concerns about label information. Although it is fairly easy to derive static code metrics from a software project, establishing the accurate effort invested in a project can require considerable amount of time and money [18]. The research direction that relaxes the dependent variable information requirement is important for predictive modeling studies. Such a predictive modeling technique is semisupervised learning. Semi-supervised methods are a group of machine learning algorithms that learn from a set of training instances among which only a small subset has preassigned labels [8]. The fundamental idea is to learn from a small set of training instances with known labels and to supplement the learning with training instances whose labels are unknown. Despite the promise, semi-supervised learning appears to be less than thoroughly investigated in the context of SE. Early work on semi-supervised learning was performed in the software quality domain. Seliya et al. use an expectation maximization (EM) based approach, where unlabeled modules are viewed as missing information and semi-supervised learning can be reduced to the problem of completing missing data [32]. As the authors put it, the proposed semi-supervised algorithm is particularly useful, when available labeled instances are not enough to yield sufficient prediction performance. Another semi-supervised approach proposed by Seliya et al. is based on k-means clustering [33], which extends traditional unsupervised clustering methods for better partitioning of the data. The clustering-based semi-supervised algorithm aids the software engineering experts when labeling clusters of modules as defective or non-defective. The experts may be able to make better estimations in comparison to unsupervised clustering algorithms. Li et al. developed a framework which maps ensemble learning and random forests into a semi-supervised learning setting [19]. Random forests start learning with the initially labeled small subset of training instances and provide the first set of estimates for the unlabeled training instances. In the following iteration, the training instances whose labels are estimated in the previous round are also included in the learning of the random forests. The random forest iterates over the unlabeled training data. The estimate of the ensemble is found via majority voting of the formed prediction trees. Recently, we started applying semi-supervised algorithms to software quality data. Our initial investigation started with application of a self training (another branch of semi-supervised 44

4 learning) method called Fitting-the-Confidence-Fits (FtCF), which iteratively trains a base supervised learner over the currently labeled training instances [24]. At each iteration the unlabeled training instances whose estimated labels have a high prediction confidence score are moved into the set of labeled training instances. The algorithm terminates, when there are no unlabeled training instances to migrate. Prior to semi-supervised iterations, the training set is given to a feature selector called Multi Dimensional Scaling (MDS), so as to reduce the space of independent features. The results of our experiments are consistent with the prior literature and show that MDS followed by semi-supervised learning consistently outperforms the most successful supervised learners while using only a fraction of the labeled data. We experimented with 4 different software quality data sets from the NASA MDP data set (for details refer to Table 1 of [24]). A sample of the results on PC4 data set is given in Figure 1. We use the following acronyms in Figure 1: SL for Supervised learning without MDS; SL.MDS for Supervised learning with MDS; FTcF for Semi-supervised learning without MDS; FTcF.MDS for Semi- supervised learning with MDS. The percentage values on the x-axis shows what percent of the labeled data information (defective and non-defective) is made available for training w.r.t. to the size of the entire data set. The y-axis either shows the area under the ROC curve (AUC) or the probability of correct detection (PD) of fault prone modules at different thresholds ({0.1, 0.5, 0.75}). For both performance measures the FTcF.MDS (i.e. the semi-supervised learner augmented with MDS) has the best performance, which shows that when the initial number of instances with available labels (in this case faulty or not Fig. 1. The performance of semi-supervised prediction experiments on PC4 dataset (figure from [24]). faulty) is limited, semi-supervised learners perform better than supervised learners. The same results are observed from the remaining 3 data sets. The ANOVA analysis reported in [24] confirms Figure 1, that FTcF.MDS outperforms the other methods w.r.t. both AUC and PD with statistical significance. C. Active Learning Active learning methods are unsupervised methods, because they utilize an initially unlabeled data set. An active learning method can query an oracle for the labels of certain instances. However each query comes with a cost: e.g. for SE the cost of the query may imply costly assessment of the dependent variable information by domain experts (inspection, testing, etc). Ideally, an active learner is expected to ask for as few labels as possible to yield a satisfactory estimation performance. Machine learning literature presents example uses of active learning to reduce the data label requirements. For example Dasgupta et al. [9] seek for generalizable guarantees in active learning. They use a greedy active learning heuristic that is able to deliver performance values as good as any other heuristic, in terms of reducing the number of required labels [9]. Balcan et al. show that active learning provides the same performance as a supervised learner with substantially smaller sample sizes [4]. In [39], Wallace et al. use active learning for a citation screening method based on active learning augmented with a priori expert knowledge. Practical application exemplars of active learning can be found in software testing [7], [40]. In Bowring et al. s study [7], active learning is used to augment learners for automatic classification of program behavior. They show that learners augmented with active learning yield significant reduction in data labeling effort and can generate results comparable to supervised learning. Xie et al. [40] use human inspection as an active learning strategy for effective test generation and specification inference. In their experiments, the number of tests selected for human inspection is reasonably low; the direct implication is that labeling required significantly less effort than screening all single test cases. Hassan et al. list active learning as part of the future of software engineering [12]. Recently, Kocaguneli et al. used an active learning heuristic to reduce the data label requirements in software effort estimation data sets [18]. The proposed active learning solution requires at most 40% of the original data and can perform as well as state-of-the-art supervised learners, which require all the available instances and labels. The reduced set of instances that can provide performance values as good as using all the instances is called as the essential content of the dataset. In this study Kocaguneli et al. interviewed practitioners from industry and academia regarding the value added by active learning. The interviews revealed that possible value-added inherent in active learning methods is twofold. Firstly, given the vast amount of data available today, it is impossible for practitioners to discuss each and every instance during meetings with other experts or clients. Active learning methods can greatly reduce the number of instances by providing an essential content of the data set to be discussed. The second 45

5 Fig. 2. The current map of the strengths and challenges of different learning methods. The solid line indicate the synergy that is currently being followed, whereas dashed lines indicate the new synergies proposed in this work. benefit of active learning methods comes from the fact that the cost associated with collecting dependent variable information is usually higher than the cost associated with the collection of the independent variables. One of the interviewees supports this idea with an example of 1.5M spent by NASA in the period 1987 to 1990 to understand the historical records of their software. Active learning solutions are not solving all the problems. The results presented in [18] are encouraging, yet these results come from a regression problem. Our preliminary results on the application of a similar active learning solution to on defect prediction show that for classification type of problems, the performance is susceptible to uneven class distributions. III. THE CURRENT CHALLENGES In the previous section, we discussed the details of solutions provided for the lack of local data (transfer learning), and the lack of labeled instances (semi-supervised and active learning). We also discussed initial applications of the promising prediction techniques. Although these algorithms are beneficial for specific types of problems, they are not completely devoid of problems. For example, transfer learning methods are helpful because they provide means for data transfer between different domains or time frames. However, transfer learning methods that use cross - domain data in an inappropriate manner perform poorly. Through relevancy filtering cross data, performance can be improved, yet such filtering process is likely to reduce the amount of available data considerably. Among 21 pairs of cross - and within - domain data data set pairs, Kocaguneli et al. report that the amount of data transferred from a cross data sources is at most 15% of the data set size and can be as low as 5% [17]. Semi-supervised learning methods enable learning where only a small subset of the instances are labeled [19], [24]. The downside of semi-supervised approaches is that they still require an initial subset of labeled training instances. In other words, they do not entirely mitigate the cost and time associated with data collection and data V&V effort. Additionally, since the estimated labels of the unlabeled training data depend on the small subset of initial labels, the correctness of the initial subset of labels plays a crucial role. Active learning methods do not, initially, require any labels [9] and they are helpful to discover the essential content of a software engineering data set. On the other hand, similar to supervised learners, they are negatively affected by the imbalanced class distribution [11], notoriously present in most software engineering data sets. IV. SYNERGIES Previous section discussed the strengths and the weaknesses associated with the learning methods discussed here. Figure 2 presents this discussion in a more organized manner. In Figure 2 the relationship between different learning techniques is shown with arrows. The solid line in Figure 2 shows the currently recognized synergy between supervised and transfer learning and prediction. The dashed lines show the synergies, we feel are about to open the new opportunities for software engineering research. To the best of our knowledge, the proposed synergies have not been investigated before in the SE community. However, we foresee that researching the pro- 46

6 posed links between seemingly separate learning techniques can help us benefit from the volume of information available in the existing public data sets and solve the challenges of predictive engineering. Synergy #1: This synergy is being investigated in the SE community with recent promising developments. Following the link between supervised learning and transfer learning has shown us that lack of local data may no longer be a problem for organizations. Although the initial results were unstable with equal amounts of studies for and against the value of cross data [15], [41], recently we have seen that performance of the cross data can be significantly improved through relevancy filtering [16], [17], [38]. Synergy #2: The synergy between transfer learning and semi-supervised learning has not been investigated yet. The challenge with the semi-supervised learning is that it requires a small set of initially labeled training data. Given the existence of a reliable small set of labeled training instances, semi-supervised learning is reported to be able to effectively augment training process [19], [24], [32], [33]. We also know that transfer learning can filter the instances of a labeled cross company data to a small subset of instances [17] that are relevant to a local context [16], [17], [38]. In the case that an organization has only unlabeled local data, it is possible that the initial small set of labeled training data (required for semisupervised learning) can be provided via transfer learning. The transferred initial data can then be used to supplement the unlabeled local training data with the estimated labels. Figure 3 visualizes the proposed synergy between transfer and semi-supervised learning. for transfer learning methods to filter down to only the locally relevant instances. Hence, before the transfer learning step, it may become a good practice to only work with the essential content of a data set. Active learning seems to be effective in finding the essential content of small to medium sized SE data sets [18]. Researching the synergy between active learning (to find the essential content) and transfer learning (to transfer instances from the essential content to local context) is a promising yet unexplored research direction. Figure 4 depicts the current transfer learning experiments and the proposed synergy between transfer and active learning. (a) Current transfer learning experiments. (b) The proposed synergy between active and transfer learning. Fig. 4. Current transfer learning experiments and the proposed Synergy #3 Fig. 3. The proposed synergy between transfer and semi-supervised learning. Labeled cross-company data is filtered via transfer learning to its locally relevant subset. Locally relevant subset is used by semi-supervised learning to supplement the unlabeled within training data. Synergy #3: Mountains of publicly available software engineering data impose certain threats too. Often cross - domain data (data collected in another organization) may be very different w.r.t. to a local context. The difference may be due to design and programming practices as well as different team environments. The size of the data sets previously employed in transfer learning experiments is at the order of a thousand instances. For example, the biggest data set used by Turhan et al. is 1109 instances [38]. However, in the era of big data, the size of the data sets in transfer learning experiments are bound to increase considerably. So when the cross data is too large (in comparison to previously experimented quality and effort data sets) and possibly too noisy, it may be challenging V. CONCLUSION Predictive modeling studies in software engineering benefit from the large publicly available data sets. The availability of data presents promising opportunities for new research directions but there are multiple challenges as well. In this position paper, we focused on the specific challenges. We analyzed information transfer from data rich domains into domains with limited data availability. We observed research trends in building prediction models when the dependent variable information is scarce. We found synergies between the two problems in terms of learning techniques that complement each other. We discussed transfer learning studies as they relate to the former challenge and semi-supervised and active learning research with respect to the latter. Their synergistic interactions appear to be mostly overlooked by the software engineering research community. We see a considerable research effort invested in these problem areas. However, the investigation of the learning techniques to tackle the presented data related challenges is new and far from having been exhausted. We believe that the research within each referenced learning method is critically important, yet identifying the strengths and weaknesses of 47

7 separate learning methods and following their synergies may enable much better use of data that we have today. REFERENCES [1] W. Afzal and R. Torkar. On the application of genetic programming for software engineering predictive modeling: A systematic review. Expert Systems with Applications, 38(9): , [2] A. Arnold, R. Nallapati, and W. Cohen. A comparative study of methods for transductive transfer learning. In ICDM 07: Seventh IEEE International Conference on Data Mining Workshops, pages 77 82, [3] A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The missing links: bugs and bug-fix commits. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, FSE 10, pages , [4] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Proceedings of the 23rd international conference on Machine learning - ICML 06, pages 65 72, [5] N. Bettenburg, M. Nagappan, and A. Hassan. Think locally, act globally: Improving defect and effort prediction models. In Mining Software Repositories (MSR), th IEEE Working Conference on, pages 60 69, June. [6] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE 09, pages , New York, NY, USA, ACM. [7] J. F. Bowring, J. M. Rehg, and M. J. Harrold. Active learning for automatic classification of software behavior. ACM SIGSOFT Software Engineering Notes, 29(4):195, July [8] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, [9] S. Dasgupta. Analysis of a greedy active learning strategy. in Neural Information Processing Systems 17:, 1(x), [10] F. Deissenboeck, E. Juergens, K. Lochmann, and S. Wagner. Software quality models: Purposes, usage scenarios and requirements. In Software Quality, WOSQ 09. ICSE Workshop on, pages 9 14, may [11] S. Ertekin, J. Huang, L. Bottou, and L. Giles. Learning on the border: active learning in imbalanced data classification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM 07, pages , New York, NY, USA, ACM. [12] A. Hassan and T. Xie. Software intelligence: the future of mining software engineering data. In Proceedings of the FSE/SDP workshop on Future of software engineering research, pages ACM, [13] M. Jorgensen and M. Shepperd. A Systematic Review of Software Development Cost Estimation Studies. IEEE Trans. Softw. Eng., 33(1):33 53, [14] J. Keung, E. Kocaguneli, and T. Menzies. Finding conclusion stability for selecting the best effort predictor in software effort estimation. Automated Software Engineering, pages 1 25, /s [15] B. A. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus withincompany cost estimation studies: A systematic review. IEEE Trans. Softw. Eng., 33(5): , [16] E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, and J. Keung. When to use data from other projects for effort estimation. In ASE 10: Proceedings of the International Conference on Automated Software Engineering (short paper), New York, NY, USA, [17] E. Kocaguneli and T. Menzies. How to find relevant data for effort estimation. In ESEM 11: International Symposium on Empirical Software Engineering and Measurement, [18] E. Kocaguneli, T. Menzies, J. Keung, D. Cok, and R. Madachy. Active learning and effort estimation: Finding the essential content of software effort estimation data. IEEE Trans. on Softw. Eng., PP(99):1, [19] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19: , [20] G. Liebchen, B. Twala, and M. Shepperd. Filtering, robust filtering, polishing: Techniques for addressing quality in software data. In Empirical Software Engineering and Measurement, ESEM First International Symposium on, pages , Sept. [21] C. Lokan and E. Mendes. Applying moving windows to software effort estimation. In ESEM 09: Proceedings of the 3rd International Symposium on Empirical Software Engineering and Measurement, pages , [22] C. Lokan and E. Mendes. Using chronological splitting to compare cross- and single-company effort models: further investigation. In Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91, ACSC 09, pages 47 54, [23] C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational benchmarking using the isbsg data repository. Software, IEEE, 18(5):26 32, [24] H. Lu, B. Cukic, and M. Culp. Software defect prediction using semi-supervised learning with dimension reduction. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012, pages , [25] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for crosscompany software defect prediction. Information and Software Technology, 54(3): , [26] T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, and D. Cok. Local vs. global models for effort estimation and defect prediction. In Automated Software Engineering (ASE), th IEEE/ACM International Conference on, pages , Nov. [27] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June [28] T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener. Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engineering, 17: , /s [29] T. Menzies and M. Shepperd. Special issue on repeatable results in software engineering prediction. Empirical Software Engineering, 17(1-2):1 17, [30] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): , [31] F. Rahman, D. Posnett, and P. Devanbu. Recalling the imprecision of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE 12, pages 61:1 61:11, [32] N. Seliya and T. Khoshgoftaar. Software quality estimation with limited fault data: a semi-supervised learning perspective. Software Quality Journal, 15: , [33] N. Seliya and T. M. Khoshgoftaar. Software quality analysis of unlabeled program modules with semisupervised clustering. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 37(2): , march [34] M. Shepperd and C. Schofield. Estimating software project effort using analogies. IEEE Trans. Softw. Eng., 23(11): , [35] A. Storkey. When training and test sets are different: characterizing learning transfer. Dataset Shift in Machine Learning (J. Candela, M. Sugiyama, A. Schwaighofer and N. Lawrence, eds.). MIT Press, Cambridge, MA, pages 3 28, [36] M. Svahnberg, T. Gorschek, R. Feldt, R. Torkar, S. B. Saleem, and M. U. Shafique. A systematic review on strategic release planning models. Information and Software Technology, 52(3): , [37] B. Turhan. On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17:62 74, [38] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5): , [39] B. Wallace, K. Small, C. Brodley, and T. Trikalinos. Active learning for biomedical citation screening. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [40] T. Xie and D. Notkin. Mutually enhancing test generation and specification inference. Formal Approaches to Software Testing, pages , [41] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Crossproject defect prediction: A large scale experiment on data vs. domain vs. process. ESEC/FSE, pages , [42] T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, PROMISE 07: ICSE Workshops International Workshop on, page 9, may