Predicting More from Less: Synergies of Learning

Size: px
Start display at page:

Download "Predicting More from Less: Synergies of Learning"

Transcription

1 Predicting More from Less: Synergies of Learning Ekrem Kocaguneli, Bojan Cukic, Huihua Lu Lane Department of Computer Science and Electrical Engineering West Virginia University Morgantown, WV, USA Abstract Thanks to the ever increasing importance of project data, its collection has been one of the primary focuses of software organizations. Data collection activities have resulted in the availability of massive amounts of data through software data repositories. This is great news for the predictive modeling research in software engineering. However, widely used supervised methods for predictive modeling require labeled data that is relevant to the local context of a project. This requirement cannot be met by many of the available data sets, introducing new challenges for software engineering research. How to transfer data between different contexts? How to handle insufficient number of labeled instances? In this position paper, we investigate synergies between different learning methods (transfer, semi-supervised and active learning) which may overcome these challenges. I. INTRODUCTION Predictive modeling and estimation methods are important research directions in software engineering (SE). The targets of predictive models can be different [1]: software quality [10], software effort estimation [13], software defects [28], release scheduling [36] and so on. Thanks to recent emphasis placed on data collection activities, we have access to massive publicly available sources of software engineering data. In fact, we have access to more open source projet data than at any other time in the history. For example, SourceForge currently hosts 300K projects with a user base of 2M 1, GoogleCode hosts 250K open source projects 2, similarly there is an abundant amount of SE repositories, e.g. ISBSG [23], PROMISE [27], Eclipse Bug Data [42] and TukuTuku 3. When an organization has no local data or the local data is outdated, i.e. it is no longer representative of the current practices, transferring the available data of other organizations or adapting the outdated data to current contexts may be helpful. The transfer of data between organizations or time frames is addressed by the transfer learning research [25]. The publicly available data sets offer an opportunity for the creation of better software engineering processes even for organizations without any local data of their own. Therefore, effective methods to transfer data from the domain of one organization to the other appears to be an important, yet challenging research direction. The performance of transfer learning (a.k.a. cross company estimation in SE) methods remain questionable with both supporting and opposing evidence [15], [41]. Recently, the relevancy filtering is reported to be a good alternative to improve the performance of transfered data to the extent that its performance is statistically significantly the same as that of local data [17], [38]. However, relevancy filtering approaches transfer only a limited number of training instances relevant to the local domain. When only a limited amount of labeled training instances are available, we typically try to supplement learning by predicting labels for the unlabeled training data. The predicted labels (so called estimated labels ) can be evaluated according to a confidence function and only the ones with high confidence are kept. This process is repeated until all the instances are labeled. This issue is handled by the semisupervised learning research [24]. The lack of labels in a given data set requires unsupervised learning methods that can provide a labeling order. This ordering is expected to start from the most informative instances and continue towards the least informative ones. This issue is handled by the active learning research [9], [16], [40]. The common property of most estimation techniques is that they rely on supervised algorithms, i.e., they require training data with labels. All the available datasets may not contain complete dependent variable information, a.k.a. labels [18]. The label information may be either non-existent or only available for a limited number of instances. This is due to the fact that often it is infeasible to collect the labels for all the instances in a large software dataset. 4 In spite of public availability of software engineering data, the aforementioned issues bring out the following challenges: How to transfer data data between domains and projects? How to accommodate prediction problems for which a limited amount of labeled instances are available? How to handle prediction problems in which no instances have labels? Transfer, semi-supervised and active learning, applied to software engineering data, are research areas that can help us answer specific problems related to contemporary prediction problems. We overview the state-of-the-art of the applications of these techniques in SE ( II). Based on the recent work, we 4 Note that the terms dependent variable information, label information and label refer to the value of the feature(s), which are believed to be related to independent variables and suitable for prediction by supervised algorithms. Hence, these terms will be used interchangeably in the text /13 c 2013 IEEE 42 RAISE 2013, San Francisco, CA, USA Accepted for publication by IEEE. c 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

2 evaluate their strengths and weaknesses ( III). One limitation in the current picture is that these techniques are typically used in separation. However, we will show that these prediction techniques complement each other s weak points ( IV). Their synergy will likely present opportunities for improved software engineering research. We will conclude our discussion in V. II. CURRENT STATE-OF-THE-ART In this section we cover the current state-of-the-art in SE concerning transfer, semi-supervised as well as active learning. The summary of this section in terms of the strengths and weaknesses of the learning methods (as identified by the existing literature) can be found in Figure 2. A. Transfer Learning A supervised learning problem is often composed of two separate sets: Training and test sets. Such a learning problem is composed of: 1) a specific domain D, which consists of a feature space and a marginal distribution that defines this space; 2) a task T, which is the combination of a label space and an objective estimation function. Transfer learning is a set of learning methods that allow the training and test sets to have different domains and/or tasks [25]. The formal definition, as given by Jialin et al. is as follows: Assuming we have a source domain D S, a source task T S, a target domain D T and a target task T T ; transfer learning tries to improve an estimation method in D T using the knowledge of D S and T S. Note that the assumption in the above definition is that D S 6= D T and T S 6= T T. There are various subgroups of transfer learning, which define the relationship between traditional machine learning methods and various transfer settings, e.g. see Table 1 of [30]. SE transfer learning studies (more popularly known as cross-company learning in SE) have the same task (estimating a dependent variable, e.g. estimating the fault proneness of modules, estimating the software development effort etc.) yet in different domains (data coming from different organizations or different time frames). Such transfer learning problems are classified as transductive transfer learning [2]. The current transfer learning results reported in SE have one thing in common: instability and significant variability of their results. A literature review of software effort estimation (SEE) studies conducted by Kitchenham et al. report equal evidence for and against the success of prior transfer learning studies [15]. A similar result is also reported in the defect prediction domain. Zimmermann et al. found that estimation methods trained on cross-application data were inferior than those from within-application data [41]. From a total of 622 transfer and within data comparisons, they report that within data performed better in 618 cases. An interesting research direction related to transfer learning results was proposed by Turhan et al. in their cross-company defect prediction study [38]. Supporting the evidence provided by Zimmermann et al. [41], Turhan et al. also reported that transferring all the available cross company data yields poor estimation performance (very high false alarm rates). However, after the instance selection pruned away irrelevant data during transfer learning, they found that the estimators built on transferred data were nearly equivalent to the estimators learned from within data [38]. Following the irrelevancy filtering idea, Kocaguneli et al. [16] used a variance-based instance selection for a study of transfer learning in SEE. In a limited study with three data sets, they found that through instance selection, the performance differences in the estimation methods trained on transferred or within data were not statistically significant. This study was then extended with 8 different data sets [17]. The results were identical: performance differences between models built from the within and transferred data are not significant. Quality of the data is a critical aspect of empirical SE [20]. Various different filtering strategies appear to be strong candidates for improving the data quality of SE data sets. The success of filtering cross data may also be due to the improved cross data quality from the perspective of a local data problem. However, that is merely our speculation and there needs to be empirical experimentation to see relation of data quality and filtering of cross data. The filtering of the SE data sets is recently attracting more attention with promising results in terms of improving the data set quality. For example, Liebchen et al. experiment on a proprietary data set with 3 different filtering-based data quality techniques: filtering; robust filtering; filtering followed by polishing [20]. They report that all the three techniques provide performance improvement over the use of the data set in an as-is manner. Another point to note from Liebchen et al. s study is the fact that the filteringbased techniques may reduce the number of instances down to 1/3 or even 1/4 of the original data set size. The issue of the data quality in SE data sets can also be observed in open source projects and data sets. Bird et al. investigate the fact that the collected bug fix data may be a biased sample of the entire population, hence they evaluate the impact of bug feature bias and commit feature bias on the quality of open source bug fix data sets [6]. They report that the data quality in bug fix data sets is significantly influenced by the systematic bias. The bias inherent in the quality of data also has a negative impact on the processes, which utilize the results of prediction models that are built on the biased data sets [6]. In another study Bachmann et al. investigate the extent of the sampling bias in another open source data set (APACHE data set) [3]. The presented results are threatening regarding the core assumptions about data quality, e.g. bugs often go unnoticed as they are shared through mailing lists instead of being recorded in a bug tracker. In other words the data set is a heavily biased sample of the entire population. Unstable results, as seen in transfer learning studies [15] [17], [38], [41], are nothing new to SE. Keung et al. investigated 90 software engineering estimation (SEE) methods using 20 data sets [14]. Their results behave in accordance with the prediction of Shepperd et al. s earlier work [34]: Changing conditions affect the performance of estimation methods. The issue of unstable results is also the main topic of the recent special issue in EMSE, where an in depth 43

3 discussion is provided [29]. Menzies et al. report that the instability can possibly be explained by the heterogeneity of the SE data sets [26]. They experiment with software effort and defect data sets using a clusterer called WHERE (to find local regions) and a contrast set learner called WHICH (to compare local and global treatments). Menzies et al. show that effort and defect data contain multiple small and local regions with different properties. Their recommendation is to focus on context specific principles instead of global ones to come up with stable conclusions. Bettenburg et al. also evaluate the performance of local and global models on effort and defect data sets [5]. They evaluate 3 different treatments: 1) Global models on the whole data; 2) Local models on the subsets of the global data; 3) Augmenting global models with lessons learned from the local model. They report that the local models tend to be better fitting the data in comparison to global data. However, they warn that local models are likely to point to too many different local lessons that may be far too context-specific. Hence, they recommend using global models that consider and combine the trends of local models. Recently, Turhan offered a formal treatment of the unstable results in SE, the so called data set shift [37]. The same problem is also addressed under the names of concept shift, concept drift, changing environments, and contrast mining. Turhan evaluates the types of data shift issues as proposed by Storkey [35] from an SE perspective. Storkey points out that dataset shift is a specific case of transfer learning. Transfer learning deals with the cases where there are multiple training scenarios that are partially related and are used to predict in one specific scenario [35]. A very interesting transfer learning direction is the transfer of data between different time frames. This direction has been investigated by Lokan et al. [21], [22]. In [22] Lokan and Mendes found out by using chronological sets of instances that time frame divisions of instances did not affect prediction accuracy. In [21], they found out that it is possible to suggest a window size of a time frame of past instances, which can yield performance increase in estimation. They also note that the size of the window frame is data set dependent. Successful application of data transfer between different time frames poses a very important finding. Given that relevant software projects from earlier time frames can be successfully identified and transferred to current time frames, older data sets are no longer obsolete and the effort invested in the collection of these datasets is not wasted. On the contrary, they can still provide training instances that are relevant to current contexts. Rahman et al. investigate the resource-constrained transfer learning in the context of defect prediction [31], which is another promising direction for the transfer learning studies. Rahman et al. note that since the resources available for testing are limited, it is infeasible to test all the entities that a transfer learning-based prediction model identified as being defective. They test the performance of cross-defect prediction performance under different practical resource constraints. Rahman et al. report that cross-defect prediction models are as good as within-defect models. Furthermore, they note that the cross-defect model performance is highly stable under different constraints. B. Semi-supervised Learning Accurate dependent variable information that is critical for supervised methods may not always be available for all or most of the training instances due to cost and difficulty inherent in their collection. For example, accurate detection of fault prone modules enables organizations to produce high quality software [24], yet such a quality modeling activity requires the fault content knowledge of the previously developed software modules. Assessing the correctness of the fault content information is time consuming and costly. Effort estimation shares common concerns about label information. Although it is fairly easy to derive static code metrics from a software project, establishing the accurate effort invested in a project can require considerable amount of time and money [18]. The research direction that relaxes the dependent variable information requirement is important for predictive modeling studies. Such a predictive modeling technique is semisupervised learning. Semi-supervised methods are a group of machine learning algorithms that learn from a set of training instances among which only a small subset has preassigned labels [8]. The fundamental idea is to learn from a small set of training instances with known labels and to supplement the learning with training instances whose labels are unknown. Despite the promise, semi-supervised learning appears to be less than thoroughly investigated in the context of SE. Early work on semi-supervised learning was performed in the software quality domain. Seliya et al. use an expectation maximization (EM) based approach, where unlabeled modules are viewed as missing information and semi-supervised learning can be reduced to the problem of completing missing data [32]. As the authors put it, the proposed semi-supervised algorithm is particularly useful, when available labeled instances are not enough to yield sufficient prediction performance. Another semi-supervised approach proposed by Seliya et al. is based on k-means clustering [33], which extends traditional unsupervised clustering methods for better partitioning of the data. The clustering-based semi-supervised algorithm aids the software engineering experts when labeling clusters of modules as defective or non-defective. The experts may be able to make better estimations in comparison to unsupervised clustering algorithms. Li et al. developed a framework which maps ensemble learning and random forests into a semi-supervised learning setting [19]. Random forests start learning with the initially labeled small subset of training instances and provide the first set of estimates for the unlabeled training instances. In the following iteration, the training instances whose labels are estimated in the previous round are also included in the learning of the random forests. The random forest iterates over the unlabeled training data. The estimate of the ensemble is found via majority voting of the formed prediction trees. Recently, we started applying semi-supervised algorithms to software quality data. Our initial investigation started with application of a self training (another branch of semi-supervised 44

4 learning) method called Fitting-the-Confidence-Fits (FtCF), which iteratively trains a base supervised learner over the currently labeled training instances [24]. At each iteration the unlabeled training instances whose estimated labels have a high prediction confidence score are moved into the set of labeled training instances. The algorithm terminates, when there are no unlabeled training instances to migrate. Prior to semi-supervised iterations, the training set is given to a feature selector called Multi Dimensional Scaling (MDS), so as to reduce the space of independent features. The results of our experiments are consistent with the prior literature and show that MDS followed by semi-supervised learning consistently outperforms the most successful supervised learners while using only a fraction of the labeled data. We experimented with 4 different software quality data sets from the NASA MDP data set (for details refer to Table 1 of [24]). A sample of the results on PC4 data set is given in Figure 1. We use the following acronyms in Figure 1: SL for Supervised learning without MDS; SL.MDS for Supervised learning with MDS; FTcF for Semi-supervised learning without MDS; FTcF.MDS for Semi- supervised learning with MDS. The percentage values on the x-axis shows what percent of the labeled data information (defective and non-defective) is made available for training w.r.t. to the size of the entire data set. The y-axis either shows the area under the ROC curve (AUC) or the probability of correct detection (PD) of fault prone modules at different thresholds ({0.1, 0.5, 0.75}). For both performance measures the FTcF.MDS (i.e. the semi-supervised learner augmented with MDS) has the best performance, which shows that when the initial number of instances with available labels (in this case faulty or not Fig. 1. The performance of semi-supervised prediction experiments on PC4 dataset (figure from [24]). faulty) is limited, semi-supervised learners perform better than supervised learners. The same results are observed from the remaining 3 data sets. The ANOVA analysis reported in [24] confirms Figure 1, that FTcF.MDS outperforms the other methods w.r.t. both AUC and PD with statistical significance. C. Active Learning Active learning methods are unsupervised methods, because they utilize an initially unlabeled data set. An active learning method can query an oracle for the labels of certain instances. However each query comes with a cost: e.g. for SE the cost of the query may imply costly assessment of the dependent variable information by domain experts (inspection, testing, etc). Ideally, an active learner is expected to ask for as few labels as possible to yield a satisfactory estimation performance. Machine learning literature presents example uses of active learning to reduce the data label requirements. For example Dasgupta et al. [9] seek for generalizable guarantees in active learning. They use a greedy active learning heuristic that is able to deliver performance values as good as any other heuristic, in terms of reducing the number of required labels [9]. Balcan et al. show that active learning provides the same performance as a supervised learner with substantially smaller sample sizes [4]. In [39], Wallace et al. use active learning for a citation screening method based on active learning augmented with a priori expert knowledge. Practical application exemplars of active learning can be found in software testing [7], [40]. In Bowring et al. s study [7], active learning is used to augment learners for automatic classification of program behavior. They show that learners augmented with active learning yield significant reduction in data labeling effort and can generate results comparable to supervised learning. Xie et al. [40] use human inspection as an active learning strategy for effective test generation and specification inference. In their experiments, the number of tests selected for human inspection is reasonably low; the direct implication is that labeling required significantly less effort than screening all single test cases. Hassan et al. list active learning as part of the future of software engineering [12]. Recently, Kocaguneli et al. used an active learning heuristic to reduce the data label requirements in software effort estimation data sets [18]. The proposed active learning solution requires at most 40% of the original data and can perform as well as state-of-the-art supervised learners, which require all the available instances and labels. The reduced set of instances that can provide performance values as good as using all the instances is called as the essential content of the dataset. In this study Kocaguneli et al. interviewed practitioners from industry and academia regarding the value added by active learning. The interviews revealed that possible value-added inherent in active learning methods is twofold. Firstly, given the vast amount of data available today, it is impossible for practitioners to discuss each and every instance during meetings with other experts or clients. Active learning methods can greatly reduce the number of instances by providing an essential content of the data set to be discussed. The second 45

5 Fig. 2. The current map of the strengths and challenges of different learning methods. The solid line indicate the synergy that is currently being followed, whereas dashed lines indicate the new synergies proposed in this work. benefit of active learning methods comes from the fact that the cost associated with collecting dependent variable information is usually higher than the cost associated with the collection of the independent variables. One of the interviewees supports this idea with an example of 1.5M spent by NASA in the period 1987 to 1990 to understand the historical records of their software. Active learning solutions are not solving all the problems. The results presented in [18] are encouraging, yet these results come from a regression problem. Our preliminary results on the application of a similar active learning solution to on defect prediction show that for classification type of problems, the performance is susceptible to uneven class distributions. III. THE CURRENT CHALLENGES In the previous section, we discussed the details of solutions provided for the lack of local data (transfer learning), and the lack of labeled instances (semi-supervised and active learning). We also discussed initial applications of the promising prediction techniques. Although these algorithms are beneficial for specific types of problems, they are not completely devoid of problems. For example, transfer learning methods are helpful because they provide means for data transfer between different domains or time frames. However, transfer learning methods that use cross - domain data in an inappropriate manner perform poorly. Through relevancy filtering cross data, performance can be improved, yet such filtering process is likely to reduce the amount of available data considerably. Among 21 pairs of cross - and within - domain data data set pairs, Kocaguneli et al. report that the amount of data transferred from a cross data sources is at most 15% of the data set size and can be as low as 5% [17]. Semi-supervised learning methods enable learning where only a small subset of the instances are labeled [19], [24]. The downside of semi-supervised approaches is that they still require an initial subset of labeled training instances. In other words, they do not entirely mitigate the cost and time associated with data collection and data V&V effort. Additionally, since the estimated labels of the unlabeled training data depend on the small subset of initial labels, the correctness of the initial subset of labels plays a crucial role. Active learning methods do not, initially, require any labels [9] and they are helpful to discover the essential content of a software engineering data set. On the other hand, similar to supervised learners, they are negatively affected by the imbalanced class distribution [11], notoriously present in most software engineering data sets. IV. SYNERGIES Previous section discussed the strengths and the weaknesses associated with the learning methods discussed here. Figure 2 presents this discussion in a more organized manner. In Figure 2 the relationship between different learning techniques is shown with arrows. The solid line in Figure 2 shows the currently recognized synergy between supervised and transfer learning and prediction. The dashed lines show the synergies, we feel are about to open the new opportunities for software engineering research. To the best of our knowledge, the proposed synergies have not been investigated before in the SE community. However, we foresee that researching the pro- 46

6 posed links between seemingly separate learning techniques can help us benefit from the volume of information available in the existing public data sets and solve the challenges of predictive engineering. Synergy #1: This synergy is being investigated in the SE community with recent promising developments. Following the link between supervised learning and transfer learning has shown us that lack of local data may no longer be a problem for organizations. Although the initial results were unstable with equal amounts of studies for and against the value of cross data [15], [41], recently we have seen that performance of the cross data can be significantly improved through relevancy filtering [16], [17], [38]. Synergy #2: The synergy between transfer learning and semi-supervised learning has not been investigated yet. The challenge with the semi-supervised learning is that it requires a small set of initially labeled training data. Given the existence of a reliable small set of labeled training instances, semi-supervised learning is reported to be able to effectively augment training process [19], [24], [32], [33]. We also know that transfer learning can filter the instances of a labeled cross company data to a small subset of instances [17] that are relevant to a local context [16], [17], [38]. In the case that an organization has only unlabeled local data, it is possible that the initial small set of labeled training data (required for semisupervised learning) can be provided via transfer learning. The transferred initial data can then be used to supplement the unlabeled local training data with the estimated labels. Figure 3 visualizes the proposed synergy between transfer and semi-supervised learning. for transfer learning methods to filter down to only the locally relevant instances. Hence, before the transfer learning step, it may become a good practice to only work with the essential content of a data set. Active learning seems to be effective in finding the essential content of small to medium sized SE data sets [18]. Researching the synergy between active learning (to find the essential content) and transfer learning (to transfer instances from the essential content to local context) is a promising yet unexplored research direction. Figure 4 depicts the current transfer learning experiments and the proposed synergy between transfer and active learning. (a) Current transfer learning experiments. (b) The proposed synergy between active and transfer learning. Fig. 4. Current transfer learning experiments and the proposed Synergy #3 Fig. 3. The proposed synergy between transfer and semi-supervised learning. Labeled cross-company data is filtered via transfer learning to its locally relevant subset. Locally relevant subset is used by semi-supervised learning to supplement the unlabeled within training data. Synergy #3: Mountains of publicly available software engineering data impose certain threats too. Often cross - domain data (data collected in another organization) may be very different w.r.t. to a local context. The difference may be due to design and programming practices as well as different team environments. The size of the data sets previously employed in transfer learning experiments is at the order of a thousand instances. For example, the biggest data set used by Turhan et al. is 1109 instances [38]. However, in the era of big data, the size of the data sets in transfer learning experiments are bound to increase considerably. So when the cross data is too large (in comparison to previously experimented quality and effort data sets) and possibly too noisy, it may be challenging V. CONCLUSION Predictive modeling studies in software engineering benefit from the large publicly available data sets. The availability of data presents promising opportunities for new research directions but there are multiple challenges as well. In this position paper, we focused on the specific challenges. We analyzed information transfer from data rich domains into domains with limited data availability. We observed research trends in building prediction models when the dependent variable information is scarce. We found synergies between the two problems in terms of learning techniques that complement each other. We discussed transfer learning studies as they relate to the former challenge and semi-supervised and active learning research with respect to the latter. Their synergistic interactions appear to be mostly overlooked by the software engineering research community. We see a considerable research effort invested in these problem areas. However, the investigation of the learning techniques to tackle the presented data related challenges is new and far from having been exhausted. We believe that the research within each referenced learning method is critically important, yet identifying the strengths and weaknesses of 47

7 separate learning methods and following their synergies may enable much better use of data that we have today. REFERENCES [1] W. Afzal and R. Torkar. On the application of genetic programming for software engineering predictive modeling: A systematic review. Expert Systems with Applications, 38(9): , [2] A. Arnold, R. Nallapati, and W. Cohen. A comparative study of methods for transductive transfer learning. In ICDM 07: Seventh IEEE International Conference on Data Mining Workshops, pages 77 82, [3] A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The missing links: bugs and bug-fix commits. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, FSE 10, pages , [4] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Proceedings of the 23rd international conference on Machine learning - ICML 06, pages 65 72, [5] N. Bettenburg, M. Nagappan, and A. Hassan. Think locally, act globally: Improving defect and effort prediction models. In Mining Software Repositories (MSR), th IEEE Working Conference on, pages 60 69, June. [6] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE 09, pages , New York, NY, USA, ACM. [7] J. F. Bowring, J. M. Rehg, and M. J. Harrold. Active learning for automatic classification of software behavior. ACM SIGSOFT Software Engineering Notes, 29(4):195, July [8] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, [9] S. Dasgupta. Analysis of a greedy active learning strategy. in Neural Information Processing Systems 17:, 1(x), [10] F. Deissenboeck, E. Juergens, K. Lochmann, and S. Wagner. Software quality models: Purposes, usage scenarios and requirements. In Software Quality, WOSQ 09. ICSE Workshop on, pages 9 14, may [11] S. Ertekin, J. Huang, L. Bottou, and L. Giles. Learning on the border: active learning in imbalanced data classification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM 07, pages , New York, NY, USA, ACM. [12] A. Hassan and T. Xie. Software intelligence: the future of mining software engineering data. In Proceedings of the FSE/SDP workshop on Future of software engineering research, pages ACM, [13] M. Jorgensen and M. Shepperd. A Systematic Review of Software Development Cost Estimation Studies. IEEE Trans. Softw. Eng., 33(1):33 53, [14] J. Keung, E. Kocaguneli, and T. Menzies. Finding conclusion stability for selecting the best effort predictor in software effort estimation. Automated Software Engineering, pages 1 25, /s [15] B. A. Kitchenham, E. Mendes, and G. H. Travassos. Cross versus withincompany cost estimation studies: A systematic review. IEEE Trans. Softw. Eng., 33(5): , [16] E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, and J. Keung. When to use data from other projects for effort estimation. In ASE 10: Proceedings of the International Conference on Automated Software Engineering (short paper), New York, NY, USA, [17] E. Kocaguneli and T. Menzies. How to find relevant data for effort estimation. In ESEM 11: International Symposium on Empirical Software Engineering and Measurement, [18] E. Kocaguneli, T. Menzies, J. Keung, D. Cok, and R. Madachy. Active learning and effort estimation: Finding the essential content of software effort estimation data. IEEE Trans. on Softw. Eng., PP(99):1, [19] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19: , [20] G. Liebchen, B. Twala, and M. Shepperd. Filtering, robust filtering, polishing: Techniques for addressing quality in software data. In Empirical Software Engineering and Measurement, ESEM First International Symposium on, pages , Sept. [21] C. Lokan and E. Mendes. Applying moving windows to software effort estimation. In ESEM 09: Proceedings of the 3rd International Symposium on Empirical Software Engineering and Measurement, pages , [22] C. Lokan and E. Mendes. Using chronological splitting to compare cross- and single-company effort models: further investigation. In Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91, ACSC 09, pages 47 54, [23] C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational benchmarking using the isbsg data repository. Software, IEEE, 18(5):26 32, [24] H. Lu, B. Cukic, and M. Culp. Software defect prediction using semi-supervised learning with dimension reduction. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012, pages , [25] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for crosscompany software defect prediction. Information and Software Technology, 54(3): , [26] T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, and D. Cok. Local vs. global models for effort estimation and defect prediction. In Automated Software Engineering (ASE), th IEEE/ACM International Conference on, pages , Nov. [27] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June [28] T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener. Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engineering, 17: , /s [29] T. Menzies and M. Shepperd. Special issue on repeatable results in software engineering prediction. Empirical Software Engineering, 17(1-2):1 17, [30] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): , [31] F. Rahman, D. Posnett, and P. Devanbu. Recalling the imprecision of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE 12, pages 61:1 61:11, [32] N. Seliya and T. Khoshgoftaar. Software quality estimation with limited fault data: a semi-supervised learning perspective. Software Quality Journal, 15: , [33] N. Seliya and T. M. Khoshgoftaar. Software quality analysis of unlabeled program modules with semisupervised clustering. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 37(2): , march [34] M. Shepperd and C. Schofield. Estimating software project effort using analogies. IEEE Trans. Softw. Eng., 23(11): , [35] A. Storkey. When training and test sets are different: characterizing learning transfer. Dataset Shift in Machine Learning (J. Candela, M. Sugiyama, A. Schwaighofer and N. Lawrence, eds.). MIT Press, Cambridge, MA, pages 3 28, [36] M. Svahnberg, T. Gorschek, R. Feldt, R. Torkar, S. B. Saleem, and M. U. Shafique. A systematic review on strategic release planning models. Information and Software Technology, 52(3): , [37] B. Turhan. On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17:62 74, [38] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5): , [39] B. Wallace, K. Small, C. Brodley, and T. Trikalinos. Active learning for biomedical citation screening. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [40] T. Xie and D. Notkin. Mutually enhancing test generation and specification inference. Formal Approaches to Software Testing, pages , [41] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Crossproject defect prediction: A large scale experiment on data vs. domain vs. process. ESEC/FSE, pages , [42] T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In Predictor Models in Software Engineering, PROMISE 07: ICSE Workshops International Workshop on, page 9, may

Software Defect Prediction Modeling

Software Defect Prediction Modeling Software Defect Prediction Modeling Burak Turhan Department of Computer Engineering, Bogazici University turhanb@boun.edu.tr Abstract Defect predictors are helpful tools for project managers and developers.

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Small Data in Big Data July 17, 2013 So6ware Experts Summit Sea>le

Small Data in Big Data July 17, 2013 So6ware Experts Summit Sea>le Small Data in Big Data July 17, 2013 So6ware Experts Summit Sea>le Ayse Basar Bener Data Science Lab Mechanical and Industrial Engineering Ryerson University Part 1 ANALYTICS IN SOFTWARE ENGINEERING Data

More information

Class Imbalance Learning in Software Defect Prediction

Class Imbalance Learning in Software Defect Prediction Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang s.wang@cs.bham.ac.uk University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang

More information

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University huanjing.wang@wku.edu Taghi M. Khoshgoftaar

More information

Code Ownership in Open-Source Software

Code Ownership in Open-Source Software Code Ownership in Open-Source Software Matthieu Foucault University of Bordeaux LaBRI, UMR 5800 F-33400, Talence, France mfoucaul@labri.fr Jean-Rémy Falleri University of Bordeaux LaBRI, UMR 5800 F-33400,

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Community-Assisted Software Engineering Decision Making. Technical Report

Community-Assisted Software Engineering Decision Making. Technical Report Community-Assisted Software Engineering Decision Making Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 Keller Hall 200 Union Street SE Minneapolis, MN 55455-0159

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Practical Considerations in Deploying AI: A Case Study within the Turkish Telecommunications Industry

Practical Considerations in Deploying AI: A Case Study within the Turkish Telecommunications Industry Practical Considerations in Deploying AI: A Case Study within the Turkish Telecommunications Industry!"#$%&'()* 1, Burak Turhan 1 +%!"#$%,$*$- 1, Tim Menzies 2 ayse.tosun@boun.edu.tr, turhanb@boun.edu.tr,

More information

Confirmation Bias as a Human Aspect in Software Engineering

Confirmation Bias as a Human Aspect in Software Engineering Confirmation Bias as a Human Aspect in Software Engineering Gul Calikli, PhD Data Science Laboratory, Department of Mechanical and Industrial Engineering, Ryerson University Why Human Aspects in Software

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN

EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN EXTENDED ANGEL: KNOWLEDGE-BASED APPROACH FOR LOC AND EFFORT ESTIMATION FOR MULTIMEDIA PROJECTS IN MEDICAL DOMAIN Sridhar S Associate Professor, Department of Information Science and Technology, Anna University,

More information

A Systematic Review of Fault Prediction Performance in Software Engineering

A Systematic Review of Fault Prediction Performance in Software Engineering Tracy Hall Brunel University A Systematic Review of Fault Prediction Performance in Software Engineering Sarah Beecham Lero The Irish Software Engineering Research Centre University of Limerick, Ireland

More information

Comparing Methods to Identify Defect Reports in a Change Management Database

Comparing Methods to Identify Defect Reports in a Change Management Database Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com

More information

Software Defect Prediction for Quality Improvement Using Hybrid Approach

Software Defect Prediction for Quality Improvement Using Hybrid Approach Software Defect Prediction for Quality Improvement Using Hybrid Approach 1 Pooja Paramshetti, 2 D. A. Phalke D.Y. Patil College of Engineering, Akurdi, Pune. Savitribai Phule Pune University ABSTRACT In

More information

It s not a Bug, it s a Feature: How Misclassification Impacts Bug Prediction

It s not a Bug, it s a Feature: How Misclassification Impacts Bug Prediction It s not a Bug, it s a Feature: How Misclassification Impacts Bug Prediction Kim Herzig Saarland University Saarbrücken, Germany herzig@cs.uni-saarland.de Sascha Just Saarland University Saarbrücken, Germany

More information

Defect Prediction Leads to High Quality Product

Defect Prediction Leads to High Quality Product Journal of Software Engineering and Applications, 2011, 4, 639-645 doi:10.4236/jsea.2011.411075 Published Online November 2011 (http://www.scirp.org/journal/jsea) 639 Naheed Azeem, Shazia Usmani Department

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Elimination of Estimation biases in the Software Development

Elimination of Estimation biases in the Software Development Elimination of Estimation biases in the Software Development Thamarai. I. Research scholar, computer science department Sathyabama University, Chennai, India ilango.thamarai@gmail.com Dr.S.Murugavalli

More information

Obtaining Optimal Software Effort Estimation Data Using Feature Subset Selection

Obtaining Optimal Software Effort Estimation Data Using Feature Subset Selection Obtaining Optimal Software Effort Estimation Data Using Feature Subset Selection Abirami.R 1, Sujithra.S 2, Sathishkumar.P 3, Geethanjali.N 4 1, 2, 3 Student, Department of Computer Science and Engineering,

More information

Performance Evaluation Metrics for Software Fault Prediction Studies

Performance Evaluation Metrics for Software Fault Prediction Studies Acta Polytechnica Hungarica Vol. 9, No. 4, 2012 Performance Evaluation Metrics for Software Fault Prediction Studies Cagatay Catal Istanbul Kultur University, Department of Computer Engineering, Atakoy

More information

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

More information

Efficient Bug Triaging Using Text Mining

Efficient Bug Triaging Using Text Mining 2185 Efficient Bug Triaging Using Text Mining Mamdouh Alenezi and Kenneth Magel Department of Computer Science, North Dakota State University Fargo, ND 58108, USA Email: {mamdouh.alenezi, kenneth.magel}@ndsu.edu

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

A Systematic Literature Review on Fault Prediction Performance in Software Engineering

A Systematic Literature Review on Fault Prediction Performance in Software Engineering 1 A Systematic Literature Review on Fault Prediction Performance in Software Engineering Tracy Hall, Sarah Beecham, David Bowes, David Gray and Steve Counsell Abstract Background: The accurate prediction

More information

Are Change Metrics Good Predictors for an Evolving Software Product Line?

Are Change Metrics Good Predictors for an Evolving Software Product Line? Are Change Metrics Good Predictors for an Evolving Software Product Line? Sandeep Krishnan Dept. of Computer Science Iowa State University Ames, IA 50014 sandeepk@iastate.edu Chris Strasburg Iowa State

More information

Pattern-Aided Regression Modelling and Prediction Model Analysis

Pattern-Aided Regression Modelling and Prediction Model Analysis San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

The INFUSIS Project Data and Text Mining for In Silico Modeling

The INFUSIS Project Data and Text Mining for In Silico Modeling The INFUSIS Project Data and Text Mining for In Silico Modeling Henrik Boström 1,2, Ulf Norinder 3, Ulf Johansson 4, Cecilia Sönströd 4, Tuve Löfström 4, Elzbieta Dura 5, Ola Engkvist 6, Sorel Muresan

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

III. DATA SETS. Training the Matching Model

III. DATA SETS. Training the Matching Model A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: wojciech.gryc@oii.ox.ac.uk Prem Melville IBM T.J. Watson

More information

Conclusions and Future Directions

Conclusions and Future Directions Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions

More information

Tracking the Impact of Design Changes During Software Development

Tracking the Impact of Design Changes During Software Development Tracking the Impact of Design Changes During Software Development Frank Padberg Fakultät für Informatik Universität Karlsruhe, Germany padberg@ira.uka.de Abstract Design changes occur frequently during

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

How To Create A Text Classification System For Spam Filtering

How To Create A Text Classification System For Spam Filtering Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar

More information

How To Predict A Program Defect

How To Predict A Program Defect Cross-project Defect Prediction A Large Scale Experiment on Data vs. Domain vs. Process Thomas Zimmermann Microsoft Research tzimmer@microsoft.com Nachiappan Nagappan Microsoft Research nachin@microsoft.com

More information

Data Collection from Open Source Software Repositories

Data Collection from Open Source Software Repositories Data Collection from Open Source Software Repositories GORAN MAUŠA, TIHANA GALINAC GRBAC SEIP LABORATORY FACULTY OF ENGINEERING UNIVERSITY OF RIJEKA, CROATIA Software Defect Prediction (SDP) Aim: Focus

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Mimicking human fake review detection on Trustpilot

Mimicking human fake review detection on Trustpilot Mimicking human fake review detection on Trustpilot [DTU Compute, special course, 2015] Ulf Aslak Jensen Master student, DTU Copenhagen, Denmark Ole Winther Associate professor, DTU Copenhagen, Denmark

More information

Software Defect Prediction Based on Classifiers Ensemble

Software Defect Prediction Based on Classifiers Ensemble Journal of Information & Computational Science 8: 16 (2011) 4241 4254 Available at http://www.joics.com Software Defect Prediction Based on Classifiers Ensemble Tao WANG, Weihua LI, Haobin SHI, Zun LIU

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Got Issues? Do New Features and Code Improvements Affect Defects?

Got Issues? Do New Features and Code Improvements Affect Defects? Got Issues? Do New Features and Code Improvements Affect Defects? Daryl Posnett dpposnett@ucdavis.edu Abram Hindle ah@softwareprocess.es Prem Devanbu devanbu@ucdavis.edu Abstract There is a perception

More information

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab

More information

What Questions Developers Ask During Software Evolution? An Academic Perspective

What Questions Developers Ask During Software Evolution? An Academic Perspective What Questions Developers Ask During Software Evolution? An Academic Perspective Renato Novais 1, Creidiane Brito 1, Manoel Mendonça 2 1 Federal Institute of Bahia, Salvador BA Brazil 2 Fraunhofer Project

More information

Mining Repositories to Assist in Project Planning and Resource Allocation

Mining Repositories to Assist in Project Planning and Resource Allocation Mining Repositories to Assist in Project Planning and Resource Allocation Tim Menzies Department of Computer Science, Portland State University, Portland, Oregon tim@menzies.us Justin S. Di Stefano, Chris

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Learn Software Microblogging - A Review of This paper

Learn Software Microblogging - A Review of This paper 2014 4th IEEE Workshop on Mining Unstructured Data An Exploratory Study on Software Microblogger Behaviors Abstract Microblogging services are growing rapidly in the recent years. Twitter, one of the most

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING

PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING PREDICTIVE TECHNIQUES IN SOFTWARE ENGINEERING : APPLICATION IN SOFTWARE TESTING Jelber Sayyad Shirabad Lionel C. Briand, Yvan Labiche, Zaheer Bawar Presented By : Faezeh R.Sadeghi Overview Introduction

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model

Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model Iman Attarzadeh and Siew Hock Ow Department of Software Engineering Faculty of Computer Science &

More information

Automated Collaborative Filtering Applications for Online Recruitment Services

Automated Collaborative Filtering Applications for Online Recruitment Services Automated Collaborative Filtering Applications for Online Recruitment Services Rachael Rafter, Keith Bradley, Barry Smyth Smart Media Institute, Department of Computer Science, University College Dublin,

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad

More information

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK 1 K.RANJITH SINGH 1 Dept. of Computer Science, Periyar University, TamilNadu, India 2 T.HEMA 2 Dept. of Computer Science, Periyar University,

More information

Strategic Online Advertising: Modeling Internet User Behavior with

Strategic Online Advertising: Modeling Internet User Behavior with 2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Data Quality in Empirical Software Engineering: A Targeted Review

Data Quality in Empirical Software Engineering: A Targeted Review Full citation: Bosu, M.F., & MacDonell, S.G. (2013) Data quality in empirical software engineering: a targeted review, in Proceedings of the 17th International Conference on Evaluation and Assessment in

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Software project cost estimation using AI techniques

Software project cost estimation using AI techniques Software project cost estimation using AI techniques Rodríguez Montequín, V.; Villanueva Balsera, J.; Alba González, C.; Martínez Huerta, G. Project Management Area University of Oviedo C/Independencia

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.N. Lutu Department of Computer Science, University of Pretoria, South Africa Patricia.Lutu@up.ac.za

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Processing and data collection of program structures in open source repositories

Processing and data collection of program structures in open source repositories 1 Processing and data collection of program structures in open source repositories JEAN PETRIĆ, TIHANA GALINAC GRBAC AND MARIO DUBRAVAC, University of Rijeka Software structure analysis with help of network

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

Regression Testing Based on Comparing Fault Detection by multi criteria before prioritization and after prioritization

Regression Testing Based on Comparing Fault Detection by multi criteria before prioritization and after prioritization Regression Testing Based on Comparing Fault Detection by multi criteria before prioritization and after prioritization KanwalpreetKaur #, Satwinder Singh * #Research Scholar, Dept of Computer Science and

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

College information system research based on data mining

College information system research based on data mining 2009 International Conference on Machine Learning and Computing IPCSIT vol.3 (2011) (2011) IACSIT Press, Singapore College information system research based on data mining An-yi Lan 1, Jie Li 2 1 Hebei

More information

Towards Inferring Web Page Relevance An Eye-Tracking Study

Towards Inferring Web Page Relevance An Eye-Tracking Study Towards Inferring Web Page Relevance An Eye-Tracking Study 1, iconf2015@gwizdka.com Yinglong Zhang 1, ylzhang@utexas.edu 1 The University of Texas at Austin Abstract We present initial results from a project,

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning By: Shan Suthaharan Suthaharan, S. (2014). Big data classification: Problems and challenges in network

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Analysis of Software Project Reports for Defect Prediction Using KNN

Analysis of Software Project Reports for Defect Prediction Using KNN , July 2-4, 2014, London, U.K. Analysis of Software Project Reports for Defect Prediction Using KNN Rajni Jindal, Ruchika Malhotra and Abha Jain Abstract Defect severity assessment is highly essential

More information

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream

More information