Software Defect Prediction Based on Classifiers Ensemble

Size: px
Start display at page:

Download "Software Defect Prediction Based on Classifiers Ensemble"

Transcription

1 Journal of Information & Computational Science 8: 16 (2011) Available at Software Defect Prediction Based on Classifiers Ensemble Tao WANG, Weihua LI, Haobin SHI, Zun LIU School of Computer, Northwestern Polytechnical University, Xi an , China Abstract Software defect prediction using classification algorithms was advocated by many researchers. However, several new literatures show the performance bottleneck by applying a single classifier recent years. On the other hand, classifiers ensemble can effectively improve classification performance than a single classifier. Motivated by above two reasons which indicate that defect prediction using classifiers ensemble methods have not fully be exploited, we conduct a comparative study of various ensemble methods with perspective of taxonomy. These methods included Bagging, Boosting, Random trees, Random forest, Random subspace, Stacking, and Voting. We also compared these ensemble methods to a single classifier Naïve Bayes. A series of benchmarking experiments on public-domain datasets MDP show that applying classifiers ensemble methods to predict defect could achieve better performance than using a single classifier. Specially, in all seven ensemble methods evolved by our experiments, Voting and Random forest had obvious performance superiority than others, and Stacking also had better generalization ability. Keywords: Software Defect Prediction; Classifiers Ensemble; Ensemble Methodology 1 Introduction With software systems are getting more and more complex, the probability of these software systems have defective modules is getting higher. Meanwhile, software quality assurance is a resource and time-consuming task, budget does not allow for complete testing of an entire system. Therefore, identifying which software modules are more likely to be defective can help us allocate limited time and resources effectively. Given recent research in artificial intelligence, data miners, such as Naïve Bayes, ANN, SVM, Decision tree, logistic regression, and association rules etc., are often used to automatically learn predictors for software quality which can identify defective software modules based on software metrics. It has been observed that the majority of a software system s faults are contained in a small number of modules [1, 2], so the identification of defect prediction can help us allocate test time and resources to a small number of modules that seems defect-prone, although an ideal defect predictor would not be able to identify every defective module. Corresponding author. address: Water@snnu.edu.cn (Tao WANG) / Copyright 2011 Binary Information Press December 2011

2 4242 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Classification is a kind of most popular approach for software defect prediction [3-6]. Classifiers categorize modules which represented by a set of software complexity metrics or code attributes/features into defect-prone and non-defect-prone by means of a classification model derived from data of previous development projects [7]. The well-known software complexity metrics include code size [8], Halstead s complexity [9] and McCabe s cyclomatic complexity [10]. Many recent researchers reported the advantages of classification models used to predict defects, such as Tim Menzies et al. [14], E. J. Weyuker et al. [11, 12], Qinbao Song et al. [13], and someone others [15-19]. Meanwhile, the "ceiling effect" of predictors was been found by several researches [20-23]. These results show that better data mining techniques may be not leading to better defect predictors. It seemed that software metrics-based defect prediction reached its performance bottleneck. Consequently, some researches tried to break this kind of "ceiling effect". Tim Menzies et al. tried to break the performance ceiling by improving the information content of the training data rather than applying better data mining technics [20]. In another paper of Tim Menzies, They thought that learners must be chosen and customized to the goal at hand and proposed a meta-learner framework WHICH that can tune a learner to specific local business goals rather than the indiscriminate use of learners[21]. Hongyu Zhang et al. thought better defect predictors can be trained from the defect dense components by focusing on the defect-rich portions of the training sets [22]. Yue Jiang et al. hypothesized that future research into defect prediction should change its focus from designing better modeling algorithms towards improving the information content of the training data. They also found that models that utilize a combination of design and code level metrics outperform models which use either one or the other metric set [23]. Yi Liu et al. presented a novel search-based approach to software quality modeling with multiple software project repositories which includes three strategies for modeling with multiple software projects: Baseline Classifier, Validation Classifier, and Validation-and-Voting Classifier [24]. These researches all got various positive conclusions for break the "ceiling effect" by means of different methods. There are many theoretical and empirical evidences which demonstrate ensemble methodology combine the predictions of multiple classifiers to lead decision-making more accuracy [25-30, 46]. Nevertheless, few researches reported defect prediction model based on ensemble of classifiers. A.Tosun et al. proposed an ensemble model which combines three different algorithms include Naïve Bayes, Artificial Neural Network and Voting Feature Intervals, and considerably improved the defect detection capability compared to Naïve Bayes algorithm[31], but A.Tosun s works only focus on one ensemble method and few NASA MDP[41] datasets. Jun Zheng analyzed three cost-sensitive boosting algorithms to boost neural networks for software defect prediction [32], but his work focused on boost which is only one method of classifiers ensemble. We believe that method by using classifiers ensemble to predict software defect has not been fully explored so far in this type of research, so this paper focuses on two questions: (a) how to construct classifiers ensemble? And (b) which popular classifiers ensemble method looks like more effective in defect prediction? In section 2 we will review 7 different popular classifiers ensemble methods from the perspective of taxonomy for characterizing ensemble methods include Bagging, Boosting, Random Forest, Random Tree, Random Subspace, Stacking, and Voting. In section 3 we will implement several experiments for evaluating performance of these methods and comparing them to single classifier. The conclusions are given in section 4.

3 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Classifiers Ensemble Based Defect Prediction Models Ensemble methodology imitates human second nature to seek several opinions before making any crucial decision [27]. Classifiers ensemble, which build a classification model by integrating multiple classifiers, can be used to improve defect prediction performance. The main idea of an ensemble methodology is to combine a set of learning models whose decisions are combined to improve the performance of the overall system. Indeed, ensemble methods can effectively make use of diversity of various types of classifiers to reduce the variance without increasing the bias [26]. We first made following assumptions. For the classification issue to predict defect, the object module m is represented by a set of code attributes, which is m : {a 1, a 2,, a n }. Classifier s task is to make decision for whether m belongs to c d or c nd, where c d is defective class label and c nd is non-defective class label. Bagging is Bootstrap AGGregatING [33]. The main idea of Bagging is constructing each member of the ensemble from a different training dataset, and the predictions combined by either uniform averaging or voting over class labels. A bootstrap samples N items uniformly at random with replacement. That means each classifier is trained on a sample of examples taken with a replacement from the training set, and each sample size is equal to the size of the original training set. Therefore, Bagging produces a combined model that often performs better than the single model built from the original single training set. Algorithm 1 Bagging Input: the number of ensemble members M Input: Training set S = {(m 1, c 1 ), (m 2, c 2 ),, (m N, c N ); c 1, c 2,, c N {c d, c nd }} Input: Testing set T Training phase: for i = 1 to M do Draw (with replacement) a bootstrap sample set S i (N examples) of the data from S Train a classifier C i from S i and add it to the ensemble end for Testing phase: for each t in T do Try all classifiers C i Predict the class that receives the highest number of votes end for Boosting is another popular ensemble method, and Adaboost[34] is the most well-known of the Boosting family of algorithms which trains models sequentially, with a new model trained at each round. Adaboost constructs an ensemble by performing multiple iterations, each time using different example weights. The weight of incorrectly classified examples will be increased, so this ensures misclassification errors for these examples count more heavily in the next iterations. This procedure provides a series of classifiers that complement one another, and the classifiers are combined by voting. An approach called randomized C4.5 was proposed by Dietterich [35] which has a more popular name "Random Tree". The main idea is randomizing the internal decisions of the learning algorithm. Specifically, it implemented a modified version of the C4.5 learning algorithm in

4 4244 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Algorithm 2 Adaboost Input: the number of ensemble members M Input: Training set S = {(m 1, c 1 ), (m 2, c 2 ),, (m N, c N ); c 1, c 2,, c N {c d, c nd }} Initialize: each training example weight w i = 1/N (i = 1 N) Training phase: for x = 1 to M do Tran a classifier C x using the current example weights compute a weighted error estimate: err x = (w i of all incorrectly classified m i )/ N i=1 w i compute a classifier weight: α x = log((1 err x )/err x )/2 for all correctly classified examples m i : w i w i e α x for all incorrectly classified examples m i : w i w i e αx normalize the weights w i so that they sum to 1 end for Testing phase: for each t in T do Try all classifiers C x Predict the class that receives the highest sum of weights α x end for which the decision about which split to introduce at each internal node of the tree is randomized. At each node in the decision tree the 20 best tests are determined and the actual test used is randomly chosen from among them. With continuous attributes, it is possible that multiple tests from the same attribute will be in the top 20. Random Forest [36] approaches to creating an ensemble also utilize a random choice of attributes in the construction of each CART decision tree. The individual trees are constructed using a simple algorithm. The decision tree is not pruned and at each node, rather than choosing the best split among all attributes, the inducer randomly samples N (where N is an input parameter) of the attributes and choose the best split from among those variables. The classification of an unlabeled example is performed using majority vote [27]. The important advantages of the Random Forest method are fast and its ability to handle a very large number of input attributes. Unlikely Random Forest, Random Subspace method creates a decision forest utilizes the random selection of attributes in creating each decision tree [37]. A subset of attributes is uniformly and randomly selected and assigned to an arbitrary learning algorithm. This way, Random Subspace increases the diversity between members of the ensemble. Ho s empirical studies had suggested good results can be obtained with a uniform and random choose 50% of the attributes to create each decision tree in an ensemble. Random Subspaces was better for data sets with a large number of attributes [37]. Stacking is another ensemble learning technique [38]. In Stacking scheme, there are two level models which are set of base models are called level-0, and the meta-model level-1. The level-0 models are constructed from bootstrap samples of a dataset, and then their outputs on a hold-out dataset are used as input to a level-1model. The task of the level-1 model is to combine the set of outputs so as to correctly classify the target, thereby correcting any mistakes made by the level-0 models. Voting is a combining strategy of classifiers [39, 40]. Majority Voting and Weighted Majority Voting are more popular methods of Voting. In Majority Voting, each ensemble member votes

5 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) for one of the classes. The ensemble predicts the class with the highest number of vote. Weighted Majority Voting makes a weighted sum of the votes of the ensemble members, and weights typically depend on the classifiers confidence in its prediction or error estimates of the classifier. Kuncheva [39] proposed that classifiers ensemble method can be grouped by the ways of construction. There are "Combination level" which defines different ways of combining the classifier decisions, "Classifier level" that indicates which base classifiers are used to constitute the ensemble, "Attributes level" in which different attribute subsets can be used for the classifiers, and "Data level" that indicates which dataset is used to train each base classifier. Based on above taxonomy, Bagging and Boosting belong to "Data level", Random tree and Random Subspace belong to "Attributes level", Stacking belongs to "Classifier level", and Voting belongs to "Combination level". The exception is Random Forest which commonly is regarded as a hybrid of the Bagging algorithm and the Random Subspace algorithm. 3 Experiments and Results 3.1 Datasets and preprocess The datasets of experiments used in this paper are NASA IV&V Facility Metrics Data Program (MDP datasets) [41]. As shown by Table 1, MDP datasets involve 14 different datasets collected from real software projects of NASA. Each dataset have many software modules coded by several different programming language including C, C++, Java, and Perl. These datasets have various scales from 125/6k to 186/315k (the numbers of module/the lines of code), and various types of code metrics including code size, Halstead s complexity and McCabe s cyclomatic complexity etc. MDP is a public-domain data program which can be used by any researchers and is dataset basis of PROMISE research community [42]. Adapting public-domain datasets in experiment is a benchmarking procedure of defect prediction research which can make different researchers compare their techniques convenience [7, 14]. Our experiments were defined and motivated by such a baseline. For preparation of our experiments, we performed the following two data preprocessing steps: 1. Replacing missing attribute values. We used the unsupervised filter named ReplaceMissing- Values in Weka [44] to replace all missing attribute values in each data set. ReplaceMissingValues replaces all missing values with the modes (where attribute value is nominal) and means (where attribute value is numeric) from the training data. In MDP, the attributes values of datasets are all numeric, so ReplaceMissingValues replaces all missing values with the means. 2. Discretizing attribute values. Numeric attributes were discretized by the filter of Discretize in Weka using unsupervised 10-bin discretization. 3.2 Accuracy indicators We adapted two accuracy indicators in our experiments to evaluate and compare the aforementioned ensemble algorithms, which are classification accuracy (AC) and area under curve (AUC) [43]. With a binary classification confusion matrix, we called test example true positive (TP) which is be predicted to positive and is actual positive, false positive (FP) which is be predicted to positive but is actual not, true negative (TN) which is be predicted to negative and is actual

6 4246 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Table 1: Datasets information of MDP CM1 JM1 KC1 KC2 KC3 KC4 MC1 MC2 MW1 PC1 PC2 PC3 PC4 PC5 Language C C C++ C++ Java Perl C++ C C C C C C C++ LOC 20k 315k 43k 18k 18k 25k 63k 6k 8k 40k 26k 40k 36k 164k Modules Defective negative, false negative (FP) which is be predicted to negative but is actual not. So we had AC= (TP+TN)/ (TP+FP+TN+FN). AUC is area under ROC curve. That is also the integral of ROC curve with false positive rate as x axis and true positive rate as y axis. If ROC curve is more close to top-left of coordinate, the corresponding classifier must have better generalization ability so that the corresponding AUC will be lager. Therefore, AUC can quantitatively indicate the generalization ability of corresponding classifier. In accordance with benchmarking procedure proposed by Stefan Lessmann [7], we also performed the statistical hypothesis testing Paired Two-tailed T-test for all algorithms in experiments which can reveal whether there is the significance performance difference between algorithms in statistical. Specifically, Paired Two-tailed T-test is to statistical hypothesis test the mean of two related samples to infer whether there is the statistical significance difference between two means. 3.3 Experiments procedure For evaluating and comparing performance in software defect prediction of various ensemble methods discussed above and a single classifier, we employed the Weka tool which implements all algorithms we need in experiments. These algorithm packets are Bagging, AdaBoostM1 (which is most popular version of boosting), NaïveBayes (which is a most popular single classifier simple and effective), RandomForest, RandomTree, RandomSubspace, Stacking, and Vote. Bagging, AdaBoostM1 and RandomSubspace are all meta-learner, so we assigned the NaïveBayes to base classifier of these algorithms. We selected four base classifiers for Vote evolving NaïveBayes, Logistic, libsvm, and J48, because these are all most popular algorithms in software defect prediction research community. The combinationrule of Vote is average of probabilities. We also let the level-1 classifiers of Stacking be these four algorithms, and the level-0 classifier be NaïveBayes. In all experiments, we performed 10-fold cross validation. That means we will got 100 accuracy values/acu values for each algorithm on each dataset, this way, the mean of these 100 values is the average accuracy/average AUC for this algorithm on this dataset. Experiments procedure: Input: the MDP datasets: D = {CM1, JM1, KC1, KC2, KC3, KC4, MC1, MC2, MW1, PC1, PC2, PC3, PC4, PC5}

7 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Input: all algorithms: A = {Bagging, AdaBoostingM1, NaïveBayes,RandomForest, RandomTree, RandomSubspace, Stacking, Vote} Dataset preprocess: a. Replacing missing attribute values. Apply ReplaceMissingValues of Weka to D. b. Discretizing attribute values. Apply Discretize of Weka to D. 10-fold cross validation: for each dataset in D do end for for each algorithm in A do end for Output: Perform 10-fold cross validation. Perform Paired Two-tailed T-test. a. Accuracy and standard deviation of 8 classifiers on 14 datasets. b. AUC and standard deviation of 8 classifiers on 14 datasets. c. The results of Paired Two-tailed T-test. 3.4 Results and discusses We had got experiment results about accuracy of various algorithms on various datasets as shown by Table 2. We draw the following observations from these experiment results: Because we let NaïveBayes be the base classifier for Bagging and AdaBoostM1, we first focused on performance difference between these three algorithms. Nevertheless, we were surprised and found that Bagging and AdaBoostM1 hardly can improve the classification accuracy relative to NaïveBayes. On 11 of all datasets besides KC4, PC3, and PC5, the Bagging with NaïveBayes for base classifier even got lower accuracy by NaïveBayes. The AdaBoostM1 also got similar results on 9 of all datasets besides MC1, MW1, PC1, PC2, and PC3. Although total mean of accuracy of Bagging or AdaBoostM1 are all higher than NaïveBayes, but the improvements even not achieve 0.5 percent. Same as above two algorithms, RandomSubSpace also has NaïveBayes for base classifier. However, RandomSubSpace improved the classification accuracy on 12 of all datasets only besides MC2 and JM1, and improved the total mean of accuracy 1.12 percent by NaïveBayes. Vote got best classification performance on 7 of 14 datasets, and RandomForest got best classification performance on 4 datasets. So only on 3 datasets, there were best classification performances got by other algorithms besides above two. Focusing on the total mean of every datasets, we found the performance of RandomForest, RandomTree, Stacking, and Vote was obviously better than NaïveBayes and other ensemble methods. Vote got the highest mean of accuracy 88.48% followed by 87.90% of Random- Forest. These two algorithms obviously outperformed than others.

8 4248 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Table 2: AC (%) and standard deviation of various classifiers on 14 datasets Dataset Bagging AdaBoostM1 NaïveBayes RandomForest RandomTree RandomSubSpace Stacking Vote CM ± ± ± ± ± ± ± ±2.30 JM ± ± ± ± ± ± ± ±0.56 KC ± ± ± ± ± ± ± ±1.64 KC ± ± ± ± ± ± ± ±3.38 KC ± ± ± ± ± ± ± ±3.20 KC ± ± ± ± ± ± ± ±11.43 MC ± ± ± ± ± ± ± ±0.13 MC ± ± ± ± ± ± ± ±7.14 MW ± ± ± ± ± ± ± ±3.07 PC ± ± ± ± ± ± ± ±1.45 PC ± ± ± ± ± ± ± ±0.13 PC ± ± ± ± ± ± ± ±1.77 PC ± ± ± ± ± ± ± ±1.75 PC ± ± ± ± ± ± ± ±0.23 MEAN 82.23± ± ± ± ± ± ± ±2.01 We concluded that some classifier ensembles methods could significantly improve the classification performance than single classifier. We performed Paired Two-tailed T-test for 100 accuracy values got by every algorithm pair on each dataset as Table 3 shown. We set significance level to be 0.05, and an entry w/t/l of Table 3 means that in statistical the approach at the corresponding row wins in w datasets, ties in t datasets, and loses in l datasets, compared to the approach at the corresponding column. We draw some observations from test results: The score of Vote vs. NaïveBayes was 12 wins, 2 ties, and 0 losses. Vote overwhelmingly defeated NaïveBayes in statistical sense. The score of Vote vs. RandomForest is 1 wins, 12 ties, and 1 loss, so these two algorithms had no significance difference. However, Vote obviously outperformed other ensemble algorithms besides RandomForest. This observation coincided with the conclusion about Vote form Table 2. RandomForest also had obvious statistical superiority than other algorithms (including NaïveBayes) besides Vote on majority part of datasets. Bagging vs. NaïveBayes had no wins and even loss on one dataset. Similarly, AdaBoostM1 vs. NaïveBayes only had 1 win and 13 ties score. Furthermore, we had a hypothesis for t-test experiments which supposed two group samples satisfy the normal distribution. This hypothesis may be not suitable for our experiment data which not sure satisfy the normal distribution, and may have some outliers. So we had not

9 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Table 3: The compared results of paired two-tailed t-test on AC with the 0.05 significance level Bagging AdaBoostM1 NaïveBayes RandomForest RandomTree RandomSubSpace Stacking AdaBoostM1 2/12/0 NaïveBayes 1/13/0 0/13/1 RandomForest 10/3/1 9/4/1 10/3/1 RandomTree 6/6/2 4/7/3 6/6/2 0/6/8 RandomSubSpace 4/10/0 2/12/0 3/11/0 1/3/10 2/7/5 Stacking 4/7/3 3/8/3 4/7/3 0/5/9 1/9/4 4/6/4 Vote 12/2/0 12/2/0 12/2/0 1/12/1 9/5/0 12/2/0 11/3/0 learned about entire information of experiments data. We employed the boxplot to visually show classification accuracy values and outliers of various algorithms on various datasets. As Figure 1 shown, on 12 of all datasets besides KC2 and MC2, accuracy data of Vote had obvious distribution superiority than other algorithms. This superiority means that boxplot of Vote was located more higher in coordinate system, and the Interquartile Range (IQR) was more small which indicated that data distribution was more centralized relatively. RandomForest also had this distribution superiority than other algorithms besides Vote on datasets besides JM1, KC2, and MC2. These conclusions about Vote and RandomForest were basically similar to the conclusions draw by Table 2 and Table 3. Table 4: AUC and standard deviation of various classifiers on 14 datasets Dataset Bagging AdaBoostM1 NaïveBayes RandomForest RandomTree RandomSubSpace Stacking Vote CM1 0.77± ± ± ± ± ± ± ±0.11 JM1 0.69± ± ± ± ± ± ± ±0.02 KC1 0.79± ± ± ± ± ± ± ±0.04 KC2 0.85± ± ± ± ± ± ± ±0.06 KC3 0.82± ± ± ± ± ± ± ±0.11 KC4 0.76± ± ± ± ± ± ± ±0.12 MC1 0.92± ± ± ± ± ± ± ±0.04 MC2 0.72± ± ± ± ± ± ± ±0.14 MW1 0.77± ± ± ± ± ± ± ±0.14 PC1 0.76± ± ± ± ± ± ± ±0.07 PC2 0.84± ± ± ± ± ± ± ±0.12 PC3 0.77± ± ± ± ± ± ± ±0.06 PC4 0.84± ± ± ± ± ± ± ±0.03 PC5 0.84± ± ± ± ± ± ± ±0.01 MEAN 0.80± ± ± ± ± ± ± ±0.08

10 4250 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Table 4 shows the detailed AUC values and standard deviation of various algorithms on all datasets. Vote got best AUC of experiments on 9 datasets and the best mean of AUC on all datasets. So Vote presents the best generalization ability with AUC numerical mode. However, it was surprised that RandomForest which was outstanding in classification accuracy and related statistical test had poor AUC values similar to other algorithms besides Vote and Stacking, and only outperformed one algorithm RandomTree. Meanwhile, we found the Stacking got very remarkable performance indicated by AUC value. It got best AUC of experiments of 7 datasets and the top 2 mean of AUC on all datasets second only to Vote. This means that Stacking also has outstanding generalization ability. Table 5: The compared results of paired two-tailed t-test on AUC with the 0.05 significance level Bagging AdaBoostM1 NaïveBayes RandomForest RandomTree RandomSubSpace Stacking AdaBoostM1 0/8/6 NaïveBayes 0/12/2 4/10/0 RandomForest 3/9/2 5/9/0 3/10/1 RandomTree 0/1/13 0/4/10 0/1/13 0/0/14 RandomSubSpace 1/11/2 6/8/0 1/12/1 1/10/3 12/2/0 Stacking 5/9/0 9/5/0 6/8/0 1/13/0 14/0/0 6/8/0 Vote 6/8/0 11/3/0 7/7/0 4/10/0 14/0/0 7/7/0 2/12/0 Table 5 shows the compared results of Paired Two-Tailed T-Test on AUC values. The significance level of test and the mean of entry are all coincide with Table 3. The scoreboard s top 3 are Vote, Stacking, and RandomForest which in statistical sense demonstrated the conclusions draw by Table 4. 4 Conclusions In this paper, we conduct a comparative study of seven classifiers ensemble methods with context of software defect prediction. These algorithms or methods have different ensemble construction scheme which focus on different "level". Experiment results show us that some of seven algorithms applied to software defect prediction could achieve obvious performance improvement than a single classifier. Various indicators including average classification accuracy, average AUC, and boxplot all provided evidence for Voting which got best performance of all ensemble algorithm and single classifier. Random Forest has second performance only to Voting, but the generalization ability of Random Forest is not better than Stacking. All these proofs let us to advocate the Voting algorithms applied to defect prediction. In our future works, we plan continue to evaluate various ensemble algorithms by applying different base classifiers which can identify whether our conclusions inflected by base classifiers adopted by our experiments or not.

11 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Fig. 1: Accuracy boxplots of various datasets classified by 8 classifiers

12 4252 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) Acknowledgements This work is supported by the National High-Tech Research and Development Plan of China under grant No.2006AA01Z406, also supported by the National Natural Science Foundation of China under grant No , the Doctorate Foundation of Northwestern Polytechnical University under grant No.CX200815, the Planned Science and Technology Project of ShaanXi Province, China under grant No. 2010JM8039 and the Youth Foundation of ShaanXi Normal University under grant No References [1] C. Andersson. A Replicated Empirical Study of a Selection Method for Software Reliability Growth Models. Empirical Software Engineering, vol. 12, no. 2, pages , [2] N. E. Fenton and N. Ohlsson. Quantitative Analysis of Faults and Failures in a Complex Software System. IEEE Transactions on Software Engineering, vol. 26, no. 8, pages , [3] T. J. Ostrand, E. J. Weyuker and R. M. Bell. Predicting the location and number of faults in large software systems. IEEE Transactions on Software Engineering, vol. 31, no. 4, pages , [4] N. E. Fenton, P. Krause and M. Neil. Software Measurement: Uncertainty and Causal Modeling. IEEE Software, vol. 10, no. 4, pages , Aug., [5] N. Nagappan and T. Ball. Using software dependencies and churn metrics to predict field failures: An empirical case study. In the Proceedings of the First International Symposium on Empirical Software Engineering and Measurement (ESEM 07), pages , [6] B. Turhan and A. Bener. A multivariate analysis of static code attributes for defect prediction. In the Proceedings of Seventh International Conference on Quality Software (QSIC 07), pages , [7] S. Lessmann, B. Baesens, C. Mues and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, vol. 34, no. 4, pages , Aug., [8] A. G. Koru, K. Emam, D. Zhang, H. Liu, and D. Mathew. Theory of relative defect proneness. Empirical Software Engineering, vol. 13, no. 5, pages , Oct., [9] T. McCabe. A complexity measure. IEEE Transactions on Software Engineering, vol. 2, no. 4, pages , Dec [10] M. H. Halstead. Elements of Software Science. Elsevier, North-Holland, [11] E. J. Weyuker, T. J. Ostrand, and R. M. Bell. Adapting a Fault Prediction Model to Allow Widespread Usage. In the Proceedings of the 4th International Workshop on Predictive Models in Software Engineering (PROMISE08), Leipzig, Germany, May 12 13, [12] E. J. Weyuker, T. J. Ostrand, and R. M. Bell. Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models. Empirical Software Engineering, vol. 13, no. 5, pages , [13] Q. Song and M. Shepperd, Software Defect Association Mining and Defect Correction Effort Prediction. IEEE Transactions on Software Engineering, vol. 32, no. 2, pages 69 82, [14] T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, vol. 33, no.1, pages 2 13, 2007.

13 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) [15] N. Nagappan and T. Ball. Using software dependencies and churn metrics to predict field failures: An empirical case study. In the Proceedings of the First International Symposium on Empirical Software Engineering and Measurement (ESEM 07), pages , Washington, DC, USA, [16] N. F. Schneidewind. Investigation of Logistic Regression as a Discriminant of Software Quality. In the Proceedings of IEEE CS Seventh Int l Conf. Software Metrics Symp., pages , Apr [17] L. Guo, B. Cukic, and H. Singh. Predicting Fault Prone Modules by the Dempster-Shafer Belief Networks. In the Proceedings of IEEE CS 18th Int l Conf. Automated Software Eng., pages , Oct [18] T.M. Khoshgoftaar and N. Seliya. Comparative Assessment of Software Quality Classification Techniques: An Empirical CaseStudy. Empirical Software Engineering, vol. 9, no. 3, pages , [19] N. Gayatri, S. Nickolas, A. Reddy, and R. Chitra. Performance analysis of data mining algorithms for software quality prediction. In the Proceeding of International Conference on Advances in Recent Technologies in Communication and Computing (ARTCom 09), pages , [20] T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, and Y. Jiang. Implications of Ceiling Effects in Defect Predictors. In the Proceedings of the 4th International Workshop on Predictive Models in Software Engineering (PROMISE 08), Leipzig, Germany, May 12 13, [21] T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener. Defect prediction from static code features: current results, limitations, new approaches. Automated Software Engineering, vol. 17, no. 5, pages , May, [22] H. ZhangčňA. Nelson and T. Menzies. On the Value of Learning from Defect Dense Components for Software Defect Prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering (PROMISE 10), Sep12 13, Timisoara, Romania, [23] Y. Jiang, B. Cukic, T. Menzies, and N. Bartlow. Comparing Design and Code Metrics for Software Quality Prediction. In the Proceedings of the 6th International Conference on Predictive Models in Software Engineering (PROMISE 10), Sep12 13, Timisoara, Romania, [24] Y. Liu, T. M. Khoshgoftaar, and N. Seliya. Evolutionary optimization of software quality modeling with multiple repositories. IEEE Transactions on Software Engineering, vol. 36, no. 6, pages , [25] C. D. Stefano, F. Fontanella, G. Folino and A. S. Freca. A Bayesian Approach for combining ensembles of GP classifiers. In the Proceedings of the 10th International Workshop Multiple Classifier Systems (MCS 2011), Naples, Italy, June 15 17, [26] D. Windridge. Tomographic Considerations in Ensemble Bias/Variance Decomposition. In the Proceedings of the 9th International Workshop Multiple Classifier Systems (MCS 2010), April 7 9, Cairo, Egypt, Lecture Notes in Computer Science, vol. 5997, pages 43 53, Apr [27] L. Rokach. Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography. Computational Statistics & Data Analysis, vol. 53, no. 12, pages , 1 October [28] J. J. Rodrĺłguez, L. I. Kuncheva, and C. J. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Patten Analysis and Machine Intelligence, vol. 28, no. 10, pages , Oct [29] R.E. Banfield, L.O. Hall, K.W. Bowyer, and W. P. Kegelmeyer. A Comparison of Decision Tree Ensemble Creation Techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pages , January [30] J. Canul-Reich, L. Shoemaker and L. O. Hall. Ensembles of Fuzzy Classifiers. In the Proceedings of IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 07), Imperial College, London, UK, July, 2007.

14 4254 T. WANG et al. / Journal of Information & Computational Science 8: 16 (2011) [31] A. Tosun. Ensemble of Software Defect Predictors: A Case Study. In the Proceedings of the 2nd International Symposium on Empirical Software Engineering and Measurement (ESEMąŕ08), Oct 9 10, Kaiserslauten, Germany, [32] J. Zheng. Cost-sensitive boosting neural networks for software defect prediction. Expert Systems with Applications, vol. 37, no. 6, pages , [33] L. Breiman. Bagging predictors. Machine Learning, vol. 24, no. 2, pages , [34] Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In the Proceedings of International Conference on Machine Learning, [35] T. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, vol. 40, no. 2, pages , [36] L. Breiman. Random forests. Machine Learning, vol. 45, no. 1, pages 5 32, [37] T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pages , [38] D. H. Wolpert. Stacked Generalization. Neural Networks, vol. 5, no. 2, pages , [39] L. I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley and Sons, Inc [40] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pages , March [41] M. Chapman, P. Callis, and W. Jackson. Metrics Data Program. NASA IV and V Facility, [42] G. Boetticher, T. Menzies and T. Ostrand. PROMISE Repository of empirical software engineering data repository, West Virginia University, Department of Computer Science, [43] F. Provost and T. Fawcett. Robust Classification for Imprecise Environments. Machine Learning, vol. 42, no. 3, pages , [44] M. Hall, E. Frank, G. Holmes, B. Pfahringer, and P. Reutemann. the WEKA Data Mining Software: An Update; SIGKDD Explorations, Ian H. Witten, vol. 11, no. 1, pages 10 18, [45] J. Demšar. Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research, vol. 7, no. 12, pages 1 30, [46] Y. ZHU, J. OU, G. CHEN, H. YU. An Approach for Dynamic Weighting Ensemble Classifiers Based on Cross-validation. Journal of Computational Information Systems, Vol. 6, no. 1, pages , 2010.

Class Imbalance Learning in Software Defect Prediction

Class Imbalance Learning in Software Defect Prediction Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang s.wang@cs.bham.ac.uk University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang

More information

Software Defect Prediction Modeling

Software Defect Prediction Modeling Software Defect Prediction Modeling Burak Turhan Department of Computer Engineering, Bogazici University turhanb@boun.edu.tr Abstract Defect predictors are helpful tools for project managers and developers.

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Software Defect Prediction for Quality Improvement Using Hybrid Approach

Software Defect Prediction for Quality Improvement Using Hybrid Approach Software Defect Prediction for Quality Improvement Using Hybrid Approach 1 Pooja Paramshetti, 2 D. A. Phalke D.Y. Patil College of Engineering, Akurdi, Pune. Savitribai Phule Pune University ABSTRACT In

More information

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University huanjing.wang@wku.edu Taghi M. Khoshgoftaar

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Practical Considerations in Deploying AI: A Case Study within the Turkish Telecommunications Industry

Practical Considerations in Deploying AI: A Case Study within the Turkish Telecommunications Industry Practical Considerations in Deploying AI: A Case Study within the Turkish Telecommunications Industry!"#$%&'()* 1, Burak Turhan 1 +%!"#$%,$*$- 1, Tim Menzies 2 ayse.tosun@boun.edu.tr, turhanb@boun.edu.tr,

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

AI-Based Software Defect Predictors: Applications and Benefits in a Case Study

AI-Based Software Defect Predictors: Applications and Benefits in a Case Study Proceedings of the Twenty-Second Innovative Applications of Artificial Intelligence Conference (IAAI-10) AI-Based Software Defect Predictors: Applications and Benefits in a Case Study Ayse Tosun 1, Ayse

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Spyware Prevention by Classifying End User License Agreements

Spyware Prevention by Classifying End User License Agreements Spyware Prevention by Classifying End User License Agreements Niklas Lavesson, Paul Davidsson, Martin Boldt, and Andreas Jacobsson Blekinge Institute of Technology, Dept. of Software and Systems Engineering,

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

Performance Evaluation Metrics for Software Fault Prediction Studies

Performance Evaluation Metrics for Software Fault Prediction Studies Acta Polytechnica Hungarica Vol. 9, No. 4, 2012 Performance Evaluation Metrics for Software Fault Prediction Studies Cagatay Catal Istanbul Kultur University, Department of Computer Engineering, Atakoy

More information

Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining Model for Web Advertising in Virtual Communities Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

More information

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 jperols@sandiego.edu April

More information

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal

Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Comparison of Bagging, Boosting and Stacking Ensembles Applied to Real Estate Appraisal Magdalena Graczyk 1, Tadeusz Lasota 2, Bogdan Trawiński 1, Krzysztof Trawiński 3 1 Wrocław University of Technology,

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Inductive Learning in Less Than One Sequential Data Scan

Inductive Learning in Less Than One Sequential Data Scan Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Research Hawthorne, NY 10532 {weifan,haixun,psyu}@us.ibm.com Shaw-Hwa Lo Statistics Department,

More information

A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction

A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction The International Arab Journal of Information Technology, Vol. 11, No. 6, November 2014 599 A Comparative Assessment of the Performance of Ensemble Learning in Customer Churn Prediction Hossein Abbasimehr,

More information

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Confirmation Bias as a Human Aspect in Software Engineering

Confirmation Bias as a Human Aspect in Software Engineering Confirmation Bias as a Human Aspect in Software Engineering Gul Calikli, PhD Data Science Laboratory, Department of Mechanical and Industrial Engineering, Ryerson University Why Human Aspects in Software

More information

Colon cancer survival prediction using ensemble data mining on SEER data

Colon cancer survival prediction using ensemble data mining on SEER data Colon cancer survival prediction using ensemble data mining on SEER data Reda Al-Bahrani, Ankit Agrawal, Alok Choudhary Dept. of Electrical Engg. and Computer Science Northwestern University Evanston,

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan

More information

Combining SVM classifiers for email anti-spam filtering

Combining SVM classifiers for email anti-spam filtering Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

A Feature Selection Based Model for Software Defect Prediction

A Feature Selection Based Model for Software Defect Prediction , pp.39-58 http://dx.doi.org/10.14257/ijast.2014.65.04 A Feature Selection Based Model for Software Defect Prediction Sonali Agarwal and Divya Tomar Indian Institute of Information Technology, Allahabad,

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

How To Choose A Churn Prediction

How To Choose A Churn Prediction Assessing classification methods for churn prediction by composite indicators M. Clemente*, V. Giner-Bosch, S. San Matías Department of Applied Statistics, Operations Research and Quality, Universitat

More information

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013. Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

More information

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing Maryam Daneshmandi mdaneshmandi82@yahoo.com School of Information Technology Shiraz Electronics University Shiraz, Iran

More information

Mining Life Insurance Data for Customer Attrition Analysis

Mining Life Insurance Data for Customer Attrition Analysis Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: oshini.g@iit.ac.lk H. A. Caldera

More information

An Experimental Study on Rotation Forest Ensembles

An Experimental Study on Rotation Forest Ensembles An Experimental Study on Rotation Forest Ensembles Ludmila I. Kuncheva 1 and Juan J. Rodríguez 2 1 School of Electronics and Computer Science, University of Wales, Bangor, UK l.i.kuncheva@bangor.ac.uk

More information

Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

How To Solve The Class Imbalance Problem In Data Mining

How To Solve The Class Imbalance Problem In Data Mining IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

More information

Increasing the Accuracy of Predictive Algorithms: A Review of Ensembles of Classifiers

Increasing the Accuracy of Predictive Algorithms: A Review of Ensembles of Classifiers 1906 Category: Software & Systems Design Increasing the Accuracy of Predictive Algorithms: A Review of Ensembles of Classifiers Sotiris Kotsiantis University of Patras, Greece & University of Peloponnese,

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

A Systematic Review of Fault Prediction approaches used in Software Engineering

A Systematic Review of Fault Prediction approaches used in Software Engineering A Systematic Review of Fault Prediction approaches used in Software Engineering Sarah Beecham Lero The Irish Software Engineering Research Centre University of Limerick, Ireland Tracy Hall Brunel University

More information

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Analysis of Software Project Reports for Defect Prediction Using KNN

Analysis of Software Project Reports for Defect Prediction Using KNN , July 2-4, 2014, London, U.K. Analysis of Software Project Reports for Defect Prediction Using KNN Rajni Jindal, Ruchika Malhotra and Abha Jain Abstract Defect severity assessment is highly essential

More information

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass

More information