Handling imbalanced datasets: A review
|
|
|
- Amos Strickland
- 9 years ago
- Views:
Transcription
1 Handling imbalanced datasets: A review Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas Educational Software Development Laboratory Department of Mathematics, University of Patras, Greece [email protected], [email protected], [email protected] Abstract. Learning classifiers from imbalanced or skewed datasets is an important topic, arising very often in practice in classification problems. In such problems, almost all the instances are labelled as one class, while far fewer instances are labelled as the other class, usually the more important class. It is obvious that traditional classifiers seeking an accurate performance over a full range of instances are not suitable to deal with imbalanced learning tasks, since they tend to classify all the data into the majority class, which is usually the less important class. This paper describes various techniques for handling imbalance dataset problems. Of course, a single article cannot be a complete review of all the methods and algorithms, yet we hope that the references cited will cover the major theoretical issues, guiding the researcher in interesting research directions and suggesting possible bias combinations that have yet to be explored. 1 Introduction High imbalance occurs in real-world domains where the decision system is aimed to detect a rare but important case. The problem of imbalance has got more and more emphasis in recent years. Imbalanced data sets exists in many real-world domains, such as spotting unreliable telecommunication customers, detection of oil spills in satellite radar images, learning word pronunciations, text classification, detection of fraudulent telephone calls, information retrieval and filtering tasks, and so on [27], [32], [33]. A number of solutions to the class-imbalance problem were previously proposed both at the data and algorithmic levels. At the data level [5], these solutions include many different forms of re-sampling such as random oversampling with replacement, random undersampling, directed oversampling (in which no new examples are created, but the choice of samples to replace is informed rather than random), directed undersampling (where, again, the choice of examples to eliminate is informed), oversampling with informed generation of new samples, and combinations of the above techniques. At the algorithmic level [26], solutions include adjusting the costs of the various classes so as to counter the class imbalance, adjusting the probabilistic estimate at the tree leaf (when working with decision trees), adjusting the decision threshold, and recognition-based (i.e., learning from one class) rather than discrimination-based (two class) learning. Mixture-of-experts approaches [20] (combining methods) have been
2 also used to handle class-imbalance problems. These methods combine the results of many classifiers; each usually induced after over-sampling or under-sampling the data with different over/under-sampling rates. The error-prone nature of small disjuncts is a direct result of rarity. Therefore, an understanding of why small disjuncts are so error prone will help explain why rarity is a problem. One explanation is that some small disjuncts may not represent rare, or exceptional, cases, but rather something else such as noisy data. Thus, only small disjuncts that are meaningful should be kept. Most classifier induction systems have some means of preventing overfitting, to remove subconcepts (i.e., disjuncts) that are not believed to be meaningful. Inductive bias also plays a role with respect to rare classes. Many induction systems will tend to prefer the more common classes in the presence of uncertainty (i.e., they will be biased in favor of the class priors). Gary Weiss [30] presents an overview of the field of learning from imbalanced data. He pays particular attention to differences and similarities between the problems of rare classes and rare cases. He then discusses some of the common issues and their range of solutions in mining imbalanced datasets. This paper describes more techniques for handling imbalance dataset problems. The reader should be cautioned that a single article cannot be a comprehensive review of all these techniques. Instead, our goal has been to provide a representative sample of existing lines of research in each technique. In each of our listed areas, there are many other papers that more comprehensively detail relevant work. Our next section covers the data level techniques for handling imbalance datasets, whereas algorithmic level techniques are described in section 3. Section 4 presents the combining methods for handling imbalance datasets. The metrics for evaluating of performance of classifiers in learning from imbalanced data are covered in section 5. Section 6 deals with other problems related with imbalance such as the small disjunct. Finally, the last section concludes this work. 2 Data level methods for handling imbalance As we have already mentioned, data level solutions include many different forms of re-sampling such as random oversampling with replacement, random undersampling, directed oversampling, directed undersampling, oversampling with informed generation of new samples, and combinations of the above techniques. 2.1 Undersampling Random under-sampling [23] is a non-heuristic method that aims to balance class distribution through the random elimination of majority class examples. The rationale behind it is to try to balance out the dataset in an attempt to overcome the idiosyncrasies of the machine learning algorithm. The major drawback of random undersampling is that this method can discard potentially useful data that could be important for the induction process. Another problem with this approach is that the purpose of machine learning is for the classifier to estimate the probability distribution of the
3 target population. Since that distribution is unknown we try to estimate the population distribution using a sample distribution. Statistics tells us that as long as the sample is drawn randomly, the sample distribution can be used to estimate the population distribution from where it was drawn. Hence, by learning the sample distribution we can learn to approximate the target distribution. Once we perform undersampling of the majority class, however, the sample can no longer be considered random. Given two examples E i and E j belonging to different classes, and d(e i, E j ) is the distance between E i and E j ; a (E i, E j ) pair is called a Tomek link if there is not an example E 1, such that d(e i, E 1 ) < d(e i, E j ) or d(ej, E 1 ) < d(e i, E j ). If two examples form a Tomek link, then either one of these examples is noise or both examples are borderline. Tomek links can be used as an under-sampling method or as a data cleaning method. As an under-sampling method, only examples belonging to the majority class are eliminated, and as a data cleaning method, examples of both classes are removed. Kubat and Matwin [24] randomly draw one majority class example and all examples from the minority class and put these examples in E. Afterwards, use a 1-NN over the examples in E to classify the examples in E. Every misclassified example from E is moved to E. The idea behind this implementation of a consistent subset is to eliminate the examples from the majority class that are distant from the decision border, since these sorts of examples might be considered less relevant for learning. 2.2 Oversampling Random over-sampling is a non-heuristic method that aims to balance class distribution through the random replication of minority class examples. Several authors [5], [24] agree that random over-sampling can increase the likelihood of occurring overfitting, since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cover one replicated example. In addition, oversampling can introduce an additional computational task if the data set is already fairly large but imbalanced. SMOTE generates synthetic minority examples to over-sample the minority class [5]. Its main idea is to form new minority class examples by interpolating between several minority class examples that lie together. For every minority example, its k (which is set to 5 in SMOTE) nearest neighbors of the same class are calculated, then some examples are randomly selected from them according to the over-sampling rate. After that, new synthetic examples are generated along the line between the minority example and its selected nearest neighbors. Thus, the overfitting problem is avoided and causes the decision boundaries for the minority class to spread further into the majority class space. 2.3 Feature Selection for imbalance datasets Zheng et al [35] suggest that existing measures used for feature selection are not very appropriate for imbalanced data sets. They propose a feature selection framework, which selects features for positive and negative classes separately and then
4 explicitly combines them. The authors show simple ways of converting existing measures so that they separately consider features for negative and positive classes. 3 Algorithm level methods for handling imbalance Drummond and Holte [9] report that when using C4.5 s default settings, oversampling is surprisingly ineffective, often producing little or no change in performance in response to modifications of misclassification costs and class distribution. Moreover, they note that over-sampling prunes less and therefore generalizes less than under-sampling, and that a modification of the C4.5 s parameter settings to increase the influence of pruning and other overfitting avoidance factors can reestablish the performance of over-sampling. For internally biasing the discrimination procedure, a weighted distance function is proposed in [1] to be used in the classification phase of knn. The basic idea behind this weighted distance is to compensate for the imbalance in the training sample without actually altering the class distribution. Thus, weights are assigned, unlike in the usual weighted k-nn rule, to the respective classes and not to the individual prototypes. In such a way, since the weighting factor is greater for the majority class than for the minority one, the distance to positive minority class prototypes becomes much lower than the distance to prototypes of the majority class. This produces a tendency for the new patterns to find their nearest neighbor among the prototypes of the minority class. Another approach to dealing with imbalanced datasets using SVM biases the algorithm so that the learned hyperplane is further away from the positive class. This is done in order to compensate for the skew associated with imbalanced datasets which pushes the hyperplane closer to the positive class. This biasing can be accomplished in various ways. In [32] an algorithm is proposed that changes the kernel function to develop this bias. Veropoulos et al [28] suggested using different penalty constants for different classes of data, making errors on positive instances costlier than errors on negative instances. 3.1 Threshold method Some classifiers, such as the Naıve Bayes classifier or some Neural Networks, yield a score that represents the degree to which an example is a member of a class. Such ranking can be used to produce several classifiers, by varying the threshold of an example pertaining to a class [30]. 3.2 One-class learning An interesting aspect of one-class (recognition-based) learning is that, under certain conditions such as multi-modality of the domain space, one class approaches to solv-
5 ing the classification problem may in fact be superior to discriminative (two-class) approaches (such as decision trees or Neural Networks) [17]. Ripper [7] is a rule induction system that utilizes a separate-and-conquer approach to iteratively build rules to cover previously uncovered training examples. Each rule is grown by adding conditions until no negative examples are covered. It normally generates rules for each class from the most rare class to the most common class. Given this architecture, it is quite straightforward to learn rules only for the minority class a capability that Ripper provides. In particular, Raskutti and Kowalczyk [27[ show that one class learning is particularly useful when used on extremely unbalanced data sets composed of a high dimensional noisy feature space. They argue that the one-class approach is related to aggressive feature selection methods, but is more practical since feature selection can often be too expensive to apply. 3.3 Cost-sensitive learning As we have already mentioned changing the class distribution is not the only way to improve classifier performance when learning from imbalanced datasets. A different approach to incorporating costs in decision-making is to define fixed and unequal misclassification costs between classes [8]. Cost model takes the form of a cost matrix, where the cost of classifying a sample from a true class j to class i corresponds to the matrix entry λ ij. This matrix is usually expressed in terms of average misclassification costs for the problem. The diagonal elements are usually set to zero, meaning correct classification has no cost. We define conditional risk for making a decision α i as: R( a x) = λ P( v x) i ij j i The equation states that the risk of choosing class i is defined by fixed misclassification costs and the uncertainty of our knowledge about the true class of x expressed by the posterior probabilities. The goal in cost-sensitive classification is to minimize the cost of misclassification, which can be realized by choosing the class (v j ) with the minimum conditional risk. 4 Combining methods A mixture-of-experts approach [10] has been used to combine the results of many classifiers, each induced after over-sampling or under-sampling the data with different over/under-sampling rates. This approach recognizes the fact that it is still unclear which sampling method performs best, what sampling rate should be used and that the proper choice is probably domain specific. Results indicate that the mixture-of experts approach performs well, generally outperforming another method (AdaBoost) with respect to precision and recall on text classification problems, and doing especially well at covering the rare, positive, examples. More detailed experiments are presented in [11].
6 Chan and Stolfo [4] run a set of preliminary experiments to identify a good class distribution and then sample in such a way as to generate multiple training sets with the desired class distribution. Each training set typically includes all minority-class examples and a subset of the majority-class examples; however, each majority-class example is guaranteed to occur in at least one training set, so no data is wasted. The learning algorithm is applied to each training set and meta-learning is used to form a composite learner from the resulting classifiers. This approach can be used with any learning method and Chan and Stolfo evaluate it using four different learning algorithms. The same basic approach for partitioning the data and learning multiple classifiers has been used with support vector machines. The resulting SVM ensemble [33] was shown to outperform both under-sampling and over-sampling. While these ensemble approaches are effective for dealing with rare classes, they assume that a good class distribution is known. This can be estimated using some preliminary runs, but this increases the time required to learn. Another method that uses this general approach employs a progressive-sampling algorithm to build larger and larger training sets, where the ratio of positive to negative examples added in each iteration is chosen based on the performance of the various class distributions evaluated in the previous iteration [31]. MetaCost [8] is another method for making a classifier cost-sensitive. The procedure begins to learn an internal cost-sensitive model by applying a cost-sensitive procedure, which employs a base learning algorithm. Then, MetaCost procedure estimates class probabilities using bagging and then re-labels the training instances with their minimum expected cost classes, and finally relearns a model using the modified training set. Boosting algorithms are iterative algorithms that place different weights on the training distribution each iteration. After each iteration boosting increases the weights associated with the incorrectly classified examples and decreases the weights associated with the correctly classified examples. This forces the learner to focus more on the incorrectly classified examples in the next iteration. Because rare classes/cases are more error-prone than common classes/cases, it is reasonable to believe that boosting may improve their classification performance because, overall, it will increase the weights of the examples associated with these rare cases/classes. Note that because boosting effectively alters the distribution of the training data, one could consider it a type of advanced sampling technique. AdaBoost s weight-update rule has been made cost-sensitive, so that examples belonging to rare class(es) that are misclassified are assigned higher weights than those belonging to common class(es). The resulting system, Adacost [12], has been empirically shown to produce lower cumulative misclassification costs than AdaBoost and thus, like other cost-sensitive learning methods, can be used to address the problem with rare classes. Rare-Boost [20] scales false-positive examples in proportion to how well they are distinguished from true-positive examples and scales false-positive examples in proportion to how well they are distinguished from true-negative examples. Another algorithm that uses boosting to address the problems with rare classes is SMOTE- Boost [6]. This algorithm recognizes that boosting may suffer from the same problems as over-sampling (e.g., overfitting), since boosting will tend to weight examples
7 belonging to the rare classes more than those belonging to the common classes effectively duplicating some of the examples belonging to the rare classes. Instead of changing the distribution of training data by updating the weights associated with each example, SMOTEBoost alters the distribution by adding new minority-class examples using the SMOTE algorithm. Kotsiantis and Pintelas [23] used three agents (the first learns using Naıve Bayes, the second using C4.5 and the third using 5NN) on a filtered version of training data and combine their predictions according to a voting scheme. This technique attempts to achieve diversity in the errors of the learned models by using different learning algorithms. The intuition is that the models generated using different learning biases are more likely to make errors in different ways. They also used feature selection of the training data because in small data sets the amount of class imbalance affects more the induction and thus feature selection makes the problem less difficult. Kaizhu Huang et al. [16] presented Biased Minimax Probability Machine (BMPM) to resolve the imbalance problem. Given the reliable mean and covariance matrices of the majority and minority classes, BMPM can derive the decision hyperplane by adjusting the lower bound of the real accuracy of the testing set. 5 Evaluation metrics Using terminology from information retrieval, the minority class has much lower precision and recall than the majority class. Many practitioners have observed that for extremely skewed class distributions the recall of the minority class is often 0 there are no classification rules generated for the minority class. Accuracy places more weight on the common classes than on rare classes, which makes it difficult for a classifier to perform well on the rare classes. Because of this, additional metrics are coming into widespread use. Most of the studies in imbalanced domains mainly concentrate on two-class problem as multi-class problem can be simplified to two-class problem. By convention, the class label of the minority class is positive, and the class label of the majority class is negative. Table 1 presents the most well known evaluation metrics. TP and TN denote the number of positive and negative examples that are classified correctly, while FN and FP denote the number of misclassified positive and negative examples respectively. Accuracy = (TP+TN)/(TP+FN+FP+TN) (1) FP rate = FP/(TN+FP) (2) TP rate = Recall = TP/(TP+FN) (3) Precision = TP/(TP+FP) (4) 2 (1 + β )Recall *Pr ecision Fvalue = (5) 2 β Recall + Pr ecision Table 1. Evaluation metrics
8 In general, four criteria are used to evaluate the performance of classifiers in learning from imbalanced data. They are (1)Minimum Cost criterion (MC), (2) the criterion of Maximum Geometry Mean (MGM) of the accuracy on the majority class and the minority class, (3) the criterion of the Maximum Sum (MS) of the accuracy on the majority class and the minority class, and (4) the criterion of Receiver Operating Characteristic (ROC) analysis. The MC criterion [3] minimizes the cost measured by Cost = FP * CFP + FN * CFP, where CFP is the cost of a false positive and CFn is the cost of a false negative. However, the cost of misclassification is generally unknown in real cases, this restricts the usage of this measure. The criterion of MGM maximizes the geometric mean of the accuracy [24], but it contains a nonlinear form, which is not easy to be automatically optimized. Comparatively, MS maximizing the sum of the accuracy on the positive class and the negative class (or maximizing the difference between the true-positive and false-positive probability) [13], is a linear form. Perhaps the most common metric is ROC analysis and the associated use of the area under the ROC curve (AUC) to assess overall classification performance [3]. AUC does not place more emphasis on one class over the other, so it is not biased against the minority class. ROC curves, like precision-recall curves, can also be used to assess different tradeoffs the number of positive examples correctly classified can be increased at the expense of introducing additional false positives. In detail, ROC curve is a two-dimensional graph in which TP rate is plotted on the y-axis and FP rate is plotted on the x-axis. FP rate (formula (2)) denotes the percentage of the misclassified negative examples, and TP rate (formula (3)) is the percentage of the correctly classified positive examples. The point (0, 1) is the ideal point of the learners. ROC curve depicts relative trade-offs between benefits (TP rate) and costs (FP rate). Furthermore, F-value (formula (5)) is also a popular evaluation metric for imbalance problem [10]. It is a kind of combination of recall (formula (3)) and precision (formula (4)), which are effective metrics for information retrieval community where the imbalance problem exist. F-value is high when both recall and precision are high, and can be adjusted through changing the value of β, where β corresponds to relative importance of precision vs. recall and it is usually set to 1. Regardless of how we produce ROC curves - by sampling, by moving the decision threshold, or by varying the cost matrix - the problem still remains of selecting the single best method and the single best classifier for deployment in an intelligent system. If the binormal assumption holds, the variances of the two distributions are equal, and error costs are the same, then the classifier at the apex of the dominant curve is the best choice. When applying machine learning to real-world problems, rarely would one or more of these assumptions hold, but to select a classifier, certain conditions must exist, and we may need more information. If one ROC curve dominates all others, then the best method is the one that produced the dominant curve, which is also the curve with the largest area. This was generally true of our domains, but it is not true of others. To select a classifier from the dominant curve, we need additional information, such as a target false-positive rate. On the other hand, if multiple curves dominate in different parts of the ROC space, then we can use the ROC Convex Hull method to select the optimal classifier [26].
9 6 Other problems related with imbalance However, it has also been observed that in some domains, for instance the Sick data set, standard ML algorithms are capable of inducing good classifiers, even using highly imbalanced training sets. This shows that class imbalance is not the only problem responsible for the decrease in performance of learning algorithms. Class imbalances are not the only problem to contend with: the distribution of the data within each class is also relevant (between-class versus within-class imbalance) [17], [34]. Prati et al [25] developed a systematic study aiming to question whether class imbalances hinder classifier induction or whether these deficiencies might be explained in other ways. Their study was developed on a series of artificial data sets in order to fully control all the variables they wanted to analyze. The results of their experiments, using a discrimination-based inductive scheme, suggested that the problem is not solely caused by class imbalance, but is also related to the degree of data overlapping among the classes. A number of papers discussed interaction between the class imbalance and other issues such as the small disjunct [19] and the rare cases [15] problems, data duplication [22] and overlapping classes [29]. It was found that in certain cases, addressing the small disjunct problem with no regard for the class imbalance problem was sufficient to increase performance. The method for handling rare case disjuncts was found to be similar to the m-estimation Laplace smoothing, but it requires less tuning. It was also found that data duplication is generally harmful, although for classifiers such as Naive Bayes and Perceptrons with Margins, high degrees of duplication are necessary to harm classification [22]. It was argued that the reason why class imbalances and overlapping classes are related is that misclassification often occurs near class boundaries where overlap usually occurs as well. Jo and Japkowicz [21] experiments suggest that the problem is not directly caused by class imbalances, but rather, that class imbalances may yield small disjuncts which, in turn, will cause degradation. The resampling strategy proposed by [21] consists of clustering the training data of each class (separately) and performing random oversampling cluster by cluster. Its idea is to consider not only the between-class imbalance (the imbalance occurring between the two classes) but also the within-class imbalance (the imbalance occurring between the subclusters of each class) and to oversample the dataset by rectifying these two types of imbalances simultaneously. Before performing random oversampling, the training examples in the minority and the majority classes must be clustered. Once the training examples of each class have been clustered, oversampling starts. In the majority class, all the clusters, except for the largest one, are randomly oversampled so as to get the same number of training examples as the largest cluster. Let maxclasssize be the overall size of the large class. In the minority class, each cluster is randomly oversampled until each cluster contains maxclasssize/nsmallclass where Nsmallclass represents the number of subclusters in the small class. Altogether, the experiments support the hypothesis that cluster based oversampling works better than simple oversampling or other methods for handling class imbalances or small disjuncts, especially when the number of training examples is small and the problem, complex. The reason is that cluster-based resampling identi-
10 fies rare cases and re-samples them individually, so as to avoid the creation of small disjuncts in the learned hypothesis. 7 Conclusion Practically, it is often reported that cost-sensitive learning outperforms random resampling [18]. Clever re-sampling and combination methods can do quite more than cost-sensitive learning as they can provide new information or eliminate redundant information for the learning algorithm, as shown by [5], [6], [24], [14], [2]. The relationship between training set size and improper classification performance for imbalanced data sets seems to be that on small imbalanced data sets the minority class is poorly represented by an excessively reduced number of examples that might not be sufficient for learning, especially when a large degree of class overlapping exists and the class is further divided into subclusters. For larger data sets, the effect of these complicating factors seems to be reduced, as the minority class is better represented by a larger number of examples. References [1] Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems, Pattern Recognition 36(3) (2003) [2] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20-29, [3] Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7): , [4] P. K. Chan, and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages , [5] N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Oversampling TEchnique. Journal of Artificial Intelligence Research, 16: , [6] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer. Smoteboost: Improving prediction of the minority class in boosting. In Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pages , Dubrovnik, Croatia, [7] W. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages , [8] Domingos, P. (1999). MetaCost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp ACM Press. [9] Drummond, C., and Holte, R. C. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-sampling beats Over-sampling. In Workshop on Learning from Imbalanced Data Sets II (2003).
11 [10] Estabrooks, and N. Japkowicz. A mixture-of-experts framework for learning from unbalanced data sets. In Proceedings of the 2001 Intelligent Data Analysis Conference, pages 34-43, [11] Andrew Estabrooks, Taeho Jo and Nathalie Japkowicz: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20 (1) (2004) [12] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: misclassification cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning, pages , [13] J.W. Grzymala-Busse, L. K. Goodwin, and X. Zhang. Increasing sensitivity of preterm birth by changing rule strengths. Pattern Recognition Letters, (24): , [14] H. Guo and H. L. Viktor. Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explorations, 6(1):30-39, [15] R. Hickey. Learning rare class footprints: the reflex algorithm. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, [16] Kaizhu Huang, Haiqin Yang, Irwin King, Michael R. Lyu. Learning Classifiers from Imbalanced Data Based on Biased Minimax Probability Machine. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2004) [17] N. Japkowicz. Concept-learning in the presence of between-class and within-class imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67-77, [18] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5): , [19] N. Japkowicz. Class imbalance: Are we focusing on the right issue? In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, [20] M. V. Joshi, V. Kumar, and R. C. Agarwal. Evaluating boosting algorithms to classify rare cases: comparison and improvements. In First IEEE International Conference on Data Mining, pages , November [21] Taeho Jo and N. Japkowicz (2004), Class Imbalances versus Small Disjuncts, Sigkdd Explorations. Volume 6, Issue 1 - Page [22] Kolez, A. Chowdhury, and J. Alspector. Data duplication: An imbalance problem? In Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets, [23] S. Kotsiantis, P. Pintelas, Mixture of Expert Agents for Handling Imbalanced Data Sets, Annals of Mathematics, Computing & TeleInformatics, Vol 1, No 1 (46-55), [24] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages , Nashville, Tennesse, Morgan Kaufmann. [25] Prati, R. C., Batista, G. E. A. P. A., and Monard, M. C. Class Imbalances versus Class Overlapping: an Analysis of a Learning System Behavior. In MICAI (2004), pp LNAI [26] Provost, F., & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42, [27] B. Raskutti and A. Kowalczyk. Extreme rebalancing for svms: a case study. SIGKDD Explorations, 6(1):60-69, [28] Veropoulos, K., Campbell, C., & Cristianini, N. (1999). Controlling the sensitivity of support vector machines. Proceedings of the International Joint Conference on AI, [29] S. Visa and A. Ralescu. Learning imbalanced and overlapping classes using fuzzy sets. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, [30] G. Weiss. Mining with rarity: A unifying framework.sigkdd Explorations, 6(1):7-19, 2004.
12 [31] G. M. Weiss, and F. Provost. Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19: , [32] Wu, G. & Chang, E. (2003). Class-Boundary Alignment for Imbalanced Dataset Learning. In ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC. [33] R. Yan, Y. Liu, R. Jin, and A. Hauptmann. On predicting rare classes with SVM ensembles in scene classification. In IEEE International Conference on Acoustics, Speech and Signal Processing, [34] B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages , [35] Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explorations, 6(1):80-89, Biography S. Kotsiantis received a diploma in mathematics, a Master and a Ph.D. degree in computer science from the University of Patras, Greece. His research interests are in the field of data mining and machine learning. He has more than 40 publications to his credit in international journals and conferences. D. Kanellopoulos received a diploma in electrical engineering and a Ph.D. degree in electrical and computer engineering from the University of Patras, Greece. He has more than 40 publications to his credit in international journals and conferences. P. Pintelas is a Professor in the Department of Mathematics, University of Patras, Greece. His research interests are in the field of educational software and machine learning. He has more than 100 publications in international journals and conferences.
On the Class Imbalance Problem *
Fourth International Conference on Natural Computation On the Class Imbalance Problem * Xinjian Guo, Yilong Yin 1, Cailing Dong, Gongping Yang, Guangtong Zhou School of Computer Science and Technology,
Editorial: Special Issue on Learning from Imbalanced Data Sets
Editorial: Special Issue on Learning from Imbalanced Data Sets Nitesh V. Chawla Nathalie Japkowicz Retail Risk Management,CIBC School of Information 21 Melinda Street Technology and Engineering Toronto,
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW
Chapter 40 DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW Nitesh V. Chawla Department of Computer Science and Engineering University of Notre Dame IN 46530, USA Abstract Keywords: A dataset is imbalanced
Learning with Skewed Class Distributions
CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science
ClusterOSS: a new undersampling method for imbalanced learning
1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms
Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu
How To Solve The Class Imbalance Problem In Data Mining
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
The class imbalance problem in pattern classification and learning. V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca
The class imbalance problem in pattern classification and learning V. García J.S. Sánchez R.A. Mollineda R. Alejo J.M. Sotoca Pattern Analysis and Learning Group Dept.de Llenguatjes i Sistemes Informàtics
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
Addressing the Class Imbalance Problem in Medical Datasets
Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Preprocessing Imbalanced Dataset Using Oversampling Approach
Journal of Recent Research in Engineering and Technology, 2(11), 2015, pp 10-15 Article ID J111503 ISSN (Online): 2349 2252, ISSN (Print):2349 2260 Bonfay Publications Research article Preprocessing Imbalanced
On the application of multi-class classification in physical therapy recommendation
RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
Roulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Learning on the Border: Active Learning in Imbalanced Data Classification
Learning on the Border: Active Learning in Imbalanced Data Classification Şeyda Ertekin 1, Jian Huang 2, Léon Bottou 3, C. Lee Giles 2,1 1 Department of Computer Science and Engineering 2 College of Information
Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction
Journal of Artificial Intelligence Research 19 (2003) 315-354 Submitted 12//02; published 10/03 Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction Gary M. Weiss
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 [email protected] April
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
How To Detect Fraud With Reputation Features And Other Features
Detection Using Reputation Features, SVMs, and Random Forests Dave DeBarr, and Harry Wechsler, Fellow, IEEE Computer Science Department George Mason University Fairfax, Virginia, 030, United States {ddebarr,
A Novel Classification Approach for C2C E-Commerce Fraud Detection
A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,
On the application of multi-class classification in physical therapy recommendation
On the application of multi-class classification in physical therapy recommendation Jing Zhang 1, Douglas Gross 2, and Osmar R. Zaiane 1 1 Department of Computing Science, 2 Department of Physical Therapy,
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. IX (Feb. 2014), PP 103-107 Review of Ensemble Based Classification Algorithms for Nonstationary
Mining Life Insurance Data for Customer Attrition Analysis
Mining Life Insurance Data for Customer Attrition Analysis T. L. Oshini Goonetilleke Informatics Institute of Technology/Department of Computing, Colombo, Sri Lanka Email: [email protected] H. A. Caldera
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
MHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
Combining SVM classifiers for email anti-spam filtering
Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and
Classification for Fraud Detection with Social Network Analysis
Classification for Fraud Detection with Social Network Analysis Miguel Pironet 1, Claudia Antunes 1, Pedro Moura 2, João Gomes 2 1 Instituto Superior Técnico - Alameda, Departamento de Engenharia Informática,
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
Discovering Criminal Behavior by Ranking Intelligence Data
UNIVERSITY OF AMSTERDAM Faculty of Science Discovering Criminal Behavior by Ranking Intelligence Data by 5889081 A thesis submitted in partial fulfillment for the degree of Master of Science in the field
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
CLASS imbalance learning refers to a type of classification
IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS, PART B Multi-Class Imbalance Problems: Analysis and Potential Solutions Shuo Wang, Member, IEEE, and Xin Yao, Fellow, IEEE Abstract Class imbalance problems
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Class Imbalance Learning in Software Defect Prediction
Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang [email protected] University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang
Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit
Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,
Mining Caravan Insurance Database Technical Report
Mining Caravan Insurance Database Technical Report Tarek Amr (@gr33ndata) Abstract Caravan Insurance collected a set of 5000 records for their customers. The dataset is composed of 85 attributes, as well
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Direct Marketing When There Are Voluntary Buyers
Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Learning Decision Trees for Unbalanced Data
Learning Decision Trees for Unbalanced Data David A. Cieslak and Nitesh V. Chawla {dcieslak,nchawla}@cse.nd.edu University of Notre Dame, Notre Dame IN 46556, USA Abstract. Learning from unbalanced datasets
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities
Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
A Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
Decision Support Systems
Decision Support Systems 50 (2011) 602 613 Contents lists available at ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss Data mining for credit card fraud: A comparative
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
A fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
The Relationship Between Precision-Recall and ROC Curves
Jesse Davis [email protected] Mark Goadrich [email protected] Department of Computer Sciences and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 2 West Dayton Street,
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring
714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: [email protected]
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA
BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA Terence Yong Koon Beh 1, Swee Chuan Tan 2, Hwee Theng Yeo 3 1 School of Business, SIM University 1 [email protected], 2 [email protected],
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
