Dealing with Multiple Classes in Online Class Imbalance Learning

Size: px
Start display at page:

Download "Dealing with Multiple Classes in Online Class Imbalance Learning"

Transcription

1 Dealing with Multiple Classes in Online Class Imbalance Learning Shuo Wang Leandro L. Minu Xin Yao University of Birmingham, UK University of Leicester, UK University of Birmingham, UK Abstract Online class imbalance learning deals with data streams having very sewed class distributions in a timely fashion. Although a few methods have been proposed to handle such problems, most of them focus on two-class cases. Multi-class imbalance imposes additional challenges in learning. This paper studies the combined challenges posed by multiclass imbalance and online learning, and aims at a more effective and adaptive solution. First, we introduce two resampling-based ensemble methods, called and, which can process multi-class data directly and strictly online with an adaptive sampling rate. Then, we loo into the impact of multi-minority and multi-majority cases on and in comparison to other methods under stationary and dynamic scenarios. Both multi-minority and multi-majority mae a negative impact. shows the best and most stable G- mean in most stationary and dynamic cases. Introduction In many real-world classification tass, such as spam filtering [Nishida et al., 28] and fault diagnosis in computer monitoring systems [Meseguer et al., 2], data often arrive over time in streams of instances. Several online learning techniques have thus been developed to process each data example once on arrival without storage and reprocessing. They maintain a current model that reflects the current data concept to mae a prediction at each time step. These data stream applications are also class imbalanced, i.e. some classes of data are heavily under-represented compared to other classes, e.g. the spam class and the class of faults. This sewed class distribution can cause great learning difficulties, because the traditional machine learners tend to ignore or overfit the minority class [Weiss, 24]. With the aim of tacling the combined issue of online learning and class imbalance learning, online class imbalance learning has been drawing growing attention. Although a few methods have been proposed to deal with class imbalance online, they assume that the data have only two classes one minority class and one majority class. This assumption does not apply in many real-world problems. For example, in fault detection of a real-time running engineering system (discriminating fault and non-fault classes), it is liely that more than one type of faults exists and needs to be recognised. Multi-class tass have been shown to suffer more learning difficulties than two-class ones in offline learning, because multi-class increases the data complexity and aggravates the imbalanced distribution [Wang and Yao, 22]. We expect this difficulty to become even more aggravated in online learning scenarios, given that it is impossible to see the whole picture of data and the data may be dynamically changing. Very little wor has addressed the multi-class issue in online class imbalance learning. It is still unclear what learning difficulties multi-class can cause and how to handle it effectively. Therefore, a systematic study of multiple classes is necessary in order to gain a better understanding of its impact in online class imbalance learning, and a more effective and adaptive method suitable for both stationary and dynamic cases needs to be developed. This paper will focus on the following research questions: Q. Can we develop a method able to process multi-class directly and overcome class imbalance adaptively? Q2. What is the impact of multi-class imbalance in online learning with stationary imbalance status? Q3. What is the impact of multi-class imbalance in online learning with dynamic imbalance status? It is worth mentioning that explicit concept drifts involving any changes in class-conditional probability density functions are not considered in this paper, for a clear observation of the impact of the number of classes. For Q (Section 3), we propose two ensemble learning methods, and, which can process multi-class directly without using class decomposition and have an adaptive resampling technique to deal with class imbalance. For Q2 (Section 4), we fix the imbalance ratio between minority and majority classes and vary the number of minority and majority classes by generating several artificial data streams. We analyse how and are affected in comparison to other existing methods. For Q3 (Section 5), the imbalance ratio between minority and majority classes is changing gradually. Class emergence and class disappearance are considered as two major class evolution types [Sun et al., 26]. Both artificial and real-world data are discussed. In general, shows the best and most stable performance in both stationary and dynamic cases.

2 2 Bacground As the basis of this paper, the research progress in online two-class imbalance learning and offline multi-class imbalance learning is reviewed in this section. Then, VWOS-ELM and CBCE are introduced. They are the only methods capable of learning multi-class imbalanced data streams so far. Research gaps that motivate our study are identified. 2. Online Two-Class Imbalance Learning Different from incremental learning methods that store and process data in batches [Hoens et al., 22a] [Hoens and Chawla, 22], online learning methods learn from data one by one without any pre-nowledge. Several online methods have been proposed to tacle class imbalance. For example, [Nguyen et al., 2] proposed a Naive Bayes ensemble method based on random undersampling. [Wang et al., 23] and [Wang et al., 25] proposed O and U, which can tacle imbalanced data with a dynamically changing imbalance rate. However, they cannot balance multi-class data effectively and adaptively. Although the original O and U targeted general online cases, their sampling rates were not set consistently when the number of classes changes. Cost-sensitive methods, such as cost-sensitive Bagging and Boosting [Wang and Pineau, 23], RLSACP [Ghazihani et al., 23], WOS-ELM [Mirza et al., 23] and ESOS- ELM [Mirza et al., 25b], set a different misclassification cost for each class. However, the costs are pre-defined, and their cost setting strategies are only designed for two-class cases. 2.2 Offline Multi-Class Imbalance Learning In offline class imbalance learning, most wor resorts to class decomposition schemes (e.g. One-against-all (OAA), oneagainst-one (OAO) and ECOC) to decompose multi-class into several two-class problems. For example, MC-HDDT [Hoens et al., 22b] is a decision tree method based on OAA and ECOC. The imecoc method [Liu et al., 23] extended the original ECOC for imbalanced data based on a BWCweighting method. Even though class decomposition simplifies multi-class problems, it causes new issues, such as combining binary classifiers. In the online learning context, maintaining and combining multiple binary classifiers is more difficult, because the number of classes may change, and some classifiers can get outdated along with time. Furthermore, each binary classifier is trained without full data nowledge. This can cause classification ambiguity or uncovered data regions with respect to each type of decomposition [Jin and Zhang, 27]. Therefore, using class decomposition schemes is not a desirable way to learn multi-class data online. Different from the above, a few approaches were proposed recently to tacle multi-class imbalance directly. EDS [Cao et al., 23] used evolutionary search techniques to find the optimum cost setup of each class, which was then integrated into a multi-class cost-sensitive ensemble classifier. AdaBoost.NC [Wang and Yao, 22] and CoMBo [Koço and Capponi, 23] are two Boosting methods for learning multiclass imbalanced data directly. However, their core techniques are not applicable to online cases. 2.3 Online Multi-Class Imbalance Learning VWOS-ELM was proposed very recently, aiming at class imbalance problems in multi-class data streams [Mirza et al., 25a]. It is an ensemble method formed by multiple WOS- ELM base classifiers [Mirza et al., 23]. WOS-ELM is a perceptron-based extreme learning machine. It requires a data set for initialisation before the sequential learning starts. Different class weights are maintained to tacle class imbalance, based on models performance on a validation data set. However, once the validation data set does not reflect the real status of data, the class weights will not be accurate for learning. Besides, initialisation data is not always available. CBCE is a class-based ensemble method focusing on class evolution [Sun et al., 26]. One-against-all class decomposition technique is used for handling multiple classes. Undersampling is applied to overcome class imbalance induced by the class evolution. Although CBCE is shown to handle dynamic classes well, class decomposition is not an ideal way of processing multi-class imbalanced problems. In summary, none existing methods can handle multi-class imbalance well in online learning. Meanwhile, there is no study looing into the radical learning issues caused by multiclass. 3 Resampling-based Ensemble Methods This section proposes two resampling-based ensemble methods, Multi-class Oversampling-based Online Bagging () and Multi-class Undersampling-based Online Bagging (). As suggested by their names, they use oversampling or undersampling to overcome class imbalance, with the framewor of Online Bagging () [Oza, 25]. Resampling is algorithm-independent, so that any type of base classifiers forming the ensemble is allowed. To be able to process multi-class data directly, for example, Hoeffding trees or neural networs can be used. Besides, resampling is one of the simplest and most effective imbalance techniques in both offline and online class imbalance learning [Hulse et al., 27] [Wang et al., 25]. In order to tacle class imbalance through resampling under both stationary and dynamic scenarios, a time-decayed class size (i.e. class prior probability) [Wang et al., 23] is adopted. It is a real-time indicator, reflecting the current class imbalance status. It is used to decide the sampling rate adaptively in and. For a sequence of examples (x t, y t ) arriving one at a time, x t is an input vector belonging to the input space X observed at time step t, and y t is the corresponding label belonging to the label set Y = {c,..., c N } (N>2). For any c Y, the time-decayed class size w (t) indicates the occurrence probability of examples belonging to c. To reflect the current imbalance status of the data stream, w (t) is incrementally updated at each time step, by using a time decay (forgetting) factor that weaens the effect of old data. When a new example x t arrives, w (t) is updated by [Wang et al., 23]: w (t) = θw (t ) + ( θ) [(x t, c )], ( =,..., N) () where [(x t, c )] = if the true class label of x t is c, otherwise. θ ( < θ < ) is the pre-defined time decay factor. It

3 forces older data to have less impact on the class percentage, so that w (t) is adjusted more based on new data. θ =.9 was shown to be a reasonable setting to balance the responding speed and estimation variance [Wang et al., 23]. With the class imbalance information, and integrate resampling into. is an online version of the offline Bagging [Breiman, 996], designed for balanced data. It builds multiple base classifiers and each classifier is trained K times by using the current training example, where K follows the Poisson(λ = ) distribution. In, oversampling is used to increase the chance of learning minorityclass examples through λ based on current w (t). Similarly, in, undersampling is used to reduce the chance of learning majority-class examples. At each time step t, their training procedures are given in Table. Table : and Training Procedures. Input: an ensemble with M classifier, current training examples (x t, y t ) where y( t corresponds to c j in Y,) and current class size w (t) = w (t),..., w(t),..., w(t) N. w min = min N = w (t) w max = max N = w(t) for each base learner f m (m =, 2,...(, M) do ) if using : set K P oisson w max /w (t) ( if using : set K P oisson update f m K times using (x t, y t ) end for j w min /w (t) j ) The minimum and maximum class sizes (w min and w max ) among all classes are calculated at each time step. sets λ to w max /w (t) j, so that the smaller class has a larger sampling rate, and vice versa for. When the data stream is strictly balanced, and will be reduced to the traditional, where λ =. The advantages of and are: ) they contain mechanisms to deal with dynamic imbalance status, so that no validation data set is needed for updating class weights. 2) Resampling allows the ensemble learner to use any types of base classifiers. Neither a data set for initialising classifiers nor class decomposition for handling multi-class is required. In the following sections, the prequential performance of and will be examined, in comparison with VWOS-ELM and. VWOS-ELM is chosen as the only online learning algorithm aiming to solve multi-class imbalance directly so far. without any class imbalance techniques is included, serving as the baseline method. The prequential test used in our experiment is a popular performance evaluation strategy in online learning, in which each individual example tests the model before it is used for training, and from this the performance measures can be incrementally updated and recorded at each time step [Minu, 2]. In our test, we record recall of each class and G- mean as the evaluation metrics. They are two most commonly used criteria in class imbalance learning, as they are insensitive to the imbalance rate. Recall is defined as the classification accuracy on a single class. It helps us to analyse the performance on one class. G-mean is an overall performance metric, defined as the geometric mean of recalls over all classes [Kubat and Matwin, 997]. It helps us to understand how well the performance is balanced among classes. It is worth noting that F-score and AUC are also popular, but they are not suitable for multi-class analysis. 4 Multi-Class Learning from Stationary Data In this section, we give an in-depth analysis of multi-class imbalance in stationary data streams. With a fixed imbalance ratio between minority and majority classes, the impact of multi-class will be easily observed. Two basic types of multiclass can occur: one majority and multiple minority classes (multi-minority); one minority and multiple majority classes (multi-majority) [Wang and Yao, 22]. We loo into each type by producing artificial data sets with a different number of minority/majority classes, aiming to find out the learning difficulties in each type and the differences between them. 4. Multi-Minority Data For multi-minority cases, we generate data streams with 5 time steps. Each data example has 2 numeric attributes. The number of minority classes is varied from to 5 (denoted by c,..., c5), and there is only one majority class (denoted by c6). The imbalance ratio between minority and majority classes remains 3:7. Data points in each class are generated randomly from Gaussian distributions, where the mean and standard deviation of each class are random real values in [,5]. Among different data streams, the examples with the same class label follow the same Gaussian distribution. Table 2 summarizes the class number (N) and class size (P) (i.e. prior probability) of the generated multi-minority data. Table 2: Class Settings for Multi-Minority Data. ID N maj P min P maj Bi (c) (c6) 3/ 7/ Min2 2 (c,c2) (c6) 3/3 7/3 Min3 3 (c-c3) (c6) 3/6 7/6 Min4 4 (c-c4) (c6) 3/9 7/9 Min5 5 (c-c5) (c6) 3/22 7/22 The prequential recall curves of c (minority) and c6 (majority) from,, and VWOS-ELM are compared in Fig.. Every method is run times independently. Because all the compared methods have an ensemble learning framewor, we set the number of base classifiers to. Choosing an odd number is to avoid an even majority vote from base classifiers. Considering that VWOS-ELM is a perceptron-based method, we use multilayer perceptron (MLP) as the base classifier in, and for a fair comparison. Further discussion of using other base classifiers can be found in [Wang et al., 25] for 2-class cases. VWOS-ELM requires a data set to initialise each base classifier and a validation data set to update class weights. According to the authors, the initialisation set needs to include

4 examples from all classes. To meet this requirement, we use the first % of data (i.e. 5 examples) as the initialisation and validation data. We start tracing the prequential performance from time step 5 for all the methods. Class Recall (Minority) Class Recall (Minority) Class Recall (Minority) Class Recall (Minority) (Multi Minority) = (Multi Minority) = (Multi Minority) = (Multi Minority) = (a) (b) (c) (Multi Minority) = (Multi Minority) = (d) VWOS-ELM (Multi Minority) Nmin Nmin = = (Multi Minority). N min Figure : Prequential recall curves of classes c (minority) and c6 (majority) in multi-minority cases. For and, it is clear to observe that increasing minority-class number reduces minority-class recall and majority-class recall. Multi-minority delays and worsens the recognition of minority-class examples, as well as majorityclass examples. At the end of learning, the online learner shows worse recall of both types of classes in data with more minority classes. Because undersampling is a more aggressive method of emphasizing the minority class than oversampling, the minority-class recall recovers better in than in as more examples arrive. The observation on VWOS-ELM is different. The negative impact of multi-minority is shown to be higher on the majority class than on the minority class. When there are five minority classes in the data stream, the majority-class recall becomes nearly zero. The liely reason is that the base learner WOS-ELM in VWOS-ELM tends to over-emphasize the minority class sometimes especially when the minorityclass size is small [Wang et al., 25]. So, when there are multiple minority classes, the performance on the only majority class in the data stream can be sacrificed greatly. Final G-mean (i.e. G-mean at the last time step) of all the methods is shown in Table 3. The significantly best value is shown in boldface, based on the Wilcoxon Sign Ran test. We can see that when the number of minority class is small ( =,2,3), performs the worst; VWOS-ELM shows very good G-mean, because it is very aggressive at boosting minority-class recall. When becomes 5, shows the best G-mean. VWOS-ELM becomes the worst, caused by too much performance degradation of the majority class as shown in Fig.. In all the multi-minority cases, is better than or at least comparable to in G-mean. Although is shown to provide higher minority-class recall than in the above analysis, its majority-class recall is compromised more than that of. Therefore, provides the most stable performance among all. 4.2 Multi-Majority Data For multi-majority cases, we use the same data generation settings as in the previous section, producing four multimajority data streams. Table 4 summarizes their class number and class size. Table 4: Class Settings for Multi-Majority Data. ID N maj P min P maj Maj2 (c) 2 (c5,c6) 3/7 7/7 Maj3 (c) 3 (c4-c6) 3/24 7/24 Maj4 (c) 4 (c3-c6) 3/3 7/3 Maj5 (c) 5 (c2-c6) 3/38 7/38 We observe the prequential recall of c (minority) and c6 (majority) in the multi-majority data streams. We obtain very similar observations on the impact of multi-majority to those in multi-minority cases (Fig. ), so the recall curves in the multi-majority cases are omitted here for the space reason. Comparing the same method in multi-minority and multi-majority cases, their final recall is very similar. Multimajority is not shown to be more difficult than multi-minority. This is an interesting result. In offline learning, multimajority is a much more difficult case than multi-minority, because the minority class is overwhelmed by the large quantity of new majority-class examples [Wang and Yao, 22]. It is not found in online cases. The reason may lie in other factors, e.g. the data distribution and imbalance ratio, which will be investigated in our next wor.

5 Table 3: Mean and standard deviation of final G-mean in multi-minority cases. Bi Min2 Min3 Min4 Min5.69±.3.625±.3.42±. 2±.5 7±.7.694±..564±.7 64±.6 4±.24.79±.23 VWOS-ELM.7±.8.492±.63.49±.38.29±.37.26± ±.7.468±.29 87±.7.45±..43±.8 Table 5: Mean and standard deviation of final G-mean in multi-majority cases. Bi Maj2 Maj3 Maj4 Maj5.69±.3.558±.9.42±.4 26±.8 77±.5.694±..564±.5 3±.9 65±.8 ±.5 VWOS-ELM.7±.8.74±.24.68±.76.27±.56.26± ±.7.52±.32 5±.33.±.8.38±.63 These results are further reflected in final G-mean, shown in Table 5. Comparing the same method in Table 5 and Table 3, G-mean in the multi-majority cases is not worse than G-mean in the multi-minority cases, except for VWOS- ELM. shows the significantly best G-mean especially in cases with more majority classes. The fact that is worse than is interesting, as oversampling does not improve the classification performance as much as undersampling and suffers from overfitting in offline learning [Wang and Yao, 22]. On one hand, oversampling seems to strengthen the performance stability, and is less liely to cause overfitting in online learning. On the other hand, discarding examples by undersampling is more liely to cause insufficient learning in online cases. 5 Multi-Class Learning from Dynamic Data In most online learning applications, the imbalance status does not stay static. It is more liely that the change occurs in a gradual manner in a real-world scenario. For example, in a faulty gearbox system, as the part is wearing out eventually and the faulty condition gets worse, the faulty data will appear more and more frequently. Therefore, this section focuses on dynamic data streams with a gradual class imbalance change. Both artificial data and real-world data are discussed. 5. Artificial Data Based on the class evolution type categorized in [Sun et al., 26], we consider an emerging class and a disappearing class in both minority and majority types of classes. Concretely speaing, we generate three data streams with 5 time steps, each of which has two minority classes (c and c2) and two majority classes (c3 and c4). How the class imbalance status changes in the artificial data is described in Table 6. The class imbalance change starts at time step and ends at time step 5. The first 5% data are used for initialisation and validation in VWOS-ELM. All the settings for the four learning methods remain the same as in Section 4. Fig. 2 shows the recall of the two dynamic minority classes in DyMin and the two dynamic majority classes in DyMaj. Table 6: Class Settings for Dynamic Data. ID P c P c2 P c3 P c4 DyMin 5 5 DyMaj DyAll.7.7 The curve tendency in DyMin is quite similar to the one in DyMaj. Class Recall (Emerging Minority) Class3 Recall (Emerging Majority) Dynamic Minority Classes Dynamic Majority Classes Class2 Recall (Disappearing Minority) (a) Recall of c anc c2 in DyMin Class4 Recall (Disappearing Majority) Dynamic Minority Classes Dynamic Majority Classes (b) Recall of c3 anc c4 in DyMaj Figure 2: Prequential recall curves of dynamic classes in DyMin and DyMaj. For the emerging class (c in DyMin and c3 in DyMaj), all the methods have a growing recall, as the examples from this class arrive more and more frequently. has a better c recall and a worse c3 recall than, because c

6 remains to be the minority in DyMin, and c3 becomes majority gradually in DyMaj, in which adjusts its focus on the minority classes eventually. For the disappearing class (c2 in DyMin and c4 in DyMaj), VWOS-ELM shows a significant drop. It is because this class is over-emphasized at the beginning with a very high recall, and the class weights are updated based on a separate validation set that does not reflect the current imbalance status. So, it causes less focus on this class later on, even though it has become a minority class. The other methods show more stable performance on the disappearing class. After the analysis within each class, we now compare the overall performance G-mean in all the three dynamic data streams, shown in Table 7. performs the best, regardless of which class is dynamically changing. For and, their G-mean is quite similar in DyMin and DyMaj; for VWOS-ELM and, however, their G-mean in DyMaj is much worse than in DyMin. This is because the sampling rate in and is adaptive to the current imbalance status, whereas VWOS-ELM and do not have such mechanism. Besides, DyMaj involves a larger change of class imbalance than DyMin. Based on the above analysis, we can see the benefit of using an adaptive sampling in dynamic data streams in and. Among all, has the best performance. Table 7: Mean and standard deviation of final G-mean in dynamic artificial data. DyMin DyMaj DyAll.452±.8.463±.7.526±.7 72±.32 89±.26 49±.43 VWOS-ELM.85±.23.66±.2.35±.9 36±.2.49±.55 4± Real-World Data We now compare the four methods in two real-world data applications: online chess game [Žliobaitė, 2] and UDI TweeterCrawl data [Li et al., 22]. The Chess data consist of online game records of one player from 27 to 2. The tas is to predict if the player will win, lose or draw (3 classes). The original Tweet data include 5 million tweets posted mainly from 28 to 2. The tas is to predict the tweet topic. In our experiment, we choose a time interval, containing 8774 examples and covering 7 tweet topics. Because real-world data hardly remain static, they both contain some gradual changes in class imbalance status. The final G-mean is shown in Table 8. VWOS-ELM performs very well, giving the best G-mean in both data sets. By looing into its recall in each class, its majority-class performance (class2 in Chess and Class4 in Tweet) is sacrificed, but not as much as the improvement on the minority classes. Therefore, its overall performance is high. However, we notice that its good performance relies heavily on the choice of the initialisation data set. comes to the second with some improvement on the minority-class recall compared to and quite high Table 8: Mean and standard deviation of final G-mean in Chess and Tweet. Chess Tweet 4±.9 46±.3 37±.2.2±. VWOS-ELM 2±.3 72±.7.±..±. majority-class recall. For, its majority-call recall is sacrificed too much in Chess and its minority-class recall is not good enough in Tweet, so its performance is not satisfactory. Overall, VWOS-ELM and are better choice for these real-world data, where VWOS-ELM is more aggressive at finding minority-class examples and is better at balancing the performance among classes. 6 Conclusions This paper investigates the multi-class problem in online class imbalance learning. We study three research issues: Q. effective and adaptive multi-class learning methods for online data; Q2. multi-class in stationary data streams; Q3. multiclass in dynamic data streams. For Q, we propose two ensemble learning methods and. To the best of our nowledge, they are the first methods that can simultaneously process multi-class imbalance directly without using class decomposition, and handle class imbalance adaptively and strictly online without using any initialisation and validation data sets. For Q2, we investigate the online performance of and by varying the number of minority and majority classes respectively and fixing the class ratio between them, in comparison to VWOS-ELM and. All methods are negatively affected by both multi-minority and multi-majority, especially the minority-class recall of and the majority-class recall of VWOS-ELM; shows the best and most stable G-mean. For Q3, we generate multi-class data with gradually emerging and disappearing classes. For more convincing results, we also select two real-world data sets that are collected over time. is the best at G-mean in most cases; VWOS-ELM shows the most aggressive performance at boosting minority-class performance, which benefits the real-world data classification, but it suffers performance reduction on the disappearing class in the artificial data. We also show the benefit of using the adaptive sampling rate in and. In the near future, we would lie to study more types of multi-class data streams, e.g. with more severe imbalance status or different data distributions. It is also important to consider concept drifts in multi-class imbalanced data streams. Acnowledgments This wor was supported by EPSRC (Grant No. EP/K523/). Xin Yao was also supported by a Royal Society Wolfson Research Merit Award.

7 References [Breiman, 996] L. Breiman. Bagging predictors. Machine Learning, 24(2):23 4, 996. [Cao et al., 23] P. Cao, B. Li, D. Zhao, and O. Zaiane. A novel cost sensitive neural networ ensemble for multiclass imbalance data learning. In The 23 International Joint Conference on Neural Networs, pages 8, 23. [Ghazihani et al., 23] A. Ghazihani, R. Monsefi, and H. S. Yazdi. Recursive least square perceptron model for non-stationary and imbalanced data stream classification. Evolving Systems, 4(2):9 3, 23. [Hoens and Chawla, 22] T. Ryan Hoens and N. V. Chawla. Learning in non-stationary environments with class imbalance. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 68 76, 22. [Hoens et al., 22a] T. R. Hoens, R. Poliar, and N. V. Chawla. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, ():89, 22. [Hoens et al., 22b] T.R. Hoens, Q. Qian, N.V. Chawla, and Z. Zhou. Building decision trees for the multi-class imbalance problem. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 22 34, 22. [Hulse et al., 27] J. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning, pages , 27. [Jin and Zhang, 27] R. Jin and J. Zhang. Multi-class learning by smoothed boosting. Machine Learning, 67(3):27 227, 27. [Koço and Capponi, 23] S. Koço and C. Capponi. On multi-class classification through the minimization of the confusion matrix norm. In JMLR: Worshop and Conference Proceedings 29, pages , 23. [Kubat and Matwin, 997] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In Proc. 4th International Conference on Machine Learning, pages 79 86, 997. [Li et al., 22] R. Li, S. Wang, H. Deng, R. Wang, and K. Chang. Towards social user profiling: unified and discriminative influence model for inferring home locations. In Proceedings of the 8th ACM SIGKDD, pages 23 3, 22. [Liu et al., 23] X. Liu, Q. Li, and Z. Zhou. Learning imbalanced multi-class data with optimal dichotomy weights. In 23 IEEE 3th International Conference on Data Mining (ICDM), pages , 23. [Meseguer et al., 2] J. Meseguer, V. Puig, and T. Escobet. Fault diagnosis using a timed discrete-event approach based on interval observers: Application to sewer networs. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 4(5):9 96, 2. [Minu, 2] L. L. Minu. Online Ensemble Learning in the Presence of Concept Drift. PhD thesis, School of Computer Science, The University of Birmingham, 2. [Mirza et al., 23] B. Mirza, Z. Lin, and K. Toh. Weighted online sequential extreme learning machine for class imbalance learning. Neural Processing Letters, 38: , 23. [Mirza et al., 25a] B. Mirza, Z. Lin, J. Cao, and X. Lai. Voting based weighted online sequential extreme learning machine for imbalance multi-class classification. In 25 IEEE International Symposium on Circuits and Systems (ISCAS), pages , 25. [Mirza et al., 25b] B. Mirza, Z. Lin, and N. Liu. Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift. Neurocomputing, 49:36 329, 25. [Nguyen et al., 2] H. M. Nguyen, E. W. Cooper, and K. Kamei. Online learning from imbalanced data streams. In International Conference of SoCPaR, pages , 2. [Nishida et al., 28] K. Nishida, S. Shimada, S. Ishiawa, and K. Yamauchi. Detecting sudden concept drift with nowledge of human behavior. In IEEE International Conference on Systems, Man and Cybernetics, pages , 28. [Oza, 25] N. C. Oza. Online bagging and boosting. IEEE International Conference on Systems, Man and Cybernetics, pages , 25. [Sun et al., 26] Y. Sun, K. Tang, L. L. Minu, S. Wang, and X. Yao. Online ensemble learning of data streams with gradually evolved classes. IEEE Transaction on Knowledge and Data Engineering, (Accepted), 26. [Wang and Pineau, 23] B. Wang and J. Pineau. Online ensemble learning for imbalanced data streams. ArXiv e- prints, 23. [Wang and Yao, 22] S. Wang and X. Yao. Multi-class imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man and Cybernetics, PartB: Cybernetics, 42(4):9 3, 22. [Wang et al., 23] S. Wang, L. L. Minu, and X. Yao. A learning framewor for online class imbalance learning. In IEEE Symposium on Computational Intelligence and Ensemble Learning (CIEL), pages 36 45, 23. [Wang et al., 25] S. Wang, L. L. Minu, and X. Yao. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, (5): , 25. [Weiss, 24] G. M. Weiss. Mining with rarity: a unifying framewor. ACM SIGKDD Explorations Newsletter, 6():7 9, 24. [Žliobaitė, 2] I. Žliobaitė. Combining similarity in time and space for training set formation under concept drift. Intelligent Data Analysis, 5(4):589 6, 2.

ONLINE class imbalance learning is an emerging

ONLINE class imbalance learning is an emerging IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING Resampling-Based Ensemble Methods for Online Class Imbalance Learning Shuo Wang, Member, IEEE, Leandro L. Minu, Member, IEEE, and Xin Yao, Fellow, IEEE

More information

Class Imbalance Learning in Software Defect Prediction

Class Imbalance Learning in Software Defect Prediction Class Imbalance Learning in Software Defect Prediction Dr. Shuo Wang s.wang@cs.bham.ac.uk University of Birmingham Research keywords: ensemble learning, class imbalance learning, online learning Shuo Wang

More information

CLASS imbalance learning refers to a type of classification

CLASS imbalance learning refers to a type of classification IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS, PART B Multi-Class Imbalance Problems: Analysis and Potential Solutions Shuo Wang, Member, IEEE, and Xin Yao, Fellow, IEEE Abstract Class imbalance problems

More information

ONLINE learning has received growing attention in

ONLINE learning has received growing attention in Concept Drift Detection for Online Class Imbalance Learning Shuo Wang, Leandro L. Minku, Davide Ghezzi, Daniele Caltabiano, Peter Tino and Xin Yao Abstract detection methods are crucial components of many

More information

ONLINE CLASS IMBALANCE LEARNING AND ITS APPLICATIONS IN FAULT DETECTION

ONLINE CLASS IMBALANCE LEARNING AND ITS APPLICATIONS IN FAULT DETECTION International Journal of Computational Intelligence and Applications c World Scientific Publishing Company ONLINE CLASS IMBALANCE LEARNING AND ITS APPLICATIONS IN FAULT DETECTION SHUO WANG, LEANDRO L.

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

Using Random Forest to Learn Imbalanced Data

Using Random Forest to Learn Imbalanced Data Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,

More information

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using

More information

Preprocessing Imbalanced Dataset Using Oversampling Approach

Preprocessing Imbalanced Dataset Using Oversampling Approach Journal of Recent Research in Engineering and Technology, 2(11), 2015, pp 10-15 Article ID J111503 ISSN (Online): 2349 2252, ISSN (Print):2349 2260 Bonfay Publications Research article Preprocessing Imbalanced

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Class Imbalance Problem in Data Mining: Review

Class Imbalance Problem in Data Mining: Review Class Imbalance Problem in Data Mining: Review 1 Mr.Rushi Longadge, 2 Ms. Snehlata S. Dongre, 3 Dr. Latesh Malik 1 Department of Computer Science and Engineering G. H. Raisoni College of Engineering Nagpur,

More information

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. IX (Feb. 2014), PP 103-107 Review of Ensemble Based Classification Algorithms for Nonstationary

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.N. Lutu Department of Computer Science, University of Pretoria, South Africa Patricia.Lutu@up.ac.za

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

How To Solve The Class Imbalance Problem In Data Mining

How To Solve The Class Imbalance Problem In Data Mining IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,

More information

A Novel Classification Approach for C2C E-Commerce Fraud Detection

A Novel Classification Approach for C2C E-Commerce Fraud Detection A Novel Classification Approach for C2C E-Commerce Fraud Detection *1 Haitao Xiong, 2 Yufeng Ren, 2 Pan Jia *1 School of Computer and Information Engineering, Beijing Technology and Business University,

More information

Selecting Data Mining Model for Web Advertising in Virtual Communities

Selecting Data Mining Model for Web Advertising in Virtual Communities Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: jerzy.surma@gmail.com Mariusz Łapczyński

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Tatsuya Minegishi 1, Ayahiko Niimi 2 Graduate chool of ystems Information cience,

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

How To Detect Fraud With Reputation Features And Other Features

How To Detect Fraud With Reputation Features And Other Features Detection Using Reputation Features, SVMs, and Random Forests Dave DeBarr, and Harry Wechsler, Fellow, IEEE Computer Science Department George Mason University Fairfax, Virginia, 030, United States {ddebarr,

More information

On the application of multi-class classification in physical therapy recommendation

On the application of multi-class classification in physical therapy recommendation RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

A Stock Pattern Recognition Algorithm Based on Neural Networks

A Stock Pattern Recognition Algorithm Based on Neural Networks A Stock Pattern Recognition Algorithm Based on Neural Networks Xinyu Guo guoxinyu@icst.pku.edu.cn Xun Liang liangxun@icst.pku.edu.cn Xiang Li lixiang@icst.pku.edu.cn Abstract pattern respectively. Recent

More information

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams Pramod D. Patil Research Scholar Department of Computer Engineering College of Engg. Pune, University of Pune Parag

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA Terence Yong Koon Beh 1, Swee Chuan Tan 2, Hwee Theng Yeo 3 1 School of Business, SIM University 1 yky2k@yahoo.com, 2 jamestansc@unisim.edu.sg,

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Direct Marketing When There Are Voluntary Buyers

Direct Marketing When There Are Voluntary Buyers Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

Path Selection Methods for Localized Quality of Service Routing

Path Selection Methods for Localized Quality of Service Routing Path Selection Methods for Localized Quality of Service Routing Xin Yuan and Arif Saifee Department of Computer Science, Florida State University, Tallahassee, FL Abstract Localized Quality of Service

More information

Learning on the Border: Active Learning in Imbalanced Data Classification

Learning on the Border: Active Learning in Imbalanced Data Classification Learning on the Border: Active Learning in Imbalanced Data Classification Şeyda Ertekin 1, Jian Huang 2, Léon Bottou 3, C. Lee Giles 2,1 1 Department of Computer Science and Engineering 2 College of Information

More information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University huanjing.wang@wku.edu Taghi M. Khoshgoftaar

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

Sample subset optimization for classifying imbalanced biological data

Sample subset optimization for classifying imbalanced biological data Sample subset optimization for classifying imbalanced biological data Pengyi Yang 1,2,3, Zili Zhang 4,5, Bing B. Zhou 1,3 and Albert Y. Zomaya 1,3 1 School of Information Technologies, University of Sydney,

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

On the application of multi-class classification in physical therapy recommendation

On the application of multi-class classification in physical therapy recommendation On the application of multi-class classification in physical therapy recommendation Jing Zhang 1, Douglas Gross 2, and Osmar R. Zaiane 1 1 Department of Computing Science, 2 Department of Physical Therapy,

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Software Defect Prediction Modeling

Software Defect Prediction Modeling Software Defect Prediction Modeling Burak Turhan Department of Computer Engineering, Bogazici University turhanb@boun.edu.tr Abstract Defect predictors are helpful tools for project managers and developers.

More information

Data Mining & Data Stream Mining Open Source Tools

Data Mining & Data Stream Mining Open Source Tools Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Expert Systems with Applications

Expert Systems with Applications Expert Systems with Applications 36 (2009) 5445 5449 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa Customer churn prediction

More information

ClusterOSS: a new undersampling method for imbalanced learning

ClusterOSS: a new undersampling method for imbalanced learning 1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately

More information

Creating a NL Texas Hold em Bot

Creating a NL Texas Hold em Bot Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the

More information

Learning with Skewed Class Distributions

Learning with Skewed Class Distributions CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Handling imbalanced datasets: A review

Handling imbalanced datasets: A review Handling imbalanced datasets: A review Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas Educational Software Development Laboratory Department of Mathematics, University of Patras, Greece

More information

Maximum Profit Mining and Its Application in Software Development

Maximum Profit Mining and Its Application in Software Development Maximum Profit Mining and Its Application in Software Development Charles X. Ling 1, Victor S. Sheng 1, Tilmann Bruckhaus 2, Nazim H. Madhavji 1 1 Department of Computer Science, The University of Western

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Distributed Regression For Heterogeneous Data Sets 1

Distributed Regression For Heterogeneous Data Sets 1 Distributed Regression For Heterogeneous Data Sets 1 Yan Xing, Michael G. Madden, Jim Duggan, Gerard Lyons Department of Information Technology National University of Ireland, Galway Ireland {yan.xing,

More information

Predictive Modeling for Collections of Accounts Receivable Sai Zeng IBM T.J. Watson Research Center Hawthorne, NY, 10523. Abstract

Predictive Modeling for Collections of Accounts Receivable Sai Zeng IBM T.J. Watson Research Center Hawthorne, NY, 10523. Abstract Paper Submission for ACM SIGKDD Workshop on Domain Driven Data Mining (DDDM2007) Predictive Modeling for Collections of Accounts Receivable Sai Zeng saizeng@us.ibm.com Prem Melville Yorktown Heights, NY,

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Design call center management system of e-commerce based on BP neural network and multifractal

Design call center management system of e-commerce based on BP neural network and multifractal Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):951-956 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Design call center management system of e-commerce

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets

Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets Mohamed Abouelenien Xiaohui Yuan Abstract Ensemble methods have been used for incremental learning. Yet, there are several issues

More information

BALANCE LEARNING TO RANK IN BIG DATA. Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj. Tampere University of Technology, Finland

BALANCE LEARNING TO RANK IN BIG DATA. Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj. Tampere University of Technology, Finland BALANCE LEARNING TO RANK IN BIG DATA Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj Tampere University of Technology, Finland {name.surname}@tut.fi ABSTRACT We propose a distributed

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

A Data Generator for Multi-Stream Data

A Data Generator for Multi-Stream Data A Data Generator for Multi-Stream Data Zaigham Faraz Siddiqui, Myra Spiliopoulou, Panagiotis Symeonidis, and Eleftherios Tiakas University of Magdeburg ; University of Thessaloniki. [siddiqui,myra]@iti.cs.uni-magdeburg.de;

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Online k-means Clustering of Nonstationary Data

Online k-means Clustering of Nonstationary Data 15.097 Prediction Proect Report Online -Means Clustering of Nonstationary Data Angie King May 17, 2012 1 Introduction and Motivation Machine learning algorithms often assume we have all our data before

More information

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)

HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet) HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Evaluating Algorithms that Learn from Data Streams

Evaluating Algorithms that Learn from Data Streams João Gama LIAAD-INESC Porto, Portugal Pedro Pereira Rodrigues LIAAD-INESC Porto & Faculty of Sciences, University of Porto, Portugal Gladys Castillo University Aveiro, Portugal jgama@liaad.up.pt pprodrigues@fc.up.pt

More information

Efficient Streaming Classification Methods

Efficient Streaming Classification Methods 1/44 Efficient Streaming Classification Methods Niall M. Adams 1, Nicos G. Pavlidis 2, Christoforos Anagnostopoulos 3, Dimitris K. Tasoulis 1 1 Department of Mathematics 2 Institute for Mathematical Sciences

More information

Inductive Learning in Less Than One Sequential Data Scan

Inductive Learning in Less Than One Sequential Data Scan Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Research Hawthorne, NY 10532 {weifan,haixun,psyu}@us.ibm.com Shaw-Hwa Lo Statistics Department,

More information

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

More information

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj

More information

The Predictive Data Mining Revolution in Scorecards:

The Predictive Data Mining Revolution in Scorecards: January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Load Balancing Algorithm Based on Services

Load Balancing Algorithm Based on Services Journal of Information & Computational Science 10:11 (2013) 3305 3312 July 20, 2013 Available at http://www.joics.com Load Balancing Algorithm Based on Services Yufang Zhang a, Qinlei Wei a,, Ying Zhao

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Equity forecast: Predicting long term stock price movement using machine learning

Equity forecast: Predicting long term stock price movement using machine learning Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK Nikola.milosevic@manchester.ac.uk Abstract Long

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information