Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams

Size: px
Start display at page:

Download "Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams"

Transcription

1 Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams Peng Zhang, Xingquan Zhu, Jianlong Tan, Li Guo Institute of Computing Technology, Chinese Academy of Sciences, Beijing, , China QCIS Centre, Faculty of Eng. & IT, Univ. of Technology, Sydney, NSW 2007, Australia {zhangpeng, tjl, Abstract Ensemble learning is a commonly used tool for building prediction models from data streams, due to its intrinsic merits of handling large volumes stream data. Despite of its extraordinary successes in stream data mining, existing ensemble models, in stream data environments, mainly fall into ensemble classifiers category, without realizing that building classifiers requires labor intensive labeling process, and it is often case that we may have a small number of labeled samples to train a few classifiers, but a large number of unlabeled samples are available to build clusters from data streams. Accordingly, in this paper, we propose a new ensemble model which combines both classifiers and clusters toger for mining data streams. We argue that main challenges of this new ensemble model include (1) clusters formulated from data streams only carry cluster IDs, with no genuine class label information, and (2) concept drifting underlying data streams maes it even harder to combine clusters and classifiers into one ensemble framewor. To handle challenge (1), we present a label propagation method to infer cluster s class label by maing full use of both class label information from classifiers, and internal structure information from clusters. To handle challenge (2), we present a new weighting schema to weight all base models according to ir consistencies with up-to-date base model. As a result, all classifiers and clusters can be combined toger, through a weighted average mechanism, for prediction. Experiments on real-world data streams demonstrate that our method outperforms simple classifier ensemble and cluster ensemble for stream data mining. Keywords-Data Stream Mining, Classification, Ensemble Learning, Concept Drifting. I. INTRODUCTION Building prediction models from data streams is one of most important research fields in data stream mining community [2]. Recently, many ensemble models have been proposed to build prediction models from concept drifting data streams [3, 6, 11]. Different from traditional incremental and online learning approaches that merely rely on a single model [4, 5], ensemble learning employs a divide-andconquer approach to first split continuous data streams into small data chuns, and n build light-weight base classifiers from small chuns. At final stage, all base classifiers are combined toger for prediction. By doing so, an ensemble model can enjoy a number of advantages, such as scaling up to large volumes of stream data, adapting quicly to new concepts, achieving lower variances than a single model, and easily to be parallelized. Neverless, most existing ensemble models [3, 6, 11, 17, 18] mae implicit assumptions that all base models are classifiers, and underlying technical solutions mainly focus on ensemble classifiers. A much more realistic situation is that, in many real-world applications, we can build only a few classifiers, but a large number of unlabeled clusters from stream data [9, 16]. As a result, it is essential to combine both classifiers and clusters toger, as an ensemble, for accurate prediction. Compared to existing ensemble classifiers, building ensemble models with both classifiers and clusters imposes two extra challenges: (1) how to acquire genuine class label information of unlabeled cluster? Clustering models only assign cluster a cluster ID instead of a genuine class label, so underlying challenge is to find mapping relationship between cluster IDs and genuine class labels; (2) how to assign proper weights to all base classifiers and clusters, such that ensemble predictor is able to handle concept drifting. In previous ensemble classifiers, all base classifiers are weighted according to ir prediction accuracies on up-to- date chun [11]. However, in our study, such a weighting schema does not wor very well because up-to-date chuns are mostly unlabeled. In light of above challenges, in this paper we present a weighted ensemble classifiers and clusters model to mine concept drifting data streams. To address challenge (1), we first build a graph to represent all classifiers and clusters. By using graph, we present a new label propagation mechanics that first propagates label information from all classifiers to clusters, and n iteratively refine results by propagating similarities among all clusters. To address challenge (2), we propose a new consistency-based weighting method that assigns base model a weight value according to ir consistencies with respect to upto-date base model. By doing so, we are able to combine both classifiers and clusters for accurate prediction through a weighted averaging schema. The rest of paper is organized as follows: Section 2 surveys related wor; Section 3 describes ensemble learning model in detail; Section 4 gives experimental results. And we conclude paper in Section 5.

2 II. RELATED WORK Our wor, in fact, is a combination of ensemble classifiers on data streams, ensemble clusters, and transfer learning across multiple data sources. Ensemble classifiers on data streams provide a generic framewor for handling massive volume data streams with concept drifting. The idea of ensemble classifiers is to partition continuous data streams into small data chuns, from which a number of base classifiers are built and combined toger for prediction. For example, Wang et al. [11] first proposed a weighted ensemble classifier framewor, and demonstrated that ir model outperforms a single learning model. Inspired by ir wor, various ensemble models have been proposed, such as ensemble different learning algorithms [6], ensemble active learners [18], to name a few. Our wor differs from above efforts because we consider both classifiers and clusters for ensemble learning. Ensemble clusters [14] aims to combine multiple clusters toger for prediction. For a given test set, cluster will derive a label vector. Noticing that some label vectors may conflict with or, most state-of-art ensemble clusters models employ a Jaccard distance metric to minimize discrepancy between pair of label vectors. Although such a label vector based consensus method performs well on static databases, it can not be directly applied to data stream scenarios for two reasons: (1) label vector structure is proportional to example size, which is unsuitable for data streams with large volumes; and (2) label vectors can not handle concept drifting [8]. Considering se s, we alternatively use cluster centers to measure similarity between pair of unlabeled clusters. From transfer learning perspectives [10], a recent wor on nowledge transfer learning [7] also studies of combining both supervised and unsupervised models built from different domains for prediction, which is very similar to our setting. The difference between our wor and irs is fourfold: (1) we aim to mine from data streams, whereas y aim to mine from static databases; (2) we use cluster centers as basic unit to propagate class labels, whereas y use label vectors; (3) in order to tacle concept drifting during label propagation process, we first propagate label information from all classifiers to clusters, and n estimate class label of cluster and iteratively refine ir labels by propagating label information among all clusters. The wor reported in [7], however, only propagates class labels from one classifier as initialization, which maes it sensitive to concept drifting; (4) we aim to build an active ensemble model, where ensemble can be constructed before arrival of test examples, whereas ir solution belongs to lazy learning category, where model training is delayed until test examples are observed. III. PROBLEM FORMULATION AND SOLUTIONS Consider a data stream S consisting of an infinite number of data records (x i, y i ), where x i R d denotes an instance containing d-dimensional attributes, and y i Y = {c 1,, c r } represents its class label (Note that only a small portion of data records are labeled and actually contain labeling information y i ). In order to build ensemble model, we partition stream data into a number data chuns as shown in Fig.1. From most recent n data chuns, we build n base models λ 1,, λ n. Without loss of generality, we assume former a models λ 1,, λ a are classifiers, while remaining b models λ a+1,, λ n are clustering models. Note that 1 a, b n and a+b = n. Our goal is to combine all n base models to construct an ensemble model E, such that to yet-to-come record x, ensemble E can assign x a class label y which satisfies Eq. (1), y = argmax y Y P (y x, E) (1) where posterior probability P (y x, E) is weighted average of all n base models as shown in Eq. (2), n P (y x, E) = w i P (y x, λ i ) (2) i=1 If all n base models in ensemble E can directly derive a class label y for test example x, n Eq. (2) will degenerate to generic ensemble classifiers models that have been extensively studied in existing research endeavors [11, 3, 6, 17]. Neverless, in our setting, clustering model can only assign x a cluster ID that doesn t carry any class label information. Formally, for test example x, a clustering model λ j (a < j n) will assign it a group ID with probability P (g j x, λj ) (1 r), instead of genuine class label P (y x, λ j ). To bridge se two different probabilities, we introduce a conditional probability P (y g j ) which reflects mapping relationship between cluster ID g j and genuine class label y Y. By doing so, for clustering model λ j, posterior probability P (y x, λ j ) can be estimated by integrating all r mappings toger as shown in Eq. (3), r P (y x, λ j ) = P (y g j )P (gj x, λj ) (3) =1 where r is total class numbers. Consequently, weighted ensemble P (y x, E) in Eq. (2) can be revised to following Eq. (4), P (y x, E) = a w ip (y x, λ i )+ i=1 n j=a+1 =1 r w jp (y g j )P (gj x, λj ) (4) where first term represents utility from a classifiers, and second term represents utility from b clustering models. To estimate Eq. (4), two s need to be considered:

3 number s d- presents of data odel, we s shown e build lity, we rs, while models. combine nsemble rd x, sfies Eq. (1) eighted 2), (2) derive a ll degenhat have rless, ly assign rmation. odel λ j obability ass label different P(y g j ) h cluster g so, for (y x, λ j ) toger (3) tly, vised to j x, λ j ) (4) a ity from goal in Clusters Ensemble Classifiers and Clusters w 1 w 2 w n-1 w n Classifier Clusters Unlabeled Labeled Unlabeled Figure1. 1. model. Clusters Unlabeled Data Stream Illustration An illustration of of weighted ensemble ensemble classifier classifiers and cluster and model. clusters described as follows. Each time when a new data chun D i (assume unlabeled) arrives, we first cluster all examples in D i into v groups g1, i, gv, i with cluster centers taen as v vertexes and included in graph G. The edges between new group g i (1 v) and all existing vertexes in G (i.e., group gu) j can be calculated using Euclidean based distance metric as shown in Eq. (5), Sim g (g, i gu) j = d 1 (g, i gu) j 1 = g i gj u 2 2 where function Sim g (g i, gj u) represents similarity between two groups g i and gj u (i j). By doing so, we can (1) For ( cluster a g j n r y, we need to estimate its conditional probability P (y g j = argmax y Y w i P(y x, λ (i) )+ w j P(y g j )P(g j x, λ j ) ) ) which maps group ID to summarize all incoming clusters into graph G. genuine class i=1 label y. Intuitively, j=a+1 if=1 we don t have any If chun D i is labeled, we will also build a classifier λ i (5) prior nowledge on probability P (y g j To estimate Eq. (5), two s need to ), we can use on D i using a classification algorithm (i.e., decision tree), be considered: all remaining (n 1) (1) For cluster g j models from ensemble E to and with v groups g1, i, gv i corresponding to classifier λ i, how to estimate conditional estimate this probability. probability P(y g j This is because classification suitably labeled. models {λ 1,, λ ) a which } can maps directly provide group ID label to information genuine class for group label g j y. Intuitively, when we don t have any prior B. Propagate label information from classifiers to clusters, while clustering models nowledge on probability P(y g j can provide inner structure information for deriving P ) (y g, we j can use of all ). If two groups As we discussed above, ey to construct our remaining have same (n structure, 1) models such from as a cluster ensemble or a manifold, E to estimate y ensemble model is to estimate conditional probability this are probability. This is because classification models {λ 1 liely,, λ a to share same class label [13]. Hence, conditional P (y g j ) which reflects mapping relationship between } can directly provide label information for group g j probability P (y g j ) can be estimated by combining group ID g j and genuine class label y Y. To all, (n while 1) base clustering models models toger, can i.e., provide inner structure information for deriving P(y g j to estimate posterior probability P (y λ 1,, λ j 1 achieve this goal, we first propagate class label information from classifiers {λ 1,, λ a } to unlabeled ),. If, two λ j+1 groups,, λ n are, g j on same structure, such as a cluster or manifold, y ). To achieve this goal, we propose to use a label propagation liely method tothat sharefirst propagates same class label. Hence, con- cluster g j. In or words, for gj, we combine all are ditional probability P(y g j class labels from all classifiers to estimate its class label. This is equivalent to classifiers to clusters ) canasbe anestimated initialization, by combining and n maximization of following utility function, all iteratively (n refine 1) base labeling models toger, values byi.e., propagating to estimate labels posterior probability P(y λ 1,, λ j 1,, λ j+1,, λ n, g j ). y j among all clusters. = argmax y Y P (y λ 1,, λ, g j ) (6) To (2) achieve Assign this a proper goal, we weight propose w to use a label propagation i for base model λ i where y j denotes estimated class label of unlabeled to alleviate method concept that first drifting propagates class in stream labels data. from For all group g j traditional classifiers weighted to ensemble clusters as classifiers an initialization, [11], a commonly and n, and λ1,, λ a are classifiers. Assume classifiers are independent of or, objective function used iteratively method refine is to se assign initial values base model by propagating a weight labels value in Eq. (6) can be revised accordingly (as shown in Eq. (7)), reversely among all proportional clusters. to classifier s accuracy on upto-date (2) How chun, to assign a proper weight w i for base a v model λ i because up-to-date chun is liely to reflect to genuine alleviate concept concept underlying drifting data. However, on data P (y λ 1,, λ a, g j ) P (y gh)p i (gh g i j ) (7) i=1 h=1 streams. if majority In traditional samples in weighted up-to-date ensemble chun classifiers are unlabeled [11] on data (which streams, is often a commonly case in used our method is setting), to assign such a where gh i represents hth (1 h r) group in base weighting model approach a weight reversely doesn t wor proportional very well. to Alternatively, classifier s classification model λ i, and P (y gh i ) represents accuracy we will weight on up-to-date base model chun, according because to its consistency up-to-date probability that group gh i belongs to class y. Since all groups chun with is up-to-date liely to reflect base model. emerging concept. However, gh i corresponding to classifiers λi are labeled, probability when To implement up-to-date chun above is two unlabeled methods, (which we will is often first P (y gh i ), indeed, is nown a prior. Therefore, difficulty construct case in a our graph to represent setting), all such classifiers a weighting and clustering schema of computing Eq. (7) is calculation of conditional doesn t models, wor. and n Alternatively, use graph we will to realize weight label propagation base model probability P (gh i gj ). Without loss of generality, we let according and weighting to its for consistency all models. with up-to-date base model. P (gh i gj ) Sim g(gh i, gj ). Then Eq. (7) can be finally To implement above two methods, we will first revised to Eq. (8), construct A. Represent a graph all models whichinrepresents a graph all classifiers and a v clustering models, and n propagate label and weight y information In order toalong propagate this graph. information among all models, we j = argmax y Y Sim g (gh, i g j )Label(gi h) (8) i=1 h=1 transform all models into a graph G = (V, E), where A. vertex Represent set V represents all models cluster in a graph center of model, and where function Label(gh i ) = P (y gi h ) is class label of Inedge order settoe propagate representsinformation similarity among between all models, pair of group gh i corresponding to classifier λi. Therefore, Eq. (8) in vertexes this part in V we. The will procedure transformof all constructing models graph intoga can graph be can be easily solved in O(av) time. (5)

4 The essence of Eq. (8) is to combine all classifiers through a weighted averaging mechanism to estimate genuine class label of unlabeled cluster g j. Compared to previous label propagation methods that only use one single classifier to predict class labels for unlabeled data (models) [7, 13, 15], our method, which combines all classifiers for estimation, is more robust to concept drifting. Neverless, a possible limitation is that estimated class label y j for unlabeled cluster gj is a rough solution which still needs to be refined. This is because number of classifiers are usually quite limited for inferring an optimal class label for unlabeled cluster. Therefore, in next subsection, we will use internal structure information from clustering models to refine y j. C. Propagate internal structural information among clusters In this subsection, we will iteratively refine class labels for unlabeled cluster according to its internal structure information. Our motivation is that if two clusters have same structure, such as a cluster or a manifold, y are liely to share same class label. Formally, let Q m c = [P (y g1 a+1 ) T,, P (y g j )T,, P (y gv n ) T ] T be matrix of conditional probability that we are aiming for, where subscript represents total number of unlabeled groups. Let F m c = [y a+1t 1,, y jt,, ynt v ] T be matrix of initial class labels with entry calculated using Eq. (8). We construct similarity matrix S v v, where non-diagonal entries S i,j = Sim λ (λ i, λ j ) represent similarity between two clustering models, and all diagonal entries S i,i = 0. Based on similarity matrix S, we also construct a normalization matrix H = U 1/2 SU 1/2 which normalizes S, where matrix U is diagonal matrix with its (i, i)-element equal to sum of i-th row of matrix S. Accordingly, objective of label propagation among all clusters is to minimize difference between two clusters if ir similarity is high, and minimize cluster s deviation from initial label obtained from Eq. (8), Min Ψ(Q) = 1 m S ij Q i Q m j 2 +η Q i F i 2 2 Dii Djj i,j=1 i=1 (9) where η [0, 1] is a trade-off parameter controlling preference of two items in objective function. The essence of Eq.(9) is to propagate label information among all clusters according to ir internal structure. To solve Eq. (9), we differentiate objective function with respect to Q, and final solution can be described as Q = (1 α)(i αh) 1 F, where parameter α = 1/(1 + η) can be set manually. To avoid calculation of matrix inverse, an iterative method Q(τ +1) = αhq(τ)+(1 α)f can be used to compute optimal value Q. It will generate a sequence {Q(0),, Q(i), } which finally converges to Q [15]. D. Weight Value Calculation for All Base Models Anor of calculating Eq. (4) is to properly calculate weight value for base model. Weighting methods have been extensively studied in previous ensemble classifiers to alleviate concept drifting. The most intuitive method is to weight base classifier according to ir prediction accuracies on up-to-date chun [11]. In our setting, since most instances in up-to-date chuns are unlabeled, previous weighting methods do not applicable here. Therefore, we propose a general method that calculates weight value for base model according to its consistency with respect to up-to-date base model. Formally, let λ n be up-to-date model, n for a base model λ i (1 i < n), its weight w i can be computed using Eq.(10), w i = 1 Z Sim λ(λ i, λ n ) (10) where Z = n i=1 Sim(λi, λ n ) is a regularization factor. E. The Ensemble Model We summarize in Algorithm 1 procedure of building our ensemble model. Each time when a new data chun arrives, we first cluster it into v groups (if data chun is labeled, we will also build a classifier). After that, we update graph G by incorporating new v clusters, and meanwhile discard outdated ones. Next, we use label propagation method to estimate class labels of unlabeled groups in clustering models. At last step, all base models are weighted according to ir consistencies with up-todate model. By doing so, ensemble model can be trained to predict yet-to-come test example using Eq. (4). From Algorithm 1, we can observe that both label propagation, and model weighting operations run at group level. Therefore, we expect that our ensemble model can be trained in linear time w.r.t. total number of clusters in graph G. Due to space limitation, we omit algorithm complexity analysis here. Algorithm 1 The Ensemble Classifier and Cluster Model Require: Graph G which represents models λ 1,, λ n, An up-to-date chun D n+1, Number of groups v, Parameter α. Step 1: Cluster D n+1 into v groups {g n+1 1,, gv n+1 }; Step 2: If D n+1 is labeled n assign class labels to newly built v groups, meanwhile build a classifier λ n+1 ; Step 3: Update graph G by adding new v vertexes, and remove outdated vertexes g1, 1, gv; 1 Step 4: Estimate class label of unlabeled group using Eqs. (8) and (9); Step 5: Using Eq.(10) to estimate model s weight; Step 6: Build ensemble model E using weighted average of all n base models; Step 7: Output ensemble model E

5 IV. EXPERIMENTS In this section, we report our experimental studies to validate claim that our ensemble method can achieve better performance than existing ensemble models, such as ensemble classifiers, and ensemble clusters. A. Experimental settings Benchmar Data Streams We use two real-world data streams which can be downloaded from UCI dataset Repository [1] as benchmar data. The Malicious URLs Detection contains both malicious and benign URL streams. The malicious URL stream is obtained from a large web mail provider, whose live, real-time feed supplies 6,000-7,500 examples of spam and phishing URLs per day. The benign URL stream is drawn from Yahoo s directory listing. For every incoming URL, features are obtained by querying DNS, WHOIS, blaclist and geographic information servers, as well as processing IP address-related and lexical-related features. For simplicity, in our experiments, we use former 128 continuous features in first wee data for analysis. The mining tas is to detect malicious URLs, such as spam, phishing, and exploits. The Intrusion Detection consists of a series of TCP connection records for a local area networ. Each example in this data stream corresponds to a connection, which is eir a normal connection or an attac. The attac types include Denial-ofservice (DOS), unauthorized access from a remote machine (R2L); unauthorized access to local root privileges (U2R); surveillance and or probing (Probing). The mining tas is to correctly detect attacs online. Benchmar Methods For comparison purposes, we implement following four models: (1) Ensemble classifiers. Previous ensemble classifiers models on data streams can be categorized into two types: ensemble classifiers built on different data chuns using same learning algorithm [11] (denoted as EC1), and ensemble classifiers built with different learning algorithms on up-to-date chun [6] (denoted as EC2). In EC1, we use Decision Tree as basic learning algorithm. In EC2, we use logistic regression classifier, decision tree, and Naive Bayes as basic algorithms; (2) Ensemble clusters. Similar to ensemble classifiers, we implement two types of ensemble clusters: ensemble clusters built on different data chuns [8] (denoted as EU1), and ensemble clusters built from different learning algorithms (denoted as EU2). -means is used as basic clustering algorithm. In EU2, all clustering models are built using -means algorithm with different initializations. Since ensemble clusters only output fae group IDs, we map outputs of clustering algorithms to best possible class prediction. All learning algorithms used here are implemented in WEKA data mining pacage [12]. For simplicity, we refer to proposed ensemble classification and clustering model on data streams as (ECU) model. In ECU, we use Decision Tree as basic classification algorithm, and -means as basic clustering algorithm. The parameters of ECU are set as follows: α in solving Q is set to be 0.618, and largest iterative steps of solving Q is set to be 15. Measurements To assess algorithm performance, we use average accuracy (Aacc), average mean square error (Amse), average raning (AR), variance of raning (SR), and number of times method rans best (#W) and worst (#L), as our performance matrices. A good prediction model should have high accuracy, a raning order close to 1, more winning times (#W), and less losing times (#L). Meanwhile, if an algorithm has a smaller SR, it is more stable. B. Comparisons with or ensemble models In Tables 1 and 2, we report algorithm performance with respect to all benchmar ensemble methods on real-world data streams. It is clearly that ECU always achieves best performance with respect to given measurements. These results assert that ECU outperforms or existing ensemble models on data streams when data streams have both classifiers and clustering models. The main reason is that our method, by combining all classifiers and clustering models toger, will achieve lower variance and result in better performance. Besides, when comparing ensemble classifiers models EC1 and EC2 with ensemble clustering models EU1 and EU2, we can observe that ensemble classifier models always perform better than ensemble clusters models. This demonstrates that although re are a large number of clustering models, we are unable to construct a satisfactory ensemble model by merely combining unlabeled clustering models toger. Fortunately, by adding a small portion of classifiers into ensemble clusters, as ECU model does, it will significantly improve algorithm performance. In Figure 2, we list chun-by-chun comparisons among EC1, EU1, and ECU models on two real-world data streams (Note that since EC2 and EU2 are inferior to EC1 and EU1, respectively, we do not include se two models in this figure). The results also validate our conclusion that ECU model performs best compared to existing ensemble classifiers and ensemble clusters models on data streams with a limited amount of classifiers but a large number of clusters. V. CONCLUSIONS Due to its inherent merits in handling drifting concepts and large data volumes, ensemble learning has traditionally attracted many attentions in stream data mining research. Many recently proposed sophisticated data stream mining

6 Measures Malicious Website Detection Intrusion Detection EC1 EC2 EU1 EU2 ECU EC1 EC2 EU1 EU2 ECU Aacc Amse AR SR #W #L Accuracy Accuracy ECU EC(1) EU(1) 0.92 ECU EC(1) EU(1) Chun ID (a) Chun ID (b) 5. on two world data streams. (a) website detection. (b) intrusion detection. The parameter settings infigure se2. twochun-by-chun data streams arecomparisons as follows: chun on twosize real-world is set todata be streams. 500, totally (a) malicious 100 data chuns URLs detection. with only(b) 20% intrusion examples detection. labeled. The It is parameter obvioussettings that ECU performs better than ensemble classifiers, and ensemble clusters models on four data streams. in se two data streams are as follows: chun size is set to be 500, totally 100 data chuns with only 20% examples labeled. It is obvious that ECU performs better than ensemble classifiers, and ensemble clusters models on two data streams. results also validate our conclusion that our proposed ECU model performs best compared to existing ensemble Table I classifiers, and ensemble clusters models on data streams COMPARISONS ON MALICIOUS WEBSITE DETECTION with a limited amount of classifiers, but a large number of clustering models. Measures EC1 EC2 EU1 EU2 ECU Aacc Amse AR SR #W #L Table II COMPARISONS ON INTRUSION DETECTION Measures EC1 EC2 EU1 EU2 ECU Aacc Amse AR SR #W #L models are based on ensemble learning framewor, with majority ensemble models falling into ensemble classifiers category. Neverless, as labeling training samples is a labor intensive and expensive process, in a practical stream data mining scenario, it is often case that we may have a very few labeled training samples (to build classifiers), but a large number of unlabeled samples are available to build clusters. It would be a waste to only consider classifiers in an ensemble model, lie most existing solutions do. Accordingly, in this paper, we present a new ensemble learning method which relaxes original ensemble models to accommodate both classifiers and clusters through a weighted average mechanism. In order to handle concept drifting, we also propose a new consistency-based weighting schema to weight all base models, according to ir consistencies with respect to up-to-date model. Experimental results on real-world data streams demonstrate that our ensemble model outperforms existing ensemble models for mining concept drifting data streams. F. Comparisons with benchmar incremental semisupervised model In Figure 6, weacknowledgment list comparison results between ensemble model ECU and incremental model SC with This research was partially supported by National Science Foundation of China (NSFC) under Grant No , Basic Research Program of China (973 Program) under Grant No.2007CB311100, and Australian Research Council s Discovery Projects funding scheme (project number DP ). respect to different concept drifting scenarios on syntic data streams. The detailed parameter settings are listed REFERENCES [1] A. Asuncion, D. Newman, UCI Machine Learning Repository. Irvine, CA, [2] C. Aggarwal, Data Streams: Models and Algorithms, Springer. [3] A. Bifet, G. Holmes, B. Pfahringer et al., New ensemble methods for evolving data streams, In Proc. of KDD [4] P. Domingos, G. Hulten. Mining high-speed data streams, In Proc. of KDD, [5] C. Domeniconi, D. Gunopulos. Incremental Support Vector Machine Construction, In Proc. of ICDM [6] J. Gao, W. Fan, J. Han. On appropriate assumptions to mine data streams: analysis and practice, In Proc. of ICDM [7] J. Gao, W. Fan, Y. Sun, J. Han, Heterogeneous source consensus learning via decision propagation and negotiation, In Proc. of KDD 2009 [8] P. Hore, L.Hall, D. Glodgof, A scalalbe framewor for cluster ensembles, Pattern Recognition, Vol. 42, [9] M. Masud, J. Gao, L. Khan et al., A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data, In Proc. of ICDM [10] S. Pan, Q. Yang. A Survey on Transfer Learning, IEEE Transactions on Knowledge and Data Engineering, [11] H.Wang, W. Fan, P. Yu, J. Han. Mining concept-drifting data streams using ensemble classifiers,in Proc. of KDD [12] I. Witten, E. Fran. Data mining: practical machine learning tools and techniques, Morgan Kaufmann, [13] O. Chapelle and B. Scholopf, A. Zien, Semi-Supervised Leanring, MIT Press, Cambridge, MA, 2006 [14] A. Strehl et al., Cluster Ensembles-A Knowledge Reuse Framewor for Combining Partitionings, JMLR, Vol.(3), [15] D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Scholopf. Learning with local and global consistency, In NIPS [16] P. Zhang, X.Zhu, L.Guo, Mining data streams with labeled and unlabeled training examples, In Proc. of ICDM [17] P.Zhang, X.Zhu, Y.Shi, Categorizing and mining concept drifting data streams. In Proc. of KDD [18] X. Zhu, P. Zhang, X. Lin, Y. Shi. Active learning from data streams, In Proc. of ICDM 2007.

How To Classify Data Stream Mining

How To Classify Data Stream Mining JOURNAL OF COMPUTERS, VOL. 8, NO. 11, NOVEMBER 2013 2873 A Semi-supervised Ensemble Approach for Mining Data Streams Jing Liu 1,2, Guo-sheng Xu 1,2, Da Xiao 1,2, Li-ze Gu 1,2, Xin-xin Niu 1,2 1.Information

More information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN

More information

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.N. Lutu Department of Computer Science, University of Pretoria, South Africa [email protected]

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

Data Mining & Data Stream Mining Open Source Tools

Data Mining & Data Stream Mining Open Source Tools Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

A Data Generator for Multi-Stream Data

A Data Generator for Multi-Stream Data A Data Generator for Multi-Stream Data Zaigham Faraz Siddiqui, Myra Spiliopoulou, Panagiotis Symeonidis, and Eleftherios Tiakas University of Magdeburg ; University of Thessaloniki. [siddiqui,myra]@iti.cs.uni-magdeburg.de;

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Learning with Local and Global Consistency

Learning with Local and Global Consistency Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

More information

Learning with Local and Global Consistency

Learning with Local and Global Consistency Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

A semi-supervised Spam mail detector

A semi-supervised Spam mail detector A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams Pramod D. Patil Research Scholar Department of Computer Engineering College of Engg. Pune, University of Pune Parag

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation Shanofer. S Master of Engineering, Department of Computer Science and Engineering, Veerammal Engineering College,

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab

More information

Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 [email protected],

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

KEITH LEHNERT AND ERIC FRIEDRICH

KEITH LEHNERT AND ERIC FRIEDRICH MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Robust personalizable spam filtering via local and global discrimination modeling

Robust personalizable spam filtering via local and global discrimination modeling Knowl Inf Syst DOI 10.1007/s10115-012-0477-x REGULAR PAPER Robust personalizable spam filtering via local and global discrimination modeling Khurum Nazir Junejo Asim Karim Received: 29 June 2010 / Revised:

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Classification of Learners Using Linear Regression

Classification of Learners Using Linear Regression Proceedings of the Federated Conference on Computer Science and Information Systems pp. 717 721 ISBN 978-83-60810-22-4 Classification of Learners Using Linear Regression Marian Cristian Mihăescu Software

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

INCREMENTAL AGGREGATION MODEL FOR DATA STREAM CLASSIFICATION

INCREMENTAL AGGREGATION MODEL FOR DATA STREAM CLASSIFICATION INCREMENTAL AGGREGATION MODEL FOR DATA STREAM CLASSIFICATION S. Jayanthi 1 and B. Karthikeyan 2 1 Department of Computer Science and Engineering, Karpagam University, Coimbatore, India 2 Dhanalakshmi Srinivsan

More information

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode Seyed Mojtaba Hosseini Bamakan, Peyman Gholami RESEARCH CENTRE OF FICTITIOUS ECONOMY & DATA SCIENCE UNIVERSITY

More information

How To Perform An Ensemble Analysis

How To Perform An Ensemble Analysis Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh

More information

Locality Based Protocol for MultiWriter Replication systems

Locality Based Protocol for MultiWriter Replication systems Locality Based Protocol for MultiWriter Replication systems Lei Gao Department of Computer Science The University of Texas at Austin [email protected] One of the challenging problems in building replication

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

A Review of Missing Data Treatment Methods

A Review of Missing Data Treatment Methods A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Network Intrusion Detection Using a HNB Binary Classifier

Network Intrusion Detection Using a HNB Binary Classifier 2015 17th UKSIM-AMSS International Conference on Modelling and Simulation Network Intrusion Detection Using a HNB Binary Classifier Levent Koc and Alan D. Carswell Center for Security Studies, University

More information

Sub-class Error-Correcting Output Codes

Sub-class Error-Correcting Output Codes Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano

More information

NEIGHBORHOOD REGRESSION FOR EDGE-PRESERVING IMAGE SUPER-RESOLUTION. Yanghao Li, Jiaying Liu, Wenhan Yang, Zongming Guo

NEIGHBORHOOD REGRESSION FOR EDGE-PRESERVING IMAGE SUPER-RESOLUTION. Yanghao Li, Jiaying Liu, Wenhan Yang, Zongming Guo NEIGHBORHOOD REGRESSION FOR EDGE-PRESERVING IMAGE SUPER-RESOLUTION Yanghao Li, Jiaying Liu, Wenhan Yang, Zongming Guo Institute of Computer Science and Technology, Peking University, Beijing, P.R.China,

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information