Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams

Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams Peng Zhang, Xingquan Zhu, Jianlong Tan, Li Guo Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China QCIS Centre, Faculty of Eng. & IT, Univ. of Technology, Sydney, NSW 2007, Australia {zhangpeng, tjl, guoli}@ict.ac.cn; xqzhu@it.uts.edu.au. Abstract Ensemble learning is a commonly used tool for building prediction models from data streams, due to its intrinsic merits of handling large volumes stream data. Despite of its extraordinary successes in stream data mining, existing ensemble models, in stream data environments, mainly fall into ensemble classifiers category, without realizing that building classifiers requires labor intensive labeling process, and it is often case that we may have a small number of labeled samples to train a few classifiers, but a large number of unlabeled samples are available to build clusters from data streams. Accordingly, in this paper, we propose a new ensemble model which combines both classifiers and clusters toger for mining data streams. We argue that main challenges of this new ensemble model include (1) clusters formulated from data streams only carry cluster IDs, with no genuine class label information, and (2) concept drifting underlying data streams maes it even harder to combine clusters and classifiers into one ensemble framewor. To handle challenge (1), we present a label propagation method to infer cluster s class label by maing full use of both class label information from classifiers, and internal structure information from clusters. To handle challenge (2), we present a new weighting schema to weight all base models according to ir consistencies with up-to-date base model. As a result, all classifiers and clusters can be combined toger, through a weighted average mechanism, for prediction. Experiments on real-world data streams demonstrate that our method outperforms simple classifier ensemble and cluster ensemble for stream data mining. Keywords-Data Stream Mining, Classification, Ensemble Learning, Concept Drifting. I. INTRODUCTION Building prediction models from data streams is one of most important research fields in data stream mining community [2]. Recently, many ensemble models have been proposed to build prediction models from concept drifting data streams [3, 6, 11]. Different from traditional incremental and online learning approaches that merely rely on a single model [4, 5], ensemble learning employs a divide-andconquer approach to first split continuous data streams into small data chuns, and n build light-weight base classifiers from small chuns. At final stage, all base classifiers are combined toger for prediction. By doing so, an ensemble model can enjoy a number of advantages, such as scaling up to large volumes of stream data, adapting quicly to new concepts, achieving lower variances than a single model, and easily to be parallelized. Neverless, most existing ensemble models [3, 6, 11, 17, 18] mae implicit assumptions that all base models are classifiers, and underlying technical solutions mainly focus on ensemble classifiers. A much more realistic situation is that, in many real-world applications, we can build only a few classifiers, but a large number of unlabeled clusters from stream data [9, 16]. As a result, it is essential to combine both classifiers and clusters toger, as an ensemble, for accurate prediction. Compared to existing ensemble classifiers, building ensemble models with both classifiers and clusters imposes two extra challenges: (1) how to acquire genuine class label information of unlabeled cluster? Clustering models only assign cluster a cluster ID instead of a genuine class label, so underlying challenge is to find mapping relationship between cluster IDs and genuine class labels; (2) how to assign proper weights to all base classifiers and clusters, such that ensemble predictor is able to handle concept drifting. In previous ensemble classifiers, all base classifiers are weighted according to ir prediction accuracies on up-todate chun [11]. However, in our study, such a weighting schema does not wor very well because up-to-date chuns are mostly unlabeled. In light of above challenges, in this paper we present a weighted ensemble classifiers and clusters model to mine concept drifting data streams. To address challenge (1), we first build a graph to represent all classifiers and clusters. By using graph, we present a new label propagation mechanics that first propagates label information from all classifiers to clusters, and n iteratively refine results by propagating similarities among all clusters. To address challenge (2), we propose a new consistency-based weighting method that assigns base model a weight value according to ir consistencies with respect to upto-date base model. By doing so, we are able to combine both classifiers and clusters for accurate prediction through a weighted averaging schema. The rest of paper is organized as follows: Section 2 surveys related wor; Section 3 describes ensemble learning model in detail; Section 4 gives experimental results. And we conclude paper in Section 5.

II. RELATED WORK Our wor, in fact, is a combination of ensemble classifiers on data streams, ensemble clusters, and transfer learning across multiple data sources. Ensemble classifiers on data streams provide a generic framewor for handling massive volume data streams with concept drifting. The idea of ensemble classifiers is to partition continuous data streams into small data chuns, from which a number of base classifiers are built and combined toger for prediction. For example, Wang et al. [11] first proposed a weighted ensemble classifier framewor, and demonstrated that ir model outperforms a single learning model. Inspired by ir wor, various ensemble models have been proposed, such as ensemble different learning algorithms [6], ensemble active learners [18], to name a few. Our wor differs from above efforts because we consider both classifiers and clusters for ensemble learning. Ensemble clusters [14] aims to combine multiple clusters toger for prediction. For a given test set, cluster will derive a label vector. Noticing that some label vectors may conflict with or, most state-of-art ensemble clusters models employ a Jaccard distance metric to minimize discrepancy between pair of label vectors. Although such a label vector based consensus method performs well on static databases, it can not be directly applied to data stream scenarios for two reasons: (1) label vector structure is proportional to example size, which is unsuitable for data streams with large volumes; and (2) label vectors can not handle concept drifting [8]. Considering se s, we alternatively use cluster centers to measure similarity between pair of unlabeled clusters. From transfer learning perspectives [10], a recent wor on nowledge transfer learning [7] also studies of combining both supervised and unsupervised models built from different domains for prediction, which is very similar to our setting. The difference between our wor and irs is fourfold: (1) we aim to mine from data streams, whereas y aim to mine from static databases; (2) we use cluster centers as basic unit to propagate class labels, whereas y use label vectors; (3) in order to tacle concept drifting during label propagation process, we first propagate label information from all classifiers to clusters, and n estimate class label of cluster and iteratively refine ir labels by propagating label information among all clusters. The wor reported in [7], however, only propagates class labels from one classifier as initialization, which maes it sensitive to concept drifting; (4) we aim to build an active ensemble model, where ensemble can be constructed before arrival of test examples, whereas ir solution belongs to lazy learning category, where model training is delayed until test examples are observed. III. PROBLEM FORMULATION AND SOLUTIONS Consider a data stream S consisting of an infinite number of data records (x i, y i ), where x i R d denotes an instance containing d-dimensional attributes, and y i Y = {c 1,, c r } represents its class label (Note that only a small portion of data records are labeled and actually contain labeling information y i ). In order to build ensemble model, we partition stream data into a number data chuns as shown in Fig.1. From most recent n data chuns, we build n base models λ 1,, λ n. Without loss of generality, we assume former a models λ 1,, λ a are classifiers, while remaining b models λ a+1,, λ n are clustering models. Note that 1 a, b n and a+b = n. Our goal is to combine all n base models to construct an ensemble model E, such that to yet-to-come record x, ensemble E can assign x a class label y which satisfies Eq. (1), y = argmax y Y P (y x, E) (1) where posterior probability P (y x, E) is weighted average of all n base models as shown in Eq. (2), n P (y x, E) = w i P (y x, λ i ) (2) i=1 If all n base models in ensemble E can directly derive a class label y for test example x, n Eq. (2) will degenerate to generic ensemble classifiers models that have been extensively studied in existing research endeavors [11, 3, 6, 17]. Neverless, in our setting, clustering model can only assign x a cluster ID that doesn t carry any class label information. Formally, for test example x, a clustering model λ j (a < j n) will assign it a group ID with probability P (g j x, λj ) (1 r), instead of genuine class label P (y x, λ j ). To bridge se two different probabilities, we introduce a conditional probability P (y g j ) which reflects mapping relationship between cluster ID g j and genuine class label y Y. By doing so, for clustering model λ j, posterior probability P (y x, λ j ) can be estimated by integrating all r mappings toger as shown in Eq. (3), r P (y x, λ j ) = P (y g j )P (gj x, λj ) (3) =1 where r is total class numbers. Consequently, weighted ensemble P (y x, E) in Eq. (2) can be revised to following Eq. (4), P (y x, E) = a w ip (y x, λ i )+ i=1 n j=a+1 =1 r w jp (y g j )P (gj x, λj ) (4) where first term represents utility from a classifiers, and second term represents utility from b clustering models. To estimate Eq. (4), two s need to be considered:

number s d- presents of data odel, we s shown e build lity, we rs, while models. combine nsemble rd x, sfies Eq. (1) eighted 2), (2) derive a ll degenhat have rless, ly assign rmation. odel λ j obability ass label different P(y g j ) h cluster g so, for (y x, λ j ) toger (3) tly, vised to j x, λ j ) (4) a ity from goal in Clusters Ensemble Classifiers and Clusters w 1 w 2 w n-1 w n Classifier Clusters Unlabeled Labeled Unlabeled Figure1. 1. model. Clusters Unlabeled Data Stream Illustration An illustration of of weighted ensemble ensemble classifier classifiers and cluster and model. clusters described as follows. Each time when a new data chun D i (assume unlabeled) arrives, we first cluster all examples in D i into v groups g1, i, gv, i with cluster centers taen as v vertexes and included in graph G. The edges between new group g i (1 v) and all existing vertexes in G (i.e., group gu) j can be calculated using Euclidean based distance metric as shown in Eq. (5), Sim g (g, i gu) j = d 1 (g, i gu) j 1 = g i gj u 2 2 where function Sim g (g i, gj u) represents similarity between two groups g i and gj u (i j). By doing so, we can (1) For ( cluster a g j n r y, we need to estimate its conditional probability P (y g j = argmax y Y w i P(y x, λ (i) )+ w j P(y g j )P(g j x, λ j ) ) ) which maps group ID to summarize all incoming clusters into graph G. genuine class i=1 label y. Intuitively, j=a+1 if=1 we don t have any If chun D i is labeled, we will also build a classifier λ i (5) prior nowledge on probability P (y g j To estimate Eq. (5), two s need to ), we can use on D i using a classification algorithm (i.e., decision tree), be considered: all remaining (n 1) (1) For cluster g j models from ensemble E to and with v groups g1, i, gv i corresponding to classifier λ i, how to estimate conditional estimate this probability. probability P(y g j This is because classification suitably labeled. models {λ 1,, λ ) a which } can maps directly provide group ID label to information genuine class for group label g j y. Intuitively, when we don t have any prior B. Propagate label information from classifiers to clusters, while clustering models nowledge on probability P(y g j can provide inner structure information for deriving P ) (y g, we j can use of all ). If two groups As we discussed above, ey to construct our remaining have same (n structure, 1) models such from as a cluster ensemble or a manifold, E to estimate y ensemble model is to estimate conditional probability this are probability. This is because classification models {λ 1 liely,, λ a to share same class label [13]. Hence, conditional P (y g j ) which reflects mapping relationship between } can directly provide label information for group g j probability P (y g j ) can be estimated by combining group ID g j and genuine class label y Y. To all, (n while 1) base clustering models models toger, can i.e., provide inner structure information for deriving P(y g j to estimate posterior probability P (y λ 1,, λ j 1 achieve this goal, we first propagate class label information from classifiers {λ 1,, λ a } to unlabeled ),. If, two λ j+1 groups,, λ n are, g j on same structure, such as a cluster or manifold, y ). To achieve this goal, we propose to use a label propagation liely method tothat sharefirst propagates same class label. Hence, con- cluster g j. In or words, for gj, we combine all are ditional probability P(y g j class labels from all classifiers to estimate its class label. This is equivalent to classifiers to clusters ) canasbe anestimated initialization, by combining and n maximization of following utility function, all iteratively (n refine 1) base labeling models toger, values byi.e., propagating to estimate labels posterior probability P(y λ 1,, λ j 1,, λ j+1,, λ n, g j ). y j among all clusters. = argmax y Y P (y λ 1,, λ, g j ) (6) To (2) achieve Assign this a proper goal, we weight propose w to use a label propagation i for base model λ i where y j denotes estimated class label of unlabeled to alleviate method concept that first drifting propagates class in stream labels data. from For all group g j traditional classifiers weighted to ensemble clusters as classifiers an initialization, [11], a commonly and n, and λ1,, λ a are classifiers. Assume classifiers are independent of or, objective function used iteratively method refine is to se assign initial values base model by propagating a weight labels value in Eq. (6) can be revised accordingly (as shown in Eq. (7)), reversely among all proportional clusters. to classifier s accuracy on upto-date (2) How chun, to assign a proper weight w i for base a v model λ i because up-to-date chun is liely to reflect to genuine alleviate concept concept underlying drifting data. However, on data P (y λ 1,, λ a, g j ) P (y gh)p i (gh g i j ) (7) i=1 h=1 streams. if majority In traditional samples in weighted up-to-date ensemble chun classifiers are unlabeled [11] on data (which streams, is often a commonly case in used our method is setting), to assign such a where gh i represents hth (1 h r) group in base weighting model approach a weight reversely doesn t wor proportional very well. to Alternatively, classifier s classification model λ i, and P (y gh i ) represents accuracy we will weight on up-to-date base model chun, according because to its consistency up-to-date probability that group gh i belongs to class y. Since all groups chun with is up-to-date liely to reflect base model. emerging concept. However, gh i corresponding to classifiers λi are labeled, probability when To implement up-to-date chun above is two unlabeled methods, (which we will is often first P (y gh i ), indeed, is nown a prior. Therefore, difficulty construct case in a our graph to represent setting), all such classifiers a weighting and clustering schema of computing Eq. (7) is calculation of conditional doesn t models, wor. and n Alternatively, use graph we will to realize weight label propagation base model probability P (gh i gj ). Without loss of generality, we let according and weighting to its for consistency all models. with up-to-date base model. P (gh i gj ) Sim g(gh i, gj ). Then Eq. (7) can be finally To implement above two methods, we will first revised to Eq. (8), construct A. Represent a graph all models whichinrepresents a graph all classifiers and a v clustering models, and n propagate label and weight y information In order toalong propagate this graph. information among all models, we j = argmax y Y Sim g (gh, i g j )Label(gi h) (8) i=1 h=1 transform all models into a graph G = (V, E), where A. vertex Represent set V represents all models cluster in a graph center of model, and where function Label(gh i ) = P (y gi h ) is class label of Inedge order settoe propagate representsinformation similarity among between all models, pair of group gh i corresponding to classifier λi. Therefore, Eq. (8) in vertexes this part in V we. The will procedure transformof all constructing models graph intoga can graph be can be easily solved in O(av) time. (5)

The essence of Eq. (8) is to combine all classifiers through a weighted averaging mechanism to estimate genuine class label of unlabeled cluster g j. Compared to previous label propagation methods that only use one single classifier to predict class labels for unlabeled data (models) [7, 13, 15], our method, which combines all classifiers for estimation, is more robust to concept drifting. Neverless, a possible limitation is that estimated class label y j for unlabeled cluster gj is a rough solution which still needs to be refined. This is because number of classifiers are usually quite limited for inferring an optimal class label for unlabeled cluster. Therefore, in next subsection, we will use internal structure information from clustering models to refine y j. C. Propagate internal structural information among clusters In this subsection, we will iteratively refine class labels for unlabeled cluster according to its internal structure information. Our motivation is that if two clusters have same structure, such as a cluster or a manifold, y are liely to share same class label. Formally, let Q m c = [P (y g1 a+1 ) T,, P (y g j )T,, P (y gv n ) T ] T be matrix of conditional probability that we are aiming for, where subscript represents total number of unlabeled groups. Let F m c = [y a+1t 1,, y jt,, ynt v ] T be matrix of initial class labels with entry calculated using Eq. (8). We construct similarity matrix S v v, where non-diagonal entries S i,j = Sim λ (λ i, λ j ) represent similarity between two clustering models, and all diagonal entries S i,i = 0. Based on similarity matrix S, we also construct a normalization matrix H = U 1/2 SU 1/2 which normalizes S, where matrix U is diagonal matrix with its (i, i)-element equal to sum of i-th row of matrix S. Accordingly, objective of label propagation among all clusters is to minimize difference between two clusters if ir similarity is high, and minimize cluster s deviation from initial label obtained from Eq. (8), Min Ψ(Q) = 1 m S ij Q i Q m j 2 +η Q i F i 2 2 Dii Djj i,j=1 i=1 (9) where η [0, 1] is a trade-off parameter controlling preference of two items in objective function. The essence of Eq.(9) is to propagate label information among all clusters according to ir internal structure. To solve Eq. (9), we differentiate objective function with respect to Q, and final solution can be described as Q = (1 α)(i αh) 1 F, where parameter α = 1/(1 + η) can be set manually. To avoid calculation of matrix inverse, an iterative method Q(τ +1) = αhq(τ)+(1 α)f can be used to compute optimal value Q. It will generate a sequence {Q(0),, Q(i), } which finally converges to Q [15]. D. Weight Value Calculation for All Base Models Anor of calculating Eq. (4) is to properly calculate weight value for base model. Weighting methods have been extensively studied in previous ensemble classifiers to alleviate concept drifting. The most intuitive method is to weight base classifier according to ir prediction accuracies on up-to-date chun [11]. In our setting, since most instances in up-to-date chuns are unlabeled, previous weighting methods do not applicable here. Therefore, we propose a general method that calculates weight value for base model according to its consistency with respect to up-to-date base model. Formally, let λ n be up-to-date model, n for a base model λ i (1 i < n), its weight w i can be computed using Eq.(10), w i = 1 Z Sim λ(λ i, λ n ) (10) where Z = n i=1 Sim(λi, λ n ) is a regularization factor. E. The Ensemble Model We summarize in Algorithm 1 procedure of building our ensemble model. Each time when a new data chun arrives, we first cluster it into v groups (if data chun is labeled, we will also build a classifier). After that, we update graph G by incorporating new v clusters, and meanwhile discard outdated ones. Next, we use label propagation method to estimate class labels of unlabeled groups in clustering models. At last step, all base models are weighted according to ir consistencies with up-todate model. By doing so, ensemble model can be trained to predict yet-to-come test example using Eq. (4). From Algorithm 1, we can observe that both label propagation, and model weighting operations run at group level. Therefore, we expect that our ensemble model can be trained in linear time w.r.t. total number of clusters in graph G. Due to space limitation, we omit algorithm complexity analysis here. Algorithm 1 The Ensemble Classifier and Cluster Model Require: Graph G which represents models λ 1,, λ n, An up-to-date chun D n+1, Number of groups v, Parameter α. Step 1: Cluster D n+1 into v groups {g n+1 1,, gv n+1 }; Step 2: If D n+1 is labeled n assign class labels to newly built v groups, meanwhile build a classifier λ n+1 ; Step 3: Update graph G by adding new v vertexes, and remove outdated vertexes g1, 1, gv; 1 Step 4: Estimate class label of unlabeled group using Eqs. (8) and (9); Step 5: Using Eq.(10) to estimate model s weight; Step 6: Build ensemble model E using weighted average of all n base models; Step 7: Output ensemble model E

IV. EXPERIMENTS In this section, we report our experimental studies to validate claim that our ensemble method can achieve better performance than existing ensemble models, such as ensemble classifiers, and ensemble clusters. A. Experimental settings Benchmar Data Streams We use two real-world data streams which can be downloaded from UCI dataset Repository [1] as benchmar data. The Malicious URLs Detection contains both malicious and benign URL streams. The malicious URL stream is obtained from a large web mail provider, whose live, real-time feed supplies 6,000-7,500 examples of spam and phishing URLs per day. The benign URL stream is drawn from Yahoo s directory listing. For every incoming URL, 3231961 features are obtained by querying DNS, WHOIS, blaclist and geographic information servers, as well as processing IP address-related and lexical-related features. For simplicity, in our experiments, we use former 128 continuous features in first wee data for analysis. The mining tas is to detect malicious URLs, such as spam, phishing, and exploits. The Intrusion Detection consists of a series of TCP connection records for a local area networ. Each example in this data stream corresponds to a connection, which is eir a normal connection or an attac. The attac types include Denial-ofservice (DOS), unauthorized access from a remote machine (R2L); unauthorized access to local root privileges (U2R); surveillance and or probing (Probing). The mining tas is to correctly detect attacs online. Benchmar Methods For comparison purposes, we implement following four models: (1) Ensemble classifiers. Previous ensemble classifiers models on data streams can be categorized into two types: ensemble classifiers built on different data chuns using same learning algorithm [11] (denoted as EC1), and ensemble classifiers built with different learning algorithms on up-to-date chun [6] (denoted as EC2). In EC1, we use Decision Tree as basic learning algorithm. In EC2, we use logistic regression classifier, decision tree, and Naive Bayes as basic algorithms; (2) Ensemble clusters. Similar to ensemble classifiers, we implement two types of ensemble clusters: ensemble clusters built on different data chuns [8] (denoted as EU1), and ensemble clusters built from different learning algorithms (denoted as EU2). -means is used as basic clustering algorithm. In EU2, all clustering models are built using -means algorithm with different initializations. Since ensemble clusters only output fae group IDs, we map outputs of clustering algorithms to best possible class prediction. All learning algorithms used here are implemented in WEKA data mining pacage [12]. For simplicity, we refer to proposed ensemble classification and clustering model on data streams as (ECU) model. In ECU, we use Decision Tree as basic classification algorithm, and -means as basic clustering algorithm. The parameters of ECU are set as follows: α in solving Q is set to be 0.618, and largest iterative steps of solving Q is set to be 15. Measurements To assess algorithm performance, we use average accuracy (Aacc), average mean square error (Amse), average raning (AR), variance of raning (SR), and number of times method rans best (#W) and worst (#L), as our performance matrices. A good prediction model should have high accuracy, a raning order close to 1, more winning times (#W), and less losing times (#L). Meanwhile, if an algorithm has a smaller SR, it is more stable. B. Comparisons with or ensemble models In Tables 1 and 2, we report algorithm performance with respect to all benchmar ensemble methods on real-world data streams. It is clearly that ECU always achieves best performance with respect to given measurements. These results assert that ECU outperforms or existing ensemble models on data streams when data streams have both classifiers and clustering models. The main reason is that our method, by combining all classifiers and clustering models toger, will achieve lower variance and result in better performance. Besides, when comparing ensemble classifiers models EC1 and EC2 with ensemble clustering models EU1 and EU2, we can observe that ensemble classifier models always perform better than ensemble clusters models. This demonstrates that although re are a large number of clustering models, we are unable to construct a satisfactory ensemble model by merely combining unlabeled clustering models toger. Fortunately, by adding a small portion of classifiers into ensemble clusters, as ECU model does, it will significantly improve algorithm performance. In Figure 2, we list chun-by-chun comparisons among EC1, EU1, and ECU models on two real-world data streams (Note that since EC2 and EU2 are inferior to EC1 and EU1, respectively, we do not include se two models in this figure). The results also validate our conclusion that ECU model performs best compared to existing ensemble classifiers and ensemble clusters models on data streams with a limited amount of classifiers but a large number of clusters. V. CONCLUSIONS Due to its inherent merits in handling drifting concepts and large data volumes, ensemble learning has traditionally attracted many attentions in stream data mining research. Many recently proposed sophisticated data stream mining

Measures Malicious Website Detection Intrusion Detection EC1 EC2 EU1 EU2 ECU EC1 EC2 EU1 EU2 ECU Aacc. 0.8279 0.7951 0.7790 0.7509 0.8732 0.9795 0.9723 0.9626 0.9601 0.9890 Amse. 0.2036 0.2127 0.2675 0.2097 0.1292 0.0176 0.0259 0.0262 0.0320 0.0031 AR. 1.7704 2.0069 2.2379 2.2397 1.6717 1.7379 1.8278 2.0320 2.5667 1.0300 SR. 0.8582 0.8651 0.8096 0.9182 0.557 0.0738 0.0847 0.1226 0.1259 0.0607 #W. 35 20 18 9 42 31 22 17 15 46 #L. 16 16 27 37 5 7 12 36 37 4 1 0.95 1 0.9 0.98 Accuracy 0.85 0.8 0.75 Accuracy 0.96 0.94 0.7 0.65 ECU EC(1) EU(1) 0.92 ECU EC(1) EU(1) 0.6 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Chun ID (a) 0.9 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Chun ID (b) 5. on two world data streams. (a) website detection. (b) intrusion detection. The parameter settings infigure se2. twochun-by-chun data streams arecomparisons as follows: chun on twosize real-world is set todata be streams. 500, totally (a) malicious 100 data chuns URLs detection. with only(b) 20% intrusion examples detection. labeled. The It is parameter obvioussettings that ECU performs better than ensemble classifiers, and ensemble clusters models on four data streams. in se two data streams are as follows: chun size is set to be 500, totally 100 data chuns with only 20% examples labeled. It is obvious that ECU performs better than ensemble classifiers, and ensemble clusters models on two data streams. results also validate our conclusion that our proposed ECU model performs best compared to existing ensemble Table I classifiers, and ensemble clusters models on data streams COMPARISONS ON MALICIOUS WEBSITE DETECTION with a limited amount of classifiers, but a large number of clustering models. Measures EC1 EC2 EU1 EU2 ECU Aacc. 0.8279 0.7951 0.7790 0.7509 0.8732 Amse. 0.2036 0.2127 0.2675 0.2097 0.1292 AR. 1.7704 2.0069 2.2379 2.2397 1.6717 SR. 0.8582 0.8651 0.8096 0.9182 0.557 #W 35 20 18 9 42 #L 16 16 27 37 5 Table II COMPARISONS ON INTRUSION DETECTION Measures EC1 EC2 EU1 EU2 ECU Aacc. 0.9795 0.9723 0.9626 0.9601 0.9890 Amse. 0.0176 0.0259 0.0262 0.0320 0.0031 AR. 1.7379 1.8278 2.0320 2.5667 1.0300 SR. 0.0738 0.0847 0.1226 0.1259 0.0607 #W 31 22 17 15 46 #L 7 12 36 37 4 models are based on ensemble learning framewor, with majority ensemble models falling into ensemble classifiers category. Neverless, as labeling training samples is a labor intensive and expensive process, in a practical stream data mining scenario, it is often case that we may have a very few labeled training samples (to build classifiers), but a large number of unlabeled samples are available to build clusters. It would be a waste to only consider classifiers in an ensemble model, lie most existing solutions do. Accordingly, in this paper, we present a new ensemble learning method which relaxes original ensemble models to accommodate both classifiers and clusters through a weighted average mechanism. In order to handle concept drifting, we also propose a new consistency-based weighting schema to weight all base models, according to ir consistencies with respect to up-to-date model. Experimental results on real-world data streams demonstrate that our ensemble model outperforms existing ensemble models for mining concept drifting data streams. F. Comparisons with benchmar incremental semisupervised model In Figure 6, weacknowledgment list comparison results between ensemble model ECU and incremental model SC with This research was partially supported by National Science Foundation of China (NSFC) under Grant No. 61003167, Basic Research Program of China (973 Program) under Grant No.2007CB311100, and Australian Research Council s Discovery Projects funding scheme (project number DP1093762). respect to different concept drifting scenarios on syntic data streams. The detailed parameter settings are listed REFERENCES [1] A. Asuncion, D. Newman, UCI Machine Learning Repository. Irvine, CA, 2007. [2] C. Aggarwal, Data Streams: Models and Algorithms, Springer. [3] A. Bifet, G. Holmes, B. Pfahringer et al., New ensemble methods for evolving data streams, In Proc. of KDD 2009. [4] P. Domingos, G. Hulten. Mining high-speed data streams, In Proc. of KDD, 2000. [5] C. Domeniconi, D. Gunopulos. Incremental Support Vector Machine Construction, In Proc. of ICDM 2001. [6] J. Gao, W. Fan, J. Han. On appropriate assumptions to mine data streams: analysis and practice, In Proc. of ICDM 2007. [7] J. Gao, W. Fan, Y. Sun, J. Han, Heterogeneous source consensus learning via decision propagation and negotiation, In Proc. of KDD 2009 [8] P. Hore, L.Hall, D. Glodgof, A scalalbe framewor for cluster ensembles, Pattern Recognition, Vol. 42, 2009. [9] M. Masud, J. Gao, L. Khan et al., A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data, In Proc. of ICDM 2008. [10] S. Pan, Q. Yang. A Survey on Transfer Learning, IEEE Transactions on Knowledge and Data Engineering, 2009. [11] H.Wang, W. Fan, P. Yu, J. Han. Mining concept-drifting data streams using ensemble classifiers,in Proc. of KDD 2003. [12] I. Witten, E. Fran. Data mining: practical machine learning tools and techniques, Morgan Kaufmann, 2005. [13] O. Chapelle and B. Scholopf, A. Zien, Semi-Supervised Leanring, MIT Press, Cambridge, MA, 2006 [14] A. Strehl et al., Cluster Ensembles-A Knowledge Reuse Framewor for Combining Partitionings, JMLR, Vol.(3), 2002. [15] D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Scholopf. Learning with local and global consistency, In NIPS 2004. [16] P. Zhang, X.Zhu, L.Guo, Mining data streams with labeled and unlabeled training examples, In Proc. of ICDM 2009. [17] P.Zhang, X.Zhu, Y.Shi, Categorizing and mining concept drifting data streams. In Proc. of KDD 2008. [18] X. Zhu, P. Zhang, X. Lin, Y. Shi. Active learning from data streams, In Proc. of ICDM 2007.