MARLINE: Multi-Source Mapping Transfer Learning for Non-Stationary Environments

Transcription

1 MARLINE: Mult-Source Mappng Transfer Learnng for Non-Statonary Envronments Honghu Du School of Informatcs Unversty of Lecester Lecester, Unted Kngdom Leandro L. Mnku School of Computer Scence Unversty of Brmngham Brmngham, Unted Kngdom Huyu Zhou School of Informatcs Unversty of Lecester Lecester, Unted Kngdom Abstract Concept drft s a major problem n onlne learnng due to ts mpact on the predctve performance of data stream mnng systems. Recent studes have started explorng data streams from dfferent sources as a strategy to tackle concept drft n a gven target doman. These approaches make the assumpton that at least one of the source models represents a concept smlar to the target concept, whch may not hold n many real-world scenaros. In ths paper, we propose a novel approach called Mult-source mappng wth transfer LearnIng for Nonstatonary Envronments (MARLINE). MARLINE can beneft from knowledge from multple data sources n non-statonary envronments even when source and target concepts do not match. Ths s acheved by projectng the target concept to the space of each source concept, enablng multple source sub-classfers to contrbute towards the predcton of the target concept as part of an ensemble. Experments on several synthetc and real-world datasets show that MARLINE was more accurate than several state-of-the-art data stream learnng approaches. Index Terms concept drfts, non-statonary envronment, mult-sources, transfer learnng. I. INTRODUCTION The need for effcent streamng data analytcs has rapdly grown n recent years [1]. A data stream can be defned as a sequence of observatons that contnuously arrve over tme, occurrng n many applcatons, such as credt card approval, fraud detecton, and software defect predcton [2]. A key challenge n data stream learnng s that the jont probablty dstrbuton of an applcaton may change over tme,.e., there may be concept drft [3]. Learnng from data streams that may suffer from concept drfts s frequently referred to as learnng n non-statonary envronments [4], [5], whereas a gven jont probablty dstrbuton can be treated as a concept [2], [6]. Data stream learnng algorthms must be able to adapt and swftly react to concept drfts to avod poor predctons [1]. Usng nformaton learned from dfferent data sources s a feasble way to speed up the learnng of a new target concept and mprove the accuracy of the predctons. Ths can be consdered as transfer learnng [7]. However, transfer learnng has been usually used off-lne, requrng the entre tranng set to be avalable before tranng commences. Whle a few Authors n order of contrbuton. L. Mnku was supported by EPSRC Grant No. EP/R006660/2 and H. Zhou was supported by EU Horzon 2020 DOMINOES Project (grant number: ). recent studes appled transfer learnng n non-statonary data streamng envronments [5], most of the approaches presume a smlarty exstng between the source and target concepts [2], [8], [9]. Ths assumpton often fals to hold n practce. For example, the concept underlyng the predcton for bke rental demands n Washngton D.C. and London are dfferent due to dfferent weather patterns and consumer behavours n these two ctes. However, data streams of these two locatons are avalable [10], [11] and could potentally be used to mprove the predctve performance of data stream learnng approaches. Another example s software effort estmaton, where data streams descrbng software projects developed by dfferent companes may be used to mprove software effort estmaton n a gven company, despte havng dfferent underlyng dstrbutons [12]. Therefore, ths paper ams to answer the followng research questons: Can mult-source transfer learnng help us to mprove the predctve performance n non-statonary envronments where source and target data streams do not share the same concept? If so, how? To answer these questons, we hereby propose a novel approach, namely Mult-source mappng transfer LearnIng for Non-statonary Envronments (MARLINE). MARLINE s the frst approach desgned to beneft from multple source data streams even when sources and target could have consderably dfferent concepts. It acheves that by projectng the target concept to the space of each source concept through a novel mappng mechansm, enablng multple source sub-classfers to contrbute towards the predcton of the target concept as part of an ensemble. Our experments show that MARLINE can mprove the predctve performance of the exstng approaches over tme and quckly obtan good performance at the early learnng stage or after the concept drft occurs, even though there are only a few tranng examples avalable. II. RELATED WORK Several approaches have been proposed for data stream learnng n non-statonary envronments [4], [6]. Among these, approaches able to learn example-by-example (onlne) rather than chunk-by-chunk [13] are partcularly relevant to our paper. They have the potental to adapt to concept drfts faster than chunk-based approaches. Such approaches can be further

2 dvded nto actve and passve approaches [4], [13]. Actve approaches trgger the adaptaton mechansms when concept drft detecton methods alert [4], [14]. Examples of drft detecton methods nclude the Drft Detecton Method (DDM) [15] and Drft Detecton Methods based on the Hoeffdng s nequalty (HDDM) (HDDM A and HDDM W ) [16]. Passve approaches contnuously adapt to concept drft wthout relyng on explct concept drft detecton [4], [13]. A popular passve approach s Dynamc Weghted Majorty (DWM) [17]. In spte of ther learnng capacty, none of these approaches uses transfer learnng or operates n mult-source scenaros. Very few approaches have used transfer learnng n nonstatonary envronments [5]. Onlne nductve parameter transfer learnng approaches nclude Dynamc Cross-company Mapped Model Learnng (Dycom) [12], Dversty for Dealng wth Drfts (DDD) [18] and Onlne Wndow Adjustment Algorthm (OWA) [19]. Dycom treats source data offlne, whereas DDD and OWA do not use source data, transferrng knowledge only from the mmedate prevous target concept to the current target concept. A new chunk-based nductve parameter transfer approach called Dversty and Transferbased Ensemble Learnng (DTEL) was proposed n [20]. Smlar to DDD, DTEL does not consder dfferent sources but uses hstorcal target concepts. In addton, ths s a chunkbased approach. Recently, two transductve transfer learnng approaches called MultStream Classfcaton usng Relatve Densty Rato (MSCRDR) [8] and Cross-doman Multstream Classfcaton (COMC) [9] have been proposed to handle multple sources, wth the assumpton that the target stream s unlabelled and the source s labelled. However, these two algorthms requre both the source and the target to share the same task, even after a concept drft happens. Mult-source transfer learnng for non-statonary envronments (Melane) [2] s another onlne transfer learnng method that can learn from multple sources. However, Melane only benefts from source concepts that are smlar to the target. Overall, there s no exstng approach to perform transfer learnng usng multple non-statonary data sources that may have dfferent concepts from those of the target. III. PROBLEM STATEMENT Let {x, y } denote an example receved at a gven pont n tme at a data stream wth doman D = {X, p (x)} and task T = {Y, p (y x)}, where {S 1, S 2,, S n, T }, S n s the nth source data stream, T s the target data stream, x X, X s a d-dmensonal feature space, y { 1, +1} s the class label, p (x) s the margnal probablty dstrbuton and p (y x) s the posteror probablty dstrbuton. All sources and target streams may suffer concept drft. We wll enumerate the concepts p j seen n a data stream usng a sequental dentfer j. Whenever a concept drft occurs, we ncrement j. We use J to denote the number of concepts observed so far n data stream. Note that a gven tranng example, doman and task are all assocated wth the gven concept. Therefore, they are actually all ndexed by j as shown n {x j, yj }, Dj, and T j, but we wll leave ths ndex mplct. S n T M x y d J p j x j H j H K λ s TABLE I: Key Notaton and Descrpton. The data stream from source n The target data stream Set of sources and target data streams seen so far Feature vector from data stream The correspondng label of x Dmensonalty of the feature space Number of concepts from data stream observed so far The jth concept of data stream, 0 < j J The projecton of x T on the jth concept of stream The base learnng ensemble classfer traned by the jth concept of data stream Pool of base learnng ensembles for data stream Base learnng ensemble sze The kth sub-classfer of H j Sum of s probablstc predctons for target examples that t correctly (s = sc) or ncorrectly (s = sw) classfes s predctve performance on current target concept At the begnnng of the data stream or after a concept drft, due to the lack of the target data representng the new concept, the performance of the predctve models s usually poor. Therefore, the am of mult-source transfer s to mprove predctve performance n non-statonary envronments and speed up learnng especally n the begnnng of the data stream or after the occurrence of concept drfts by usng the data from multple sources. We wll nvestgate nductve transfer learnng (.e. T S T T whle D S D T or D S = D T ), as concept drfts may cause changes n T T and T S over tme. IV. PROPOSED METHOD Ths secton ntroduces MARLINE. An overvew of MAR- LINE s tranng framework s gven n Fgure 1, and ts Java mplementaton s avalable n [21]. MARLINE consders that we have multple nput data streams (represented n grey n the fgure), where the concepts of the sources and target streams may be dfferent. However, classfers learned from the source concepts may stll be used to mprove the predctve performance n the target doman. For that, each concept observed from each data stream s learnt by an ndependent onlne learnng ensemble, whch we refer to as base learnng ensemble (shown n yellow). To dentfy dfferent concepts, a concept drft detecton method s used. Whenever a new target example needs to be predcted, MARLINE uses a mappng between the current target concept and each source concept, so that the target example s geometrcally projected onto the space of the source concept (the projectons are shown n purple). Through ths mappng, the target example can be predcted by dfferent sub-classfers (shown n orange) traned by dfferent base learnng ensembles. MARLINE weghts each sub-classfer dependng on how useful t s for predctng the projected target (shown n cyan-blue). The predcton by MARLINE s based on the weghted majorty vote of the predctons of all the sub-classfers that compose all the source and target ensembles. The set of all weghted sub-classfers s referred to as the MARLINE ensemble. Secton IV-A ntroduces MARLINE s tranng procedure. Secton IV-B presents the mappng procedure for concept projectons, whch s undertaken wth respect to the centrods of

3 Fg. 1: Overvew of the proposed MARLINE tranng framework. each concept. The centrods calculaton s explaned n Secton IV-C. Secton IV-D dscusses the weghtng of sub-classfers that compose the MARLINE ensemble used for predctons. Secton IV-E explans the votng procedure used for makng predctons. Secton IV-F shows the tme complexty. A. Tranng The pseudo-code of MARLINE s tranng process s shown n Algorthm 1. The set M {S 1,, S n, T } s the set of all the sources and target for whch an onlne base learnng ensemble has already been generated. When a new example (x, y ) s receved from a source or target data stream for the frst tme, the proposed method creates one onlne base learnng ensemble H 1 for ths source or target (lnes 3 to 6). Any onlne learnng ensemble method can be used, e.g., onlne boostng or baggng [22]. Ensembles are used here because ther dversty ncreases the chances that at least some of the sub-classfers become useful for predctng the target [2]. As the concept of each data stream may change due to concept drfts, each source/target s assocated wth a pool of base learnng ensembles H, where each ensemble may represent a dfferent concept observed from that source/target. Each ensemble H j wthn the pool contans K sub-classfers, where 1 k K. The newly created ensemble H 1 s added to ts correspondng ensemble pool H (lne 7). Ths pool receves addtonal ensembles when suffers from concept drfts, as explaned later n ths secton. After havng added H 1 to H, the method dscussed n Secton IV-C s used to calculate the centrod assocated wth the concept p 1 (lne 8). Ths centrod s used to create the mappng from the target to the source concepts, as explaned n Secton IV-B. Lne 9 s used to ntalse the weghts assocated wth each sub-classfer. These weghts are used to dentfy whch sub-classfers are more mportant for predctng the projected target, and ther calculaton s explaned n Secton IV-D. Whenever a new tranng example (x, y ) s receved from stream, MARLINE runs a drft detecton method on. Any drft detecton method could be used, e.g., HDDM [16]. If the drft detecton method requres montorng a predctve model representng, the most recent ensemble H J s used. If a concept drft s detected, the proposed method creates a new onlne base learnng ensemble H J+1 wth the ntalsaton of ts weghts and connecton wth the pool of ensembles H (lnes 11 to 15). If the new example belongs to the target doman, all the values of λ sc, λ sw,, M, j J, k K for all the sub-classfers are reset (lnes 16 to 18) so that all the sub-classfers weghts can start adaptng to the new target dstrbuton. After checkng for concept drft, the most recent ensemble created for the source or target s traned on the current example (x, y ) (lne 20). The centrod of concept p J s H J updated (lne 21), as explaned n Secton IV-C. If the new tranng example belongs to the target stream, the sub-classfers weghtng scheme dscussed n Secton IV-D s appled (lne 23). The mappng procedure shown n Secton IV-B s used n the weghtng scheme. B. Mappng Procedure The man purpose of the mappng procedure s to create the projecton (x j ) of the current target example (x T ) on the source concept p j. Therefore, gven a current target example, the classfers traned wth a certan source concept can make a predcton on the projecton of ths target example based on the knowledge learned from the source concept. After a concept drft has been detected n the target data stream, any prevous concept n that stream s also regarded as a source concept. From ths pont onward, we wll use the term source+ nstead of source when the past target concepts are ncluded as sources. Therefore, and j n the source+ concept p j are defned as follows: {, j {S 1,, S n }, 0 < j J } {, j = T, 0 < j < J }. To seek the projecton (x j ) of target example (x T ) on p j, a mappng functon between p j and pj T T s requred. However, for onlne learnng we only store the latest example n memory

4 Algorthm 1 Learnng Procedure of MARLINE. Input: Data streams D, {S 1, S 2,, S n, T } Parameter: Onlne Base Learnng Ensemble Algorthm; Base Learnng Ensemble Sze K; Drft Detecton Method; Tme Forgettng Factor 0 < θ 1; Performance Index 0 σ 1 1: Set H =, J = 0, M = ; {S 1, S 2,, S n, T } 2: whle Receve a new example (x, y ), {S 1, S 2,, S n, T } do 3: f / M then 4: M M 5: J 1 (Number of onlne base learnng ensembles assocated to s 1) 6: Intalse onlne base learnng ensemble H J 7: H H H J 8: Calculate centrods c,j as shown n Secton IV-C, used to create the mappng as shown n Secton IV-B 9: λ sc h J,k 0, λ sw h J,k 0, J,k 1, k K 10: end f 11: f DrftDetecton (x, y ) = true then 12: Intalse new onlne base learnng ensemble H J+1 13: J J : H H H J 15: λ sc 0, λ sw 1, k K h J,k 16: f = T then 17: λ sc h J,k 0, λ sw 0, J,k 0, 1, M, j J, k K 18: end f 19: end f 20: OnlneBaseLearnerAlgorthm{H J, (x, y )} 21: Update centrods c,j as shown n Secton IV-C, used to create the mappng as shown n Secton IV-B 22: f = T then 23: Update each sub-classfer s weght as shown n Secton IV-D 24: end f 25: end whle over tme. To buld the mappng functon between the source+ and target concepts wthout retrevng the overall hstorcal data, we propose the followng procedure. Consder a par of reference ponts n a gven source+ concept, and another par n the target concept. For nstance, for a sngle concept p j, the par of reference ponts can be the centrods of the dstrbutons p j (x y), y [ 1, +1],.e.: c y,j = [c1, c 2,, c d ]; y [ 1, +1] (1) The calculaton of c y,j s shown n Secton IV-C. We connect the pars of reference ponts of a gven concept usng a vector V,j = c y=1,j c y= 1,j. The transformaton matrx R between V,j and V T,JT can be consdered as the mappng functon of any two vectors between p j and pj T T : V,j = R V T,JT (2) Therefore, the source+ vector V,j can be seen as a projecton of the target vector V T,JT on p j. The transformaton matrx R can be calculated by Eqs. (3)- (5), where I d s the dentty matrx wth d dmensons V,j u = V -1, V v = T,JT,j V -1 (3) T,JT A = I d ( u + ( u + v ) -1 I d v ) 2 ( u + v ) -1 ( u + (4) v ) R = (A v -1 A v 2 v -1 v ) V,j (5) V T,JT When a new example (x T ) from the target doman s receved, the vector between the new example and one centrod (c y=1 T,J T ) of the target concept can be computed as follows: V T = x T c y=1 T,J T (6) The projecton of V T on p j can be calculated as follows: V I j = R V T (7) The projecton (x j ) s the sum of cy=1,j and V I j : x j = c y=1,j + V I j (8) C. Calculatng and Updatng the Centrods For savng computatonal tme, we update the centrods of each concept nstead of updatng transformaton matrx R at each tme step. The transformaton matrx R wll be calculated whenever a target example needs to be predcted. The centrods of the concept p j (x, y) are dynamcally updated based on examples (x, y ) receved from data stream durng the tme wndow snce the concept j has become actve. Durng ths tme wndow, f no example wth label y has been seen before, the centrod c y=y,j s set as x. Otherwse, t s updated as follows: sumc y=y,j c y=y = θsumc y=y,j + x (9),J = sumcy=y,j L t =1 θ(t 1) (10) where L s the number of the tranng examples receved n the data stream snce the concept p j has become actve, and θ, 0 < θ 1, s a pre-defned forgettng factor used to reduce the weght gven to the hstorcal examples. It helps to deal wth non-statonary envronments. The summaton n the denomnator s a normalsaton factor, whch can be updated n an onlne manner. D. Sub-Classfers Weghtng When an example (x T, y T ) from the target doman s receved, all the sub-classfers weghts ω h are updated. We assgn larger weghts to the sub-classfers whch focus on harder classfed examples. The sub-classfer s weght ω h depends on the correspondng sub-classfer s performance

5 on the current target concept p J T T. All the sub-classfers performances, M, 0 < j J, 0 < k K are updated based on the correspondng projecton (x j, y T ) of the current target example (x T, y T ), whch has been obtaned by the mappng procedure explaned n Secton IV-B. When = T, j = J T, the projecton (x j, y T ) s set to the current target example (x T, y T ). The sub-classfers performance s ntalsed wth = 1, M, 0 < j J, 0 < k K. To update each subclassfer s performance, the current example s weght needs to be calculated. We get the predcton of each sub-classfer on the correspondng target example projecton. The example s weght SW SC can be calculated as follows: SC J K P ( (x j ) = y T ) (11) SW M j=1 k=1 J K M j=1 k=1 P ( (x j ) y T ) (12) where P (h(x) = y) s the probablty of class y, estmated by the sub-classfer h(x), and s the performance computed based on the target examples receved before recevng (x T, y T ). The example s weght SW SC s used to ndcate how confdent the MARLINE ensemble s for the current example. We expect that the more confdent the MARLINE ensemble s, the smaller weght the example receves, and vce versa. Consder that λ sc s the sum of the probablstc predctons of sub-classfer for the target examples that t correctly classfes and λ sw s the sum of the probablstc predctons for msclassfed target examples. The performance of each sub-classfer can be computed ncrementally by: λ sc λ sw θλ sc + SW SC θλ sw + SW SC λ sc P ( (x j ) = y T ) SC P ( λ sc SW + λ sw (x j ) y T ) where 0 < θ 1 s a forgettng factor and α (13) (14), (15) P ( (x j )=y T ) represents how much contrbuton sub-classfer makes n the MARLINE ensemble to vote for the projecton (x j ) of (x T ) wth label y T. Thus, s the current performance percentage of each sub-classfer, gvng more focus to more recently arrvng target examples. The weghts of all the sub-classfers assocated wth > σ are assgned to ther predctve performance α h normalsed by the sum of the predctve performances of all the sub-classfers assocated wth can be formulated as follows: ω h = 1 J K M j =1 k=1 ( j,k >σ? α j,k h SC > σ. The weght ω h α :0) h, > σ 0, otherwse (16) where σ s a pre-defned parameter, (testcondton? v1 : v2) retreves v1 f testcondton s true, and v2 otherwse. E. Votng Procedure for Makng Predctons When a predcton s needed for a target nstance (x T ), we multply the correspondng weghts of the sub-classfers wth the probablstc predcton made by each sub-classfer on ther correspondng projecton (x j ) of the current target example. All sub-classfers, M, j {1,, J }, k {1,, K} are consdered for ths purpose. Afterwards, we obtan the sum of the weghted predcton probabltes of all classes and use majorty vote to decde the predcted class. F. Tme Complexty Analyss When learnng a target tranng example, MARLINE s tranng tme complexty s O(f DD +f H +(J S1 +J S2 + +J Sn + J T )d 2 +(J S1 +J S2 + +J Sn +J T )K f h ), where f H, f DD and f h are the tme complextes for tranng the base learnng ensemble wth the example, runnng the drft detecton method and gettng the predcton from a sub-classfer. When learnng a source tranng example, MARLINE s tranng tme complexty s O(f DD + f H + d). MARLINE s tme complexty for predcton s O((J S1 + J S2 + + J Sn + J T )d 2 + (J S1 + J S2 + + J Sn )K f h Detals on the complexty estmaton can be found n the Supplementary Materal of ths paper [23]. V. EXPERIMENTS SETUP We evaluate MARLINE under several dfferent condtons, ncludng statonary envronments, non-statonary envronments wth dfferent types of concept drfts, and dfferent target data stream szes. Artfcal datasets enable us to better understand when and how MARLINE can be helpful. Real world datasets enable us to check whether MARLINE can work well n practce. A. Datasets We use the same three artfcal datasets of smlar target and sources as those of [2] and generated addtonal datasets where the source was non-smlar to the targets. These datasets have two numerc features and one bnary output. The examples belongng to each output class were generated by a Gaussan dstrbuton as shown n Table II, where each dataset s composed of several (target and sources) data streams. The three datasets wth smlar sources use only the target and smlar source data streams from Table II, whereas the three datasets wth non-smlar source use only the target and nonsmlar source data stream from Table II. The datasets smulate a statonary envronment (no drft) and two types of nonstatonary envronments (abrupt drft and ncremental drft) on the target stream. Each dataset also has three dfferent versons based on dfferent class sze scenaros (small, medum, large) by varyng the number of target tranng examples of each class n 50, 500, Each source stream has 5000 tranng examples of each class wthout any concept drft. The real-world datasets are acqured from London bke sharng dataset [10] and Bke Sharng n Washngton D.C.

6 dataset [11]. The task s to classfy whether rental bkes are n low or hgh demand. We use the medan of the total count of the rental bkes of a gven dataset to ndcate low and hgh demands n ths dataset. We select the features shared by the two datasets (actual temperature, feelng temperature, humdty and wnd speeds) to unfy the feature dmenson. Each dataset s dvded nto three sub-datasets based on holday, weekend and weekday and compose the followng three scenaros: We choose weekdays from Washngton D.C. as the source, and (1) holdays and (2) weekends n London are the targets. We make weekends n Washngton D.C. as the source and (3) weekdays n London are the target. The am of these three dfferent target sub-datasets s to create small, medum and large stream szes (Holday: 384, Weekend: 4970, Weekday: 12060). B. Benchmark Methods and Evaluaton Measures MARLINE was compared aganst Melane [2], Adaptve Random Forest [14], Dynamc Weghted Majorty (DWM) [17], Onlne Baggng [22], Onlne Boostng [22], Onlne Baggng wth Drft Detecton and Onlne Boostng wth Drft Detecton. Melane was chosen for the comparson because t s the state-of-the-art mult-source transfer learnng approach for non-statonary data streams. As MSCRDR [8] and COMC [9], Melane s only able beneft from source concepts when they share the same task as the target concept. However, dfferent from these approaches, Melane has the advantage of beng able to detect when source tasks are dssmlar to the target, avodng to hnder predctve performance on the target when that s the case. Comparng MARLINE aganst Melane wll reveal whether MARLINE s mappng functon s helpful to mprove predctve performance aganst an approach that s only able to beneft from source concepts when they are smlar to the target concept. Adaptve Random Forest and DWM were chosen because they are popular data stream learnng approaches avalable n the Massve Onlne Analyss (MOA) tool [24]. Comparng aganst them shows whether MARLINE can outperform the popular approaches. Onlne baggng and onlne boostng [22] were ncluded as baselne ensemble approaches wth no strat- TABLE II: The parameters of the Gaussans of Artfcal Datasets. Datasets Doman Type Data Stream Class 0 Centre Class 1 Centre Target Target (2,3) (7,8) No Drft Datasets Smlar Source (2,1) (7,8) Non-Smlar Source (-2,-3) (-7,2) Before Drft (2,3) (7,8) Target Target After Drft (2,9) (5,4) Abrupt Drft Datasets Smlar Source (2,9) (5,4) Non-Smlar Source (-2,-3) (-7,2) Target Target Before Drft (2,3) (7,8) After Drft (2,3) (7,8) Source 1 (2,3) (7,8) Source 2 (3,4) (6,7) Incremental Drft Datasets Source 3 (4,5) (5,6) Smlar Source 4 (5,6) (4,5) Source 5 (6,7) (3,4) Source 6 (7,8) (2,3) Non-Smlar Source (-2,-3) (-7,2) ( ) 1 0 The covarance matrx for each class of each doman s except for 0 ( 2 ) 2 0 target n the no drft and non-smlar datasets, whch uses for both 0 2 classes. For the ncremental drft, the centres of the Gaussan of class 0 and 1 move towards each other by one unt at each 100, 1000 and tme steps for the datasets wth class szes of 50, 500, 5000, respectvely, untl the Gaussans of class 0 and 1 swap locaton. Ths leads to ntermedate concepts that are equvalent to each of the smlar sources 1 to 6. egy to deal wth concept drft. They provde a desred lower bound for the predctve performance acheved by MARLINE and any other approaches for non-statonary envronments. Addtonally, they were also appled n combnaton wth a drft detecton method, whch resets the models upon drft alarm, to enable these approaches to cope wth drfts. Comparng aganst the combnaton shows whether or not MARLINE s able to beneft from sources n general. MARLINE wthout source data streams was also used n the comparson because mappng s also performed between the old target concepts and the current target concept. Includng ths approach shows whether or not t s benefcal to use dfferent sources especally for the ntal learnng stage, rather than only mappng between old and new target concepts. Both MARLINE and Melane have been nvestgated wth onlne baggng and onlne boostng as base learnng ensemble methods [22]. All ensemble approaches used Hoeffdng trees [25] as the basc unts of learnng, except for one of the compared approaches (ARF), whch s based on ARFHoeffdng Tree [14]. Two drft detecton methods (DDM [15] and HDDM A [16]) were used for all the approaches that requre drft detecton. DDM s a well known method. HDDM A has been recently shown to perform well compared to other drft detecton methods when confgurng ensembles [26]. Thrty runs were performed for all the compared approaches, except for DWM [17], whch s determnstc and requres a sngle run. The average accuracy across the 30 runs s reported. Grd search was used to tune the hyperparameters of each approach on each dataset based on a prelmnary run. For Onlne Baggng and Boostng, the szes of the sub-classfers vared n 1:1:30. For DWM, β vared n 0:0.1:1, perod p = 1, and the weght threshold for removng sub-classfers was For ARF, the number of trees vared n 10:1:30 (MOA restrcts the mnmum ARF ensemble sze to 10). For Melane, the forgettng factor vared n 0.9 : 0.01 : 1. For MARLINE, θ = 0.9 : 0.01 : 1 and σ = 0.1 : 0.1 : 1. The grd searches results are n the Supplementary Materal of ths paper [23]. For the artfcal data streams, the predctve performance s calculated prequentally and reset upon the real locaton of the drfts [18]. In the artfcal datasets, we know exactly when the concept drfts happen. Ths evaluaton framework wll reset the accuracy to zero when the concept drfts (or the ncrements of an ncremental drft) occur. Ths enables us to measure the performance on each concept separately wthout beng affected by the prevous concepts. For real world data streams, the predctve performance of all the approaches s evaluated usng sldng wndows [27] wth the sze of 10% of the target data stream. Fredman and ther Nemeny post-hoc tests were used to compare the predctve performance of all approaches, on each dataset. VI. EXPERIMENT RESULTS A. Comparson on Artfcal Datasets 1) Experments wth Non-Smlar Source: The experment ams to nvestgate whether or not the use of very dfferent

7 Dataset TABLE III: Fredman Ranks on Each Dataset. Non-Smlar Source Smlar Source Real-World Data No Drft Abrupt Incremental No Drft Abrupt Incremental Holday Weekend Weekday Class Sze or Target Stream Sze Marlne(DDM(Onlne Baggng)) wth source Marlne(DDM(Onlne Boostng)) wth source Marlne(HDDMA(Onlne Baggng)) wth source Marlne(HDDMA(Onlne Boostng)) wth source Marlne(DDM(Onlne Baggng)) wthout source Marlne(DDM(Onlne Boostng)) wthout source Marlne(HDDMA(Onlne Baggng)) wthout source Marlne(HDDMA(Onlne Boostng)) wthout source Melane(DDM(Onlne Baggng)) wth source Melane(DDM(Onlne Boostng)) wth source Melane(HDDMA(Onlne Baggng)) wth source Melane(HDDMA(Onlne Boostng)) wth source Melane(DDM(Onlne Baggng)) wthout source Melane(DDM(Onlne Boostng)) wthout source Melane(HDDMA(Onlne Baggng)) wthout source Melane(HDDMA(Onlne Boostng)) wthout source DDM(Onlne Baggng) DDM(Onlne Boostng) HDDMA(Onlne Baggng) HDDMA(Onlne Boostng) Onlne Baggng Onlne Boostng Adaptve Random Forest(DDM) Adaptve Random Forest(HDDMA) Dynamc Weghted Majorty Fredman s p-values were always < The best approach has ts rankng n red wth grey background and the approaches not sgnfcantly dfferent from t accordng to the Nemeny test are n bold wth grey background. Mean accuracy and standard devatons are n the Supplementary Materal [23]. (a) No Drft; class sze of 50 (b) No Drft; class sze of 5000 (a) Holday (b) Weekend (c) Abrupt; class sze of 50 (d) Abrupt; class sze of 5000 (e) Incremental; class sze of 50 (f) Incremental; class sze of 5000 Fg. 2: Average Accuracy wth Non-Smlar Source. concept sources by MARLINE can help us to mprove the predctve performance. The Fredman rankng of the approaches on each dataset s shown n Table III. We can see that MARLINE wth source s amongst the best performers under dfferent amounts of the tranng data and dfferent types of drfts, as shown by the table cells hghlghted n grey, except the ncremental drfts wth the class sze of 500. MARLINE wthout source s sometmes amongst the best, demonstratng that mappng the new target concept to the space of the hstorcal target concepts s also benefcal. Fgure 2 shows (c) Weekday (d) Weekday Fg. 3: Accuracy on Real World Datasets. some representatve results across tme. Other fgures were omtted due to space restrctons. 2) Experments wth Smlar Source: Melane was desgned to transfer knowledge wth smlar sources and target concepts, beng thus expected to acheve the best performance for these data streams. Based on Fredman and Nemeny tests shown n Table III, Melane outperforms the other approaches n most scenaros. However, MARLINE wth source also acheves compettve results, smlar to those of Melane. In some cases, the performance of MARLINE wth source s better than that of Melane, e.g., MARLINE(HDDM A (Onlne Boostng)) wth source and Melane (HDDM A (Onlne Boostng)) wth source wth the class sze of 500 on the abrupt drft dataset. B. Comparson on Real World Datasets London and Washngton D.C. bke sharng data were collected from dfferent sources, so ther nput and output spaces are qute dfferent (see Table IV). Therefore, the concepts (both n terms of domans and tasks) of the two datasets are supposed to be very dfferent. From Table III,

8 TABLE IV: Input Space and Rental Count for Real World Dataset. Feature AT FT HD WS RC London 1.5 : 34 6 : : : : 7860 Washngton D.C 0.02 : 1 0 : 1 0 : 1 0 : : 977 AT: Actual Temperature; FT Feelng Temperature; HD: Humdty; WS: Wnd Speed; RC: Rental Count. MARLINE(HDDM A (Onlne Baggng)) wth source has the best performance on all the real world datasets. MARLINE wthout source s the 2nd best. From Fgure 3, we can see that the accuracy of MARLINE wth source s qute smlar to that of MARLINE wthout source. Ths may be due to the adaptve mechansm of MARLINE. It s worth notng that when the concept of the stream was easer to learn (as for the artfcal datasets), then MARLINE was most helpful n the begnnng of the stream and rght after drfts. Ths s because, wth tme ncreasng, every method can learn the concept well (Fgure 2). However, when the concept was more complex (as n the real world datasets), then MARLINE provded great help throughout tme (Fgure 3). Addtonal related analyses are n the Supplementary Materal [23]. C. Contrbuton of Source+ Sub-classfers To further nvestgate the mportance of ndvdual subclassfers n MARLINE, we select two datasets (Abrupt wth non-smlar source and class sze of 5000; and Weekday) and plot the average weght ratos of all the source+ sub-classfers over 30 runs n Fgure 4. The total weght rato of the source+ sub-classfers s calculated as: W eghtrato = J K M, T j=1 k=1 ω h J T 1 + j=1 K k=1 ω h T Ths s the sum of the total weghts assgned to the source and hstorcal target sub-classfers n the MARLINE ensemble. From Fgure 4a, before the concept drft occurs, the mean total weght of the source sub-classfers durng ths perod s 25.86%. After concept drft, due to past target subclassfers jonng the MARLINE ensemble, the mportance of the source+ sub-classfers ncreases and the mean total weght of the source+ sub-classfers s 48.24%. The average total weght s even larger for the real world dataset (see Fgure 4b). From Fgure 4b, we notce that the source+ classfers can sgnfcantly contrbute towards the predctons throughout tme (the mean of the total weghts over the whole data stream s 94.88%). Ths may be due to the fact that the real world dataset poses more challenges to the target sub-classfers, whch struggle to mantan ther performance on the artfcal datasets. We also notce some spkes and sudden drops n the total weghts over tme. Ths suggests that the weghtng mechansm s affected by nose. In our future work, we wll nvestgate whether or not other weghtng mechansms can mprove the predctve performance of MARLINE further. VII. SENSITIVITY ANALYSIS As MARLINE has a few hyperparameters, t s mportant to conduct a study to understand (Q1) how dfferent MARLINE (a) Non-smlar; Abrupt; Sze 5000 (b) Weekday Fg. 4: Sources+ Sub-classfers Average Total Weght (Over 30 Runs) by MARLINE (HDDM A(Onlne Baggng)) wth source. ensemble compostons (dfferent types of base learnng ensemble wth dfferent szes K and performance ndex σ) affect the predctve performance, and (Q2) the nfluence of dfferent drft detecton methods and forgettng factors θ used to handle dfferent types of concept drft. Secton VII-B1 answers these questons based on artfcal datasets. The real world datasets are also used to support the analyss n Secton VII-B2. A. Expermental Desgn To nvestgate (Q1) and (Q2), Analyss of Varance (ANOVA) [28] was performed to analyse the nfluence of each hyperparameter as well as ts nteractons wth others on the average prequental accuracy. The step-wse changes of each hyperparameter are defned to cover the range of the best hyperparameter values selected by the grd search for dfferent datasets n Secton V. For (Q1), the followng factors are nvestgated: base learner wth two levels (BLM: Onlne Baggng and Boostng), base learnng ensemble sze K {10, 20, 30} and performance ndex σ {0.0, 0.2, 0.4, 0.6}. As these are all subject-based factors, a Repeated Measures ANOVA desgn s used. For (Q2), the followng factors are nvestgated: drft detecton method (DD: DDM and HDDM A ), forgettng factor θ {0.9, 0.92, 0.94, 0.96, 0.98, 1} and drft type (DT: No Drft, Abrupt, Incremental). The last s only consdered when we apply the artfcal datasets. As the frst two factors are wthnsubject factors and the last s a between-subjects factor, a splt plot (mxed) ANOVA desgn s adopted for the artfcal datasets and a Repeated Measures ANOVA desgn s adopted for the real world datasets. Thrty runs for each combnaton of the factors are carred out on each dataset. Mauchly s sphercty test [29] s used wth a level of sgnfcance of 0.05 to evaluate whether or not the sphercty assumpton s volated. If volated, the ANOVA s p-values are corrected to take that nto account. If the epslon estmate s below 0.75, the Greenhouse Gesser correcton [30] s adopted to correct the degrees of freedom of the F-dstrbuton. Otherwse, the Huynh Feldt correcton [31] s adopted to make t less conservatve [32]. B. Results Table V and VI present the ANOVA results for the artfcal and real world datasets, respectvely. The Sum of squares (SS), degrees of freedom (DF), mean squares (MS), test F statstcs (F) and partal eta-squared (η 2 p) are reported.

9 TABLE V: ANOVA Results for Artfcal Datasets. Factor/Int. SS DF MS F ηp 2 Test of Wthn-Subjects Effects (Q1) σ K K * σ BLM * K BLM BLM * K * σ BLM * σ Test of Wthn-Subjects Effects (Q2) DD * DT θ θ * DT DD * θ * DT DD DD * θ Test of Between-Subjects Effects (Q2) DT BLM: Base Learner Method; DD: Drft Detector Method; DT: Drft Type. The p-value s always less than 0.001, except for DD * θ, whch s (a) Effect of K*σ wth Onlne Baggng (c) Effect of θ*dt wth DDM (b) Effect of K*σ wth Onlne Boostng (d) Effect of θ*dt wth HDDM A Fg. 5: Plots of margnal means on Artfcal Datasets. 1) Analyss Usng Artfcal Datasets: As t can be observed from Table V, performance ndex σ, base learnng ensemble sze K and nteracton K σ have a large mpact (η 2 p 0.131), whereas the other factors and nteractons have a small mpact (η 2 p 0.017). Therefore, the MARLINE ensemble composton factors σ and K have more nfluence on the accuracy. Fgures 5a and 5b llustrate the mpact of factors σ, K, base ensemble learner method and ther nteracton. The two plots are farly smlar to each other, confrmng that the nteracton BLM * K * σ has a small mpact. A large σ = 0.06 s detrmental to the accuracy wth worse accuracy obtaned especally wth a smaller ensemble sze K. Ths s because ths performance ndex s dffcult to be reached by the subclassfers. Therefore, most sub-classfers wll have weght zero n the MARLINE ensemble, effectvely decreasng ts sze and dversty. When σ 0.4, dfferent ensemble szes K and σ values lead to smlar accuracy, where a smaller ensemble sze, e.g. K = 10, leads to slghtly worse accuracy and σ = 0.4 leads to slghtly better accuracy when we use Onlne Baggng. Table V also shows that the forgettng factor θ, the nteracton between the drft detectng method and the drft type DD * DT and the nteracton θ * DT have a medum mpact (0.071 η 2 p 0.076). So, the drft detecton method and θ play mportant and probably dverse roles when handlng dfferent types of concept drfts. Fgures 5c and 5d llustrate the effect of the factors θ, DD and DT and ther nteracton. The two plots show farly smlar patterns, verfyng that the nteracton DD * θ * DT has a small mpact. When the drft appears abruptly, ndependent of the drft detecton method (DT), θ = 0.9 result n the best accuracy. As θ ncreases, the accuracy slghtly decreases. When the drft type s ncremental, the accuracy has larger drops wth θ 0.96, compared wth the abrupt concept drft. As the concept drfts n the ncremental datasets are more dffcult to be detected than the concept drfts n the abrupt datasets, t s reasonable that θ wll take more responsbltes to cope wth concept drfts when drft detecton does not perform well. When the dataset has no drft, there s no sgnfcant dfference between the accuracy obtaned by dfferent drft detecton methods, whch we confrmed by addtonal pared T tests wth Bonferron correctons. Furthermore, the accuracy changes very slghtly when we change θ. Therefore, we summarse that: Q1: Large performance ndex σ and small ensemble szes K can be detrmental to the predctve performance of MARLINE, whereas n general σ = 0.4 assocated wth K 20 led to better results. Q2: If there are concept drfts n the data steam, when the drft detecton method cannot detect the concept drfts accurately, a small value for the forgettng factor (0.9 θ 0.94) normally can help MARLINE to ncrease predctve performance on handlng concept drfts. When there s no concept drft, a small value for the forgettng factor wll not hurt the performance ether. 2) Analyss Usng Real World Datasets: Table VI shows the tests of wthn-subjects performed on the real world datasets. The plots of margnal means are shown n Fgure 6. The results are n general smlar to the results on artfcal datasets, though certan effects and dfferences n the magntude of the predctve performance were larger. We can see that σ, K and nteracton σ K have large effect sze (η 2 p 0.161). Meanwhle, θ has a very large effect sze (η 2 p = 0.495) and the choce of drft detecton method also has a large effect sze (η 2 p = 0.111). Ths could be because n the real world datasets, the concept drfts are a mx of dfferent types of concept drft, whch makes them more dffcult to be detected. Therefore, MARLINE reles more on θ to cope wth the concept drfts. Fgures 6a and 6b show smlar trends to the artfcal datasets. However, when σ 0.4, the mprovement n accuracy wth a greater σ s more sgnfcant, confrmng that both the sze and the qualty (performance ndex) of the subclassfers are mportant. Fgure 6c also shows that HDDM A performs better on the real world datasets, n lne wth the experments shown n Secton V. Also, we fnd that smaller θ values beneft the accuracy more.

10 TABLE VI: ANOVA Results for Real World Datasets. Factor/Int. SS DF MS F ηp 2 Test of Wthn-Subjects Effects (Q1) σ K * σ K BLM BLM * σ BLM * K * σ BLM * K Test of Wthn-Subjects Effects (Q2) θ DD DD * θ BLM: Base Learner Method; DD: Drft Detector Method; DT: Drft Type. The p-value s always less than (a) Effect of K*σ wth Onlne Baggng (c) Effect of DD*θ (b) Effect of K*σ wth Onlne Boostng Fg. 6: Plots of margnal means on Real World Datasets. VIII. CONCLUSION In ths paper, we focus on a general and challengng problem learnng from very dfferent concepts n data stream mnng. By mappng the target concept to the space of each source+ concept, the sub-classfers that closely match the part of the projecton of the target concept are gven hgher weghts n the MARLINE ensemble, beng able to acheve better performance n non-statonary envronments. We carred out extensve experments and the results demonstrate that our proposed MARLINE s effectve. A senstvty analyss s also presented. Future work ncludes the nvestgaton of strateges to reduce the sze of MARLINE s classfer pool; nvestgaton of dfferent weghtng schemes to further mprove accuracy; analyss of the computatonal tme taken to run the approach, complementng ts tme complexty analyss; experments wth more data streams, base learners and drft detecton methods; and an nvestgaton of senstvty to nose. REFERENCES [1] J. Lu, A. Lu, F. Dong, F. Gu, J. Gama, and G. Zhang, Learnng under concept drft: A revew, IEEE TKDE, [2] H. Du, L. L. Mnku, and H. Zhou, Mult-source transfer learnng for non-statonary envronments, n IJCNN, 2019, pp [3] A. C. Pocock, P. Yapans, J. Snger, M. Luján, and G. Brown, Onlne non-statonary boostng. n MCS, 2010, pp [4] G. Dtzler, M. Rover, C. Alpp, and R. Polkar, Learnng n nonstatonary envronments: A survey, IEEE CIM, vol. 10, no. 4, pp , [5] L. L. Mnku, Transfer learnng n non-statonary envronments, n Learnng from Data Streams n Evolvng Envronments, 2019, pp [6] J. Gama, I. Žlobatė, A. Bfet, M. Pechenzky, and A. Bouchacha, A survey on concept drft adaptaton, ACM CSUR, vol. 46, no. 4, p. 44, [7] S. J. Pan and Q. Yang, A survey on transfer learnng, IEEE TKDE, vol. 22, no. 10, pp , [8] B. Dong, Y. Gao, S. Chandra, and L. Khan, Multstream classfcaton wth relatve densty rato estmaton, n AAAI, vol. 33, 2019, pp [9] H. Tao, Z. Wang, Y. L, M. Zaman, and L. Khan, Comc: A framework for onlne cross-doman multstream classfcaton, n IJCNN, 2019, pp [10] London bke sharng dataset, london-bke-sharng-dataset, [11] H. Fanaee-T and J. Gama, Event labelng combnng ensemble detectors and background knowledge, PRAI, vol. 2, no. 2-3, pp , [12] L. L. Mnku and X. Yao, How to make best use of cross-company data n software effort estmaton? n ICSE, 2014, pp [13] B. Krawczyk, L. L. Mnku, J. Gama, J. Stefanowsk, and M. Woźnak, Ensemble learnng for data stream analyss: A survey, Informaton Fuson, vol. 37, pp , [14] H. M. Gomes, A. Bfet, J. Read, J. P. Barddal, F. Enembreck, B. Pfharnger, G. Holmes, and T. Abdessalem, Adaptve random forests for evolvng data stream classfcaton, Machne Learnng, vol. 106, no. 9-10, pp , [15] J. Gama, P. Medas, G. Castllo, and P. Rodrgues, Learnng wth drft detecton, n SBIA, 2004, pp [16] I. Frías-Blanco, J. del Campo-Ávla, G. Ramos-Jmenez, R. Morales- Bueno, A. Ortz-Díaz, and Y. Caballero-Mota, Onlne and nonparametrc drft detecton methods based on hoeffdng s bounds, IEEE TKDE, vol. 27, no. 3, pp , [17] J. Z. Kolter and M. A. Maloof, Dynamc weghted majorty: An ensemble method for drftng concepts, JMLR, vol. 8, no. Dec, pp , [18] L. L. Mnku and X. Yao, DDD: a new ensemble approach for dealng wth concept drft, IEEE TKDE, vol. 24, no. 4, pp , [19] P. Zhao, S. C. Ho, J. Wang, and B. L, Onlne transfer learnng, Artfcal Intellgence, vol. 216, pp , [20] Y. Sun, K. Tang, Z. Zhu, and X. Yao, Concept drft adaptaton by explotng hstorcal knowledge, IEEE TNNLS, [21] H. Du, Code repostory for MARLINE: Mult-source mappng transferlearnng for non-statonary envronments, MARLINE, [22] N. C. Oza, Onlne baggng and boostng, n SMC, vol. 3, 2005, pp [23] H. Du, L. L. Mnku, and H. Zhou, Supplementary materal for MARLINE: Mult-source mappng transferlearnng for non-statonary envronments, [24] A. Bfet, G. Holmes, R. Krkby, and B. Pfahrnger, Moa: Massve onlne analyss, JMLR, vol. 11, no. May, pp , [25] P. Domngos and G. Hulten, Mnng hgh-speed data streams, n KDD, 2000, pp [26] R. S. M. de Barros and S. G. T. de Carvalho Santos, An overvew and comprehensve comparson of ensembles for concept drft, Informaton Fuson, vol. 52, pp , [27] J. Gama, R. Sebastão, and P. P. Rodrgues, Issues n evaluaton of stream learnng algorthms, n KDD, 2009, pp [28] D. C. Montgomery, Desgn and analyss of experments. John wley & sons, [29] J. W. Mauchly, Sgnfcance test for sphercty of a normal n-varate dstrbuton, The Annals of Mathematcal Statstcs, vol. 11, no. 2, pp , [30] S. W. Greenhouse and S. Gesser, On methods n the analyss of profle data, Psychometrka, vol. 24, no. 2, pp , [31] H. Huynh and L. S. Feldt, Estmaton of the box correcton for degrees of freedom from sample data n randomzed block and spltplot desgns, Journal of educatonal statstcs, vol. 1, no. 1, pp , [32] J. Verma, Repeated measures desgn for emprcal researchers. John Wley & Sons, 2015.