An Adaptive Regression Tree for Non-stationary Data Streams

Transcription

1 An Adaptive Regression Tree for Non-stationary Data Streams ABSTRACT Data streams are endless flow of data produced in high speed, large size and usually non-stationary environments. These characteristics make them capable of being used in modeling a significant part of data mining applications. Learning nonstationary streaming data have received much attention in recent years. The main property of these streams is the occurrence of concept drifts. The existing methods handle concept drifts by updating their models either in a regular basis or upon detecting them. Most successful methods use ensemble algorithms to deal with the many concepts found in the stream; however, they usually lack acceptable running time. Using decision trees is shown to be a powerful approach for accurate and fast learning of data streams. In this paper, we present an incremental regression tree that can predict the target variable of newly coming instances. The labels of the instances are first predicted by the regression tree. Then, the true labels are revealed to the algorithm. The labeled instances are used to incrementally construct the tree. The tree is updated in the case of occurring concept drifts either by altering its structure or updating its embedded models. Experimental results show the effectiveness of our algorithm in speed and accuracy aspects in comparison to the best state-of-theart methods. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning concept learning, regression trees, data streams, non-stationary environments; I.5.2 [Pattern Recognition]: Design Methodologies classifier design and evaluation General Terms Algorithm, Design, Performance. Keywords regression tree, model tree, data stream, concept drift, nonstationary environment, drift detection 1. INTRODUCTION One very important class of regression models is regression trees. The significance of them is because in practice, a single regression model cannot address many regression problems; hence, regression tree algorithms that work on the basis of recursively partitioning the data and fitting local models in the leaves can be an adequate solution. As in the case of decision trees, regression trees are generally built top-down by partitioning a feature space; however, the main difference is that the dependent variable to be predicted is continuous [1]. From another perspective, there are many situations when incremental learning is required instead of a batch processing technique. This is because many sources produce continuous flows of data in the form of streams. As a convenient example, consider an agent operating in a real-time environment that may need to constantly process the latest sensor information to determine its next action.[2] Data streams are distinguished by three major characteristics: 1) being open-ended 2) flowing at high-speed and 3) generated from non stationary distributions, introducing drift into the target function. Contemplating these features, online learning algorithms which address data streams should have certain capabilities: keeping up with the rate at which data arrive, using a single pass on data and fixed memory, maintaining a prediction model at any time and finally being able to make the model consistent with the most recent data. Traditional estimates of prediction errors are mostly based on the assumption that training instances are observed from the Independent and Identical Distribution (I.I.D.) of the domain data [3]. It is important to note that I.I.D condition is not valid in the streams in which concept drift occurs, but it is rational to assume that small size batches of data satisfy the I.I.D condition [4]. In this paper we present an algorithm named Adaptive Regression Tree (ART) that focuses on incrementally building a fast and accurate regression tree based on recently seen instances in a timely manner. As its name suggests, the most important advantage of ART is its adaptability, that is the ability to detect concept drifts and adapt the tree structure effectively to the new concept. In fact, detecting concept drift in a portion of the tree, the associated subtree is updated rather than being discarded. This preserves speed and accuracy of the algorithm. The rest of the paper is organized as follows: In section 2, related works are discussed. The proposed method is presented in section 3. Section 4 evaluates the method and compares it with some previous algorithms. Finally, section 5 concludes the paper and suggests some future research directions. 2. RELATED WORK Instances of data streams fall into two categories: stationary and time-changing. Many approaches are used in literature for processing open-ended but stationary streams in dynamic environments. Some methods use decision trees for learning evolving data streams. Examples of efforts in this field include [5-7]. In [7] one such algorithm, the Very Fast Decision Tree (VFDT) is proposed. VFDT is a decision tree learning algorithm that dynamically adjusts its bias upon coming new instances. When an instance is available, it traverses the tree, following the branch corresponding to the attribute s value in the instance. When it reaches a leaf, the sufficient statistics are updated. Then, each possible condition based on attribute-values is evaluated. If there is enough statistical support in favor of one test over the others, the leaf is changed to a decision node. The new decision node will have as many descendant leaves as the number of possible values for the chosen attribute (therefore this tree is not necessarily binary). The decision nodes only maintain the

2 information about the split-test installed within them. To the best of our knowledge, however, there are few methods addressing the problem of incremental learning using model trees [8]. Time-changing data streams, on the other hand, have attracted fewer interests. Two serious obstacles in this realm are detection of drifts and developing effective algorithms to cope with unique characteristics of streams with drift. [9] is an example that proposes a method for dealing with the former. An idea that has successfully addressed the latter is using decision trees [6, 8]. In [6] VFDT has been extended to deal with non-stationary data streams and the so-called CVFDT algorithm is introduced. By storing some statistical variables associated with each node over a window of instances, the algorithm is able to perform regular periodic checks for its splitting decisions. If some split recognized to be invalid, a new decision tree rooted at the related node is started growing. In [8] a fast incremental model tree with drift detection (FIMT-DD) is proposed. The algorithm starts with an empty leaf and reads instances in the order of arrival. Each instance is traversed to a leaf where the necessary statistics are updated. Given the first portion of instances, the algorithm finds the best split for each attribute, and then ranks the attributes according to some valuation measure. If the splitting criterion is satisfied it makes a split on the best attribute, creating two new leaves. Upon arrival of new instances to a recently created split, they are passed down along the branches corresponding to the outcome of the test in the split for their values. Change detection tests are updated with every instance from the stream. If a change is detected, an adaptation of the tree structure will be performed. 3. PROPOSED METHODOLOGY 3.1 Overall Algorithm In this section, a learning algorithm, named Adaptive Regression Tree (ART), is presented. The main goal is to incrementally build a regression tree that can predict the target variable of newly coming instances with high accuracy and in a timely manner. The labels of the instances are first predicted by the regression tree. Then, the true labels are revealed to the algorithm to be used to update the current tree. Since the stream of instances may contain concept drifts, the tree is adapted in a way that makes it consistent with new concepts. The adaptation process either alters the tree structure or updates the nodes regression models while maintaining the structure unchanged. A number of variables for drift detection, expanding the tree and also prediction of the target labels are stored in each node which are updated while processing consecutive instances. The main novelty of the proposed algorithm is in the handling of the drifts. ART seeks for adapting the tree in a way that it becomes suitable for the present concept instead of building a new subtree for that region. ART algorithm is demonstrated in procedure 1. The algorithm starts with a single leaf (line 1). Coming a new instance, it is classified by the regression tree being at hand up to now (line 3). For this purpose, the instance is passed over the nodes starting from the root down toward a leaf of the tree. In the leaf, a very simple model predicts the label of the instance. The simple rule to build a model at each leaf is to calculate the average of the labels for instances reached to that leaf. This average is used as the model s output. Then the real label is uncovered and the process of updating the tree begins. Once more, the instance is traversed throughout the tree. At each node it visits, the variables associated with drift detection test, called PH test, are updated according to this instance (line 6) and PH test is run, firstly. Drift detection test will be discussed in Input: data stream along with description of attributes Output: predicted labels of instances 1 root init_tree() 2 foreach instance i do 3 l[i] root.predict(i); 4 cur_node root 5 while ( cur_node null ) do for cur_node 6 update_ph() 7 if PHTestPassed() 8 tagnode( drifted ) 9 resetsubtree() 10 if isdrifted() or isleaf() 11 updateebst(i) 12 if!isleaf() 13 if isdrifted() 14 if adaptrequired() 15 adapt() 16 if adapted() or driftpassed() 17 resetebst() 18 setdrifted(false) 19 cur_node cur_node.next(i) 20 else//leaf 21 if!driftednow() 22 update_model(i) 23 if splittestpassed() 24 split() Procedure 1. Overall pseudo code of ART algorithm section 3.3. If drift has occurred (line 7), the node will be marked as drifted and all variables associated with this node and its successors are reset to an initializing value, since the previous variables do not describe the environment properly anymore (lines 8, 9). A summary of values of attributes of previous instances are stored in so-called E-BST data structures per each node. These statistics help identify the best attribute to split over. In the case that the node is either a leaf or is faced with drift, a split may be required and thus these structures should be updated regarding the new instance (lines 10,11). Otherwise, no updates should be performed because large E-BST structures can make the updating process very ineffective so they are to be kept as small as possible. Details will be explained in section 3.2. A non-leaf drifted node is given the opportunity to become consistent with new concepts via updating the models of its subtree. This prevents structural altering of the tree. Only if it fails to update itself appropriately, the altering process will go through (line 15). If the drift is not due to the change in the order of discriminative attributes, the structural altering should not be performed so some strict condition (line 14) is set to prevent the tree from these unnecessary changes. In addition, compared to structural altering, adapting the tree just by updating the leaves models is preferable, because of its much less time consumption. Section 3.4 describes the structural altering process in details. If the altering process is done or sufficient number of instances are processed after occurring a drift, E-BST structures are reset and the node will be unmarked.(lines 16-18). If the node under review is a leaf, its regression model will be updated provided that no concept drift has occurred currently (lines 21, 22). This condition for updating the models is used to alleviate the problem of noisy instances for the tree. In fact, even one noisy instance may be the cause for concept drift detection. Therefore, the updating process for models will not be done right after detecting concept drifts. As the final step of each loop, if necessary conditions for split (refer to

3 section 3.2) are fulfilled in the leaf, then a split will occur (lines 23, 24). 3.2 Split Test Once sufficient number of instances reach a leaf, it will have the potential to be split over some attribute. The very simple split condition that must be satisfied within a leaf is that the number of instances reaching that leaf is at least N split that is a parameter of the algorithm. In order to make a split, the best split value over each attribute has to be identified. Because attributes are assumed to be numerical, any real value can potentially act as a split value and thus the process of finding the best one is not straight-forward. Furthermore, according to intrinsic characteristics of data streams, storing the values of attributes of coming instances is not practical. Consequently, a method based on extended binary search tree (E-BST), proposed by [8] is used in ART. As the E-BST method suggests, for each possible split value, h, a corresponding value measured by Standard Deviation Reduction (SDR) is calculated. That is: (h) = ( ) ( ) ( ), ( ) = (1) 1 ( 1 ( ) ), (2) where S is a set of size N containing all split values and S L (S R ) is a set of size N L (N R ) containing the values less than h. For each attribute, then, the best value to split over will be the one having maximum SDR among others. Finally, the attribute corresponding to the highest SDR value will be chosen as the splitter of that leaf. As a new instance arrives, SDR values will change and should be recalculated, but for running time purposes, this is done only when the number of instances is a multiple of a user defined parameter, N min. There are also two monitoring leverages that prevent the tree from exceeding a normal size. The first one is a constraint on a variable, called split_count A, that determines how many times split has been done over an attribute A within a single path from root to a specified node. If the attribute is used max_split times, there will be no more splits on it for successors of that node. The second one is a constraint on the depth of the tree. If splitting a leaf makes depth more than so-called max_height value, the leaf will not be split anymore. max_height and max_split are two parameters of the algorithm that are to be tuned. 3.3 Drift Detection In the case that a substantial decrease in accuracy is detected, a concept drift has occurred. A simple test, named Page-Hinckley (PH) proposed in [10] is used with some modifications to detect drift. This test works based on the cumulative classification error up to this new instance compared to that of previous instances with the underlying aim of monitoring the tree s accuracy. Suppose that for instance i, l i and p i denote its true and predicted labels, respectively. Then: =. (3) If total number of instances reaching a node, starting from previous time a drift has taken place, is n t, then the average value of the labels of these instances is: and where is a constant. Define: = 1, (4) =, (5) = min, (6) =. (7) Coming a new instance, if the value of m T shows a substantial change from its previous values, that is if the difference between m T and its minimum over the period starting from last occurrence of drift exceeds a predefined threshold,, (i.e. PH > ) then a concept drift has been detected. Once a drift detected, all variables of PH test including, m T, M T and PH are reset to zero, in order to be able to test for later drifts. A similar drift detection mechanism is used in [8] which is based on the changes in the values of the labels of the instances reaching a node. However, in that of ART it is based on the accuracy of models. 3.4 Altering Tree Structure In the case of detecting a concept drift, two options are ahead. One is to disregard the subtree at which the drifted node is rooted and start to build it from scratch. The other is to maintain the current subtree and try to apply some structural changes in order to make the tree adapted to new concept. The former option is too costly both in time and performance to be used in data streams. Therefore we suggest using the latter. Nevertheless, structural adapting is also a time consuming procedure, so only under certain circumstances, explained below, it will be done. To put it briefly, if drift is identified to be occurred in a node for the first time, the tree s structure will not be adapted immediately. Only if further investigation, specifically detection of drift for the second time, approves that the node really needs a change in the structure of its subtree then the adaptation procedure will be invoked. When the Page-Hinckley test for drift detection (discussed in section 3) turns to be true for a node, the node is marked as drifted. In this case, the only reaction is to reset all variables including E-BST structures, PH test variables and models in leaves to an initializing value. This is done for this node as well as for all nodes that have current node as their predecessor. These resets are due to the fact that E-BSTs have to be built from scratch to identify the best attribute after drift and PH test should be restarted to check for further drifts. While the process of coming new instances goes through, two different situations are likely to happen in this node. First, the node may recover itself by updating its model in a way adjusted to new instances, and second, the PH test warns about the second occurrence of drift in that node. If the former takes place, the tree is rescued from structural changes and time is saved also. The drifted tag will be removed after N min instances come to node without a sign of second drift. On the

4 contrary, if the latter happens, then structural adapt should be done. From a conceptual point of view, an adaptation in the structure, is supposed to replace the attribute used to split up a node with another attribute that is more consistent with new concept. The procedure for finding a better attribute is the same as what is done while splitting a node. It means, following the split test, the attribute that currently owns the maximum SDR value will be the new attribute we prefer to substitute for the old one. This new attribute and the previously used one are related to each other in one of the four cases discussed here: Case I. Two attributes are identical. In this case, no additional changes are required and the drifted tag will be removed. Case II. The new attribute is the attribute used to split up one of the two children of this node and these two children have different splitting attributes. Figure 1 shows this situation. The tags on nodes denote the name of attribute on which the split occurs. Assume that node A is to be adapted. Further, assume the new attribute known to be the best after drift, is B, exactly the one used in the node s left child. Adaptation process, changes the subtree structure in a way shown in figure. The two nodes denoted by A, in adapted tree are copies of A node in original tree, identical in each and every aspect. Figure 1. Altering tree structure (case II) In the special case that B and C are leaves, the new attribute simply substitutes for the old one and all of the variables of these two children will be reset. The reason for reset is that the instances seen before adaptation should not influence the model which is going to be built after adaptation. As it is clear in the figure, a new attribute, namely B, is added to the path A-C-G and A-C-F. This may put the tree beyond the constraints of the maximum split number and/or maximum height defined for it. Thus, a pruning procedure will be activated which traverses all the nodes below the node B and is responsible for pruning the subtree beneath a node in which at least one of these two rules are violated. Finally, the models and PH test variables of the right subtree, C, F, G and all of their children, are reset to an initializing value. Exactly the same discussion holds for the case that the new attribute is C, the one used in right child of current node. Case III. The new attribute is the attribute used to split up one of the two children of this node and these two children have identical splitting attributes. The following figure shows this situation. Structural adaptation is done in a slightly different manner with that of case II. Figure 2. Altering tree structure (case III) Case IV. Otherwise. This case in concerned with situations in which the new attribute in neither identical to old splitting attribute, nor is it the one used in one of the children. This reveals that this attribute is either used in indirect children of that node or is not used at all. No matter which state it is, a change more drastic than the ones described in cases II and III has to be applied. To put it another way, a global change in the structure of the tree is required in this case which may even lead to higher values of classification error. Furthermore, the change replacing the old attribute by the new one can take place with a short delay in later consequent adaptations of tree provided that those adaptations fall into case II or III. The sensible idea behind these adaptations is that the previously built structure for the tree is worth keeping. The new structure should be as close as possible to the old one, that is, instances should be conducted through the same path to a node that they would have reached before the adaptation happens. This is because the model adopted for an instance, is still the best model for that instance even after the first moments the drift has occurred. 4. EXPERIMENTAL RESULTS In order to evaluate ART, its implementation is developed 1 in C++ language. Some computer experiments are conducted based on this implementation. A major hurdle in regression problems, especially in the presence of concept drift is the lack of suitable datasets. We found five real datasets appropriate for our testing purpose. In the following, firstly these datasets are described. Then parameter tunings and evaluation metrics are discussed. Finally, results of the proposed method on the introduced datasets are reported. The algorithm is compared with FIMT-DD method. According to [8] FIMT-DD overcame other algorithms proposed in the field of regression in data streams with concept drift. In addition, this method also uses regression tree as its model infrastructure and is the most similar methods to ours. The experiments show effectiveness of our algorithm in comparison to FIMT-DD method. 4.1 Datasets We used five real datasets on which the performance of the algorithm is evaluated. Three of these datasets are manipulated in order to make them appropriate for regression problem, while the other two remained unchanged. Sensor: The dataset introduced by [11] consists of about 2,219,000 records and 5 attributes, namely time, humidity, light, voltage and temperature. These records include the information collected from 54 sensors deployed in Intel Berkeley Research laboratory in a two-month period. The type and place of concept drift is not specified. This dataset is prepared for classification tasks and target label is the sensor ID. Thus, in order to make it usable for regression, a few modifications are necessary. The sensor ID is omitted and the temperature is used as target value. 1 Accessible from {first_author_home_page/art/code}

5 Sensor Electricity Houses Dodgers Airline Table 1. Comparison of and time on all datasets ART FIMT-DD time(s) time(s) Noisy data also are removed. The ultimate dataset contains 4 attributes, one target attribute and 1,500,000 instances. Electricity: The data was first described in [12]. They are collected from the Australian New South Wales Electricity Market. In this market, the prices are affected by demand and supply of the market and are set every five minutes. The dataset contains 45,312 instances dated from 7 May 1996 to 5 December Each instance of the dataset refers to a period of 30 minutes. The dataset has 8 attributes and the goal is to predict whether prices go up or down compared to last 24 hours. For our regression purpose, it is modified to have 6 attributes including date and price. The target is to predict electricity demand. Houses: The dataset firstly appeared in [13]. It contains 20,640 instances on housing prices in California with 9 economic covariates. Dodgers: This loop sensor data, accessible from [14], was collected for the Glendale on ramp for the 101 North freeway in Los Angeles. It is close enough to the stadium to see unusual traffic after a Dodgers game, but not too close so that the signal for the extra traffic is overly obvious. The observations were taken over 25 weeks, 288 time slices per day (5 minute count aggregates). The goal is to predict the presence of a baseball game at stadium. The dataset is changed in the way that the number of cars passing the freeway regarding the 5 attributes month, day, year, hour and minute is the target variable. Records containing missing value are removed and final size of dataset is 47,497. Airline: Introduced by [8], this dataset consists of about 116 million records and 13 attributes. The records include flight arrival and departure information for the commercial flights within the USA, from 1987 to The target value is arrival delay. The last 1 million records are used in our tests. 4.2 Parameters All experiments were run on Intel Pentium 3.4 GHz CPU and 2 GB RAM. Furthermore, some variables should be initialized to a constant value before the algorithms start to run. Same values of these parameters are used for the first four datasets. However, for airline dataset different values have to be selected, due to its different properties. These parameters are categorized into three parts: a. E-BST parameters: including N split used for split test and N min, denoting the chunk size on which split condition is tested. N split and N min are set to 60 and 200 for all datasets. b. PH test parameters: including, used to calculate PH formula and, the drift detection threshold. and is set to 20 and for the first four datasets and and 0.01 for airline dataset. c. Pruning parameters: including max_split, identifying how many times split on each attribute is allowed within a single path and max_height, identifying maximum allowed height of tree. max_split and max_hight are set to 4 and 15 for the first four datasets and 2 and 3 for airline dataset. Two sets of default parameters are used in the experiments of FIMT-DD method. In order to provide a fair comparison, the best set is used for all datasets [8]. 4.3 Evaluation Method and Metrics In data stream mining, the most frequently used measure for evaluating predictive error of a model is predictive sequential or prequential error. For each instance of the stream, the actual model makes a prediction based only on the instance attributevalues. The prequential error is computed based on an accumulated sum of a loss function between the predicted and real values [15]. Here, Mean Squared Error () is used to compare prequential errors of the regression models. Running time is also an important evaluating factor. The results given in this section are averages over 5 runs on each configuration of the algorithms. 4.4 Results Running ART and FIMT-DD algorithms on five described datasets leads to the results reported in table 1. ART shows lower errors in all cases and better running times in most of them. Specifically, for sensor, houses and airline datasets, both s and running times of ART are lower than that of FIMT-DD. Running time for the sensor dataset is much lower (about 20%), while for houses and airline datasets, the running times are almost the same. For the other two datasets, i.e. electricity and dodgers, ART shows a significantly lower at a cost of more running time. By setting max_split and max_height parameters to lower values, better running times also could be obtained. The main advantage of ART over FIMT-DD that leads to notable results is that in ART a drifted area within the tree is not discarded completely and it is intended to be restructured to adapt to new concept. However, in FIMT-DD a drifted subtree is replaced thoroughly by a new tree that is built upon instances arrived after drift. In the following diagrams, a comparison on of ART and FIMT-DD is given on each dataset separately. The horizontal axis shows the arrival of instances in chronological order. The variations of s according to concept drifts can be seen in these plots. Figure 3. Comparison of on sensor dataset

6 Figure 4. Comparison of on electricity dataset Figure 5. Comparison of on houses dataset Figure 6. Comparison of on dodgers dataset Figure 7. Comparison of on airline dataset 5. CONCLUSION AND FUTURE WORKS We proposed a regression tree algorithm, named ART, for data streams in the presence of concept drift. In this method, a tree was built incrementally for predicting the target value of incoming instances. After the prediction, the real label of the instances was revealed to the algorithm and used to update it. In the case of detecting concept drifts in a portion of the tree, either altering the tree structure or updating the nodes regression models were done to handle it. This method was compared to FIMT-DD and it was shown that it improves the accuracy or running times over the used data sets. Some future research directions might include the followings. First, new conditions for altering the tree structure could be tested. Second, other efficient methods could be used instead of the current drift detection or splitting test. Third, the algorithm could be run on more real data sets in order to achieve more reliable results. 6. REFERENCES [1] Malerba D., Appice A., Ceci M., and Monopoli M Trading-off local versus global effects of regression nodes in model trees. In Proceedings of the 13th international symposium on foundations of intelligent systems, LNCS. Springer, Berlin, vol 2366, [2] Potts D., and Sammut C Incremental learning of linear model trees. J. Mach. Learn. Res. 6, doi: /s [3] Rodrigues P. P., Gama J., and Bosnic Z Online reliability estimates for individual predictions in data streams. In Proceedings of IEEE international conference on data mining workshops. (IEEE Computer Society, Los Alamitos, CA), [4] Hosseini M.J., Ahmadi Z., and Beigy H.,2011. Pool and Accuracy Based Stream Classification: A new ensemble algorithm on data stream classification using recurring concepts detection. In Proceedings of the 11 th International conference on data minging workshops, Vancouver, Canada. [5] Gama, J., Medas, P., and Rodrigues, P Learning Decision Trees from Dynamic Data Streams. In Proceedings of the 2005 ACM Symposium on Applied Computing, [6] Hulten, G., Spencer, L., and Domingos, P Mining Time-changing Data Streams. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, [7] Domingos, P., and Hulten, G Mining High-speed Data Streams. In Proceedings of the ACM Sixth International Conference on Knowledge Discovery and Data Mining, [8] Ikonomovska E., Gama J., and Dzeroski S., Learning model trees from evolving data streams. In Data mining and knowledge discovery. (Kluwer Academic Publishers Hingham, MA, USA) vol. 23, [9] Song, X., Wu, M., Jermaine, C., and Ranka, S Statistical change detection for multidimensional data. In Proceedings of the 13th ACM SIGKDD conference on knowledge discovery and data mining, [10] Mouss H., Mouss D., Mouss N., and Sefouhi L Test of Page Hinckley, an approach for fault detection in an agroalimentary production system. In Proceedings of the 5th Asian control conference (IEEE Computer Society, Los Alamitos, CA), vol 2, [11] Zhu, X Stream Data Mining repository. Accessed on Sep 2012; Available from: [12] Harries, M Splice-2 comparative evaluation: Electricity pricing. Technical report, The University of South Wales. [13] Pace, R. K., and Barry, R Sparse Spatial Autoregressions, Statistics and Probability Letters, vol. 83, no. 3, dataset accessible from [14] accessed on July [15] Gama, J Knowledge discovery from data streams, Chapman & Hall/CRC.