Incremental Learning

Transcription

1 Incremental Learning Abdelhamid Bouchachia Department of Informatics University of Klagenfurt Universitaetsstr Klagenfurt, 9020 Austria voice: fax:

2 Incremental Learning Abdelhamid Bouchachia, University of Klagenfurt, Austria INTRODUCTION Data mining and knowledge discovery is about creating a comprehensible model of the data. Such a model may take different forms going from simple association rules to complex reasoning system. One of the fundamental aspects this model has to fulfill is adaptivity. This aspect aims at making the process of knowledge extraction continually maintainable and subject to future update as new data become available. We refer to this process as knowledge learning. Knowledge learning systems are traditionally built from data samples in an off-line oneshot experiment. Once the learning phase is exhausted, the learning system is no longer capable of learning further knowledge from new data nor is it able to update itself in the future. In this chapter, we consider the problem of incremental learning (IL). We show how, in contrast to offline or batch learning, IL learns knowledge, be it symbolic (e.g., rules) or sub-symbolic (e.g., numerical values) from data that evolves over time. The basic idea motivating IL is that as new data points arrive, new knowledge elements may be created and existing ones may be modified allowing the knowledge base (respectively, the system) to evolve over time. Thus, the acquired knowledge becomes self-corrective in light of new evidence. This update is of paramount importance to ensure the adaptivity of the system. However, it should be meaningful (by capturing only interesting events brought by the arriving data) and sensitive (by safely ignoring unimportant events). Perceptually, IL is a fundamental problem of cognitive development. Indeed, the perceiver usually learns how to make sense of its sensory inputs in an incremental

3 manner via a filtering procedure. In this chapter, we will outline the background of IL from different perspectives: machine learning and data mining before highlighting our IL research, the challenges, and the future trends of IL. BACKGROUND IL is a key issue in applications where data arrives over long periods of time and/or where storage capacities are very limited. Most of the knowledge learning literature reports on learning models that are one-shot experience. Once the learning stage is exhausted, the induced knowledge is no more updated. Thus, the performance of the system depends heavily on the data used during the learning (knowledge extraction) phase. Shifts of trends in the arriving data cannot be accounted for. Algorithms with an IL ability are of increasing importance in many innovative applications, e.g., video streams, stock market indexes, intelligent agents, user profile learning, etc. Hence, there is a need to devise learning mechanisms that are able of accommodating new data in an incremental way, while keeping the system under use. Such a problem has been studied in the framework of adaptive resonance theory (Carpenter et al., 1991). This theory has been proposed to efficiently deal with the stability-plasticity dilemma. Formally, a learning algorithm is totally stable if it keeps the acquired knowledge in memory without any catastrophic forgetting. However, it is not required to accommodate new knowledge. On the contrary, a learning algorithm is completely plastic if it is able to continually learn new knowledge without any requirement on preserving the knowledge previously learned. The dilemma aims at accommodating new data (plasticity) without forgetting (stability) by generating knowledge

4 elements over time whenever the new data conveys new knowledge elements worth considering. Basically there are two schemes to accommodate new data. To retrain the algorithm from scratch using both old and new data is known as revolutionary strategy. In contrast, an evolutionary continues to train the algorithm using only the new data (Michalski, 1985). The first scheme fulfills only the stability requirement, whereas the second is a typical IL scheme that is able to fulfill both stability and plasticity. The goal is to make a tradeoff between the stability and plasticity ends of the learning spectrum as shown in Fig.1. Incremental learning Favoring stability Favoring plasticity Figure 1: Learning spectrum As noted in (Polikar et al., 2000), there are many approaches referring to some aspects of IL. They exist under different names like on-line learning, constructive learning, lifelong learning, and evolutionary learning. Therefore, a definition of IL turns out to be vital: IL should be able to accommodate plasticity by learning knowledge from new data. This data can refer either to the already known structure or to a new structure of the system. IL can use only new data and should not have access at any time to the previously used data to update the existing system. IL should be able to observe the stability of the system by avoiding forgetting. It is worth noting that the IL research flows in three directions: clustering, classification, and rule associations mining. In the context of classifcation and clustering, many IL approaches have been introduced. A typical incremental approach is discussed in (Parikh & Polikar, 2007) which consists of combining an ensemble of multilayer perceptron networks (MLP) to

5 accommodate new data. Similar work was done later in (Chakraborty & Pal, 2003) using also MLP. Note here that stand-alone MLPs, like many other neural networks, need retraining in order to learn from the new data. Other IL algorithms were proposed in (Fritzke, 1994) and in (Domeniconi & Gunopulos, 2001). The former algorithm is based on radial basis function networks (RBFs), while the latter aims at constructing incremental support vector machine classifiers. Actually, there exist four neural models that are inherently incremental: (i) adaptive resonance theory (ART) (Carpenter et al., 1991), (ii) min-max neural networks (Simpson, 1992), (iii) nearest generalized exemplar (Salzberg, 1991), and (iv) neural gas model (Fritzke, 1995). The first three incremental models aim at learning hyper-rectangle categories, while the last one aims at building point-prototyped categories. It is important to mention that there exist many classification approaches that are referred to as IL approaches and which rely on neural networks. These range from retraining misclassified samples to various weighing schemes (Freeman & Saad, 1997; Grippo, 2000). All of them are about sequential learning where input samples are sequentially, but iteratively, presented to the algorithm. However, sequential learning works only in close-ended environments where classes to be learned have to be reflected by the readily available training data and more important prior knowledge can also be forgotten if the classes are unbalanced. In contrast to sub-symbolic learning, few authors have studied incremental symbolic learning, where the problem is incrementally learning simple classification rules (Maloof & Michalski, 2004; Reinke & Michalski, 1988; Utgoff, 1988). In addition, the concept of incrementality has been discussed in the context of association rules mining (ARM). The goal of ARM is to generate all association rules in the form of X Y that have support and confidence greater than a user-specified minimum support and minimum

6 confidence respectively. The motivation underlying incremental ARM stems from the fact that databases grow over time. The association rules mined need to be updated as new items are inserted in the database. Incremental ARM aims at using only the incremental part to infer new rules. However, this is usually done by processing the incremental part separately and scanning the older database if necessary. Some of the algorithms proposed are FUP (Cheung et al., 1996), temporal windowing (Rainsford et al., 1997), and DELI (Lee & Cheung, 1997). In contrast to static databases, IL is more visible in data stream ARM. The nature of data imposes such an incremental treatment of data. Usually data continually arrives in the form of high-speed streams. IL is particularly relevant for online streams since data is discarded as soon as it has been processed. Many algorithms have been introduced to maintain association rules (Charikar et al., 2004; Chang & Lee, 2004; Domingos & Hulten, 2000; Giannella et al., 2003; Lin et al., 2005; Yu et al., 2004). Furthermore, many classification clustering algorithm, which are not fully incremental, have been developed in the context of stream data (Aggarwal et al., 2004; Guha et al., 2000; Last, 2002). FOCUS IL has a large spectrum of investigation facets. We shall focus in the following on classification and clustering which are key issues in many domains such as data mining, pattern recognition, knowledge discovery, and machine learning. In particular, we focus on two research avenues which we have investigated: (i) incremental fuzzy classifiers (IFC) (Bouchachia & Mittermeir, 2006) and (ii) incremental learning by function decomposition (IFD) (Bouchachia, 2006a). The motivation behind IFC is to infer knowledge in the form of fuzzy rules from data that

7 evolves over time. To accommodate IL, appropriate mechanisms are applied in all steps of the fuzzy system construction: (1) Incremental supervised clustering: Given a labeled data set, the first step is to cluster this data with the aim of achieving high purity and separability of clusters. To do that, we have introduced a clustering algorithm that is incremental and supervised. These two characteristics are vital for the whole process. The resulting labeled clusters prototypes are projected onto each feature axis to generate some fuzzy partitions. (2) Fuzzy partitioning and accommodation of change: Fuzzy partitions are generated relying on two steps: Initially, each cluster is mapped onto a triangular partition. In order to optimize the shape of the partitions, the number and the complexity of rules, an aggregation of these triangular partitions is performed. As new data arrives, these partitions are systematically updated without referring to the previously used data. The consequent of rules are then accordingly updated. (3) Incremental feature selection: To find the most relevant features (which results in compact and transparent rules), an incremental version of Fisher s interclass separability criterion is devised. As new data arrives, some features may be substituted for new ones in the rules. Hence, the rules premises are dynamically updated. At any time of the life of a classifier, the rule base should reflect the semantic contents of the already used data. To the best of our knowledge, there is no previous work on feature selection algorithms that observe the notion of incrementality. In another research axis, IL has been thoroughly investigated in the context of neural networks. In (Bouchachia, 2006a; Bouchachia, 2006b) we have proposed a novel IL algorithm

8 based on function decomposition (ILFD) that is realized by a neural network. ILFD uses clustering and vector quantization techniques to deal with classification tasks. The main motivation behind ILFD is to enable an on-line classification of data lying in different regions of the space allowing to generate non-convex partitions and, more generally, to generate disconnected partitions (not lying in the same contiguous space). Hence, each class can be approximated by a sufficient number of categories centered around their prototypes. Furthermore, ILFD differs from the aforementioned learning techniques (Sec. Background) with respect to the following aspects: Most of those techniques rely on geometric shapes to represent the categories, such as hyper-rectangles, hyper-ellipses, etc.; whereas the ILFD approach is not explicitly based on a particular shape since one can use different types of distances to obtain different shapes. Usually, there is no explicit mechanism (except for the neural gas model) to deal with redundant and dead categories, the ILFD approach uses two procedures to get rid of dead categories. The first is called dispersion test and aims at eliminating redundant category nodes. The second is called staleness test and aims at pruning categories that become stale. While all of those techniques modify the position of the winner when presenting the network with a data vector, the learning mechanism in ILFD consists of reinforcing the winning category from the class of the data vector and pushes away the second winner from a neighboring class to reduce the overlap between categories. While the other approaches are either self-supervised or need to match the input with all existing categories, ILFD compares the input only with categories having the same label

9 as the input in the first place and then with categories from other labels distinctively. The ILFD can also deal with the problem of partially labeled data. Indeed, even unlabeled data can be used during the training stage. Moreover, the characteristics of ILFD can be compared to other models such as fuzzy ARTMAP (FAM), min-max neural networks (MMNN), nearest generalized exemplar (NGE), and growing neural gas (GNG) as shown in Tab. I (Bouchachia et al., 2007). TABLE I: Characteristics of some IL algorithms Characteristics FAM MMNN NGE GNG ILFD Online learning Y Y Y Y Y Type of prototypes Hyperbox Hyperbox Hyperbox Graph node Cluster center Generation control Y Y Y Y Y Shrinking of prototypes N Y Y U U Deletion of prototypes N N N Y Y Overlap of prototypes Y N N U U Growing of prototypes Y Y Y U U Noise resistance U Y U U U Sensitivity to data order Y Y Y Y Y Normalization Y Y Y/N N Y/N Legend: Y: yes N: no U: uknown/undefined In our research, we have tried to stick to the spirit of IL. To put it clearly, an IL algorithm, in our view, should fulfill the following characteristics: Ability of life-long learning and to deal with plasticity and stability Old data is never used in subsequent stages No prior knowledge about the (topological) structure of the system is needed

10 Ability to incrementally tune the structure of the system No prior knowledge about the statistical properties of the data is needed No prior knowledge about the number of existing classes and the number of categories per class and no prototype initialization are required. FUTURE TRENDS The problem of incrementality remains a key aspect in learning systems. The goal is to achieve adaptive systems that are equipped with self-correction and evolution mechanisms. However, many issues, which can be seen as shortcomings of existing IL algorithms, remain open and therefore worth investigating: Order of data presentation: All of the proposed IL algorithms suffer from the problem of sensitivity to the order of data presentation. Usually, the inferred classifiers are biased by this order. Indeed different presentation orders result in different classifier structures and therefore in different accuracy levels. It is therefore very relevant to look closely at developing algorithms whose behavior is data-presentation independent. Usually, this is a desired property. Category proliferation: The problem of category proliferation in the context of clustering and classification refers to the problem of generating a large number of categories. This number is in general proportional to the granularity of categories. In other terms, fine category size implies large number of categories and larger size implies less categories. Usually, there is a parameter in each IL algorithm that controls the process of category generation. The problem here is: what is the appropriate value of such a parameter. This is clearly related to the problem of plasticity that plays a central role in IL algorithms.

11 Hence, the question: how can we distinguish between rare events and outliers? What is the controlling parameter value that allows making such a distinction? This remains a difficult issue. Number of parameters: One of the most important shortcomings of the majority of the IL algorithms is the huge number of user-specified parameters that are involved. It is usually hard to find the optimal value of these parameters. Furthermore, they are very sensitive to data, i.e., in general to obtain high accuracy values, the setting requires change from one data set to another. In this context, there is a real need to develop algorithms that do not depend heavily on many parameters or which can optimize such parameters. Self-consciousness & self-correction: The problem of distinction between noisy input data and rare event is not only crucial for category generation, but it is also for correction. In the current approaches, IL systems cannot correct wrong decisions made previously, because each sample is treated once and any decision about it has to be taken at that time. Now, assume that at the processing time the sample x was considered a noise, while in reality it was a rare event, then in a later stage the same rare event was discovered by the system. Therefore, in the ideal case the system has to recall that the sample x has to be reconsidered. Current algorithms are not able to adjust the systems by re-examining old decisions. Thus, IL systems have to be equipped with some memory in order to become smarter enough. Data drift: One of the most difficult questions that is worth looking at is related to drift. Little, if none, attention has been paid to the application and evaluation of the aforementioned IL algorithms in the context of drifting data although the change of

12 environment is one of the crucial assumptions of all these algorithms. Furthermore, there are many publicly available datasets for testing systems within static setting, but there are very few benchmark data sets for dynamically changing problems. Those existing are usually artificial sets. It is very important for the IL community to have a repository, similar to that of the Irvine UCI, in order to evaluate the proposed algorithms in evolving environments. As a final aim, the research in the IL framework has to focus on incremental but stable algorithms that have to be transparent, self-corrective, less sensitive to the order of data arrival, and whose parameters are less sensitive to the data itself. CONCLUSION Building adaptive systems that are able to deal with nonstandard settings of learning is one of key research avenues in machine learning, data mining and knowledge discovery. Adaptivity can take different forms, but the most important one is certainly incrementality. Such systems are continuously updated as more data becomes available over time. The appealing features of IL, if taken into account, will help integrate intelligence into knowledge learning systems. In this chapter we have tried to outline the current state of the art in this research area and to show the main problems that remain unsolved and require further investigations. REFERENCES Aggarwal, C., Han, J., Wang, J., & Yu, P. (2004). On demand classification of data streams. International Conference on Knowledge Discovery and Data Mining, pages:

13 Bouchachia, A. & Mittermeir, R. (2006). Towards fuzzy incremental classifiers. Soft Computing, 11(2): , January Bouchachia, A., Gabrys, B. & Sahel, Z. (2007). Overview of some incremental learning algorithms. To appear in proc. of the 16 th IEEE international conference on fuzzy systems, IEEE Computer Society, 2007 Bouchachia, A. (2006a). Learning with incrementality. The 13 th International conference on neural information processing, LNCS 4232, pages: Bouchachia, A. (2006b). Incremental learning via function decomposition. The 5 th International conference on machine learning and applications, pages: 63-68, IEEE Computer Society, Carpenter, G., Grossberg, D., & Rosen, D. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4(6): Chakraborty, D., & Pal, N. (2003). A novel training scheme for multilayered perceptrons to realize proper generalization and incremental learning. IEEE Transaction on Neural Networks, 14(1):1-14. Chang, J., & Lee, W. (2004). A sliding window method for finding recently frequent itemsets over online data streams; Journal of Information Science and Engineering, 20(4): Charikar, M., Chen, K., & Farach-Colton, M. (2004). Finding frequent items in data streams. International Colloquium on Automata, Languages and Programming, pages: Cheung, D., Han, J., Ng, V., & Wong, C. (1996). Maintenance of discovered association rules in large databases: An incremental updating technique; IEEE International Conference on Data Mining, Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. The ACM 6 th International Conference on Knowledge Discovery and Data Mining, pages:

14 Domeniconi, C. & Gunopulos, D. (2001). Incremental Support Vector Machine Construction. International Conference on Data Mining, pages: Freeman, J. & Saad, D. (1997). On-line learning in radial basis function networks. Neural Computation, 9: Fritzke, B. (1994). Fast learning with incremental RBF networks. Neural Processing Letters, 1(1):25. Fritzke, B. (1995). A growing neural gas network learns topologies. Advances in neural information processing systems, pages Giannella, C., Han, J., Pei, J., Yan, X., & Yu, P. (2003). Mining frequent patterns in data streams at multiple time granularities. Workshop on Data Mining: Next Generation Challenges and future Directions, AAAI. Grippo, L. (2000). Convergent on-line algorithms for supervised learning in neural networks. IEEE Trans. on Neural Networks, 11: Guha, S., Mishra, N., Motwani, R., & O'Callaghan, L. (2000). Clustering data streams. IEEE Symposium on Foundations of Computer Science, pages: Last, M. (2002). Online classification of non-stationary data streams, Intelligent Data Analysis, 6(2): Lee, S., & Cheung, D. (1997). Maintenance of discovered association rules: when to update?. SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Lin, C., Chiu, D., Wu, Y., & Chen, A. (2005). Mining frequent itemsets from data streams with a time-sensitive sliding window. International SIAM Conference on Data Mining. Maloof, M., & Michalski, R. (2004). Incremental learning with partial instance memory. Artificial Intelligence 154: Michalski, R. (1985). Knowledge repair mechanisms: evolution vs. revolution. International

15 Machine Learning Workshop, pages Parikh, D. & Polikar, R. (2007). An ensemble-based incremental learning approach to data fusion. IEEE transaction on Systems, Man and Cybernetics, 37(2): Polikar, R., Udpa, L., Udpa, S. & Honavar, V. (2000). Learn++: An incremental learning algorithm for supervised neural networks. IEEE Trans. on Systems, Man, and Cybernetics, 31(4): Rainsford, C., Mohania, M., & Roddick, J. (1997). A temporal windowing approach to the incremental maintenance of association rules. International Database Workshop, Data Mining, Data Warehousing and Client/Server Databases, pages: Reinke, R., & Michalski, R. (1988). Machine intelligence, chapter: Incremental learning of concept descriptions: a method and experimental results, pages Salzberg, S. (1991). A nearest hyperrectangle learning method. Machine learning, 6: Simpson, P. (1992). Fuzzy min-max neural networks. Part 1: Classification. IEEE Trans. Neural Networks, 3(5): Utgoff, P. (1988). ID5: An incremental ID3. International Conference on Machine Learning, pages Yu, J., Chong, Z., Lu, H., & Zhou, A.(2004). False positive or false negative: mining frequent itemsets from high speed transactional data streams. International Conference on Very Large Databases, pages: KEY TERMS AND THEIR DEFINITIONS

16 Knowledge learning: Knowledge Learning: The process of automatic extracting Knowledge from data. Incrementality: The characteristic of an algorithm that is capable of processing data which arrives over time sequentially in a stepwise manner without referring to the previously seen data. Stability: A learning algorithm is totally stable if it keeps the acquired knowledge in memory without any catastrophic forgetting. Plasticity: A learning algorithm is completely plastic if it is able to continually learn new knowledge without any requirement on preserving previously seen data. Data drift: Unexpected change over time of the data values (according to one or more dimensions). Keywords: online learning, incrementality, adaptivity, model evolution, stability-plasticity