Incremental Clustering for Mining in a Data Warehousing Environment

Transcription

1 Incremental Clustering for Mining in a Data Warehousing Environment Martin Ester, Hans-Peter Kriegel, Jörg Sander, Michael Wimmer, Xiaowei Xu Institute for Comuter Science, University of Munich Oettingenstr 67, D München, Germany {ester kriegel sander wimmerm wu}@informatikuni-muenchende Abstract Data warehouses rovide a great deal of oortunities for erforming data mining tasks such as classification and clustering Tyically, udates are collected and alied to the data warehouse eriodically in a batch mode, eg, during the night Then, all atterns derived from the warehouse by some data mining algorithm have to be udated as well Due to the very large size of the databases, it is highly desirable to erform these udates incrementally In this aer, we resent the first incremental clustering algorithm Our algorithm is based on the clustering algorithm DBSCAN which is alicable to any database containing data from a metric sace, eg, to a satial database or to a WWW-log database Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in the neighborhood of this object Thus, efficient algorithms can be given for incremental insertions and deletions to an eisting clustering Based on the formal definition of clusters, it can be roven that the incremental algorithm yields the same result as DBSCAN A erformance evaluation of IncrementalDBSCAN on a satial database as well as on a WWW-log database is resented, demonstrating the efficiency of the roosed algorithm IncrementalDBSCAN yields significant seed-u factors over DBSCAN even for large numbers of daily udates in a data warehouse 1 Introduction Many comanies have recognized the strategic imortance of the knowledge hidden in their large databases and, Permission to coy without fee all or art of this material is granted rovided that the coies are not made or distributed for direct commercial advantage, the VLDB coyright notice and the title of the ublication and its date aear, and notice is given that coying is by ermission of the Very Large Data Base Endowment To coy otherwise, or to reublish, requires a fee and/or secial ermission from the Endowment Proceedings of the 24th VLDB Conference New York, USA, 1998 therefore, have built data warehouses A data warehouse is a collection of data from multile sources, integrated into a common reository and etended by summary information (such as aggregate views) for the urose of analysis [MQM 97] When seaking of a data warehousing environment, we do not anticiate any secial architecture but we address an environment with the following two characteristics: (1) Derived information is resent for the urose of analysis (2) The environment is dynamic, ie many udates occur In such an environment, either manual analyses suorted by aroriate visualization tools or (semi)automatic data mining may be erformed Data mining has been defined as the alication of data analysis and discovery algorithms that - under accetable comutational efficiency limitations - roduce a articular enumeration of atterns over the data [FPS 96] Several data mining tasks have been identified [FPS 96], eg, clustering, classification and summarization Tyical results of data mining are as follows: Clusters of items which are tyically bought together by some set of customers (clustering in a data warehouse storing sales transactions) Symtoms distinguishing disease A from disease B (classification in a medical data warehouse) Descrition of the tyical WWW access atterns (summarization in the data warehouse of an internet rovider) The task considered in this aer is clustering [KR 90], ie grouing the objects of a database into meaningful subclasses Recently, several clustering algorithms for mining in large databases have been develoed [NH 94], [ZRL 96], [EKSX 96] Tyically, a data warehouse is not udated immediately when insertions and deletions on the oerational databases occur Udates are collected and alied to the data warehouse eriodically in a batch mode, eg, each night [MQM 97] Then, all atterns derived from the warehouse by data mining algorithms have to be udated as well This udate must be efficient enough to be finished when the warehouse has to be available for users again, eg, the net morning Due to the very large size of the databases, it is highly desirable to erform these udates incrementally ([FAAM 97], [Huy 97]), so as to consider only the old clus-

2 ters and the objects inserted or deleted during the day, instead of alying the clustering algorithm to the (very large) udated database Maintenance of derived information such as views and summary tables has been an active area of research [MQM 97], [Huy 97] The roblem of incrementally udating mined atterns on changes of the database, however, has just recently started to receive more investigation [CHNW 96] and [FAAM 97] roose efficient methods for incrementally modifying a set of association rules mined from a database [EW 98] introduces generalization algorithms for incremental summarization in a data warehousing environment In this aer, we resent the first incremental clustering algorithm Our algorithm is based on DBSCAN [EKSX 96], [SEKX 98] which is an efficient clustering algorithm for metric databases (that is, databases with a distance function for airs of objects) for mining in a data warehousing environment Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in the neighborhood of this object We demonstrate the high efficiency of incremental clustering on a satial database [Gue 94] as well as on a WWW access log database [MJHS 96] The rest of this aer is organized as follows We discuss related work on clustering algorithms in section 2 In section 3, we briefly introduce the clustering algorithm DB- SCAN The algorithms for incrementally udating a clustering on insertions and deletions of the database are resented in section 4 and an etensive erformance evaluation is reorted in section 5 Section 6 concludes with a summary and some directions for future research 2 Related Work The roblem of incrementally udating mined atterns after making changes to the database has just recently started to receive more attention The task of mining association rules has been introduced by [AS 94] An association rule is a rule I 1 I 2 where I 1 and I 2 are disjoint subsets of a set of items I For a given database DB of transactions (ie each record contains a set of items bought by some customer in one transaction), all association rules should be discovered having a suort of at least minsuort and a confidence of at least minconfidence in DB The subsets of I that have at least minsuort in DB are called frequent sets [FAAM 97] describes two tyical scenarios for mining association rules in a dynamic database For eamle, in a medical database, one may seek associations between treatments and results The database is constantly udated and at any given time, the medical researcher is interested in obtaining the current associations In a database containing news articles, eg, atterns of co-occurrence amongst the toics of articles may be of interest An economic analyst receives a lot of new articles every day and he would like to find relevant associations based on all current articles [CHNW 96] rooses to aly a non-incremental algorithm for mining association rules to the newly inserted database objects, ie to the increment of the database, and then to combine the frequent sets of both the database and the increment The incremental algorithms resented in [FAAM 97] are based on information about the frequency of attribute airs and border sets resectively While the sace overhead for keeing track of these frequencies is small, the incremental algorithms yield a seed-u of several orders of magnitude comared to the non-incremental algorithm Summarization, eg, by generalization, is another imortant task of data mining Attribute-oriented generalization [HCC 93] of a relation is the rocess of relacing the attribute values by a more general value, one attribute at a time, until the number of tules of the relation becomes less than a secified threshold The more general value is taken from a concet hierarchy which is tyically available for most attributes in a data warehouse [EW 98] resents algorithms for incremental attributeoriented generalization with the conflicting goals of good efficiency and minimal overly generalization The algorithms for incremental insertions and deletions are based on the materialization of a relation at an intermediate generalization level, ie the anchor relation Eeriments demonstrate that incremental generalization can be erformed efficiently at a low degree of overly generalization This aer focuses on the data mining task of clustering and, in the following, we review clustering algorithms from a data mining ersective Partitioning algorithms construct a artition of a database DB of n objects into a set of k clusters where k is an inut arameter Each cluster is reresented by the center of gravity of the cluster (k-means) or by one of the objects of the cluster located near its center (k-medoid) [KR 90] and each object is assigned to the cluster with its reresentative closest to the considered object Tyically, artitioning algorithms start with an initial artition of DB and then use an iterative control strategy to otimize the clustering quality, eg, the average distance of an object to its reresentative [NH 94] elores artitioning algorithms for mining in satial databases An algorithm called CLARANS (Clustering Large Alications based on RANdomized Search) is introduced which is more effective and more efficient than revious artitioning algorithms Hierarchical algorithms create a hierarchical decomosition of DB The hierarchical decomosition is reresented by a dendrogram, a tree that iteratively slits DB into smaller subsets until each subset consists of only one object In such a hierarchy, each level of the tree reresents a clustering of DB The basic hierarchical clustering algorithm works as follows ([Sib 73], [Bou 96]) Initially, each object is laced in a unique cluster For each air of clusters, some value of dissimliarity or distance is comuted For instance, the distance may be the minimum distance of all airs of oints from the two clusters (single-link method) [Bou 96] discusses alternative definitions of the distance and shows that, in general, no one aroach outerforms any other in terms of clustering quality In every ste, the clusters with the minimum distance in the current clustering are merged until all oints are contained in one cluster

3 None of the above algorithms is efficient on large databases Therefore, some focusing techniques have been roosed to increase the efficiency of clustering algorithms [EKX 95] resents an R*-tree based focusing technique (1) creating a samle of the database that is drawn from each R*-tree data age and (2) alying the clustering algorithm only to that samle [ZRL 96] rooses a secial data structure to condense information about subclusters of oints A Clustering Feature (CF) is a trile that contains the number of oints, the linear sum and the square sum of all oints in the cluster Clustering features are organized in a height balanced tree, ie the CF-tree BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [ZRL 96] is a CF-tree based multihase clustering method First, the database is scanned to build an initial in-memory CF-tree In an otional second hase, this CF-tree can be further reduced until a desired number of leaf nodes is reached In hase 3 an arbitrary clustering algorithm is used to cluster the CF-values stored in the leaf nodes of the CFtree Note that the CF-tree is an incremental structure but hase 3 of BIRCH is non-incremental Recently, a new tye of single scan clustering algorithms has been introduced The basic idea of a single scan algorithm is to grou neighboring objects of the database into clusters based on a local cluster condition, thus erforming only one scan through the database Single scan clustering algorithms are very efficient if the retrieval of the neighborhood of an object is efficiently suorted by the DBMS Different cluster conditions yield different cluster definitions and algorithms For instance, DBSCAN (Density Based Satial Clustering of Alications with Noise) [EKSX 96] [SEKX 98] relies on a density-based notion of clusters We use DBSCAN as a base for our incremental clustering algorithm due to the following reasons First, DBSCAN is one of the most efficient algorithms on large databases Second, whereas BIRCH is alicable only to satial databases (Euclidean vector sace), DBSCAN can be alied to any database containing data from a metric sace (only assuming a distance function) 3 The Algorithm DBSCAN The key idea of density-based clustering is that for each object of a cluster the neighborhood of a given radius (Es) has to contain at least a minimum number of objects (MinPts), ie the cardinality of the neighborhood has to eceed some threshold We will first give a short introduction to DBSCAN including the definitions which are required for incremental clustering For a detailed resentation of DBSCAN see [EKSX 96] Definition 1: (directly density-reachable) An object is directly density-reachable from an object q wrt Es and MinPts in the set of objects D if 1) N Es (q) (N Es (q) is the subset of D contained in the Es-neighborhood of q) 2) Card(N Es (q)) MinPts Definition 2: (density-reachable) An object is densityreachable from an object q wrt Es and MinPts in the set of objects D, denoted as > D q, if there is a chain of objects 1,, n, 1 = q, n = such that i D and i+1 is directly density-reachable from i wrt Es and MinPts Density-reachability is a canonical etension of direct density-reachability This relation is transitive, but it is not symmetric Although not symmetric in general, it is obvious that density-reachability is symmetric for objects o with Card(N Es (o)) MinPts Two border objects of a cluster are ossibly not density-reachable from each other because there are not enough objects in their Es-neighborhoods However, there must be a third object in the cluster from which both border objects are density-reachable Therefore, we introduce the notion of density-connectivity Definition 3: (density-connected) An object is densityconnected to an object q wrt Es and MinPts in the set of objects D if there is an object o D such that both and q are density-reachable from o wrt Es and MinPts in D Density-connectivity is a symmetric relation Figure 1 illustrates the definitions on a samle database of objects from a 2-dimensional vector sace Note however, that the above definitions only require a distance measure and will also aly to data from a metric sace q density-reachable from q q not density-reachable from and q density-connected to each other by o o q Figure 1: : density-reachability and density-connectivity A cluster is defined as a set of density-connected objects which is maimal wrt density-reachability and the noise is the set of objects not contained in any cluster Definition 4: (cluster) Let D be a set of objects A cluster C wrt Es and MinPts in D is a non-emty subset of D satisfying the following conditions: 1) Maimality:,q D: if C and q > D wrt Es and MinPts, then also q C 2) Connectivity:,q C: is density-connected to q wrt Es and MinPts in D Definition 5: (noise) Let C 1,, C k be the clusters wrt Es and MinPts in D Then, we define the noise as the set of objects in the database D not belonging to any cluster C i, ie noise = { D i: C i } We omit the term wrt Es and MinPts in the following whenever it is clear from the contet There are two different kinds of objects in a clustering: core objects (satisfying condition 2 of definition 1) and non-core objects (otherwise) In the following, we will refer to this characteristic of an object as the core object roerty of the object The noncore objects in turn are either border objects (not a core object but density-reachable from another core object) or noise objects (not a core object and not density-reachable from other objects) The algorithm DBSCAN was designed to efficiently discover the clusters and the noise in a database according to

4 the above definitions The rocedure for finding a cluster is based on the fact that a cluster is uniquely determined by any of its core objects: First, given an arbitrary object for which the core object condition holds, the set {o o > D } of all objects o density-reachable from in D forms a comlete cluster C and C Second, given a cluster C and an arbitrary core object C, C in turn equals the set {o o > D } (cf lemma 1 and 2 in [EKSX 96]) To find a cluster, DBSCAN starts with an arbitrary object in D and retrieves all objects of D density-reachable from with resect to Es and MinPts If is a core object, this rocedure yields a cluster with resect to Es and MinPts If is a border object, no objects are density-reachable from and is assigned to the noise Then, DBSCAN visits the net object of the database D The retrieval of density-reachable objects is erformed by successive region queries A region query returns all objects intersecting a secified query region Such queries are suorted efficiently by satial access methods such as R*- trees [BKSS 90] for data from a vector sace or M-trees [CPZ 97] for data from a metric sace The algorithm DBSCAN is sketched in figure 2 Algorithm DBSCAN (D, Es, MinPts) // Precondition: All objects in D are unclassified FORALL objects o in D DO: IF o is unclassified call function eand_cluster to construct a cluster wrt Es and MinPts containing o FUNCTION eand_cluster (o, D, Es, MinPts): retrieve the Es-neighborhood N Es (o) of o; IF N Es (o) < MinPts // ie o is not a core object mark o as noise and RETURN; ELSE // ie o is a core object select a new cluster-id and mark all objects in N Es (o) with this current cluster-id; ush all objects from N Es (o)\{o} onto the stack seeds; WHILE NOT seedsemty() DO currentobject := seedsto(); retrieve the Es-neighborhood N Es (currentobject) of currentobject; IF N Es (currentobject) MinPts select all objects in N Es (currentobject) not yet classified or are marked as noise, ush the unclassified objects onto seeds and mark all of these objects with current cluster-id; seedso(); RETURN Figure 2: : Algorithm DBSCAN 4 IncrementalDBSCAN DBSCAN, as introduced in [EKSX 96], is alied to a static database In a data warehouse, however, the databases may have frequent udates and thus may be rather dynamic For eamle, in a WWW access log database, we may want to find and monitor grous of similar access atterns by clustering the access sequences of different users These atterns may change over time because each day new logentries are added to the database and old entries (ast a usersulied eiration date) are deleted After insertions and deletions to the database, the clustering discovered by DBSCAN has to be udated In section 41, we eamine which art of an eisting clustering is affected by an udate of the database We resent algorithms for incremental udates of a clustering after insertions (section 42) and deletions (section 43) Based on the formal notion of clusters, it can be roven that the incremental algorithm yields the same result as the non-incremental DBSCAN algorithm This is an imortant advantage of our aroach 41 Affected Objects We want to show that changes of some clustering of a database D are restricted to a neighborhood of an inserted or deleted object Objects contained in N Es () can change their core object roerty, ie core objects may become non-core objects and vice versa The objects contained in N 2Es () \ N Es () kee their core object roerty, but noncore objects may change their connection status, ie border objects may become noise objects or vice versa, because their Es-neighborhood may contain objects with a changed core object roerty For all objects outside of N 2Es (), it holds that neither these objects themselves nor objects in their Es-neighborhood change their core object roerty Therefore, the connection status of these objects is unchanged After the insertion of some object, non-core objects (border objects or noise objects) in N Es () may become core objects imlying that new density connections may be established, ie chains 1,, n, 1 = r, n = s with i+1 directly density-reachable from i for two objects r and s may arise which were not density-reachable from each other before the insertion Then, one of the i for i < n must be contained in N Es () When deleting some object, core objects in N Es () may become non-core objects imlying that density connections may be removed, ie there may no longer be a chain 1,, n, 1 = r, n = s with i+1 directly density-reachable from i for two objects r and s which were density-reachable from each other before the deletion Again, one of the i for i < n must be contained in N Es () Figure 3 illustrates our discussion using a samle database of 2D objects and an object to be inserted or to be deleted The objects a and b are density connected wrt Es as deicted and MinPts = 4 without using one of the elements of N Es () Therefore, a and b belong to the same cluster indeendently from On the other hand, the objects d and e in D \ N Es () are only density-connected via c in N Es () if

5 the object is resent, so that the cluster membershi of d and e is affected by a d c e b N Es () Affected D () Figure 3: : Affected objects in a samle database In general, on an insertion or deletion of an object, the set of affected objects, ie objects which may otentially change cluster membershi after the udate, is the set of objects in N Es () lus all objects density-reachable from one of these objects in D {} The cluster membershi of all other objects not in the set of affected objects will not change This is the intuition of the following definition and lemma In articular, the lemma states that a cluster c in the database is indeendent of an insertion or deletion of an object if a core object of the cluster is outside the set Affected D () Note that a cluster is uniquely determined by any of its core objects Therefore, by definition of Affected D () it follows that if one core object of a cluster is outside (inside) Affected D () then all core objects of the cluster are outside (inside) the set Affected D () Definition 6: (affected objects) Let D be a database of objects and be some object (either in or not in D) We define the set of objects in D affected by the insertion or deletion of as Affected D () = N Es () {q o N Es () q > D {} o} Lemma 1: Let D be a set of objects and be some object Then o D: o Affected D () {q q > D\{} o} = {q q > D {} o} Proof (sketch): 1) : because D \ {} D {} 2) : if q {q q > D {} o},then there is some chain q 1,, q n, q 1 = o, q n = q, q i+1 N Es (q i ) and q i is a core object in D {} for all i < n and, for all i, it holds that q i > D {} o Because q i is a core object for all i < n and density-reachability is symmetric for core objects, it also holds that o > D {} q i If there eisted an i < n such that q i N Es (), then q i > D {} imlying also o > D {} due to the transitivity of densityreachability By definition of the set Affected D () it now follows that o Affected D (), in contrast to the assumtion Thus, q i N Es () for all i < n imlying that all the objects q i, i < n, are core objects indeendent of and also q n because otherwise q n-1 N Es () Thus, the chain q 1,, q n eists also in the set D \ {} and then q {q q > D\{} o} Due to lemma 1, after inserting or deleting an object, it is sufficient to realy DBSCAN to the set Affected D () in order to udate the clustering For that urose, however, it is not necessary to retrieve the set first and then aly the clustering algorithm We simly have to start a restricted version of DBSCAN which does not loo over the whole database to start eanding a cluster but only over certain seed -objects which are all located in the neighborhood of These seed -objects are core objects after the udate oeration which are located in the Es-neighborhood of a core object in D {} which in turn is located in N Es () This is the content of the net lemma Lemma 2: Let D be a set of objects Additionally, let D * =D {} after insertion of an object or D * =D \ {} after deletion of and let c be a core object in D * C = {o o > D* c} is a cluster in D * and C Affected D () q,q : q N Es (q ), q N Es (), c > D q, q is core object in D * and q is core object in D {} Proof (sketch): If D * = D {} or c N Es (), the lemma is obvious by definition of Affected D () Therefore, we consider only the case D * = D \ {} and c N Es () => : C Affected D () and C Then, there eists o N Es () and c > D {} o, ie there is a chain of directly density-reachable objects from o to c Now, because c N Es () we can construct a chain o=o 1,, o n =c, o i+1 N Es (o i ) with the roerty that there is j n such that for all k, j k n, o k N Es () and for all k, 1 k< j, o k N Es () Then q=o j N Es (o j-1 ), q =o j-1 N Es (), c > D o j, o j is a core object in D * and o j-1 is a core object in D {} <= : obviously, C = {o o > D* c} is a cluster (see the comments on the algorithm after definition 5) By assumtion, c is density-reachable from a core object q in D * and q is density-reachable from an object q N Es () in D {} Then also c and hence all objects in C are density-reachable from q in D {} Thus, C Affected D () Due to lemma 2, the general strategy for udating a clustering would be to start the DBSCAN algorithm only with core objects that are in the Es-neighborhood of a (revious) core object in N Es () However, it is not necessary to rediscover density-connections which are known from the revious clustering and which are not changed by the udate oeration For that urose, we only need to look at core objects in the Es-neighborhood of those objects that change their core object roerty as a result of the udate In case of an insertion, these objects may be connected after the insertion In case of a deletion, density connections between them may be lost In general, this information can be determined by using very few region queries The remaining information needed to adjust the clustering can be derived from the cluster membershi before the udate Definition 7 introduces the formal notions which are necessary to describe this aroach Remember: objects with a changed core object roerty are all located in N Es () Definition 7: (seed objects for the udate) Let D be a set of objects and be an object to be inserted or deleted Then, we define the following notions: UdSeed Ins = {q q is a core object in D {}, q : q is core object in D {} but not in D and q N Es (q )} UdSeed Del = {q q is a core object in D \ {}, q : q is core object in D but not in D \ {} and q N Es (q )} We call the objects q UdSeed seed objects for the udate Note that these sets can be comuted rather efficiently if we additionally store for each object the number of ob-

6 jects in its neighborhood when initially clustering the database Then, we need only to erform a single region query for the object to be inserted or deleted to detect all objects q with a changed core object roerty (ie objects in N Es () with number = MinPts-1 in case of an insertion, objects in N Es () with number = MinPts in case of a deletion) Only for these objects q (if there are any) do we have to retrieve N Es (q ) to determine all objects q in the set UdSeed Since at this oint of time the Es-neighborhood of is still in main memory we first check this set for neighbors of q and erform an additional region query only if there are more objects in the neighborhood of q than already contained in N Es () Our eeriments, however, indicate that objects with a changed core object roerty after an udate (different from the inserted or deleted object ) are not very frequent (see section 5) Therefore, in most cases we just have to erform the Es-neighborhood query for and to change the counter for the number of objects in the neighborhood of the retrieved objects 42 Insertions When inserting a new object, new density-connections may be established, but none are removed In this case, it is sufficient to restrict the alication of the clustering rocedure to the set UdSeed Ins If we have to change cluster membershi for an object from C to D we erform the same change of cluster membershi for all other objects in C Changing cluster membershi of these objects does not involve the alication of the clustering algorithm but can be handled by simly storing the information about which clusters have been merged When inserting an object into the database D, we can distinguish the following cases: (1) (Noise) UdSeed Ins is emty, ie there are no new core objects after insertion of Then, is a noise object and nothing else is changed (2) (Creation) UdSeed Ins contains only core objects which did not belong to a cluster before the insertion of, ie they were noise objects or equal to, and a new cluster containing these noise objects as well as is created (3) (Absortion) UdSeed Ins contains core objects which were members of eactly one cluster C before the insertion The object and ossibly some noise objects are absorbed into cluster C (4) (Merge) UdSeed Ins contains core objects which were members of several clusters before the insertion All these clusters and the object are merged into one cluster Figure 4 illustrates the most simle forms of the different cases when inserting an object into a samle database of 2D oints, using arameters Es as deicted and MinPts=3 case 1: noise case 3: absortion Figure 4: : The different cases of the insertion algorithm Figure 5 resents a more comlicated eamle of merging clusters when inserting an object In this eamle the value for Es is as deicted and MinPts = 6 Then, the inserted oint is not a core object, but o 1, o 2, o 3 and o 4 are core objects after the udate The revious clustering can be adated by analyzing only the Es-neighborhood of these objects: cluster A is merged with cluster B and C because o 1 and o 4 as well as o 2 and o 3 are mutual directly densityreachable, imlying the merge of B and C The changing of cluster membershi for objects in case of merging clusters can be done very efficiently by simly storing the information about the clusters that have been merged Note that this kind of transitive merging can only occur if MinPts is larger than 5, because otherwise would be a core object and then all objects in N Es () would already be densityreachable from A o 1 X case 2: creation case 4: merge Figure 5: : Transitive merging of clusters A, B, C by the insertion algorithm o 2 o 4 o 3 B C objects from cluster A objects from cluster B objects from cluster C

7 43 Deletions As oosed to an insertion, when deleting an object, density-connections may be removed, but no new connections are established The difficult case for deletion occurs when the cluster C of is no longer density-connected via (revious) core objects in N Es () after deleting In this case, we do not know in general how many objects we have to check before it can be determined whether C has to be slit or not In most cases, however, this set of objects is very small because the slit of a cluster is not very frequent and in general a non-slit situation will be detected in a small neighborhood of the deleted object When deleting an object from the database D we can distinguish the following cases: (1) (Removal) UdSeed Del is emty, ie there are no core objects in the neighborhood of objects that may have lost their core object roerty after the deletion of Then is deleted from D and eventually other objects in N Es () change from a former cluster C to noise If this haens, the cluster C is comletely removed because then C cannot have core objects outside of N Es () (2) (Reduction) All objects in UdSeed Del are directly density-reachable from each other Then is deleted from D and some objects in N Es () may become noise (3) (otential Slit) The objects in UdSeed Del are not directly density-reachable from each other These objects belonged to eactly one cluster C before the deletion of Now we have to check whether or not these objects are density-connected by other objects in the former cluster C Deending on the eistence of such density-connections, we can distinguish a slit and a non-slit situation Figure 6 illustrates the different cases when deleting from a samle database of 2D oints using arameters Es as deicted and MinPts = 3 Note that the situations described in case 3 may occur simultaneously case 1: removal case 3: slit case 2: reduction slit slit case 3: slit and no slit Figure 6: : The different cases of the deletion algorithm If case (3) occurs, then the clustering rocedure must also consider objects outside of UdSeed Del, but it stos in case of a non-slit situation as soon as the objects from the set UdSeed Del are density-connected to each other Case (3) is imlemented by a rocedure similar to the function eand_cluster in algorithm DBSCAN (see figure 2) starting in arallel from the elements of the set UdSeed Del The main difference is that the candidates for further eansion are managed in a queue instead of a stack Thus, a breadth-first search for the missing density-connections is erformed which is more efficient than a deth-first search due to the following reasons: In a non-slit situation, we sto as soon as all members of UdSeed Del are found to be density-connected to each other The breadth-first search imlies that density-connections with the minimum number of objects (requiring the minimum number of region queries) are detected first A slit situation is in general the more eensive case because the arts of the cluster to be slit actually have to be discovered The algorithm stos when all but the last art have been visited Usually, a cluster is slit only into two arts and one of them is relatively small Using breadthfirst search we only have to visit the smaller art and a small ercentage of the larger one 5 Performance Evaluation In this section, we evaluate the efficiency of IncrementalDBSCAN versus DBSCAN We resent an eerimental evaluation using a 2D satial database as well as a WWW access log database For this urose, we imlemented both algorithms in C++ based on imlementations of the R*-tree [BKSS 90] (for the 2D satial database) and the M-tree [CPZ 97] (for the WWW log database) resectively Furthermore, we resent an analytical comarison of both algorithms and derive the seed-u factors for tyical arameter values deending on the database size and the number of udates For the first set of eeriments, we used a synthetic database of 1,000,000 2D oints with k = 40 clusters of similar sizes 217% of all oints are noise, uniformly distributed outside of the clusters, and all other oints are uniformly distributed inside the clusters with a significantly higher density than the noise In this database, the goal of clustering is to discover grous of neighboring objects A tyical real world alication for this tye of database is clustering earthquake eicenters stored in an earthquake catalog Earthquake eicenters occur along seismically active faults, and are measured with some errors, so that over time observed earthquake eicenters should be clustered along such seismic faults [AF 96] In this tye of alication, there are only insertions The Euclidean distance was used as distance function and an R*-tree [BKSS 90] as an inde structure Es was set to 448 and MinPts was set to 30 Note that the MinPts value had to be rather large due to the high ercentage of noise We erformed eeriments on several other synthetic 2D databases with n varying from 100,000 to 1,000,000, k varying from 7 to 40 and with the noise ercentage varying from 10% u to 20% Since we always obtained similar results, we restrict the discussion to the above database

8 rombloninformatikuni-muenchende loa - [04/Mar/1997:01:44: ] "GET /~loa/ HTTP/10" rombloninformatikuni-muenchende loa - [04/Mar/1997:01:45: ] "GET /~loa// HTTP/10" fiersegacoj unknown - [04/Mar/1997:01:58: ] "GET /dbs/oradahtml HTTP/10" scootera-deccom unknown - [04/Mar/1997:02:08: ] "GET /dbs/kriegel_ehtml HTTP/10" Figure 7: : Samle WWW access log entries For the second set of eeriments, we used a WWW access log database of the Institute for Comuter Science of the University of Munich This database contains 1,400,000 entries following the Common Log Format secified as art of the HTTP rotocol [Luo 95] Figure 7 deicts some samle log entries All log entries with identical IP address and user id within a given maimum time ga are groued into a session and redundant entries, ie entries with filename suffies such as gif, jeg, and jg are removed [MJHS 96] A session has the following structure: session::= <i_address, user_id, [url 1,, url k ]> In this alication, the goal of clustering is to discover grous of similar sessions A WWW rovider may use the discovered clusters as follows: The users associated with the sessions of a cluster form some kind of user grou which may be used to develo marketing strategies The URLs of the sessions contained in a cluster seem to be logically correlated and should be made easily accessible from each other via aroriate links Entries are deleted from the WWW access log database after si months Assuming a constant daily number of WWW accesses, the numbers of insertions and deletions are the same We used the following distance function for airs of sessions s 1 and s 2 : dist( s 1, s 2 ) Cardinality( s 1 \s 2 ) + Cardinality( s 2 \s 1 ) = Cardinality( s 1 ) + Cardinality( s 2 ) The domain of dist is the interval [0 1], dist(s,s) = 0, dist is symmetric and it fulfills the triangle inequality Other distance functions may use the hierarchy of the directories to define the degree of similarity between two URLs The database was indeed by an M-tree [CPZ 97] Es was set to 04 and MinPts to 2 In the following, we comare the erformance of IncrementalDBSCAN versus DBSCAN Tyically, the number of age accesses is used as a cost measure for database algorithms because the I/O time heavily dominates CPU time In both algorithms, region queries are the only oerations requiring age accesses Since the number of age accesses of a single region query is the same for DBSCAN and for IncrementalDBSCAN, we only have to comare the number of region queries Thus, we use the number of region queries as the cost measure for our comarison Note that we are not interested in the absolute erformance of the two algorithms but only in their relative erformance, ie in the seed-u factor as defined below To validate this aroach, we erformed a set of eeriments on our test databases and found that the eerimental seed-u factor always was slightly larger than the analytically derived seed-u factor (eerimental value 16 times the eected value in all eeriments) DBSCAN erforms eactly one region query for each of the n objects of the database (see algorithm in figure 2), ie the cost of DBSCAN for clustering n objects, denoted by Cost DBSCAN (n), is Cost DBSCAN ( n) = n The number of region queries erformed by IncrementalDBSCAN deends on the alication and, therefore, it must be determined eerimentally In general, a deletion affects more objects than an insertion Thus, we introduce two arameters r ins and r del denoting the average number of region queries for an incremental insertion res deletion Let f ins and f del denote the ercentage of insertions res deletions in the number of all incremental udates Then, the cost of IncrementalDBSCAN for erforming m incremental udates, denoted by Cost IncrementalDBSCAN (m), is as follows: Cost IncrementalDBSCAN ( m) = m ( f ins r ins + f del r del ) Table 1 lists the arameters of our erformance evaluation and the values obtained for the 2D satial as well as for the WWW-log database To determine the average values

9 Table 1: Parameters of the erformance evaluation Parameter Meaning Value for 2D satial Value for WWW-log n number of database objects 1,000,000 69,000 m number of (incremental) udates varying varying r ins average number of region queries for an incremental insertion r del average number of region queries for an incremental deletion f del relative frequency of deletions in the number of all udates 0 05 f ins relative frequency of insertions in the number of all udates (1- f del ) (r ins and r del ), the whole databases were incrementally inserted and deleted, although f del = 0 for the 2D satial database Now, we can calculate the seed-u factor of IncrementalDBSCAN versus DBSCAN We define the seed-u factor as the ratio of the cost of DBSCAN (alied to the database after all insertions and deletions) and the cost of m calls of IncrementalDBSCAN (once for each of the insertions res deletions), ie: SeeduFactor Cost DBSCAN ( n + f ins m f del m) = Cost IncrementalDBSCAN ( m) = ( n + f ins m f del m) m ( f ins r ins + f del r del ) Figure 8 and figure 9 deict the seed-u factors deending on n for several values of m For relatively small numbers of daily udates, eg, m = 1,000 and n = 1,000,000, we obtain seed-u factors of 633 for the 2D satial database and 260 for the WWW-log database Even for rather large numbers of daily udates, eg, m = 25,000 and n = 1,000,000, IncrementalDBSCAN yields seed-u factors of 26 and 10 for the 2D satial as well as for the WWW-log database seed-u factor ,000 1,000,000 1,500,000 2,000,000 size of database (n) Figure 8: : Seed-u factors for 2D satial database number of udates 1,000 5,000 10,000 25,000 50, ,000 seed-u factor ,000 1,000,000 1,500,000 2,000,000 size of database (n) Figure 9: : Seed-u factors for WWW-log database number of udates (m) 1,000 5,000 10,000 25,000 50, ,000 When setting the seed-u factor to 10, we obtain the number of udates (denoted by MaUdates) u to which the multile alication of IncrementalDBSCAN for each udate is more efficient than the single alication of DBSCAN to the whole udated database Figure 10 deicts the values of MaUdates deending on n for f del values of u to 05 which is the maimum value to be eected in most real alications This figure was derived by setting r ins to 134 and r del to 675 which are the averages over the resective values obtained for our test databases Note that - in contrast to the significant differences of other characteristics of the two alications - the differences of both r ins and r del are rather small indicating that the average values are a realistic choice for many alications The MaUdates values obtained are much larger than the actual numbers of daily udates in most real databases For databases without deletions (that is, f del = 0), MaUdates is aroimately 3 * n, ie the cost for 3 * n udates on a database of n objects using IncrementalDBSCAN is the same as the cost of DBSCAN on the udated database containing 4 * n objects Even in the worst case of f del = 05, MaUdates is aroimately 025 * n These results clearly emhasize the relevance of incremental clustering

10 MaUdates 1,000, , , ,000 relative frequency of deletions (f_del) Acknowledgments We thank Marco Patella for the M-tree imlementation and Franz Krojer for roviding us with the WWW access log database References 200, ,000 1,000,000 1,500,000 2,000,000 6 Conclusions size of database (n) Figure 10: MaUdates deending on database size for different relative frequencies of deletions Data warehouses rovide a great deal of oortunities for erforming data mining tasks such as classification and clustering Tyically, udates are collected and alied to the data warehouse eriodically in a batch mode, eg, during the night Then, all atterns derived from the warehouse by some data mining algorithm have to be udated as well In this aer, we resented the first incremental clustering algorithm - based on DBSCAN - for mining in a data warehousing environment DBSCAN requires only a distance function and, therefore, it is alicable to any database containing data from a metric sace Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in a small neighborhood of this object Thus, efficient algorithms could be given for incremental insertions and deletions to a clustering Based on the formal definition of clusters, it was roven that the incremental algorithm yields the same result as DBSCAN A erformance evaluation of IncrementalDBSCAN versus DBSCAN using a satial database as well as a WWWlog database was resented, demonstrating the efficiency of the roosed algorithm For relatively small numbers of daily udates, eg, 1,000 udates in a database of size 1,000,000, IncrementalDBSCAN yielded seed-u factors of several hundred Even for rather large numbers of daily udates, eg, 25,000 udates in a database of 1,000,000 objects, we obtained seed-u factors of more than 10 versus DBSCAN In this aer, we assumed that the arameter values Es and MinPts of DBSCAN do not change significantly when inserting and deleting objects However, there may be alications where this assumtion does not hold, ie the arameters may change after many udates of the database In our future work, we lan to investigate this case In this aer, sets of udates are rocessed one at a time without considering the relationshis between the single udates In the future, bulk insertions and deletions will be considered to further imrove the efficiency of IncrementalDBSCAN [AF 96] [AS 94] Allard D and Fraley C: Non Parametric Maimum Likelihood Estimation of Features in Satial Point Process Using Voronoi Tessellation, Journal of the American Statistical Association, December 1997 [also htt://wwwstatwashingtonedu/techreorts/ tr293rs] Agrawal R, Srikant R: Fast Algorithms for Mining Association Rules, Proc 20th Int Conf on Very Large Data Bases, Santiago, Chile, 1994, [BKSS 90] Beckmann N, Kriegel H-P, Schneider R, Seeger B: The R*-tree: An Efficient and Robust Access Method for Points and Rectangles, Proc ACM SIGMOD Int Conf on Management of Data, Atlantic City, NJ, 1990, [Bou 96] Bouguettaya A: On-Line Clustering, IEEE Transactions on Knowledge and Data Engineering, Vol 8, No 2, 1996, [CHNW 96] Cheung D W, Han J, Ng V T, Wong Y: Maintenance of Discovered Association Rules in Large Databases: An Incremental Technique, Proc 12th Int Conf on Data Engineering, New Orleans, USA, 1996, [CPZ 97] Ciaccia P, Patella M, Zezula P: M-tree: An Efficient Access Method for Similarity Search in Metric Saces, Proc 23rd Int Conf on Very Large Data Bases, Athens, Greece, 1997, [EKSX 96] Ester M, Kriegel H-P, Sander J, Xu X: A Density-Based Algorithm for Discovering Clusters in Large Satial Databases with Noise, Proc 2nd Int Conf on Knowledge Discovery and Data Mining, Portland, OR, 1996, [EKX 95] Ester M, Kriegel H-P, Xu X: Knowledge Discovery in Large Satial Databases: Focusing Techniques for Efficient Class Identification, Proc 4th Int Sym on Large Satial Databases, Portland, ME, 1995, in: Lecture Notes in Comuter Science, Vol 951, Sringer, 1995, 67-82

11 [EW 98] Ester M, Wittmann R: Incremental Generalization for Mining in a Data Warehousing Environment, Proc 6th Int Conf on Etending Database Technology, Valencia, Sain, 1998, in: Lecture Notes in Comuter Science, Vol 1377, Sringer, 1998, [FAAM 97] Feldman R, Aumann Y, Amir A, Mannila H: Efficient Algorithms for Discovering Frequent Sets in Incremental Databases, Proc ACM SIGMOD Worksho on Research Issues on Data Mining and Knowledge Discovery, Tucson, AZ, 1997, [FPS 96] Fayyad U, Piatetsky-Shairo G, and Smyth P: Knowledge Discovery and Data Mining: Towards a Unifying Framework, Proc 2nd Int Conf on Knowledge Discovery and Data [Gue 94] Mining, Portland, OR, 1996, Gueting R H: An Introduction to Satial Database Systems, The VLDB Journal, Vol 3, No 4, October 1994, [HCC 93] Han J, Cai Y, Cercone N: Data-driven Discovery of Quantitative Rules in Relational Databases, IEEE Transactions on Knowledge and Data Engineering, Vol5, No 1, 1993, [Huy 97] [KR 90] Huyn N: Multile-View Self-Maintenance in Data Warehousing Environments, Proc 23rd Int Conf on Very Large Data Bases, Athens, Greece, 1997, Kaufman L, Rousseeuw P J: Finding Grous in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990 [Luo 95] Luotonen A: The common log file format, htt://wwww3org/ub/www/, 1995 [MJHS 96] Mombasher B, Jain N, Han E-H, Srivastava J: Web Mining: Pattern Discovery from World Wide Web Transactions, Technical Reort , University of Minnesota, 1996 [MQM 97] Mumick I S, Quass D, Mumick B S: Maintenance of Data Cubes and Summary Tables in a Warehouse, Proc ACM SIGMOD Int Conf on Management of Data, 1997, [NH 94] Ng R T, Han J: Efficient and Effective Clustering Methods for Satial Data Mining, Proc 20th Int Conf on Very Large Data Bases, Santiago, Chile, 1994, [SEKX 98] Sander J, Ester M, Kriegel H-P, Xu X: Density-Based Clustering in Satial Databases: The Algorithm GDBSCAN and its Alications, will aear in: Data Mining and Knowledge Discovery, Kluwer Acedemic Publishers, Vol 2, 1998 [Sib 73] [ZRL 96] Sibson R: SLINK: an otimally efficient algorithm for the single-link cluster method, The Comuter Journal, Vol 16, No 1, 1973, Zhang T, Ramakrishnan R, Linvy M: BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proc ACM SIGMOD Int Conf on Management of Data, 1996,