Incremental Clustering for Mining in a Data Warehousing Environment

Size: px
Start display at page:

Download "Incremental Clustering for Mining in a Data Warehousing Environment"

Transcription

1 Incremental Clustering for Mining in a Data Warehousing Environment Martin Ester, Hans-Peter Kriegel, Jörg Sander, Michael Wimmer, Xiaowei Xu Institute for Comuter Science, University of Munich Oettingenstr 67, D München, Germany {ester kriegel sander wimmerm wu}@informatikuni-muenchende Abstract Data warehouses rovide a great deal of oortunities for erforming data mining tasks such as classification and clustering Tyically, udates are collected and alied to the data warehouse eriodically in a batch mode, eg, during the night Then, all atterns derived from the warehouse by some data mining algorithm have to be udated as well Due to the very large size of the databases, it is highly desirable to erform these udates incrementally In this aer, we resent the first incremental clustering algorithm Our algorithm is based on the clustering algorithm DBSCAN which is alicable to any database containing data from a metric sace, eg, to a satial database or to a WWW-log database Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in the neighborhood of this object Thus, efficient algorithms can be given for incremental insertions and deletions to an eisting clustering Based on the formal definition of clusters, it can be roven that the incremental algorithm yields the same result as DBSCAN A erformance evaluation of IncrementalDBSCAN on a satial database as well as on a WWW-log database is resented, demonstrating the efficiency of the roosed algorithm IncrementalDBSCAN yields significant seed-u factors over DBSCAN even for large numbers of daily udates in a data warehouse 1 Introduction Many comanies have recognized the strategic imortance of the knowledge hidden in their large databases and, Permission to coy without fee all or art of this material is granted rovided that the coies are not made or distributed for direct commercial advantage, the VLDB coyright notice and the title of the ublication and its date aear, and notice is given that coying is by ermission of the Very Large Data Base Endowment To coy otherwise, or to reublish, requires a fee and/or secial ermission from the Endowment Proceedings of the 24th VLDB Conference New York, USA, 1998 therefore, have built data warehouses A data warehouse is a collection of data from multile sources, integrated into a common reository and etended by summary information (such as aggregate views) for the urose of analysis [MQM 97] When seaking of a data warehousing environment, we do not anticiate any secial architecture but we address an environment with the following two characteristics: (1) Derived information is resent for the urose of analysis (2) The environment is dynamic, ie many udates occur In such an environment, either manual analyses suorted by aroriate visualization tools or (semi)automatic data mining may be erformed Data mining has been defined as the alication of data analysis and discovery algorithms that - under accetable comutational efficiency limitations - roduce a articular enumeration of atterns over the data [FPS 96] Several data mining tasks have been identified [FPS 96], eg, clustering, classification and summarization Tyical results of data mining are as follows: Clusters of items which are tyically bought together by some set of customers (clustering in a data warehouse storing sales transactions) Symtoms distinguishing disease A from disease B (classification in a medical data warehouse) Descrition of the tyical WWW access atterns (summarization in the data warehouse of an internet rovider) The task considered in this aer is clustering [KR 90], ie grouing the objects of a database into meaningful subclasses Recently, several clustering algorithms for mining in large databases have been develoed [NH 94], [ZRL 96], [EKSX 96] Tyically, a data warehouse is not udated immediately when insertions and deletions on the oerational databases occur Udates are collected and alied to the data warehouse eriodically in a batch mode, eg, each night [MQM 97] Then, all atterns derived from the warehouse by data mining algorithms have to be udated as well This udate must be efficient enough to be finished when the warehouse has to be available for users again, eg, the net morning Due to the very large size of the databases, it is highly desirable to erform these udates incrementally ([FAAM 97], [Huy 97]), so as to consider only the old clus-

2 ters and the objects inserted or deleted during the day, instead of alying the clustering algorithm to the (very large) udated database Maintenance of derived information such as views and summary tables has been an active area of research [MQM 97], [Huy 97] The roblem of incrementally udating mined atterns on changes of the database, however, has just recently started to receive more investigation [CHNW 96] and [FAAM 97] roose efficient methods for incrementally modifying a set of association rules mined from a database [EW 98] introduces generalization algorithms for incremental summarization in a data warehousing environment In this aer, we resent the first incremental clustering algorithm Our algorithm is based on DBSCAN [EKSX 96], [SEKX 98] which is an efficient clustering algorithm for metric databases (that is, databases with a distance function for airs of objects) for mining in a data warehousing environment Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in the neighborhood of this object We demonstrate the high efficiency of incremental clustering on a satial database [Gue 94] as well as on a WWW access log database [MJHS 96] The rest of this aer is organized as follows We discuss related work on clustering algorithms in section 2 In section 3, we briefly introduce the clustering algorithm DB- SCAN The algorithms for incrementally udating a clustering on insertions and deletions of the database are resented in section 4 and an etensive erformance evaluation is reorted in section 5 Section 6 concludes with a summary and some directions for future research 2 Related Work The roblem of incrementally udating mined atterns after making changes to the database has just recently started to receive more attention The task of mining association rules has been introduced by [AS 94] An association rule is a rule I 1 I 2 where I 1 and I 2 are disjoint subsets of a set of items I For a given database DB of transactions (ie each record contains a set of items bought by some customer in one transaction), all association rules should be discovered having a suort of at least minsuort and a confidence of at least minconfidence in DB The subsets of I that have at least minsuort in DB are called frequent sets [FAAM 97] describes two tyical scenarios for mining association rules in a dynamic database For eamle, in a medical database, one may seek associations between treatments and results The database is constantly udated and at any given time, the medical researcher is interested in obtaining the current associations In a database containing news articles, eg, atterns of co-occurrence amongst the toics of articles may be of interest An economic analyst receives a lot of new articles every day and he would like to find relevant associations based on all current articles [CHNW 96] rooses to aly a non-incremental algorithm for mining association rules to the newly inserted database objects, ie to the increment of the database, and then to combine the frequent sets of both the database and the increment The incremental algorithms resented in [FAAM 97] are based on information about the frequency of attribute airs and border sets resectively While the sace overhead for keeing track of these frequencies is small, the incremental algorithms yield a seed-u of several orders of magnitude comared to the non-incremental algorithm Summarization, eg, by generalization, is another imortant task of data mining Attribute-oriented generalization [HCC 93] of a relation is the rocess of relacing the attribute values by a more general value, one attribute at a time, until the number of tules of the relation becomes less than a secified threshold The more general value is taken from a concet hierarchy which is tyically available for most attributes in a data warehouse [EW 98] resents algorithms for incremental attributeoriented generalization with the conflicting goals of good efficiency and minimal overly generalization The algorithms for incremental insertions and deletions are based on the materialization of a relation at an intermediate generalization level, ie the anchor relation Eeriments demonstrate that incremental generalization can be erformed efficiently at a low degree of overly generalization This aer focuses on the data mining task of clustering and, in the following, we review clustering algorithms from a data mining ersective Partitioning algorithms construct a artition of a database DB of n objects into a set of k clusters where k is an inut arameter Each cluster is reresented by the center of gravity of the cluster (k-means) or by one of the objects of the cluster located near its center (k-medoid) [KR 90] and each object is assigned to the cluster with its reresentative closest to the considered object Tyically, artitioning algorithms start with an initial artition of DB and then use an iterative control strategy to otimize the clustering quality, eg, the average distance of an object to its reresentative [NH 94] elores artitioning algorithms for mining in satial databases An algorithm called CLARANS (Clustering Large Alications based on RANdomized Search) is introduced which is more effective and more efficient than revious artitioning algorithms Hierarchical algorithms create a hierarchical decomosition of DB The hierarchical decomosition is reresented by a dendrogram, a tree that iteratively slits DB into smaller subsets until each subset consists of only one object In such a hierarchy, each level of the tree reresents a clustering of DB The basic hierarchical clustering algorithm works as follows ([Sib 73], [Bou 96]) Initially, each object is laced in a unique cluster For each air of clusters, some value of dissimliarity or distance is comuted For instance, the distance may be the minimum distance of all airs of oints from the two clusters (single-link method) [Bou 96] discusses alternative definitions of the distance and shows that, in general, no one aroach outerforms any other in terms of clustering quality In every ste, the clusters with the minimum distance in the current clustering are merged until all oints are contained in one cluster

3 None of the above algorithms is efficient on large databases Therefore, some focusing techniques have been roosed to increase the efficiency of clustering algorithms [EKX 95] resents an R*-tree based focusing technique (1) creating a samle of the database that is drawn from each R*-tree data age and (2) alying the clustering algorithm only to that samle [ZRL 96] rooses a secial data structure to condense information about subclusters of oints A Clustering Feature (CF) is a trile that contains the number of oints, the linear sum and the square sum of all oints in the cluster Clustering features are organized in a height balanced tree, ie the CF-tree BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) [ZRL 96] is a CF-tree based multihase clustering method First, the database is scanned to build an initial in-memory CF-tree In an otional second hase, this CF-tree can be further reduced until a desired number of leaf nodes is reached In hase 3 an arbitrary clustering algorithm is used to cluster the CF-values stored in the leaf nodes of the CFtree Note that the CF-tree is an incremental structure but hase 3 of BIRCH is non-incremental Recently, a new tye of single scan clustering algorithms has been introduced The basic idea of a single scan algorithm is to grou neighboring objects of the database into clusters based on a local cluster condition, thus erforming only one scan through the database Single scan clustering algorithms are very efficient if the retrieval of the neighborhood of an object is efficiently suorted by the DBMS Different cluster conditions yield different cluster definitions and algorithms For instance, DBSCAN (Density Based Satial Clustering of Alications with Noise) [EKSX 96] [SEKX 98] relies on a density-based notion of clusters We use DBSCAN as a base for our incremental clustering algorithm due to the following reasons First, DBSCAN is one of the most efficient algorithms on large databases Second, whereas BIRCH is alicable only to satial databases (Euclidean vector sace), DBSCAN can be alied to any database containing data from a metric sace (only assuming a distance function) 3 The Algorithm DBSCAN The key idea of density-based clustering is that for each object of a cluster the neighborhood of a given radius (Es) has to contain at least a minimum number of objects (MinPts), ie the cardinality of the neighborhood has to eceed some threshold We will first give a short introduction to DBSCAN including the definitions which are required for incremental clustering For a detailed resentation of DBSCAN see [EKSX 96] Definition 1: (directly density-reachable) An object is directly density-reachable from an object q wrt Es and MinPts in the set of objects D if 1) N Es (q) (N Es (q) is the subset of D contained in the Es-neighborhood of q) 2) Card(N Es (q)) MinPts Definition 2: (density-reachable) An object is densityreachable from an object q wrt Es and MinPts in the set of objects D, denoted as > D q, if there is a chain of objects 1,, n, 1 = q, n = such that i D and i+1 is directly density-reachable from i wrt Es and MinPts Density-reachability is a canonical etension of direct density-reachability This relation is transitive, but it is not symmetric Although not symmetric in general, it is obvious that density-reachability is symmetric for objects o with Card(N Es (o)) MinPts Two border objects of a cluster are ossibly not density-reachable from each other because there are not enough objects in their Es-neighborhoods However, there must be a third object in the cluster from which both border objects are density-reachable Therefore, we introduce the notion of density-connectivity Definition 3: (density-connected) An object is densityconnected to an object q wrt Es and MinPts in the set of objects D if there is an object o D such that both and q are density-reachable from o wrt Es and MinPts in D Density-connectivity is a symmetric relation Figure 1 illustrates the definitions on a samle database of objects from a 2-dimensional vector sace Note however, that the above definitions only require a distance measure and will also aly to data from a metric sace q density-reachable from q q not density-reachable from and q density-connected to each other by o o q Figure 1: : density-reachability and density-connectivity A cluster is defined as a set of density-connected objects which is maimal wrt density-reachability and the noise is the set of objects not contained in any cluster Definition 4: (cluster) Let D be a set of objects A cluster C wrt Es and MinPts in D is a non-emty subset of D satisfying the following conditions: 1) Maimality:,q D: if C and q > D wrt Es and MinPts, then also q C 2) Connectivity:,q C: is density-connected to q wrt Es and MinPts in D Definition 5: (noise) Let C 1,, C k be the clusters wrt Es and MinPts in D Then, we define the noise as the set of objects in the database D not belonging to any cluster C i, ie noise = { D i: C i } We omit the term wrt Es and MinPts in the following whenever it is clear from the contet There are two different kinds of objects in a clustering: core objects (satisfying condition 2 of definition 1) and non-core objects (otherwise) In the following, we will refer to this characteristic of an object as the core object roerty of the object The noncore objects in turn are either border objects (not a core object but density-reachable from another core object) or noise objects (not a core object and not density-reachable from other objects) The algorithm DBSCAN was designed to efficiently discover the clusters and the noise in a database according to

4 the above definitions The rocedure for finding a cluster is based on the fact that a cluster is uniquely determined by any of its core objects: First, given an arbitrary object for which the core object condition holds, the set {o o > D } of all objects o density-reachable from in D forms a comlete cluster C and C Second, given a cluster C and an arbitrary core object C, C in turn equals the set {o o > D } (cf lemma 1 and 2 in [EKSX 96]) To find a cluster, DBSCAN starts with an arbitrary object in D and retrieves all objects of D density-reachable from with resect to Es and MinPts If is a core object, this rocedure yields a cluster with resect to Es and MinPts If is a border object, no objects are density-reachable from and is assigned to the noise Then, DBSCAN visits the net object of the database D The retrieval of density-reachable objects is erformed by successive region queries A region query returns all objects intersecting a secified query region Such queries are suorted efficiently by satial access methods such as R*- trees [BKSS 90] for data from a vector sace or M-trees [CPZ 97] for data from a metric sace The algorithm DBSCAN is sketched in figure 2 Algorithm DBSCAN (D, Es, MinPts) // Precondition: All objects in D are unclassified FORALL objects o in D DO: IF o is unclassified call function eand_cluster to construct a cluster wrt Es and MinPts containing o FUNCTION eand_cluster (o, D, Es, MinPts): retrieve the Es-neighborhood N Es (o) of o; IF N Es (o) < MinPts // ie o is not a core object mark o as noise and RETURN; ELSE // ie o is a core object select a new cluster-id and mark all objects in N Es (o) with this current cluster-id; ush all objects from N Es (o)\{o} onto the stack seeds; WHILE NOT seedsemty() DO currentobject := seedsto(); retrieve the Es-neighborhood N Es (currentobject) of currentobject; IF N Es (currentobject) MinPts select all objects in N Es (currentobject) not yet classified or are marked as noise, ush the unclassified objects onto seeds and mark all of these objects with current cluster-id; seedso(); RETURN Figure 2: : Algorithm DBSCAN 4 IncrementalDBSCAN DBSCAN, as introduced in [EKSX 96], is alied to a static database In a data warehouse, however, the databases may have frequent udates and thus may be rather dynamic For eamle, in a WWW access log database, we may want to find and monitor grous of similar access atterns by clustering the access sequences of different users These atterns may change over time because each day new logentries are added to the database and old entries (ast a usersulied eiration date) are deleted After insertions and deletions to the database, the clustering discovered by DBSCAN has to be udated In section 41, we eamine which art of an eisting clustering is affected by an udate of the database We resent algorithms for incremental udates of a clustering after insertions (section 42) and deletions (section 43) Based on the formal notion of clusters, it can be roven that the incremental algorithm yields the same result as the non-incremental DBSCAN algorithm This is an imortant advantage of our aroach 41 Affected Objects We want to show that changes of some clustering of a database D are restricted to a neighborhood of an inserted or deleted object Objects contained in N Es () can change their core object roerty, ie core objects may become non-core objects and vice versa The objects contained in N 2Es () \ N Es () kee their core object roerty, but noncore objects may change their connection status, ie border objects may become noise objects or vice versa, because their Es-neighborhood may contain objects with a changed core object roerty For all objects outside of N 2Es (), it holds that neither these objects themselves nor objects in their Es-neighborhood change their core object roerty Therefore, the connection status of these objects is unchanged After the insertion of some object, non-core objects (border objects or noise objects) in N Es () may become core objects imlying that new density connections may be established, ie chains 1,, n, 1 = r, n = s with i+1 directly density-reachable from i for two objects r and s may arise which were not density-reachable from each other before the insertion Then, one of the i for i < n must be contained in N Es () When deleting some object, core objects in N Es () may become non-core objects imlying that density connections may be removed, ie there may no longer be a chain 1,, n, 1 = r, n = s with i+1 directly density-reachable from i for two objects r and s which were density-reachable from each other before the deletion Again, one of the i for i < n must be contained in N Es () Figure 3 illustrates our discussion using a samle database of 2D objects and an object to be inserted or to be deleted The objects a and b are density connected wrt Es as deicted and MinPts = 4 without using one of the elements of N Es () Therefore, a and b belong to the same cluster indeendently from On the other hand, the objects d and e in D \ N Es () are only density-connected via c in N Es () if

5 the object is resent, so that the cluster membershi of d and e is affected by a d c e b N Es () Affected D () Figure 3: : Affected objects in a samle database In general, on an insertion or deletion of an object, the set of affected objects, ie objects which may otentially change cluster membershi after the udate, is the set of objects in N Es () lus all objects density-reachable from one of these objects in D {} The cluster membershi of all other objects not in the set of affected objects will not change This is the intuition of the following definition and lemma In articular, the lemma states that a cluster c in the database is indeendent of an insertion or deletion of an object if a core object of the cluster is outside the set Affected D () Note that a cluster is uniquely determined by any of its core objects Therefore, by definition of Affected D () it follows that if one core object of a cluster is outside (inside) Affected D () then all core objects of the cluster are outside (inside) the set Affected D () Definition 6: (affected objects) Let D be a database of objects and be some object (either in or not in D) We define the set of objects in D affected by the insertion or deletion of as Affected D () = N Es () {q o N Es () q > D {} o} Lemma 1: Let D be a set of objects and be some object Then o D: o Affected D () {q q > D\{} o} = {q q > D {} o} Proof (sketch): 1) : because D \ {} D {} 2) : if q {q q > D {} o},then there is some chain q 1,, q n, q 1 = o, q n = q, q i+1 N Es (q i ) and q i is a core object in D {} for all i < n and, for all i, it holds that q i > D {} o Because q i is a core object for all i < n and density-reachability is symmetric for core objects, it also holds that o > D {} q i If there eisted an i < n such that q i N Es (), then q i > D {} imlying also o > D {} due to the transitivity of densityreachability By definition of the set Affected D () it now follows that o Affected D (), in contrast to the assumtion Thus, q i N Es () for all i < n imlying that all the objects q i, i < n, are core objects indeendent of and also q n because otherwise q n-1 N Es () Thus, the chain q 1,, q n eists also in the set D \ {} and then q {q q > D\{} o} Due to lemma 1, after inserting or deleting an object, it is sufficient to realy DBSCAN to the set Affected D () in order to udate the clustering For that urose, however, it is not necessary to retrieve the set first and then aly the clustering algorithm We simly have to start a restricted version of DBSCAN which does not loo over the whole database to start eanding a cluster but only over certain seed -objects which are all located in the neighborhood of These seed -objects are core objects after the udate oeration which are located in the Es-neighborhood of a core object in D {} which in turn is located in N Es () This is the content of the net lemma Lemma 2: Let D be a set of objects Additionally, let D * =D {} after insertion of an object or D * =D \ {} after deletion of and let c be a core object in D * C = {o o > D* c} is a cluster in D * and C Affected D () q,q : q N Es (q ), q N Es (), c > D q, q is core object in D * and q is core object in D {} Proof (sketch): If D * = D {} or c N Es (), the lemma is obvious by definition of Affected D () Therefore, we consider only the case D * = D \ {} and c N Es () => : C Affected D () and C Then, there eists o N Es () and c > D {} o, ie there is a chain of directly density-reachable objects from o to c Now, because c N Es () we can construct a chain o=o 1,, o n =c, o i+1 N Es (o i ) with the roerty that there is j n such that for all k, j k n, o k N Es () and for all k, 1 k< j, o k N Es () Then q=o j N Es (o j-1 ), q =o j-1 N Es (), c > D o j, o j is a core object in D * and o j-1 is a core object in D {} <= : obviously, C = {o o > D* c} is a cluster (see the comments on the algorithm after definition 5) By assumtion, c is density-reachable from a core object q in D * and q is density-reachable from an object q N Es () in D {} Then also c and hence all objects in C are density-reachable from q in D {} Thus, C Affected D () Due to lemma 2, the general strategy for udating a clustering would be to start the DBSCAN algorithm only with core objects that are in the Es-neighborhood of a (revious) core object in N Es () However, it is not necessary to rediscover density-connections which are known from the revious clustering and which are not changed by the udate oeration For that urose, we only need to look at core objects in the Es-neighborhood of those objects that change their core object roerty as a result of the udate In case of an insertion, these objects may be connected after the insertion In case of a deletion, density connections between them may be lost In general, this information can be determined by using very few region queries The remaining information needed to adjust the clustering can be derived from the cluster membershi before the udate Definition 7 introduces the formal notions which are necessary to describe this aroach Remember: objects with a changed core object roerty are all located in N Es () Definition 7: (seed objects for the udate) Let D be a set of objects and be an object to be inserted or deleted Then, we define the following notions: UdSeed Ins = {q q is a core object in D {}, q : q is core object in D {} but not in D and q N Es (q )} UdSeed Del = {q q is a core object in D \ {}, q : q is core object in D but not in D \ {} and q N Es (q )} We call the objects q UdSeed seed objects for the udate Note that these sets can be comuted rather efficiently if we additionally store for each object the number of ob-

6 jects in its neighborhood when initially clustering the database Then, we need only to erform a single region query for the object to be inserted or deleted to detect all objects q with a changed core object roerty (ie objects in N Es () with number = MinPts-1 in case of an insertion, objects in N Es () with number = MinPts in case of a deletion) Only for these objects q (if there are any) do we have to retrieve N Es (q ) to determine all objects q in the set UdSeed Since at this oint of time the Es-neighborhood of is still in main memory we first check this set for neighbors of q and erform an additional region query only if there are more objects in the neighborhood of q than already contained in N Es () Our eeriments, however, indicate that objects with a changed core object roerty after an udate (different from the inserted or deleted object ) are not very frequent (see section 5) Therefore, in most cases we just have to erform the Es-neighborhood query for and to change the counter for the number of objects in the neighborhood of the retrieved objects 42 Insertions When inserting a new object, new density-connections may be established, but none are removed In this case, it is sufficient to restrict the alication of the clustering rocedure to the set UdSeed Ins If we have to change cluster membershi for an object from C to D we erform the same change of cluster membershi for all other objects in C Changing cluster membershi of these objects does not involve the alication of the clustering algorithm but can be handled by simly storing the information about which clusters have been merged When inserting an object into the database D, we can distinguish the following cases: (1) (Noise) UdSeed Ins is emty, ie there are no new core objects after insertion of Then, is a noise object and nothing else is changed (2) (Creation) UdSeed Ins contains only core objects which did not belong to a cluster before the insertion of, ie they were noise objects or equal to, and a new cluster containing these noise objects as well as is created (3) (Absortion) UdSeed Ins contains core objects which were members of eactly one cluster C before the insertion The object and ossibly some noise objects are absorbed into cluster C (4) (Merge) UdSeed Ins contains core objects which were members of several clusters before the insertion All these clusters and the object are merged into one cluster Figure 4 illustrates the most simle forms of the different cases when inserting an object into a samle database of 2D oints, using arameters Es as deicted and MinPts=3 case 1: noise case 3: absortion Figure 4: : The different cases of the insertion algorithm Figure 5 resents a more comlicated eamle of merging clusters when inserting an object In this eamle the value for Es is as deicted and MinPts = 6 Then, the inserted oint is not a core object, but o 1, o 2, o 3 and o 4 are core objects after the udate The revious clustering can be adated by analyzing only the Es-neighborhood of these objects: cluster A is merged with cluster B and C because o 1 and o 4 as well as o 2 and o 3 are mutual directly densityreachable, imlying the merge of B and C The changing of cluster membershi for objects in case of merging clusters can be done very efficiently by simly storing the information about the clusters that have been merged Note that this kind of transitive merging can only occur if MinPts is larger than 5, because otherwise would be a core object and then all objects in N Es () would already be densityreachable from A o 1 X case 2: creation case 4: merge Figure 5: : Transitive merging of clusters A, B, C by the insertion algorithm o 2 o 4 o 3 B C objects from cluster A objects from cluster B objects from cluster C

7 43 Deletions As oosed to an insertion, when deleting an object, density-connections may be removed, but no new connections are established The difficult case for deletion occurs when the cluster C of is no longer density-connected via (revious) core objects in N Es () after deleting In this case, we do not know in general how many objects we have to check before it can be determined whether C has to be slit or not In most cases, however, this set of objects is very small because the slit of a cluster is not very frequent and in general a non-slit situation will be detected in a small neighborhood of the deleted object When deleting an object from the database D we can distinguish the following cases: (1) (Removal) UdSeed Del is emty, ie there are no core objects in the neighborhood of objects that may have lost their core object roerty after the deletion of Then is deleted from D and eventually other objects in N Es () change from a former cluster C to noise If this haens, the cluster C is comletely removed because then C cannot have core objects outside of N Es () (2) (Reduction) All objects in UdSeed Del are directly density-reachable from each other Then is deleted from D and some objects in N Es () may become noise (3) (otential Slit) The objects in UdSeed Del are not directly density-reachable from each other These objects belonged to eactly one cluster C before the deletion of Now we have to check whether or not these objects are density-connected by other objects in the former cluster C Deending on the eistence of such density-connections, we can distinguish a slit and a non-slit situation Figure 6 illustrates the different cases when deleting from a samle database of 2D oints using arameters Es as deicted and MinPts = 3 Note that the situations described in case 3 may occur simultaneously case 1: removal case 3: slit case 2: reduction slit slit case 3: slit and no slit Figure 6: : The different cases of the deletion algorithm If case (3) occurs, then the clustering rocedure must also consider objects outside of UdSeed Del, but it stos in case of a non-slit situation as soon as the objects from the set UdSeed Del are density-connected to each other Case (3) is imlemented by a rocedure similar to the function eand_cluster in algorithm DBSCAN (see figure 2) starting in arallel from the elements of the set UdSeed Del The main difference is that the candidates for further eansion are managed in a queue instead of a stack Thus, a breadth-first search for the missing density-connections is erformed which is more efficient than a deth-first search due to the following reasons: In a non-slit situation, we sto as soon as all members of UdSeed Del are found to be density-connected to each other The breadth-first search imlies that density-connections with the minimum number of objects (requiring the minimum number of region queries) are detected first A slit situation is in general the more eensive case because the arts of the cluster to be slit actually have to be discovered The algorithm stos when all but the last art have been visited Usually, a cluster is slit only into two arts and one of them is relatively small Using breadthfirst search we only have to visit the smaller art and a small ercentage of the larger one 5 Performance Evaluation In this section, we evaluate the efficiency of IncrementalDBSCAN versus DBSCAN We resent an eerimental evaluation using a 2D satial database as well as a WWW access log database For this urose, we imlemented both algorithms in C++ based on imlementations of the R*-tree [BKSS 90] (for the 2D satial database) and the M-tree [CPZ 97] (for the WWW log database) resectively Furthermore, we resent an analytical comarison of both algorithms and derive the seed-u factors for tyical arameter values deending on the database size and the number of udates For the first set of eeriments, we used a synthetic database of 1,000,000 2D oints with k = 40 clusters of similar sizes 217% of all oints are noise, uniformly distributed outside of the clusters, and all other oints are uniformly distributed inside the clusters with a significantly higher density than the noise In this database, the goal of clustering is to discover grous of neighboring objects A tyical real world alication for this tye of database is clustering earthquake eicenters stored in an earthquake catalog Earthquake eicenters occur along seismically active faults, and are measured with some errors, so that over time observed earthquake eicenters should be clustered along such seismic faults [AF 96] In this tye of alication, there are only insertions The Euclidean distance was used as distance function and an R*-tree [BKSS 90] as an inde structure Es was set to 448 and MinPts was set to 30 Note that the MinPts value had to be rather large due to the high ercentage of noise We erformed eeriments on several other synthetic 2D databases with n varying from 100,000 to 1,000,000, k varying from 7 to 40 and with the noise ercentage varying from 10% u to 20% Since we always obtained similar results, we restrict the discussion to the above database

8 rombloninformatikuni-muenchende loa - [04/Mar/1997:01:44: ] "GET /~loa/ HTTP/10" rombloninformatikuni-muenchende loa - [04/Mar/1997:01:45: ] "GET /~loa// HTTP/10" fiersegacoj unknown - [04/Mar/1997:01:58: ] "GET /dbs/oradahtml HTTP/10" scootera-deccom unknown - [04/Mar/1997:02:08: ] "GET /dbs/kriegel_ehtml HTTP/10" Figure 7: : Samle WWW access log entries For the second set of eeriments, we used a WWW access log database of the Institute for Comuter Science of the University of Munich This database contains 1,400,000 entries following the Common Log Format secified as art of the HTTP rotocol [Luo 95] Figure 7 deicts some samle log entries All log entries with identical IP address and user id within a given maimum time ga are groued into a session and redundant entries, ie entries with filename suffies such as gif, jeg, and jg are removed [MJHS 96] A session has the following structure: session::= <i_address, user_id, [url 1,, url k ]> In this alication, the goal of clustering is to discover grous of similar sessions A WWW rovider may use the discovered clusters as follows: The users associated with the sessions of a cluster form some kind of user grou which may be used to develo marketing strategies The URLs of the sessions contained in a cluster seem to be logically correlated and should be made easily accessible from each other via aroriate links Entries are deleted from the WWW access log database after si months Assuming a constant daily number of WWW accesses, the numbers of insertions and deletions are the same We used the following distance function for airs of sessions s 1 and s 2 : dist( s 1, s 2 ) Cardinality( s 1 \s 2 ) + Cardinality( s 2 \s 1 ) = Cardinality( s 1 ) + Cardinality( s 2 ) The domain of dist is the interval [0 1], dist(s,s) = 0, dist is symmetric and it fulfills the triangle inequality Other distance functions may use the hierarchy of the directories to define the degree of similarity between two URLs The database was indeed by an M-tree [CPZ 97] Es was set to 04 and MinPts to 2 In the following, we comare the erformance of IncrementalDBSCAN versus DBSCAN Tyically, the number of age accesses is used as a cost measure for database algorithms because the I/O time heavily dominates CPU time In both algorithms, region queries are the only oerations requiring age accesses Since the number of age accesses of a single region query is the same for DBSCAN and for IncrementalDBSCAN, we only have to comare the number of region queries Thus, we use the number of region queries as the cost measure for our comarison Note that we are not interested in the absolute erformance of the two algorithms but only in their relative erformance, ie in the seed-u factor as defined below To validate this aroach, we erformed a set of eeriments on our test databases and found that the eerimental seed-u factor always was slightly larger than the analytically derived seed-u factor (eerimental value 16 times the eected value in all eeriments) DBSCAN erforms eactly one region query for each of the n objects of the database (see algorithm in figure 2), ie the cost of DBSCAN for clustering n objects, denoted by Cost DBSCAN (n), is Cost DBSCAN ( n) = n The number of region queries erformed by IncrementalDBSCAN deends on the alication and, therefore, it must be determined eerimentally In general, a deletion affects more objects than an insertion Thus, we introduce two arameters r ins and r del denoting the average number of region queries for an incremental insertion res deletion Let f ins and f del denote the ercentage of insertions res deletions in the number of all incremental udates Then, the cost of IncrementalDBSCAN for erforming m incremental udates, denoted by Cost IncrementalDBSCAN (m), is as follows: Cost IncrementalDBSCAN ( m) = m ( f ins r ins + f del r del ) Table 1 lists the arameters of our erformance evaluation and the values obtained for the 2D satial as well as for the WWW-log database To determine the average values

9 Table 1: Parameters of the erformance evaluation Parameter Meaning Value for 2D satial Value for WWW-log n number of database objects 1,000,000 69,000 m number of (incremental) udates varying varying r ins average number of region queries for an incremental insertion r del average number of region queries for an incremental deletion f del relative frequency of deletions in the number of all udates 0 05 f ins relative frequency of insertions in the number of all udates (1- f del ) (r ins and r del ), the whole databases were incrementally inserted and deleted, although f del = 0 for the 2D satial database Now, we can calculate the seed-u factor of IncrementalDBSCAN versus DBSCAN We define the seed-u factor as the ratio of the cost of DBSCAN (alied to the database after all insertions and deletions) and the cost of m calls of IncrementalDBSCAN (once for each of the insertions res deletions), ie: SeeduFactor Cost DBSCAN ( n + f ins m f del m) = Cost IncrementalDBSCAN ( m) = ( n + f ins m f del m) m ( f ins r ins + f del r del ) Figure 8 and figure 9 deict the seed-u factors deending on n for several values of m For relatively small numbers of daily udates, eg, m = 1,000 and n = 1,000,000, we obtain seed-u factors of 633 for the 2D satial database and 260 for the WWW-log database Even for rather large numbers of daily udates, eg, m = 25,000 and n = 1,000,000, IncrementalDBSCAN yields seed-u factors of 26 and 10 for the 2D satial as well as for the WWW-log database seed-u factor ,000 1,000,000 1,500,000 2,000,000 size of database (n) Figure 8: : Seed-u factors for 2D satial database number of udates 1,000 5,000 10,000 25,000 50, ,000 seed-u factor ,000 1,000,000 1,500,000 2,000,000 size of database (n) Figure 9: : Seed-u factors for WWW-log database number of udates (m) 1,000 5,000 10,000 25,000 50, ,000 When setting the seed-u factor to 10, we obtain the number of udates (denoted by MaUdates) u to which the multile alication of IncrementalDBSCAN for each udate is more efficient than the single alication of DBSCAN to the whole udated database Figure 10 deicts the values of MaUdates deending on n for f del values of u to 05 which is the maimum value to be eected in most real alications This figure was derived by setting r ins to 134 and r del to 675 which are the averages over the resective values obtained for our test databases Note that - in contrast to the significant differences of other characteristics of the two alications - the differences of both r ins and r del are rather small indicating that the average values are a realistic choice for many alications The MaUdates values obtained are much larger than the actual numbers of daily udates in most real databases For databases without deletions (that is, f del = 0), MaUdates is aroimately 3 * n, ie the cost for 3 * n udates on a database of n objects using IncrementalDBSCAN is the same as the cost of DBSCAN on the udated database containing 4 * n objects Even in the worst case of f del = 05, MaUdates is aroimately 025 * n These results clearly emhasize the relevance of incremental clustering

10 MaUdates 1,000, , , ,000 relative frequency of deletions (f_del) Acknowledgments We thank Marco Patella for the M-tree imlementation and Franz Krojer for roviding us with the WWW access log database References 200, ,000 1,000,000 1,500,000 2,000,000 6 Conclusions size of database (n) Figure 10: MaUdates deending on database size for different relative frequencies of deletions Data warehouses rovide a great deal of oortunities for erforming data mining tasks such as classification and clustering Tyically, udates are collected and alied to the data warehouse eriodically in a batch mode, eg, during the night Then, all atterns derived from the warehouse by some data mining algorithm have to be udated as well In this aer, we resented the first incremental clustering algorithm - based on DBSCAN - for mining in a data warehousing environment DBSCAN requires only a distance function and, therefore, it is alicable to any database containing data from a metric sace Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in a small neighborhood of this object Thus, efficient algorithms could be given for incremental insertions and deletions to a clustering Based on the formal definition of clusters, it was roven that the incremental algorithm yields the same result as DBSCAN A erformance evaluation of IncrementalDBSCAN versus DBSCAN using a satial database as well as a WWWlog database was resented, demonstrating the efficiency of the roosed algorithm For relatively small numbers of daily udates, eg, 1,000 udates in a database of size 1,000,000, IncrementalDBSCAN yielded seed-u factors of several hundred Even for rather large numbers of daily udates, eg, 25,000 udates in a database of 1,000,000 objects, we obtained seed-u factors of more than 10 versus DBSCAN In this aer, we assumed that the arameter values Es and MinPts of DBSCAN do not change significantly when inserting and deleting objects However, there may be alications where this assumtion does not hold, ie the arameters may change after many udates of the database In our future work, we lan to investigate this case In this aer, sets of udates are rocessed one at a time without considering the relationshis between the single udates In the future, bulk insertions and deletions will be considered to further imrove the efficiency of IncrementalDBSCAN [AF 96] [AS 94] Allard D and Fraley C: Non Parametric Maimum Likelihood Estimation of Features in Satial Point Process Using Voronoi Tessellation, Journal of the American Statistical Association, December 1997 [also htt://wwwstatwashingtonedu/techreorts/ tr293rs] Agrawal R, Srikant R: Fast Algorithms for Mining Association Rules, Proc 20th Int Conf on Very Large Data Bases, Santiago, Chile, 1994, [BKSS 90] Beckmann N, Kriegel H-P, Schneider R, Seeger B: The R*-tree: An Efficient and Robust Access Method for Points and Rectangles, Proc ACM SIGMOD Int Conf on Management of Data, Atlantic City, NJ, 1990, [Bou 96] Bouguettaya A: On-Line Clustering, IEEE Transactions on Knowledge and Data Engineering, Vol 8, No 2, 1996, [CHNW 96] Cheung D W, Han J, Ng V T, Wong Y: Maintenance of Discovered Association Rules in Large Databases: An Incremental Technique, Proc 12th Int Conf on Data Engineering, New Orleans, USA, 1996, [CPZ 97] Ciaccia P, Patella M, Zezula P: M-tree: An Efficient Access Method for Similarity Search in Metric Saces, Proc 23rd Int Conf on Very Large Data Bases, Athens, Greece, 1997, [EKSX 96] Ester M, Kriegel H-P, Sander J, Xu X: A Density-Based Algorithm for Discovering Clusters in Large Satial Databases with Noise, Proc 2nd Int Conf on Knowledge Discovery and Data Mining, Portland, OR, 1996, [EKX 95] Ester M, Kriegel H-P, Xu X: Knowledge Discovery in Large Satial Databases: Focusing Techniques for Efficient Class Identification, Proc 4th Int Sym on Large Satial Databases, Portland, ME, 1995, in: Lecture Notes in Comuter Science, Vol 951, Sringer, 1995, 67-82

11 [EW 98] Ester M, Wittmann R: Incremental Generalization for Mining in a Data Warehousing Environment, Proc 6th Int Conf on Etending Database Technology, Valencia, Sain, 1998, in: Lecture Notes in Comuter Science, Vol 1377, Sringer, 1998, [FAAM 97] Feldman R, Aumann Y, Amir A, Mannila H: Efficient Algorithms for Discovering Frequent Sets in Incremental Databases, Proc ACM SIGMOD Worksho on Research Issues on Data Mining and Knowledge Discovery, Tucson, AZ, 1997, [FPS 96] Fayyad U, Piatetsky-Shairo G, and Smyth P: Knowledge Discovery and Data Mining: Towards a Unifying Framework, Proc 2nd Int Conf on Knowledge Discovery and Data [Gue 94] Mining, Portland, OR, 1996, Gueting R H: An Introduction to Satial Database Systems, The VLDB Journal, Vol 3, No 4, October 1994, [HCC 93] Han J, Cai Y, Cercone N: Data-driven Discovery of Quantitative Rules in Relational Databases, IEEE Transactions on Knowledge and Data Engineering, Vol5, No 1, 1993, [Huy 97] [KR 90] Huyn N: Multile-View Self-Maintenance in Data Warehousing Environments, Proc 23rd Int Conf on Very Large Data Bases, Athens, Greece, 1997, Kaufman L, Rousseeuw P J: Finding Grous in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990 [Luo 95] Luotonen A: The common log file format, htt://wwww3org/ub/www/, 1995 [MJHS 96] Mombasher B, Jain N, Han E-H, Srivastava J: Web Mining: Pattern Discovery from World Wide Web Transactions, Technical Reort , University of Minnesota, 1996 [MQM 97] Mumick I S, Quass D, Mumick B S: Maintenance of Data Cubes and Summary Tables in a Warehouse, Proc ACM SIGMOD Int Conf on Management of Data, 1997, [NH 94] Ng R T, Han J: Efficient and Effective Clustering Methods for Satial Data Mining, Proc 20th Int Conf on Very Large Data Bases, Santiago, Chile, 1994, [SEKX 98] Sander J, Ester M, Kriegel H-P, Xu X: Density-Based Clustering in Satial Databases: The Algorithm GDBSCAN and its Alications, will aear in: Data Mining and Knowledge Discovery, Kluwer Acedemic Publishers, Vol 2, 1998 [Sib 73] [ZRL 96] Sibson R: SLINK: an otimally efficient algorithm for the single-link cluster method, The Comuter Journal, Vol 16, No 1, 1973, Zhang T, Ramakrishnan R, Linvy M: BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proc ACM SIGMOD Int Conf on Management of Data, 1996,

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases Published in the Proceedings of 14th International Conference on Data Engineering (ICDE 98) A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases Xiaowei Xu, Martin Ester, Hans-Peter

More information

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1 1. Introduction Today data is received automatically

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11) Point Location Prerocess a lanar, olygonal subdivision for oint location ueries. = (18, 11) Inut is a subdivision S of comlexity n, say, number of edges. uild a data structure on S so that for a uery oint

More information

The Online Freeze-tag Problem

The Online Freeze-tag Problem The Online Freeze-tag Problem Mikael Hammar, Bengt J. Nilsson, and Mia Persson Atus Technologies AB, IDEON, SE-3 70 Lund, Sweden [email protected] School of Technology and Society, Malmö University,

More information

Automatic Search for Correlated Alarms

Automatic Search for Correlated Alarms Automatic Search for Correlated Alarms Klaus-Dieter Tuchs, Peter Tondl, Markus Radimirsch, Klaus Jobmann Institut für Allgemeine Nachrichtentechnik, Universität Hannover Aelstraße 9a, 0167 Hanover, Germany

More information

Storage Basics Architecting the Storage Supplemental Handout

Storage Basics Architecting the Storage Supplemental Handout Storage Basics Architecting the Storage Sulemental Handout INTRODUCTION With digital data growing at an exonential rate it has become a requirement for the modern business to store data and analyze it

More information

ENFORCING SAFETY PROPERTIES IN WEB APPLICATIONS USING PETRI NETS

ENFORCING SAFETY PROPERTIES IN WEB APPLICATIONS USING PETRI NETS ENFORCING SAFETY PROPERTIES IN WEB APPLICATIONS USING PETRI NETS Liviu Grigore Comuter Science Deartment University of Illinois at Chicago Chicago, IL, 60607 [email protected] Ugo Buy Comuter Science

More information

An important observation in supply chain management, known as the bullwhip effect,

An important observation in supply chain management, known as the bullwhip effect, Quantifying the Bullwhi Effect in a Simle Suly Chain: The Imact of Forecasting, Lead Times, and Information Frank Chen Zvi Drezner Jennifer K. Ryan David Simchi-Levi Decision Sciences Deartment, National

More information

Mean shift-based clustering

Mean shift-based clustering Pattern Recognition (7) www.elsevier.com/locate/r Mean shift-based clustering Kuo-Lung Wu a, Miin-Shen Yang b, a Deartment of Information Management, Kun Shan University of Technology, Yung-Kang, Tainan

More information

Evaluating a Web-Based Information System for Managing Master of Science Summer Projects

Evaluating a Web-Based Information System for Managing Master of Science Summer Projects Evaluating a Web-Based Information System for Managing Master of Science Summer Projects Till Rebenich University of Southamton [email protected] Andrew M. Gravell University of Southamton [email protected]

More information

DAY-AHEAD ELECTRICITY PRICE FORECASTING BASED ON TIME SERIES MODELS: A COMPARISON

DAY-AHEAD ELECTRICITY PRICE FORECASTING BASED ON TIME SERIES MODELS: A COMPARISON DAY-AHEAD ELECTRICITY PRICE FORECASTING BASED ON TIME SERIES MODELS: A COMPARISON Rosario Esínola, Javier Contreras, Francisco J. Nogales and Antonio J. Conejo E.T.S. de Ingenieros Industriales, Universidad

More information

Local Connectivity Tests to Identify Wormholes in Wireless Networks

Local Connectivity Tests to Identify Wormholes in Wireless Networks Local Connectivity Tests to Identify Wormholes in Wireless Networks Xiaomeng Ban Comuter Science Stony Brook University [email protected] Rik Sarkar Comuter Science Freie Universität Berlin [email protected]

More information

OPTICS: Ordering Points To Identify the Clustering Structure

OPTICS: Ordering Points To Identify the Clustering Structure Proc. ACM SIGMOD 99 Int. Conf. on Management of Data, Philadelphia PA, 1999. OPTICS: Ordering Points To Identify the Clustering Structure Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander

More information

A Virtual Machine Dynamic Migration Scheduling Model Based on MBFD Algorithm

A Virtual Machine Dynamic Migration Scheduling Model Based on MBFD Algorithm International Journal of Comuter Theory and Engineering, Vol. 7, No. 4, August 2015 A Virtual Machine Dynamic Migration Scheduling Model Based on MBFD Algorithm Xin Lu and Zhuanzhuan Zhang Abstract This

More information

Branch-and-Price for Service Network Design with Asset Management Constraints

Branch-and-Price for Service Network Design with Asset Management Constraints Branch-and-Price for Servicee Network Design with Asset Management Constraints Jardar Andersen Roar Grønhaug Mariellee Christiansen Teodor Gabriel Crainic December 2007 CIRRELT-2007-55 Branch-and-Price

More information

The impact of metadata implementation on webpage visibility in search engine results (Part II) q

The impact of metadata implementation on webpage visibility in search engine results (Part II) q Information Processing and Management 41 (2005) 691 715 www.elsevier.com/locate/inforoman The imact of metadata imlementation on webage visibility in search engine results (Part II) q Jin Zhang *, Alexandra

More information

Synopsys RURAL ELECTRICATION PLANNING SOFTWARE (LAPER) Rainer Fronius Marc Gratton Electricité de France Research and Development FRANCE

Synopsys RURAL ELECTRICATION PLANNING SOFTWARE (LAPER) Rainer Fronius Marc Gratton Electricité de France Research and Development FRANCE RURAL ELECTRICATION PLANNING SOFTWARE (LAPER) Rainer Fronius Marc Gratton Electricité de France Research and Develoment FRANCE Synosys There is no doubt left about the benefit of electrication and subsequently

More information

Concurrent Program Synthesis Based on Supervisory Control

Concurrent Program Synthesis Based on Supervisory Control 010 American Control Conference Marriott Waterfront, Baltimore, MD, USA June 30-July 0, 010 ThB07.5 Concurrent Program Synthesis Based on Suervisory Control Marian V. Iordache and Panos J. Antsaklis Abstract

More information

Service Network Design with Asset Management: Formulations and Comparative Analyzes

Service Network Design with Asset Management: Formulations and Comparative Analyzes Service Network Design with Asset Management: Formulations and Comarative Analyzes Jardar Andersen Teodor Gabriel Crainic Marielle Christiansen October 2007 CIRRELT-2007-40 Service Network Design with

More information

A Modified Measure of Covert Network Performance

A Modified Measure of Covert Network Performance A Modified Measure of Covert Network Performance LYNNE L DOTY Marist College Deartment of Mathematics Poughkeesie, NY UNITED STATES lynnedoty@maristedu Abstract: In a covert network the need for secrecy

More information

Load Balancing Mechanism in Agent-based Grid

Load Balancing Mechanism in Agent-based Grid Communications on Advanced Comutational Science with Alications 2016 No. 1 (2016) 57-62 Available online at www.isacs.com/cacsa Volume 2016, Issue 1, Year 2016 Article ID cacsa-00042, 6 Pages doi:10.5899/2016/cacsa-00042

More information

Sage Timberline Office

Sage Timberline Office Sage Timberline Office Get Started Document Management 9.8 NOTICE This document and the Sage Timberline Office software may be used only in accordance with the accomanying Sage Timberline Office End User

More information

On Multicast Capacity and Delay in Cognitive Radio Mobile Ad-hoc Networks

On Multicast Capacity and Delay in Cognitive Radio Mobile Ad-hoc Networks On Multicast Caacity and Delay in Cognitive Radio Mobile Ad-hoc Networks Jinbei Zhang, Yixuan Li, Zhuotao Liu, Fan Wu, Feng Yang, Xinbing Wang Det of Electronic Engineering Det of Comuter Science and Engineering

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Rummage Web Server Tuning Evaluation through Benchmark

Rummage Web Server Tuning Evaluation through Benchmark IJCSNS International Journal of Comuter Science and Network Security, VOL.7 No.9, Setember 27 13 Rummage Web Server Tuning Evaluation through Benchmark (Case study: CLICK, and TIME Parameter) Hiyam S.

More information

Comparing Dissimilarity Measures for Symbolic Data Analysis

Comparing Dissimilarity Measures for Symbolic Data Analysis Comaring Dissimilarity Measures for Symbolic Data Analysis Donato MALERBA, Floriana ESPOSITO, Vincenzo GIOVIALE and Valentina TAMMA Diartimento di Informatica, University of Bari Via Orabona 4 76 Bari,

More information

Multistage Human Resource Allocation for Software Development by Multiobjective Genetic Algorithm

Multistage Human Resource Allocation for Software Development by Multiobjective Genetic Algorithm The Oen Alied Mathematics Journal, 2008, 2, 95-03 95 Oen Access Multistage Human Resource Allocation for Software Develoment by Multiobjective Genetic Algorithm Feng Wen a,b and Chi-Ming Lin*,a,c a Graduate

More information

Memory management. Chapter 4: Memory Management. Memory hierarchy. In an ideal world. Basic memory management. Fixed partitions: multiple programs

Memory management. Chapter 4: Memory Management. Memory hierarchy. In an ideal world. Basic memory management. Fixed partitions: multiple programs Memory management Chater : Memory Management Part : Mechanisms for Managing Memory asic management Swaing Virtual Page relacement algorithms Modeling age relacement algorithms Design issues for aging systems

More information

Re-Dispatch Approach for Congestion Relief in Deregulated Power Systems

Re-Dispatch Approach for Congestion Relief in Deregulated Power Systems Re-Disatch Aroach for Congestion Relief in Deregulated ower Systems Ch. Naga Raja Kumari #1, M. Anitha 2 #1, 2 Assistant rofessor, Det. of Electrical Engineering RVR & JC College of Engineering, Guntur-522019,

More information

A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION 9 th ASCE Secialty Conference on Probabilistic Mechanics and Structural Reliability PMC2004 Abstract A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

More information

Monitoring Frequency of Change By Li Qin

Monitoring Frequency of Change By Li Qin Monitoring Frequency of Change By Li Qin Abstract Control charts are widely used in rocess monitoring roblems. This aer gives a brief review of control charts for monitoring a roortion and some initial

More information

Static and Dynamic Properties of Small-world Connection Topologies Based on Transit-stub Networks

Static and Dynamic Properties of Small-world Connection Topologies Based on Transit-stub Networks Static and Dynamic Proerties of Small-world Connection Toologies Based on Transit-stub Networks Carlos Aguirre Fernando Corbacho Ramón Huerta Comuter Engineering Deartment, Universidad Autónoma de Madrid,

More information

Time-Cost Trade-Offs in Resource-Constraint Project Scheduling Problems with Overlapping Modes

Time-Cost Trade-Offs in Resource-Constraint Project Scheduling Problems with Overlapping Modes Time-Cost Trade-Offs in Resource-Constraint Proect Scheduling Problems with Overlaing Modes François Berthaut Robert Pellerin Nathalie Perrier Adnène Hai February 2011 CIRRELT-2011-10 Bureaux de Montréal

More information

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive

More information

STATISTICAL CHARACTERIZATION OF THE RAILROAD SATELLITE CHANNEL AT KU-BAND

STATISTICAL CHARACTERIZATION OF THE RAILROAD SATELLITE CHANNEL AT KU-BAND STATISTICAL CHARACTERIZATION OF THE RAILROAD SATELLITE CHANNEL AT KU-BAND Giorgio Sciascia *, Sandro Scalise *, Harald Ernst * and Rodolfo Mura + * DLR (German Aerosace Centre) Institute for Communications

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Web Application Scalability: A Model-Based Approach

Web Application Scalability: A Model-Based Approach Coyright 24, Software Engineering Research and Performance Engineering Services. All rights reserved. Web Alication Scalability: A Model-Based Aroach Lloyd G. Williams, Ph.D. Software Engineering Research

More information

X How to Schedule a Cascade in an Arbitrary Graph

X How to Schedule a Cascade in an Arbitrary Graph X How to Schedule a Cascade in an Arbitrary Grah Flavio Chierichetti, Cornell University Jon Kleinberg, Cornell University Alessandro Panconesi, Saienza University When individuals in a social network

More information

Alpha Channel Estimation in High Resolution Images and Image Sequences

Alpha Channel Estimation in High Resolution Images and Image Sequences In IEEE Comuter Society Conference on Comuter Vision and Pattern Recognition (CVPR 2001), Volume I, ages 1063 68, auai Hawaii, 11th 13th Dec 2001 Alha Channel Estimation in High Resolution Images and Image

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Software Cognitive Complexity Measure Based on Scope of Variables

Software Cognitive Complexity Measure Based on Scope of Variables Software Cognitive Comlexity Measure Based on Scoe of Variables Kwangmyong Rim and Yonghua Choe Faculty of Mathematics, Kim Il Sung University, D.P.R.K [email protected] Abstract In this aer, we define

More information

Two-resource stochastic capacity planning employing a Bayesian methodology

Two-resource stochastic capacity planning employing a Bayesian methodology Journal of the Oerational Research Society (23) 54, 1198 128 r 23 Oerational Research Society Ltd. All rights reserved. 16-5682/3 $25. www.algrave-journals.com/jors Two-resource stochastic caacity lanning

More information

An inventory control system for spare parts at a refinery: An empirical comparison of different reorder point methods

An inventory control system for spare parts at a refinery: An empirical comparison of different reorder point methods An inventory control system for sare arts at a refinery: An emirical comarison of different reorder oint methods Eric Porras a*, Rommert Dekker b a Instituto Tecnológico y de Estudios Sueriores de Monterrey,

More information

TOWARDS REAL-TIME METADATA FOR SENSOR-BASED NETWORKS AND GEOGRAPHIC DATABASES

TOWARDS REAL-TIME METADATA FOR SENSOR-BASED NETWORKS AND GEOGRAPHIC DATABASES TOWARDS REAL-TIME METADATA FOR SENSOR-BASED NETWORKS AND GEOGRAPHIC DATABASES C. Gutiérrez, S. Servigne, R. Laurini LIRIS, INSA Lyon, Bât. Blaise Pascal, 20 av. Albert Einstein 69621 Villeurbanne, France

More information

The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree

The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree The Priority R-Tree: A Practically Efficient and Worst-Case Otimal R-Tree Lars Arge Deartment of Comuter Science Duke University, ox 90129 Durham, NC 27708-0129 USA [email protected] Mark de erg Deartment

More information

From Simulation to Experiment: A Case Study on Multiprocessor Task Scheduling

From Simulation to Experiment: A Case Study on Multiprocessor Task Scheduling From to Exeriment: A Case Study on Multirocessor Task Scheduling Sascha Hunold CNRS / LIG Laboratory Grenoble, France [email protected] Henri Casanova Det. of Information and Comuter Sciences University

More information

FDA CFR PART 11 ELECTRONIC RECORDS, ELECTRONIC SIGNATURES

FDA CFR PART 11 ELECTRONIC RECORDS, ELECTRONIC SIGNATURES Document: MRM-1004-GAPCFR11 (0005) Page: 1 / 18 FDA CFR PART 11 ELECTRONIC RECORDS, ELECTRONIC SIGNATURES AUDIT TRAIL ECO # Version Change Descrition MATRIX- 449 A Ga Analysis after adding controlled documents

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Identifying erroneous data using outlier detection techniques

Identifying erroneous data using outlier detection techniques Identifying erroneous data using outlier detection techniques Wei Zhuang 1, Yunqing Zhang 2 and J. Fred Grassle 2 1 Department of Computer Science, Rutgers, the State University of New Jersey, Piscataway,

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

Buffer Capacity Allocation: A method to QoS support on MPLS networks**

Buffer Capacity Allocation: A method to QoS support on MPLS networks** Buffer Caacity Allocation: A method to QoS suort on MPLS networks** M. K. Huerta * J. J. Padilla X. Hesselbach ϒ R. Fabregat O. Ravelo Abstract This aer describes an otimized model to suort QoS by mean

More information

Stochastic Derivation of an Integral Equation for Probability Generating Functions

Stochastic Derivation of an Integral Equation for Probability Generating Functions Journal of Informatics and Mathematical Sciences Volume 5 (2013), Number 3,. 157 163 RGN Publications htt://www.rgnublications.com Stochastic Derivation of an Integral Equation for Probability Generating

More information

On the predictive content of the PPI on CPI inflation: the case of Mexico

On the predictive content of the PPI on CPI inflation: the case of Mexico On the redictive content of the PPI on inflation: the case of Mexico José Sidaoui, Carlos Caistrán, Daniel Chiquiar and Manuel Ramos-Francia 1 1. Introduction It would be natural to exect that shocks to

More information

Risk and Return. Sample chapter. e r t u i o p a s d f CHAPTER CONTENTS LEARNING OBJECTIVES. Chapter 7

Risk and Return. Sample chapter. e r t u i o p a s d f CHAPTER CONTENTS LEARNING OBJECTIVES. Chapter 7 Chater 7 Risk and Return LEARNING OBJECTIVES After studying this chater you should be able to: e r t u i o a s d f understand how return and risk are defined and measured understand the concet of risk

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Multiperiod Portfolio Optimization with General Transaction Costs

Multiperiod Portfolio Optimization with General Transaction Costs Multieriod Portfolio Otimization with General Transaction Costs Victor DeMiguel Deartment of Management Science and Oerations, London Business School, London NW1 4SA, UK, [email protected] Xiaoling Mei

More information

Implementation of Statistic Process Control in a Painting Sector of a Automotive Manufacturer

Implementation of Statistic Process Control in a Painting Sector of a Automotive Manufacturer 4 th International Conference on Industrial Engineering and Industrial Management IV Congreso de Ingeniería de Organización Donostia- an ebastián, etember 8 th - th Imlementation of tatistic Process Control

More information

Managing specific risk in property portfolios

Managing specific risk in property portfolios Managing secific risk in roerty ortfolios Andrew Baum, PhD University of Reading, UK Peter Struemell OPC, London, UK Contact author: Andrew Baum Deartment of Real Estate and Planning University of Reading

More information

Failure Behavior Analysis for Reliable Distributed Embedded Systems

Failure Behavior Analysis for Reliable Distributed Embedded Systems Failure Behavior Analysis for Reliable Distributed Embedded Systems Mario Tra, Bernd Schürmann, Torsten Tetteroo {tra schuerma tetteroo}@informatik.uni-kl.de Deartment of Comuter Science, University of

More information

Multi-Channel Opportunistic Routing in Multi-Hop Wireless Networks

Multi-Channel Opportunistic Routing in Multi-Hop Wireless Networks Multi-Channel Oortunistic Routing in Multi-Ho Wireless Networks ANATOLIJ ZUBOW, MATHIAS KURTH and JENS-PETER REDLICH Humboldt University Berlin Unter den Linden 6, D-99 Berlin, Germany (zubow kurth jr)@informatik.hu-berlin.de

More information

Franck Cappello and Daniel Etiemble LRI, Université Paris-Sud, 91405, Orsay, France Email: [email protected]

Franck Cappello and Daniel Etiemble LRI, Université Paris-Sud, 91405, Orsay, France Email: fci@lri.fr MPI versus MPI+OenMP on the IBM SP for the NAS Benchmarks Franck Caello and Daniel Etiemble LRI, Université Paris-Sud, 945, Orsay, France Email: [email protected] Abstract The hybrid memory model of clusters

More information

F inding the optimal, or value-maximizing, capital

F inding the optimal, or value-maximizing, capital Estimating Risk-Adjusted Costs of Financial Distress by Heitor Almeida, University of Illinois at Urbana-Chamaign, and Thomas Philion, New York University 1 F inding the otimal, or value-maximizing, caital

More information

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Finding a Needle in a Haystack: Pinointing Significant BGP Routing Changes in an IP Network Jian Wu, Zhuoqing Morley Mao University of Michigan Jennifer Rexford Princeton University Jia Wang AT&T Labs

More information

Learning Human Behavior from Analyzing Activities in Virtual Environments

Learning Human Behavior from Analyzing Activities in Virtual Environments Learning Human Behavior from Analyzing Activities in Virtual Environments C. BAUCKHAGE 1, B. GORMAN 2, C. THURAU 3 & M. HUMPHRYS 2 1) Deutsche Telekom Laboratories, Berlin, Germany 2) Dublin City University,

More information

Service Network Design with Asset Management: Formulations and Comparative Analyzes

Service Network Design with Asset Management: Formulations and Comparative Analyzes Service Network Design with Asset Management: Formulations and Comarative Analyzes Jardar Andersen Teodor Gabriel Crainic Marielle Christiansen October 2007 CIRRELT-2007-40 Service Network Design with

More information

The fast Fourier transform method for the valuation of European style options in-the-money (ITM), at-the-money (ATM) and out-of-the-money (OTM)

The fast Fourier transform method for the valuation of European style options in-the-money (ITM), at-the-money (ATM) and out-of-the-money (OTM) Comutational and Alied Mathematics Journal 15; 1(1: 1-6 Published online January, 15 (htt://www.aascit.org/ournal/cam he fast Fourier transform method for the valuation of Euroean style otions in-the-money

More information

The risk of using the Q heterogeneity estimator for software engineering experiments

The risk of using the Q heterogeneity estimator for software engineering experiments Dieste, O., Fernández, E., García-Martínez, R., Juristo, N. 11. The risk of using the Q heterogeneity estimator for software engineering exeriments. The risk of using the Q heterogeneity estimator for

More information

Sage HRMS I Planning Guide. The Complete Buyer s Guide for Payroll Software

Sage HRMS I Planning Guide. The Complete Buyer s Guide for Payroll Software I Planning Guide The Comlete Buyer s Guide for Payroll Software Table of Contents Introduction... 1 Recent Payroll Trends... 2 Payroll Automation With Emloyee Self-Service... 2 Analyzing Your Current Payroll

More information

COST CALCULATION IN COMPLEX TRANSPORT SYSTEMS

COST CALCULATION IN COMPLEX TRANSPORT SYSTEMS OST ALULATION IN OMLEX TRANSORT SYSTEMS Zoltán BOKOR 1 Introduction Determining the real oeration and service costs is essential if transort systems are to be lanned and controlled effectively. ost information

More information

A Certification Authority for Elliptic Curve X.509v3 Certificates

A Certification Authority for Elliptic Curve X.509v3 Certificates A Certification Authority for Ellitic Curve X509v3 Certificates Maria-Dolores Cano, Ruben Toledo-Valera, Fernando Cerdan Det of Information Technologies & Communications Technical University of Cartagena

More information

Visual Data Mining with Pixel-oriented Visualization Techniques

Visual Data Mining with Pixel-oriented Visualization Techniques Visual Data Mining with Pixel-oriented Visualization Techniques Mihael Ankerst The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124 [email protected] Abstract Pixel-oriented visualization

More information

A Multivariate Statistical Analysis of Stock Trends. Abstract

A Multivariate Statistical Analysis of Stock Trends. Abstract A Multivariate Statistical Analysis of Stock Trends Aril Kerby Alma College Alma, MI James Lawrence Miami University Oxford, OH Abstract Is there a method to redict the stock market? What factors determine

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Title: Stochastic models of resource allocation for services

Title: Stochastic models of resource allocation for services Title: Stochastic models of resource allocation for services Author: Ralh Badinelli,Professor, Virginia Tech, Deartment of BIT (235), Virginia Tech, Blacksburg VA 2461, USA, [email protected] Phone : (54) 231-7688,

More information

Secure synthesis and activation of protocol translation agents

Secure synthesis and activation of protocol translation agents Home Search Collections Journals About Contact us My IOPscience Secure synthesis and activation of rotocol translation agents This content has been downloaded from IOPscience. Please scroll down to see

More information

CABRS CELLULAR AUTOMATON BASED MRI BRAIN SEGMENTATION

CABRS CELLULAR AUTOMATON BASED MRI BRAIN SEGMENTATION XI Conference "Medical Informatics & Technologies" - 2006 Rafał Henryk KARTASZYŃSKI *, Paweł MIKOŁAJCZAK ** MRI brain segmentation, CT tissue segmentation, Cellular Automaton, image rocessing, medical

More information

ANALYSING THE OVERHEAD IN MOBILE AD-HOC NETWORK WITH A HIERARCHICAL ROUTING STRUCTURE

ANALYSING THE OVERHEAD IN MOBILE AD-HOC NETWORK WITH A HIERARCHICAL ROUTING STRUCTURE AALYSIG THE OVERHEAD I MOBILE AD-HOC ETWORK WITH A HIERARCHICAL ROUTIG STRUCTURE Johann Lóez, José M. Barceló, Jorge García-Vidal Technical University of Catalonia (UPC), C/Jordi Girona 1-3, 08034 Barcelona,

More information

Sage Document Management. User's Guide Version 12.1

Sage Document Management. User's Guide Version 12.1 Sage Document Management User's Guide Version 12.1 NOTICE This is a ublication of Sage Software, Inc. Version 12.1. November, 2012 Coyright 2012. Sage Software, Inc. All rights reserved. Sage, the Sage

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Simulink Implementation of a CDMA Smart Antenna System

Simulink Implementation of a CDMA Smart Antenna System Simulink Imlementation of a CDMA Smart Antenna System MOSTAFA HEFNAWI Deartment of Electrical and Comuter Engineering Royal Military College of Canada Kingston, Ontario, K7K 7B4 CANADA Abstract: - The

More information

Risk in Revenue Management and Dynamic Pricing

Risk in Revenue Management and Dynamic Pricing OPERATIONS RESEARCH Vol. 56, No. 2, March Aril 2008,. 326 343 issn 0030-364X eissn 1526-5463 08 5602 0326 informs doi 10.1287/ore.1070.0438 2008 INFORMS Risk in Revenue Management and Dynamic Pricing Yuri

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

NUTSS: A SIP-based Approach to UDP and TCP Network Connectivity

NUTSS: A SIP-based Approach to UDP and TCP Network Connectivity NUTSS: A SIP-based Aroach to UDP and TCP Network Connectivity Saikat Guha Det. of Comuter Science Cornell University Ithaca, NY 14853 [email protected] Yutaka Takeda Panasonic Communications San Diego,

More information

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

A Two-Step Method for Clustering Mixed Categroical and Numeric Data Tamkang Journal of Science and Engineering, Vol. 13, No. 1, pp. 11 19 (2010) 11 A Two-Step Method for Clustering Mixed Categroical and Numeric Data Ming-Yi Shih*, Jar-Wen Jheng and Lien-Fu Lai Department

More information

Machine Learning with Operational Costs

Machine Learning with Operational Costs Journal of Machine Learning Research 14 (2013) 1989-2028 Submitted 12/11; Revised 8/12; Published 7/13 Machine Learning with Oerational Costs Theja Tulabandhula Deartment of Electrical Engineering and

More information

Sage Document Management. User's Guide Version 13.1

Sage Document Management. User's Guide Version 13.1 Sage Document Management User's Guide Version 13.1 This is a ublication of Sage Software, Inc. Version 13.1 Last udated: June 19, 2013 Coyright 2013. Sage Software, Inc. All rights reserved. Sage, the

More information

Optimal Routing and Scheduling in Transportation: Using Genetic Algorithm to Solve Difficult Optimization Problems

Optimal Routing and Scheduling in Transportation: Using Genetic Algorithm to Solve Difficult Optimization Problems By Partha Chakroborty Brics "The roblem of designing a good or efficient route set (or route network) for a transit system is a difficult otimization roblem which does not lend itself readily to mathematical

More information