# Online Hierarchical Clustering in a Data Warehouse Environment

Save this PDF as:

Size: px
Start display at page:

## Transcription

3 reachability diagram dataset dendrogram Figure 1. A reachability diagram (left) and a dendrogram (right) for a sample data set (middle). tween clusters (e.g. single-link, average-link, completelink) usually favor clusters of certain shapes. In [1] the hierarchical algorithm called OPTICS is presented which uses a density-based clustering notion. In particular, OPTICS takes the density in the neighborhood of an object into account, using two input parameters, ε and minpts. Thecore distance of o D, denoted by CDist(o), measures the density around o and is computed as CoreDist(o) =nn-dist minpts (o) if N ε (o) minpts else CoreDist(o) =. N ε (o) denotes the ε-neighborhood of o and nn-dist k (o) denotes the k-nearest neighbor distance of o. Thereachabil- ity distance of an object p Drelative from object o D w.r.t. ε Ê and minpts Æ is then defined as RDist(o, p) = max(cdist(o), δ(o, p)), where δ is the distance function. The reachability distance of p from o is an asymmetric distance measure that takes the density around o (its core distance) into account. Let us point out that for ε= and minpts=2 the reachability distance of p from o is δ(o, p)), since CDist(o) =δ(o, p)). The OPTICS algorithm computes a so-called cluster ordering of a database w.r.t. the two input parameters ε and minpts. In addition, the core distance and a suitable reachability distance is stored for each object. This information is enough to extract the hierarchical clustering structure of the data set. The algorithm starts with an arbitrary object o D, assigns a reachability distance of to o and expands the cluster order if the core distance of o is smaller than the input parameter ε. The expansion is performed by inserting the objects p N ε (o) into a seed list. This seed list is organized as a heap, storing that object q as first object in the list, having the minimum reachability distance to all the already processed objects. The next object to be inserted in the cluster ordering is always the first object in the seed list. If the core distance of this object is smallerorequaltoε, all objects in the ε-neighborhood are again inserted into or updated in the seed list. If the seed list is empty and there are still some not yet processed objects in D, we have a so-called jump. OPTICS selects another arbitrary not yet handled object in D to further expand the cluster ordering CO as described above. The resulting cluster ordering can be visualized very intuitively and clearly by means of a so-called reachability diagram, even for very large datasets. A reachability diagram is a 2D visualization of a cluster ordering, where the objects are plotted according to the cluster ordering along the x-axis, and the reachability distance of each object along the y-axis (cf. Figure 1 (left)). Clusters are valleys in the diagram. In contrast to other methods, OPTICS does not favor clusters of a particular size or shape. Single-link clustering is in fact a special parametrisation of OP- TICS. A single-link clustering can be computed using OPTICS with ε= and minpts=2. 3 Online Bulk Updates of Hierarchical Clustering Structures As discussed above, usually a reachability diagram can be transformed into a dendrogram and vice versa [12]. Due to these relationships, we can develop online algorithms for bulk insertions and deletions using reachability diagrams as a representation of a hierarchical clustering structure without loosing generality. We propose algorithmic schemata called OnlineOP- TICS that can be used for efficient online clustering in both worlds agglomerative single-link clustering and density-based clustering. In the following, we call the insertion or deletion of a bulk of objects an update operation. The bulk of update objects is denoted by U. We further assume D to be a database of n objects and δ to be a metric distance function on the objects in D. 3.1 General Ideas and Concepts OPTICS computes a cluster ordering. In order to maintain a cluster ordering, we first formalize it. 3

4 o is predecessor of p and q: PRE(p)=o and PRE(q)=o p and q are the successors of o: SUCC(o) = {p,q} reachability distance o p q cluster ordering Figure 2. Visualization of predecessor and successors. Definition 1 (cluster ordering) Let minpts Æ, ε Ê, and CO be a permutation of the objects in D. Eacho Dhas additional attributes o.p, o.c and o.r, where o.p {1,...,n} symbolizes the position of o in CO. We call CO a cluster ordering w.r.t. ε and minpts if the following conditions hold: (1) p CO : p.c = CDist(p) (2) p CO : p.p = i p.r =min{rdist(x, y) x.p < i y.p i}, where min =. Condition (1) states that p.c is the core distance of object p. Intuitively, condition (2) states that the order is built by starting at an arbitrary object in D and then selecting at each position i in CO that object p having the minimum reachability distance from any object before i. p.r is this minimum reachability distance assigned to object p during the generation of CO. The object p which is responsible for the choice of object o at position o.p is called the predecessor of o in CO, denoted by Pre(o). We say that o has been reached from its predecessor. If o has not been reached from any other object, its reachability distance o.r in CO is, and thus, its predecessor is undefined. Analogously, the successors of on object o in CO, denoted by Suc(o), are those objects having o as their predecessor, i.e. Suc(o) ={p CO Pre(p) =o}. The successors of an object o include all objects in the cluster ordering that have been reached from o. Obviously, there may also be objects that do not have any successors. Both concepts of predecessor and successors are visualized in Figure 2. As stated above, the cluster ordering is exactly that representation we need for an efficient online reconstruction of the hierarchical clustering structure. In particular, for each object o in the cluster ordering, we compute its position (o.p ), its core distance (o.c), the suitable reachability distance (o.r), andinaddi- tion to the original OPTICS algorithm its predecessor (Pre(o)) and its successors (Suc(o)) on the fly during the original OPTICS run or during reorganization. 3.2 Affected Objects In case of an update operation, condition (2) of Definition 1 may be violated. Recomputing the entire cluster ordering using the offline OPTICS algorithm requires the computation of one range query for each object in the database. However, in most cases, several parts of the cluster ordering remain unchanged. For these parts of the cluster ordering, we do not need to computerangequeries. Theideaofanonlineupdateis to save unnecessary range queries and distance calculations during reconstruction. In the following, we say that if we decide to compute a range query around an object, this object is affected by the update and needs reorganization. To determine the objects that are affected by update operations, we make the following considerations. Due to an update operation, the core distance of some objects may change. As a consequence, the reachability distances of some objects that were reached from these objects in the cluster ordering may also change, causing the above mentioned violation of condition (2) in Definition 1. We call the objects with changing core distances directly affected objects. Let NN k (q) bethe set of the k-nearest neighbors of q D.Thesetofreverse k-nearest neighbors of an object p Dis defined as Rev k (p) ={q D p NN k (q)}. Definition 2 (directly affected objects) Let U be the update set and CO be a cluster ordering of D w.r.t. ε Ê and minpts Æ. Thesetofdirectly affected objects due to an update operation, denoted by DAff(U), is defined as DAff(U) ={o Rev minpts (u) u U δ(o, u) ε}. In fact, not all objects in DAff(U) must change their core distances. It may happen that an object o DAff(U) has a core distance of already. Then, in case of deletion, the core distance of o obviously remains unchanged. In case of insertion, the change of the core distance of o depends on whether the minptsnearest neighbor distance of o decreases under the limit of ε. Obviously, an object not belonging to DAff(U) cannot change its core distance due to the update operation. The set DAff(U) can be efficiently computed using the following lemma. Lemma 1 Let U be the update set and CO be a cluster ordering of D. Then the following holds: 4

5 DAff(U) = {o u U o N ε (u) δ(u, o) CDist(o)}. Proof. Let X := {o u U o N ε (u) δ(u, o) CDist(o)}. We show that DAff(U) =X: o DAff(U) :u U δ(o, u) ε o Rev minpts (u) u U o N ε (u) o Rev minpts (u) u U o N ε (u) u NN minpts (o) u U o N ε (u) δ(u, o) nn-dist minpts (o) u U o N ε (u) δ(u, o) CDist(o) o X Lemma 1 states that we can filter out a lot of objects not belonging to DAff(U) by computing only one range query around each update object u U. In addition, for all u Uwe only have to test the objects o N ε (u) whether o Rev minpts (u) holds. The idea is that for all o Rev minpts (u), it holds that δ(u, o) CDist(o). If the update operation is an insertion, the core distance of o decreases, whereas if the update operation is a deletion, the core distance of o increases. Due to mutating core distances, reachability distances of some objects may change. We call these objects indirectly affected. A changing reachability distance may cause the violation of condition (2) in Definition 1. If a reachability distance of an object o decreases due to a changed core distance, o may move forward in the cluster ordering, otherwise o may move backwards. In addition, if an object o has changed its postion in the cluster ordering, the successors and the predecessor (if not yet handled) might also be affected because these objects may now be reached earlier or later, i.e. may be placed at a different position in the new cluster ordering. We call these objects that may be indirectly affected by the reorganization of an object o the potential successors of o. The set of potential successors Suc(o) of an affected object o CO is recursively defined: (1) if p Suc(o) thenp Suc pot (o). (2) if q Suc pot (o) andp Suc(q) and x Suc pot (o) x.p = p.p 1 x.r p.r then p Suc pot (o). Condition (1) states that all direct successors of o are also candidates for reorganization. The second condition collects recursively the direct successors of potential successors, as long as their assigned reachability distance increases. If the reachability distance decreases, we enter a new cluster where the objects are more densely packed. In this case, the objects can only be affected by the local neighborhood. The intuition behind condition (2) is visualized in Figure 3. The recursive collection of the potential successors of object o stops at object p. All objects in C will be reached from o q p o p C q CO Figure 3. The recursive collection of potential successors of o stops at p and may restart at q. p or another object in C but not from o. The recursive collection may restart at q if q or any object after q is the successor of p or of any object before p. Lemma 2 Let U be the update set, CO be a cluster ordering of D, ando, p CO. If p Suc pot (o), then o Pre(p). Proof. Let p Suc pot (o). We show o Pre(p) by recursion over Suc pot (o): (1) p Suc(o) o Pre(p). (2) q Suc pot (o) and p Suc(q) and x Suc pot (o) x.p = p.p 1 x.r > p.r: Since x has been reached earlier than p from the objects before x, x.p = p.p 1, and x.r > p.r, p must have been reached from x, i.e. Pre(p) =x o Pre(p). Lemma 2 states, that only the points in PotSuc(o) can be affected by a reorganization of an object o. Now we are able to develop online versions to handle insertions and deletions efficiently. In the following, CO old denotes the original cluster ordering before the update. OnlineOPTICS will create CO new, the new (valid) cluster ordering after the update, by performing a single pass over CO old. 3.3 Online Bulk Insertions The pseudo code of the online bulk insertion algorithm is depicted in Figure 4. In the first step of the insertion of all objects in U, the core distances of each o DAff(U) are updated and all objects u Uare inserted into the seed list OrderedSeeds with u.r = and Pre(u) =. Thisisbecauseitisnotyetclear, from which object the objects in U are reached in CO new. After that, the reorganization is performed, imitating the original OPTICS algorithm. We manage the objects that need reorganization in the seed list. In each iteration of the reconstruction loop, we compare the next not yet handled object c in CO old with the 5

6 bulkinsert(setofobjects U, ClusterOrdering CO old ) // all o CO old and u U marked as not yet handled // CO new is an empty cluster ordering for each u Udo u.p := n +1; u.c := CDist(u); for each o DAff(U) do update the core distance of o; insert u Uinto OrderedSeeds with u.r = and Pre(u) = ; while CO old contains unhandled objects or OrderedSeeds do c := first not yet handled object in CO old ; s := first not yet handled object in OrderedSeeds; if s.r > c.r or (s.r = c.r and s.p > c.p ) then l:= c; appendl to CO new; else l:= s; OrderedSeeds.remove(s); append l to CO new; mark l as handled; if l is chosen from OrderedSeeds or l DAff(U) then OrderedSeeds.update(Suc pot (l), l); else for each u U: u do if u is not yet handled and l.c ε and u N ε (l) then OrderedSeeds.update({u}, l); Figure 4. Algorithm bulkinsert. first object s in OrderedSeeds. The object among the two (c and s) which has the smaller reachability distance value is appended to CO new. If the reachability distance values of both objects are equal, we append that object having the smaller position. After the insertion of an object l into the new cluster ordering CO new,wehavetoupdateorderedseeds. This is done by the method update which is an adoption of the corresponding method in [1]. If the recently processed object l is derived from the original cluster ordering CO old, we have to test which update objects u Uhave been already processed. If any u Uhas not been processed and l.c and δ(l, u) ε, u would have been inserted/updated in OrderedSeeds in the original OPTICS run. Thus, l may be a potential predecessor of u and OrderedSeeds has to be updated accordingly. Other connections are not affected, since we store the objects that need reorganization in the seed list. If l is derived from OrderedSeeds or from DAff(U), some connections may need reorganization. Thus, all not yet processed potential successors x Suc pot (l) have to be inserted/updated in the seed list. The reorganization stops if the original cluster ordering CO old does not contain unprocessed objects any more and the seed list is empty. Lemma 3 The online bulk insert algorithm produces a valid cluster ordering w.r.t. Definition 1. Proof. Due to space limitations, we refer to the long version of this paper for the proof. bulkdelete(setofobjects U, ClusterOrdering CO old ) // all o CO old marked as not yet handled // CO new is an empty cluster ordering for each u U mark u as handled; rs := ; for each o DAff(U) do update the core distance of o; insert Suc pot (o) into rs; insert r rs into OrderedSeeds with r.r = and Pre(r) = ; while CO old contains unhandled objects or OrderedSeeds do c := first unhandled object in CO old not contained in rs; s := first unhandled object in OrderedSeeds; if s.r c.r or Pre(c) is not yet handled then l:=s; OrderedSeeds.remove(s); append l to CO new; else l:=c; appendl to CO new; mark l as handled; if l.c ε then OrderedSeeds.updateAll(l, ε); OrderedSeeds.update(Suc (l), l); Figure 5. Algorithm bulkdelete. 3.4 Online Bulk Deletion The pseudo code of the online bulk deletion is depicted in Figure 5. In the first step, all objects u Uare marked as handled. In addition, for each o DAff(U) the core distance of each o is updated and its potential successors Suc pot (o) are inserted into OrderedSeeds with a reachability distance of and (yet) undefined predecessor. After that, the reorganization is performed, again simulating the original OPTICS algorithm. In each iteration of the reconstruction loop, we compare the reachability distance of the next not yet handled object c in CO old which is not contained in DAff(U) withthat of the first object s in OrderedSeeds. If the predecessor of c is not yet processed (i.e. inserted into CO new ), c cannot be taken from CO old and cannot be appended to CO new. Otherwise, that object of c and s, having the smaller reachability distance value, is appended to CO new. After the insertion of an object l into the new cluster ordering CO new,wehavetoupdateorderedseeds. This is done by the methods update and updateall which are both adoptions of the according method in [1]. The reorganization stops if the original cluster ordering CO old does not contain unprocessed objects any more and the seed list is empty. Lemma 4 The online bulk delete algorithm produces a valid cluster ordering w.r.t. Definition 1. Proof. Due to space limitations, we refer to the long version of this paper for the proof. 6

7 speed-up factor 10,0 9,0 8,0 7,0 6,0 5,0 4,0 3,0 2,0 1,0 0, deleted objects 6000 deleted objects 9000 deleted objects 3000 inserted objects 6000 inserted object 9000 inserted objects number of database tuples Figure 6. Speed-up w.r.t database size. 12,0 speed-up factor 14,0 12,0 10,0 8,0 6,0 4,0 2,0 0, deleted objects 3000 deleted objects 5000 deleted objects 1000 inserted objects 3000 inserted objects 5000 inserted objects epsilon Figure 8. Speed-up w.r.t the choice of ε. 10,0 2D delete speed-up factor 8,0 6,0 4,0 5D delete 9D delete 2D insert 5D insert 9D insert CPU + I/O-Time [min] OPTICS Online-OPTICS Delete Online-OPTICS Insert 2, , number of update objects number of update objects Figure 7. Speed-up w.r.t data dimensionality. Figure 9. Results on real-world TV data. 4 Experimental Evaluation We compared OnlineOPTICS with offline OPTICS. We on several synthetic databases and a real-world database containing TV-snapshots represented by 64D color histograms. All tests were run under Linux on a workstation with a 3.2 GHz CPU, and hard disk withatransferrateof20mbyte/sforthesequential scan. Random page accesses need 8ms including seek time, latency delay, and transfer time. Unless otherwise specified, we used an X-Tree [3] to speed-up the range-queries required by both algorithms. The speed-up factors of OnlineOPTICS over the offline version w.r.t. the database size are depicted in Figure 6. As it can be seen, the performance gain achieved by the online update algorithm is growing with an increasing number of data objects both for insertions and deletions. In fact, an online update of nearly 10% of the data objects is still more efficient than an offline update. We illustrate the speed-up factors of OnlineOPTICS over offline OPTICS w.r.t. the dimensionality of the database in Figure 7. With increasing dimensionality of the database, the speed-up factors slightly decrease due to the performance deterioration of the underlying index. The only parameter of offline OPTICS that affects the performance is ε. In [1] the authors state, that ε has to be chosen large enough. In particular, ε has to be chosen large enough, such that there is no jump in the ordering. In general, the optimal choice of ε is quite hard to anticipated beforehand, so we copmared OnlineOPTICS to offline OPTICS w.r.t. several choices of ε. The results are depicted in Figure 8 showing that the higher the value for ε, the bigger is the performance gain of the online update. This can be explained by the decreasing selectivity of the underlying index. We applied offline OPTICS and OnlineOPTICS to the above mentioned real-world dataset of 64D color histograms representing TV snapshots. The results are illustrated in Figure 9, confirming the observations made on the synthetic data. As it can be seen, OnlineOPTICS also achieved impressive speed-up factors over OPTICS for both insertion and deletion of a bulk of objects. We also tested offline OPTICS and OnlineOPTICS computing all range queries on top of the sequential scan using a synthetic database of 2D feature vectors. 7

### An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

### Visual Data Mining with Pixel-oriented Visualization Techniques

Visual Data Mining with Pixel-oriented Visualization Techniques Mihael Ankerst The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124 mihael.ankerst@boeing.com Abstract Pixel-oriented visualization

### A Novel Density based improved k-means Clustering Algorithm Dbkmeans

A Novel Density based improved k-means Clustering Algorithm Dbkmeans K. Mumtaz 1 and Dr. K. Duraiswamy 2, 1 Vivekanandha Institute of Information and Management Studies, Tiruchengode, India 2 KS Rangasamy

### Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

### OPTICS: Ordering Points To Identify the Clustering Structure

Proc. ACM SIGMOD 99 Int. Conf. on Management of Data, Philadelphia PA, 1999. OPTICS: Ordering Points To Identify the Clustering Structure Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

### A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

Published in the Proceedings of 14th International Conference on Data Engineering (ICDE 98) A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases Xiaowei Xu, Martin Ester, Hans-Peter

### THE concept of Big Data refers to systems conveying

EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more

### Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

### Clustering & Visualization

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### MODULE 15 Clustering Large Datasets LESSON 34

MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data

### Extensive Survey on Hierarchical Clustering Methods in Data Mining

Extensive Survey on Hierarchical Clustering Methods in Data Mining Dipak P Dabhi 1, Mihir R Patel 2 1Dipak P Dabhi Assistant Professor, Computer and Information Technology (C.E & I.T), C.G.P.I.T, Gujarat,

### Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

### Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

### The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses

Published in the Proceedings of 16th International Conference on Data Engineering (ICDE 2) The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses Martin Ester, Jörn Kohlhammer, Hans-Peter Kriegel

### The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

### The DC-tree: A Fully Dynamic Index Structure for Data Warehouses

The DC-tree: A Fully Dynamic Index Structure for Data Warehouses Martin Ester, Jörn Kohlhammer, Hans-Peter Kriegel Institute for Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich,

### Philosophies and Advances in Scaling Mining Algorithms to Large Databases

Philosophies and Advances in Scaling Mining Algorithms to Large Databases Paul Bradley Apollo Data Technologies paul@apollodatatech.com Raghu Ramakrishnan UW-Madison raghu@cs.wisc.edu Johannes Gehrke Cornell

### Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi

### Clustering Data Streams

Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting

### Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise

DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1 1. Introduction Today data is received automatically

### Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Prepared by: Yacine ghanjaoui Supervised by: Dr. Hachim Haddouti March 24, 2003 Abstract The indexing techniques in multidimensional

### Hierarchical Clustering. Clustering Overview

lustering Overview Last lecture What is clustering Partitional algorithms: K-means Today s lecture Hierarchical algorithms ensity-based algorithms: SN Techniques for clustering large databases IRH UR ata

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

### A Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory

A Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory Yen-Jen Oyang, Chien-Yu Chen, and Tsui-Wei Yang Department of Computer Science and Information Engineering National Taiwan

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

### 6. If there is no improvement of the categories after several steps, then choose new seeds using another criterion (e.g. the objects near the edge of

Clustering Clustering is an unsupervised learning method: there is no target value (class label) to be predicted, the goal is finding common patterns or grouping similar examples. Differences between models/algorithms

### Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

### A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

### . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

### Cluster Description Formats, Problems and Algorithms

Cluster Description Formats, Problems and Algorithms Byron J. Gao Martin Ester School of Computing Science, Simon Fraser University, Canada, V5A 1S6 bgao@cs.sfu.ca ester@cs.sfu.ca Abstract Clustering is

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

### TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Clustering methods for Big data analysis

Clustering methods for Big data analysis Keshav Sanse, Meena Sharma Abstract Today s age is the age of data. Nowadays the data is being produced at a tremendous rate. In order to make use of this large-scale

### Analysis of Algorithms I: Binary Search Trees

Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary

### Clustering UE 141 Spring 2013

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

### Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

### Data Warehousing and Data Mining

Data Warehousing and Data Mining Lecture Clustering Methods CITS CITS Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement: The

### Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### Clustering Hierarchical clustering and k-mean clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity

### Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

### Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

### CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

### GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering Yu Qian Kang Zhang Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75083-0688, USA {yxq012100,

### Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### DWMiner : A tool for mining frequent item sets efficiently in data warehouses

DWMiner : A tool for mining frequent item sets efficiently in data warehouses Bruno Kinder Almentero, Alexandre Gonçalves Evsukoff and Marta Mattoso COPPE/Federal University of Rio de Janeiro, P.O.Box

### Anomaly Detection using multidimensional reduction Principal Component Analysis

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. II (Jan. 2014), PP 86-90 Anomaly Detection using multidimensional reduction Principal Component

### Benchmarking Cassandra on Violin

Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

### Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

### Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves

### SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA mbriat@esri.com, jmonnot@esri.com,

### An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

### SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

### Performance of KDB-Trees with Query-Based Splitting*

Performance of KDB-Trees with Query-Based Splitting* Yves Lépouchard Ratko Orlandic John L. Pfaltz Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Virginia Illinois

### Physical Data Organization

Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

### Protein Protein Interaction Networks

Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

### Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,

### Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

### Recent Advances in Clustering: A Brief Survey

Recent Advances in Clustering: A Brief Survey S.B. KOTSIANTIS, P. E. PINTELAS Department of Mathematics University of Patras Educational Software Development Laboratory Hellas {sotos, pintelas}@math.upatras.gr

### Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

### Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

### Clustering Techniques

Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### A comparison of various clustering methods and algorithms in data mining

Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

### A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets Preeti Baser, Assistant Professor, SJPIBMCA, Gandhinagar, Gujarat, India 382 007 Research Scholar, R. K. University,

### International Journal of Enterprise Computing and Business Systems. ISSN (Online) : 2230

Analysis and Study of Incremental DBSCAN Clustering Algorithm SANJAY CHAKRABORTY National Institute of Technology (NIT) Raipur, CG, India. Prof. N.K.NAGWANI National Institute of Technology (NIT) Raipur,

### Efficient Storage and Temporal Query Evaluation of Hierarchical Data Archiving Systems

Efficient Storage and Temporal Query Evaluation of Hierarchical Data Archiving Systems Hui (Wendy) Wang, Ruilin Liu Stevens Institute of Technology, New Jersey, USA Dimitri Theodoratos, Xiaoying Wu New

### Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud

### Duplicate Detection Algorithm In Hierarchical Data Using Efficient And Effective Network Pruning Algorithm: Survey

www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 12 December 2014, Page No. 9766-9773 Duplicate Detection Algorithm In Hierarchical Data Using Efficient

### Scientific Report. BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of Finland, Finland. Raimo / Launonen. First name / Family name

Scientific Report First name / Family name Nationality Name of the Host Organisation First Name / family name of the Scientific Coordinator BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of

### Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India

### Graph Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 3. Topic Overview Definitions and Representation Minimum

### Dynamic Data in terms of Data Mining Streams

International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

### Email Spam Detection Using Customized SimHash Function

International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

### IncSpan: Incremental Mining of Sequential Patterns in Large Database

IncSpan: Incremental Mining of Sequential Patterns in Large Database Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 hcheng3@uiuc.edu Xifeng

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### Fuzzy Duplicate Detection on XML Data

Fuzzy Duplicate Detection on XML Data Melanie Weis Humboldt-Universität zu Berlin Unter den Linden 6, Berlin, Germany mweis@informatik.hu-berlin.de Abstract XML is popular for data exchange and data publishing

### Data Warehousing und Data Mining

Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

### Caching XML Data on Mobile Web Clients

Caching XML Data on Mobile Web Clients Stefan Böttcher, Adelhard Türling University of Paderborn, Faculty 5 (Computer Science, Electrical Engineering & Mathematics) Fürstenallee 11, D-33102 Paderborn,

### Classification of Websites as Sets of Feature Vectors

Classification of Websites as Sets of Feature Vectors Hans-Peter Kriegel Institute for Computer Science University of Munich Oettingenstr. 67 D-80538 Munich, Germany kriegel@dbs.informatik.uni-muenchen.de

### An Adaptive Index Structure for High-Dimensional Similarity Search

An Adaptive Index Structure for High-Dimensional Similarity Search Abstract A practical method for creating a high dimensional index structure that adapts to the data distribution and scales well with

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...