Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise



Similar documents
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Public Transportation BigData Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Clustering methods for Big data analysis

A Comparative Study of clustering algorithms Using weka tools

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

The Role of Visualization in Effective Data Cleaning

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Cluster Analysis: Advanced Concepts

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Data Mining: Foundation, Techniques and Applications

Chapter 7. Cluster Analysis

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

OUTLIER ANALYSIS. Data Mining 1

Scalable Cluster Analysis of Spatial Events

Chapter ML:XI (continued)

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Pre-Analysis and Clustering of Uncertain Data from Manufacturing Processes

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Introduction to Data Mining

A comparison of various clustering methods and algorithms in data mining

A Review of Clustering Methods forming Non-Convex clusters with, Missing and Noisy Data

Identifying erroneous data using outlier detection techniques

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

[EN-024]Clustering radar tracks to evaluate efficiency indicators

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

The Data Mining Process

International Journal of Enterprise Computing and Business Systems. ISSN (Online) : 2230

Algorithms and Applications for Spatial Data Mining

Building Data Cubes and Mining Them. Jelena Jovanovic

Cluster Analysis: Basic Concepts and Algorithms

Using an Ontology-based Approach for Geospatial Clustering Analysis

Using Data Mining for Mobile Communication Clustering and Characterization

A Method for Decentralized Clustering in Large Multi-Agent Systems

Comparison the various clustering algorithms of weka tools

Synflood Spoof Source DDOS Attack Defence Based on Packet ID Anomaly Detection PIDAD

Information Management course

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

The Scientific Data Mining Process

Clustering & Visualization

On Clustering Validation Techniques

Data Mining Process Using Clustering: A Survey

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

Cluster Analysis: Basic Concepts and Algorithms

SCAN: A Structural Clustering Algorithm for Networks

For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Density-Based Clustering over an Evolving Data Stream with Noise

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Multi-represented knn-classification for Large Class Sets

Forschungskolleg Data Analytics Methods and Techniques

Car Insurance. Jan Tomášek Štěpán Havránek Michal Pokorný

Cluster Analysis: Basic Concepts and Methods

Tianjun Fu and Hsinchun Chen, Artificial Intelligence Lab, University of Arizona Christopher C. Yang 1,2 and Tobun D. Ng 2

On Data Clustering Analysis: Scalability, Constraints and Validation

Introduction. A. Bellaachia Page: 1

OPTICS: Ordering Points To Identify the Clustering Structure

Knowledge Discovery in Data with FIT-Miner

Unsupervised Data Mining (Clustering)

Categorical Data Visualization and Clustering Using Subjective Factors

Context-aware taxi demand hotspots prediction

Network Intrusion Detection using Data Mining Technique

Data Clustering Techniques Qualifying Oral Examination Paper

The Predictive Data Mining Revolution in Scorecards:

Data Mining + Business Intelligence. Integration, Design and Implementation

Quality Assessment in Spatial Clustering of Data Mining

SPATIAL DATA CLASSIFICATION AND DATA MINING

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

A Review on Clustering and Outlier Analysis Techniques in Datamining

An Introduction to Cluster Analysis for Data Mining

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

PhoCA: An extensible service-oriented tool for Photo Clustering Analysis

An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus

Conclusions and Future Directions

Crime Hotspots Analysis in South Korea: A User-Oriented Approach

Adaptive Anomaly Detection for Network Security

Location Analytics for Financial Services. An Esri White Paper October 2013

Why is Internal Audit so Hard?

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Clustering Using Data Mining Techniques

Using Map-Reduce for Large Scale Analysis of Graph-Based Data

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

Clustering of Documents for Forensic Analysis

Distributed forests for MapReduce-based machine learning

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Using multiple models: Bagging, Boosting, Ensembles, Forests

Transcription:

DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1

1. Introduction Today data is received automatically from many different kinds of equipments. Satellites, x-rays and traffic cameras are just a few of them. To make this information/data understandable for us, it has to be processed. When working with large data sets it is in most scenarios useful to be able to separate information by dividing the data into smaller categories, and eventually, to do class identification. Not least is this important when treating large spatial databases. A satellite, for example, gathers images as it travels around our earth. It is desired to classify what parts of the images are houses, cars, roads, lakes, forests, etc. Since the image database is big, a good classification algorithm is needed. Classification can, for instance, be done with the help of clustering algorithms, which clumps similar data together into different clusters. However, using clustering algorithms involves some problems: It can often be difficult to know which input parameters that should be used for a specific database, if the user does not have enough knowledge of the domain. Furthermore, spatial data sets can contain huge amounts of data, and trying to find cluster patterns in several dimensions is very computationally costly. Short computing time is always favorable. Last, the shapes of the clusters can be arbitrary and in bad cases very complex. Finding these shapes can be very cumbersome. There are some well-used clustering algorithms out there; one of them is the famous CLARANS. Other methods are K-means, K-medoid, Hierarchical Clustering and Self-Organized Maps. Nevertheless, none of these algorithms can handle all these three mentioned problems in a good way. This report will not discuss these methods but focus on the DBSCAN [1] (Density Based Spatial Clustering of Applications with Noise) algorithm, which introduces solutions to these problems. The following structure will be used in this paper. Section 2 will discuss how the DBSCAN algorithm works in. Section 3 will present various possible applications for the DBSCAN and some comparisons with CLARANS, in terms of efficiency. Finally, section 4 contains the conclusion of this paper and sums up the positive and negative aspects of the DBSCAN algorithm. 2

2. The DBSCAN algorithm The DBSCAN algorithm can identify clusters in large spatial data sets by looking at the local density of database elements, using only one input parameter. Furthermore, the user gets a suggestion on which parameter value that would be suitable. Therefore, minimal knowledge of the domain is required. The DBSCAN can also determine what information should be classified as noise or outliers. In spite of this, its working process is quick and scales very well with the size of the database almost linearly. By using the density distribution of nodes in the database, DBSCAN can categorize these nodes into separate clusters that define the different classes. DBSCAN can find clusters of arbitrary shape, as can be seen in figure 1 [1]. However, clusters that lie close to each other tend to belong to the same class. Figure 1. The node distribution of three different databases, taken from SEQUOIA 2000 benchmark database. The following section will describe further how the DBSCAN algorithm works. Its computing process is based on six rules or definitions, creating two lemmas. Definition 1: (The Eps-neighborhood of a point) N Eps (p) = {q ϵ D dist(p,q)<eps} For a point to belong to a cluster it needs to have at least one other point that lies closer to it than the distance Eps. Definition 2: (Directly density-reachable) There are two kinds of points belonging to a cluster; there are border points and core points, as can be seen in figure 2 [1]. Figure 2. Border- and core points The Eps-neighborhood of a border point tends to have significantly less points than the Epsneighborhood of a core point. [1]. The border points will still be a part of the cluster and in order to include these points, they must belong to the Eps-neighborhood of a core point q as seen in figure 3 [1]. 3

1) p ϵ N Eps (q) In order for point q to be a core point it needs to have a minimum number of points within its Epsneighborhood. 2) N Eps (q) MinPts (core point condition) Figure 3. Point p is directly density-reachable from point q but not vice versa. Definition 3: (Density-reachable) A point p is density-reachable from a point q with respect to Eps and MinPts if there is a chain of points p 1,p n, p 1 =q, p n =p such that p i+1 is directly density-reachable from p i. [1] Figure 4 [1] shows an illustration of a density-reachable point. Figure 4. Point p is density-reachable from point q and not vice versa. Definition 4: (Density-connected) There are cases when two border points will belong to the same cluster but where the two border points don t share a specific core point. In these situations the points will not be density-reachable from each other. There must however be a core point q from which they are both density-reachable. Figure 5 [1] shows how density connectivity works. A point p is density-connected to a point q with respect to Eps and MinPts if there is a point o such that both, p and q are density-reachable from o with respect to Eps and MinPts. Figure 5. Density connectivity. 4

Definition 5: (cluster) If point p is a part of a cluster C and point q is density-reachable from point p with respect to a given distance and a minimum number of points within that distance, then q is also a part of cluster C. 1) " p, q: if p ϵ C and q is density-reachable from p with respect to Eps and MinPts, then q ϵ C. Two points belongs to the same cluster C, is the same as saying that p is density-connected to q with respect to the given distance and the number of points within that given distance. 2) " p, q ϵ C: p is density-connected to q with respect to Eps and MinPts. Definition 6: (noise) Noise is the set of points, in the database, that don t belong to any of the clusters. Lemma 1: A cluster can be formed from any of its core points and will always have the same shape. Lemma 2: Let p be a core point in cluster C with a given minimum distance (Eps) and a minimum number of points within that distance (MinPts). If the set O is density-reachable from p with respect to the same Eps and MinPts, then C is equal to the set O. To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points density-reachable from p with respect to Eps and MinPts. If p is a core point, this procedure yields a cluster with respect to Eps and MinPts (see Lemma 2). If p is a border point then no points are density-reachable from p and DBSCAN visits the next point of the database. [1] 3. Applications An example of software program that has the DBSCAN algorithm implemented is WEKA. The following of this section gives some examples of practical application of the DBSCAN algorithm. Satellites images A lot of data is received from satellites all around the world and this data have to be translated into comprehensible information, for instance, classifying areas of the satellite-taken images according to forest, water and mountains. Before the DBSCAN algorithm can classify these three elements in the database, some work have to be done with image processing. Once the image processing is done, the data appears as spatial data where the DBSCAN can classify the clusters as desired. X-ray crystallography X-ray crystallography is another practical application that locates all atoms within a crystal, which results in a large amount of data. The DBSCAN algorithm can be used to find and classify the atoms in the data. 5

Anomaly Detection in Temperature Data This kind of application focuses on pattern anomalies in data, which is important in several cases, e.g. credit fraud, health condition etc. This application measures anomalies in temperatures [3], which is relevant due to the environmental changes (global warming). It can also discover equipment errors and so forth. These unusual patterns need to be detected and examined to get control over the situation. The DBSCAN algorithm has the capability to discover such patterns in the data. 4. Comparisons (DBSCAN vs. CLARANS) Through the original report [1], the DBSCAN algorithm is compared to another clustering algorithm. This one is called CLARANS (Clustering Large Applications based on RANdomized Search). It is an improvement of the k-medoid algorithms (one object of the cluster located near the center of the cluster, instead of the gravity point of the cluster, i.e. k-means). The good properties compared to k-medoid are that CLARANS works efficient for databases with about a thousand objects. When the database grows larger, CLARANS will fall behind because the algorithm temporarily stores all the objects in the main memory, i.e. the run time will increase. The following part shows that the difference in performance between DBSCAN and CLARANS is vast. The used databases are the ones shown in figure 1 and the result from using CLARANS is shown in figure 6 [1]. Figure 6. The classification of the CLARANS algorithm. As seen in the figure 6, CLARANS did not manage to classify all the clusters correctly for the three different databases, in contrast to DBSCAN that does, as shown in figure 7 [1]. Figure 7. The classification of the DBSCAN algorithm. 6

So far the DBSCAN algorithm have done a good job classifying all the clusters. Nevertheless, it is also superior when comparing the run time between the two algorithms. Graph 1 shows the comparisons, and the resulting differences in time are large. Data taken from paper [1]. Graph 1. Run time in seconds. To summarize the table, the DBSCAN has an almost linear increase in computing time, relative to the number of points in the database. The CLARANS gain, however, is exponential and gets outperformed with a factor of 250 to 1900 in this test. The factor will continue to get bigger as the size of the database increases. 5. Conclusion Classification on large spatial databases is a difficult and computational heavy task. Using clustering algorithms is one way of doing it, and we have seen that there are several methods out there that get the job done. However, we have also seen the problems included when using these methods; finding the right input parameters, localizing clusters of arbitrary shapes and, last but not least, doing the whole process in a reasonable time. The DBSCAN algorithm has a solution to all these problems as it can find those complicated cluster shapes in a very quickly, with only one assigned input parameter. A value for this parameter is also suggested to the user. We have discussed the efficiency, in terms of accuracy, of the DBSCAN algorithm compared to the famous CLARANS algorithm. The results were extremely one-sided. In some cases with huge databases, the DBSCAN algorithm was over 1900 times faster and still ended up with the better final result. Its only real weakness is that it has some difficulties in distinguishing separated clusters if they are located too close to each other, even though they have different densities. 7

Bibliography [1] - A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.1980 [2] - Advances in knowledge discovery and data mining: 9th Pacific-Asia conference, PAKDD 2005, Hanoi, Vietnam, May 18-20, 2005; proceedings Tu Bao Ho, David Cheung, Huan Liu [3] - Anomaly Detection in Temperature Data Using DBSCAN Algorithm: Erciyes Univeristy, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5946052&tag=1 Mete Celic, Filiz Dadaser-Celic, Ahmet Sakir DOKUZ 8