An Enhanced Clustering Algorithm to Analyze Spatial Data

Similar documents
Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

A comparison of various clustering methods and algorithms in data mining

A Comparative Study of clustering algorithms Using weka tools

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering UE 141 Spring 2013

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Cluster Analysis: Advanced Concepts

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Information Retrieval and Web Search Engines

How To Cluster

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Neural Networks Lesson 5 - Cluster Analysis

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Social Media Mining. Data Mining Essentials

Forschungskolleg Data Analytics Methods and Techniques

Chapter ML:XI (continued)

Hadoop SNS. renren.com. Saturday, December 3, 11

Machine Learning using MapReduce

Cluster Analysis: Basic Concepts and Methods

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Unsupervised Data Mining (Clustering)

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Statistical Databases and Registers with some datamining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Unsupervised learning: Clustering

International Journal of Advance Research in Computer Science and Management Studies

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Original Article Survey of Recent Clustering Techniques in Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

K-Means Clustering Tutorial

Cluster analysis Cosmin Lazar. COMO Lab VUB

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

On Clustering Validation Techniques

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Chapter 7. Cluster Analysis

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Data Mining: Foundation, Techniques and Applications

Quality Assessment in Spatial Clustering of Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Environmental Remote Sensing GEOG 2021

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Rule based Classification of BSE Stock Data with Data Mining

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

A Review on Clustering and Outlier Analysis Techniques in Datamining

ANALYSIS & PREDICTION OF SALES DATA IN SAP- ERP SYSTEM USING CLUSTERING ALGORITHMS

152 International Journal of Computer Science and Technology. Paavai Engineering College, India

SoSe 2014: M-TANI: Big Data Analytics

King Saud University

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Standardization and Its Effects on K-Means Clustering Algorithm

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Introduction to Clustering

DataMining Clustering Techniques in the Prediction of Heart Disease using Attribute Selection Method

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

SCAN: A Structural Clustering Algorithm for Networks

Resource-bounded Fraud Detection

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Web Document Clustering

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Bisecting K-Means for Clustering Web Log data

Grid Density Clustering Algorithm

Classification algorithm in Data mining: An Overview

How To Cluster On A Large Data Set

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

CLUSTER ANALYSIS FOR SEGMENTATION

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Distances, Clustering, and Classification. Heatmaps

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Vector Quantization and Clustering

Clustering Connectionist and Statistical Language Processing

Movie Classification Using k-means and Hierarchical Clustering

Fuzzy Clustering Technique for Numerical and Categorical dataset

Support Vector Machines with Clustering for Training with Very Large Datasets

Semi-Supervised and Unsupervised Machine Learning. Novel Strategies

Data, Measurements, Features

Cluster Analysis: Basic Concepts and Algorithms

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Transcription:

International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav Abstract Cluster analysis is one of the basic tools for exploring the underlying structure of a given data set and is being applied in a wide variety of engineering and scientific disciplines such as medicine, psychology, biology, sociology, pattern recognition, and image processing. The primary objective of cluster analysis is to partition a given data set of multidimensional vectors (patterns) into so-called homogeneous clusters such that patterns within a cluster are more similar to each other than patterns belonging to different clusters. Cluster analysis aims to group data on the basis of similarities and dissimilarities among the data elements. The process can be performed in a supervised, semi-supervised or unsupervised manner. In the paper we have proposed an enhanced algorithm for clustering using K Means technique of spatial data by which results through WEKA GUI Chooser tool are quite satisfactory. Though the results of the enhanced algorithm are compared to the most common known technique such as K Means, Euclidian distance algorithm and DBSCAN algorithm etc. In future it could be beneficiary for searching of the huge data in various sector of life as data is increasing day by day. Index Terms Cluster, Pattern recognition, Spatial, Supervised and Unsupervised Manner. I. INTRODUCTION Clustering involves identifying a finite set of categories (clusters) to describe the data. The clusters can be mutually exclusive, hierarchical or overlapping. Each member of a cluster should be very similar to other members in its cluster and dissimilar to other clusters. Techniques for creating clusters include partitioning (often using the k-means algorithm) and hierarchical methods (which group objects into a tree of clusters), as well as grid, model, and density-based methods that subset. It also called characterization or generalization [1]. generates values of 0 for documents exhibiting no agreement among the assigned indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of partial agreement. C. Average Similarity : If the similarity measure is computed for all pairs of documents ( d i, d j ) except when i=j, an average value average similarity is obtainable. specifically, average similarity = constant similar ( d i, d j ), where i=1,2,.n and j=1,2,.n and i < > j D. Threshold : The lowest possible input value of similarity required to join two objects in one cluster. E. Similarity Matrix : Similarity between objects calculated by the function similar (d i,,d j ), represented in the form of a matrix is called a similarity matrix. III. CLUSTERING TECHNIQUES There are so many clustering technique evolved till now but we will see few algorithms among them. A. K Means Clustering Algorithm : K-means is a typical non supervised clustering algorithm in data mining and which is widely used for clustering large set of data. It is a partitioning clustering algorithm, this method is to classify the given data objects into k different clusters through the iterative, converging to a local minimum. So the results of generated clusters are compact and independent. The algorithm consists of two separate phases. The first phase selects k centers randomly, where the value k is fixed in advance [2]. II. FEW TERMS ASSOCIATED WITH CLUSTERING A cluster is an ordered list of objects, which have some common characteristics. The objects belong to an interval [a, b], in our case [0, 1]. A. Distance between two Clusters : The distance between two clusters involves some or all elements of the two clusters. The clustering method determines how the distance should be computed. B. Similarity : A similarity measure similar (d i, d j ) can be used to represent the similarity between the documents. Typical similarity Manuscript received July 16, 2014. Dr. Mahesh Kumar, Department of Computer Science & Engineering, MRKIET Rewari, India Mr. Sachin Yadav, Department of Computer Science & Engineering, MRKIET Rewari, India Figure 1 K-means clustering 53 www.erpublication.org

An Enhanced Clustering Algorithm to Analyze Spatial Data The next phase is to take each data object to the nearest centre. Euclidean distance is generally considered to determine the distance between each data object and the cluster centres. When all the data objects are included in some clusters, the first step is completed and an early grouping is done by recalculating the average of the early formed clusters. This iterative process continues repeatedly until the criterion function becomes the minimum. B. Hierarchical Clustering Algorithm Hierarchical clustering algorithms according to the method that produce clusters can be further divided into: Agglomerative algorithms. They produce a sequence of clustering schemes of decreasing number of clusters at east step. The clustering scheme produced at each step results from the previous one by merging the two closest clusters into one. Divisive algorithms. These algorithms produce a sequence of clustering schemes of increasing number of clusters at each step. Contrary to the agglomerative algorithms the clustering produced at each step results from the previous one by splitting a cluster into two. A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until a termination condition holds. The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster. In successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds [3]. C. Density based Clustering Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind. Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The key idea of density based clustering is that for each object of a cluster the neighbourhood of a given radius ɛ has to contain at least a minimum number of µ objects, i.e. the cardinality of the neighbourhood has to exceed a given threshold [4]. (sometimes not found in commercial data mining software), it has become one of the most widely used data mining systems. WEKA also became one of the favorite vehicles for data mining research and helped to advance it by making many powerful features available to all. The WEKA project aims to provide a comprehensive collection of machine learning algorithms and data pre-processing tools to researchers and practitioners alike. It allows users to quickly try out and compare different machine learning methods on new data sets. To perform the enhanced algorithm we have used the following data set as shown below in the table. Table 1 DATA SET USED This data sets is described and used in WEKA tool. It can be downloaded from archive.ics.uci.edu/ml/datasets.html. The image recognition dataset has 17 attributes and 20,000 instances and Seeds data set has 7 attribute and 210 instances. When we clustered these datasets in the WEKA tool we get the following results: Table 2 Results using K Means using s=10 Table 3 Results using Improved K Means using s=10 Now using the result shown above, we can draw the distribution for two clusters in the table: Table 4 Cluster-0 and cluster-1distribution IV. EXPERIMENTAL SETUP AND RESULTS For clustering using different algorithms as mentioned earlier, WEKA GUI Chooser tool has been used. WEKA is a landmark system in the history of the data mining and machine learning research communities and is freely available for download. WEKA offers many powerful features As here in table E stands for error measured. 54 www.erpublication.org

Now using the result shown below, we can draw the distribution in the form of a pie chart: International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 As from above figures and table it can be easily concluded that the results shown by the improved K-Means Algorithm results are better, thus can be used for the data set used in different segment of life as Medical, transportation etc. V. CONCLUSION From the comparative study it can be easily concluded that instances of a data set are more closely clustered as there is less difference in the two clustered formed using the modified algorithm. Thus the proposed algorithm can be used further for data mining in the medical, transportation, social work analysis, pattern matching. REFERENCES [1] Anil K. Jain, Data Clustering: 50 Years Beyond K-Means 19th International Conference on Pattern Recognition (ICPR), Tampa, FL, December 8, 2008. [2] Prof. Pier Luca Lanzi, Clustering Partationing (Spring 2009) [3] Halkidi M., Batistakis Y., Vazirgiannis M., On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17:2/3, 107 145, 2001. [4] [52] Hand D., Mannila H., Smyth P., Principles of Data Mining, MIT Press, Cambridge, MA. ISBN 0-262 - 08290-X, 2001. [5] http://www. en.wikipedia.org [6] http://www.google.com [7] http://www.archive.ics.uci.edu/ml/datasets.html 55 www.erpublication.org