A Comparative Study of clustering algorithms Using weka tools

Similar documents
Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

A comparison of various clustering methods and algorithms in data mining

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Clustering methods for Big data analysis

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

The Role of Visualization in Effective Data Cleaning

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Chapter 7. Cluster Analysis

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Context-aware taxi demand hotspots prediction

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Clustering UE 141 Spring 2013

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Comparison the various clustering algorithms of weka tools

Proposed Application of Data Mining Techniques for Clustering Software Projects

DATA MINING TECHNIQUES AND APPLICATIONS

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

International Journal of Enterprise Computing and Business Systems. ISSN (Online) : 2230

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Public Transportation BigData Clustering

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Information Management course

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Quality Assessment in Spatial Clustering of Data Mining

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

How To Analyse A Website With Weka

Cluster Analysis: Advanced Concepts

How To Solve The Kd Cup 2010 Challenge

A Study of Web Log Analysis Using Clustering Techniques

A Review of Clustering Methods forming Non-Convex clusters with, Missing and Noisy Data

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Customer Classification And Prediction Based On Data Mining Technique

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Introduction to Data Mining

Clustering Model for Evaluating SaaS on the Cloud

Introduction Predictive Analytics Tools: Weka

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

An Introduction to WEKA. As presented by PACE

Constrained Clustering of Territories in the Context of Car Insurance

Knowledge Discovery in Data with FIT-Miner

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Web Document Clustering

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Comparison of K-means and Backpropagation Data Mining Algorithms

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Original Article Survey of Recent Clustering Techniques in Data Mining

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Grid Density Clustering Algorithm

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Advance Research in Computer Science and Management Studies

How To Understand How Weka Works

Network Intrusion Detection using Data Mining Technique

How To Cluster On A Large Data Set

Using Data Mining Techniques for Performance Analysis of Software Quality

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Clustering Data Streams

Three Perspectives of Data Mining

SoSe 2014: M-TANI: Big Data Analytics

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Introduction. A. Bellaachia Page: 1

Cluster Analysis: Basic Concepts and Algorithms

Unsupervised Data Mining (Clustering)

Data Mining with Weka

Data Warehousing and Data Mining

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES *

Data Mining Process Using Clustering: A Survey

How To Predict Web Site Visits

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Application of Data Mining Techniques in Intrusion Detection

Data Clustering Using Data Mining Techniques

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Cluster Analysis: Basic Concepts and Methods

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Transcription:

A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups based on the principle of maximizing the intra-class similarity and minimizing the inter-class similarity. This paper analyze the three major clustering algorithms: K-Means, Hierarchical clustering and Density based clustering algorithm and compare the performance of these three major clustering algorithms on the aspect of correctly class wise cluster building ability of algorithm. Performance of the 3 techniques are presented and compared using a clustering tool WEKA. Keywords: Data mining algorithms, Weka tools, K-means algorithms, Hierarchical clustering, Density based clustering algorithm etc. 1. INTRODUCTION Knowledge discovery process consists of an iterative sequence of steps such as data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge presentation. Data mining functionalities are characterization and discrimination, mining frequent patterns, association, correlation, classification and prediction, cluster analysis, outlier analysis and evolution analysis [1]. Three of the major data mining techniques are regression, classification and clustering. CLUSTERING is a data mining technique to group the similar data into a cluster and dissimilar data into different clusters. I am using Weka data mining tools for clustering. 2. WEKA Weka is a landmark system in the history of the data mining and machine learning research communities, because it is the only toolkit that has gained such widespread adoption and survived for an extended period of time[2]. The Weka or woodhen (Gallirallus australis) is an endemic bird of New Zealand. It provides many different algorithms for data mining and machine learning. Weka is open source and freely available. It is also platform-independent. Figure 1 View of weka tool Volume 1, Issue 2, October 2012 Page 154

The GUI Chooser consists of four buttons: Explorer: An environment for exploring data with WEKA. Experimenter: An environment for performing experiments and conducting statistical tests between learning schemes. Knowledge Flow: This environment supports essentially the same functions as the Explorer but with a dragand-drop interface. One advantage is that it supports incremental learning. Simple CLI: Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. Figure 2 Clustering Algorithms 3. CLUSTERING ALGORITHMS AND TECHNIQUES Many algorithms exist for clustering. Following figures showing three major clustering methods and their approach for clustering. 3.1 K-means Clustering The term "k-means" was first used by James MacQueen in 1967 [3], though the idea goes back to 1957 [4]. The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, though it wasn't published until 1982. K-means is a widely used partitional clustering method in the industries. The K-means algorithm is the most commonly used partitional clustering algorithm because it can be easily implemented and is the most efficient one in terms of the execution time. Here s how the algorithm works [5]: K-Means Algorithm: The algorithm for partitioning, where each cluster s center is represented by mean value of objects in the cluster. Input: k: the number of clusters. D: a data set containing n objects. Output: A set of k clusters. Method: 1. Arbitrarily choose k objects from D as the initial cluster centers. 2. Repeat. 3. (re)assign each object to the cluster to which the object is most similar using Eq. 1, based on the mean value of the objects in the cluster. 4. Update the cluster means, i.e. calculate the mean value of the objects for each cluster. 5. until no change. Volume 1, Issue 2, October 2012 Page 155

Figure 3 Result of K-means Clustering This figure show that the result of k-means clustring Methods using WEKA tool. After that we saved the result, the result will be saved in the ARFF file formate. We also open this file in the ms exel. 3.2 Hierarchical clustering HIERARCHICAL CLUSTERING builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram[6]. Agglomerative (bottom up) 1. Start with 1 point (singleton). 2. Recursively add two or more appropriate clusters. 3. Stop when k number of clusters is achieved. Divisive (top down) 1. Start with a big cluster. 2. Recursively divides into smaller clusters. 3. Stop when k number of clusters is achieved. Figure 4 Result of Hierarchical Clustering This figure show that the result of Hierarchical clustering Methods with single-linkage [7] between data points using WEKA tool. Volume 1, Issue 2, October 2012 Page 156

3.3 Density based clustering Density-based clustering algorithms try to find clusters based on density of data points in a region. The key idea of density-based clustering is that for each instance of a cluster the neighborhood of a given radius (Eps) has to contain at least a minimum number of instances (MinPts). One of the most well known density-based clustering algorithms is the DBSCAN [7]. DBSCAN separates data points into three classes: Core points: These are points that are at the interior of a cluster. Border points: A border point is a point that is not a core point, but it falls within the neighborhood of a core point. Noise points: A noise point is any point that is not a core point or a border point. To find a cluster, DBSCAN starts with an arbitrary instance (p) in data set (D) and retrieves all instances of D with respect to Eps and Min Pts. The algorithm makes use of a spatial data structure(r*tree) to locate points within Eps distance from the core points of the clusters [8]. Another density based algorithm OPTICS is introduced in [9], which is an interactive clustering algorithm, works by creating an ordering of the data set representing its density-based clustering structure. Figure 5 Result of Density based Clustering. Above figure showing the result of density-based clustering methods using WEKA tool. 4. COMPARISON Above section involves the study of each of the three techniques introduced previously using Weka Clustering Tool on a set of banking data consists of 11 attributes and 600 entries [10]. Clustering of the data set is done with each of the clustering algorithm using Weka tool and the results are: Table 2 Comparison result of algorithms using weka tool Name No. of clusters Cluster Instances K-Means 2 0: 247(41%) 1: 353(59%) No. of Iterations Within clusters sum of squared errors 3 1737.279199193 316 Time taken to build model 0.23 Log likelihood Uncluste red Instances 0 Hierarchic al clustering 2 0: 599(100%) 1: 1 (0%) 4.67 0 densitybased clustering 2 0:230 (38%) 1:370 (62%) 0.05-22.19241 0 Volume 1, Issue 2, October 2012 Page 157

5. CONCLUSION After analyzing the results of testing the algorithms we can obtain the following conclusions: The performance of K-Means algorithm is better than Hierarchical Clustering algorithm. All the algorithms have some ambiguity in some (noisy) data when clustered. Density based clustering algorithm is not suitable for data with high variance in density. K-Means algorithm is produces quality clusters when using huge dataset. Hierarchical clustering algorithm is more sensitive for noisy data. REFERENCES [1]Han J. and Kamber M.: Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, 2000. [2]Sapna Jain, M Afshar Aalam and M N Doja, K-means clustering using weka interface, Proceedings of the 4th National Conference; INDIACom-2010. [3]MacQueen J. B., "Some Methods for classification and Analysis of Multivariate Observations", Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability.University of California Press. 1967, pp. 281 297. [4]Lloyd, S. P. "Least square quantization in PCM". IEEE Transactions on Information Theory 28, 1982,pp. 129 137. [5]Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers,second Edition, (2006). [6]Manish Verma, Mauly Srivastava, Neha Chack, Atul Kumar Diswar and Nidhi Gupta, A Comparative Study of Various Clustering Algorithms in Data Mining, International Journal of Engineering Research and Applications (IJERA) Vol. 2, Issue 3, May-Jun 2012, pp.1379-1384 [7]Timonthy C. Havens. Clustering in relational data and ontologies July 2010. [8]Xu R. Survey of clustering algorithms.ieee Trans. Neural Networks 2005. [9]B OHM, C., KAILING, K., KRIEGEL, H.-P., AND KR OGER, P. 2004. Density connected clustering with local subspace preferences.in Proceedings of the 4th International Conference on Data Mining (ICDM). [10] Ian Written and Eibe Frank, Data Mining, Practical Machine Learning Tools and Techniques,3rd Ed., Morgan Kaufmann, 2011. AUTHOR Mr.Bharat Chaudhari doing ME in Department of Computer Science and Engg from Kalol Institute of Technology & Research Centre, Kalol-382721,Gujarat Technological University.published his paper titled A Comparative Study of clustering algorithms using weka tools..in this paper we are trying to compare major clustering algorithms using weka tools. Mr. Manan Parikh doing ME in Department of Computer Science and Engg from Kalol Institute of Technology & Research Centre, Kalol-382721, Gujarat Technological University.published his paper titled A Comparative Study of clustering algorithms using weka tools. Volume 1, Issue 2, October 2012 Page 158