Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Similar documents
Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

A comparison of various clustering methods and algorithms in data mining

Public Transportation BigData Clustering

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

A Comparative Study of clustering algorithms Using weka tools

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Grid Density Clustering Algorithm

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Comparison the various clustering algorithms of weka tools

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Social Media Mining. Data Mining Essentials

Data Mining: Foundation, Techniques and Applications

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

8. Machine Learning Applied Artificial Intelligence

An Introduction to Data Mining

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Clustering methods for Big data analysis

Rule based Classification of BSE Stock Data with Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Forschungskolleg Data Analytics Methods and Techniques

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Prediction of Heart Disease Using Naïve Bayes Algorithm

How To Cluster

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Web Document Clustering

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Categorical Data Visualization and Clustering Using Subjective Factors

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

International Journal of Advance Research in Computer Science and Management Studies

Crime Hotspots Analysis in South Korea: A User-Oriented Approach

Cluster Analysis: Basic Concepts and Algorithms

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

Intellectual Performance Analysis of Students by Using Data Mining Techniques

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining with Weka

Quality Assessment in Spatial Clustering of Data Mining

Using Data Mining for Mobile Communication Clustering and Characterization

Clustering Connectionist and Statistical Language Processing

Utility-Based Fraud Detection

An Overview of Knowledge Discovery Database and Data mining Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

A Review of Data Mining Techniques

Clustering UE 141 Spring 2013

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Machine Learning using MapReduce

Supervised and unsupervised learning - 1

Data Mining & Data Stream Mining Open Source Tools

COURSE RECOMMENDER SYSTEM IN E-LEARNING

DATA MINING TECHNIQUES AND APPLICATIONS

Neural Networks Lesson 5 - Cluster Analysis

Bayesian networks - Time-series models - Apache Spark & Scala

Inner Classification of Clusters for Online News

Network Intrusion Detection using Data Mining Technique

MS1b Statistical Data Mining

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

Advances in Natural and Applied Sciences

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Specific Usage of Visual Data Analysis Techniques

Chapter 12 Discovering New Knowledge Data Mining

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

DATA MINING AND CUSTOMER RELATIONSHIP MANAGEMENT FOR CLIENTS SEGMENTATION

Clustering Technique in Data Mining for Text Documents

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Azure Machine Learning, SQL Data Mining and R

SCAN: A Structural Clustering Algorithm for Networks

Probabilistic Latent Semantic Analysis (plsa)

COC131 Data Mining - Clustering

FOREX TRADING PREDICTION USING LINEAR REGRESSION LINE, ARTIFICIAL NEURAL NETWORK AND DYNAMIC TIME WARPING ALGORITHMS

Visualization of Breast Cancer Data by SOM Component Planes

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Introduction to Data Mining

Chapter ML:XI (continued)

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Role of Neural network in data mining

Big Data with Rough Set Using Map- Reduce

Designing and Embodiment of Software that Creates Middle Ware for Resource Management in Embedded System

Unsupervised Data Mining (Clustering)

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Pentaho Data Mining Last Modified on January 22, 2007

Transcription:

International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. Prajwala T R 1, Sangeeta V I 2 Assistant professor, Dept. of CSE, PESIT, Bangalore Abstract:- Machine learning is type of artificial intelligence wherein computers make predictions based on data. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. The two clustering algorithms considered are EM and Density based algorithm. EM algorithm is general method of finding the maximum likelihood estimate of data distribution when data is partially missing or hidden. In Density based clustering, clusters are dense regions in the data space, separated by regions of lower object density. The comparison between the above two algorithms is carried out using open source tool called WEKA, with the Weather dataset as it s input. Keywords:- Machine learning, Unsupervised learning, supervised learning, EM clustering, Density based clustering, WEKA, Likelihood I. INTRODUCTION Machine learning is type of artificial intelligence wherein computers make predictions based on data. Machine learning broadly classified into supervised classification and unsupervised classification. In supervised systems, the data as presented to a machine learning algorithm is fully labelled. In supervised learning the variables can be split into two groups: explanatory variables and one (or more) dependent variables[1]. The target of the analysis is to specify a relationship between the explanatory variables and the dependent variable. In unsupervised learning situations all variables are treated in the same way, there is no distinction between explanatory and dependent variables. Unsupervised systems are not provided any training examples. Supervised learning includes classification and regression techniques. Classification technique involves identifying category of new dataset. Regression is a statistical method of identifying relationship between variables of dataset[11]. One of unsupervised learning technique is clustering. clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. There are different types of clustering techniques namely K-means clustering, Hierarchical clustering, Exception-maximization clustering and density based clustering[10]. WEKA is one of the open source tool, is a collection of machine learning algorithms for solving realworld. It is written in Java and runs on almost any platform[5]. 1. Clustering Technique Clustering is the unsupervised classification of patterns - observations, data items, or feature vectors into groups (clusters) which have same features. The two properties of a cluster are i. High intra cluster similarity. ii. Low inter cluster similarity. Consider the following example, Figure 1:set of elements in dataset 19

Figure 2:clustered elements in dataset Figure 1 shows the set of data elements. Based on the positions in the figure1,the data elements are grouped into clusters C1,C2 and C3 as shown in figure 2[12]. 2. EM(Expectation maximization) algorithm It is general method of finding the maximum likelihood estimate of data distribution when data is partially missing or hidden[3]. The two steps are: 1. E (Exception) step- This step is responsible to estimate the probability of each element belong to each cluster - P(C _j x_ k ). Each element is composed by an attribute vector (x k ). The relevance degree of the points of each cluster is given by the likelihood of each element attribute in comparison with the attributes of the other elements of cluster C j. Where, x is input dataset. M is the total number of clusters t is an instance and initial instance is zero. 2. M (maximization) step-this step is responsible to estimate the parameters of the probability distribution of each class for the next step. First is computed the mean (μ j ) of class j obtained through the mean of all points in function of the relevance degree of each point. The covariance matrix at each iteration is calculated using Bayes theorem. The probability of occurrence of each class is computed through the mean of probabilities (C_ j ) in function of the relevance degree of each point from the class. Figure 3: Flowchart for EM algorithm 20

Where, x is input dataset. M is the total number of clusters t is an instance and initial instance is zero[8]. 4. Density based clustering The basic idea of density based clustering is clusters are dense regions in the data space, separated by regions of lower object density. Intuition for the formalization of the basic idea is [2], i. For any point in a cluster, the local point density around that point has to exceed some threshold ii. The set of points from one cluster is spatially connected Two global parameters are[6]: i. Є(Eps):Maximum radius of the neighbourhood ii. MinPts: Minimum number of points in an Є -neighbourhood of that point Core object is object with at least MinPts objects within a radius Є -neighborhood. Border object is object that on the border of a cluster. Figure 4: Illustration of global parameters of Density based clustering algorithm 4.1 Density-reachable and Density connectivity Є -Neighborhood Objects within a radius of Є from an objectdensity reachable- An object q is directly density-reachable from object p if p is a core object and q is in p s Є neighborhood[6]. Figure 5:Illustration of density reachablity Density-Connected-A pair of points p and q are density-connected if they are commonly density-reachable from a point o[12]. Figure 6: Illustration of density connection of points 21

Figure 7: flowchart for Density based algorithm 5. Comparison of EM and density based algorithm using WEKA tool WEKA(Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software. The WEKA workbench contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to this functionality[5]. The EM algorithm is run using Weather dataset. The figure 6 shows the output for EM algorithm. There are five attributes namely outlook, Humidity, temperature, windy, play. There are 14 instances. Figure 8: EM clusterer output. 22

The Density based algorithm is run using Weather dataset. The figure 7 shows the output for EM algorithm. There are five attributes namely outlook, Humidity, temperature, windy, play. There are 14 instances. Figure 9: Density based clusterer output Comparison between EM and Density based algorithm is shown in Table 1. Log-Likeli-hood Time taken to Clustered build the model instances EM -4.2017 0.06 seconds 1 algorithm Density -4.0778 0.02 seconds 2 based algorithm Table 1: comparison between EM and Density based algorithm Likelihood is often used as a synonym for probability. It is more convenient to work with the natural logarithm of the likelihood function, called the log-likelihood. Log likelihood here refers to probability of identifying correct group of data elements. In terms of likelihood EM algorithm is better than density based algorithm, referred to Table 1. From Table 1 we can infer that Density based algorithm takes less time than EM algorithm to build the model. Conclusion Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. EM algorithm is general method of finding the maximum likelihood estimate of data distribution when data is partially missing or hidden. Density based clustering, clusters are dense regions in the data space, separated by regions of lower object density. WEKA an open source tool is used for comparing the above two algorithm. In terms of likelihood EM algorithm is better than density based algorithm, referred to Table 1. From Table 1 we can infer that Density based algorithm takes less time than EM algorithm to build the model. 23

REFERENCES [1]. Statistical pattern recognition: a review, Pattern Analysis and Machine Intelligence, Kyu-Young Whang, IEEE Transactions, August 2002, P: 4-37 [2]. A top-down approach for density-based clustering using multidimensional indexes Jae-Joon Hwan, Kyu-Young Whang, Yang-Sae Moon, Byung-Suk Lee, The Journal of Systems and Software 73 (2004) 169 180 [3]. The study of EM algorithm based on forward sampling. Peng Shangu, Wang Xiwu ; Zhong Qigen Electronics, Communications and Control (ICECC), 2011, Pages 4597 4600 [4]. A fast density based clustering algorithm for spatial database system. Computer and Communication Technology (ICCCT), 2011 2nd International Conference, Pages: 1652 1656 [5]. Comparison of clustering algorithms using WEKA tool, Narendra Sharma, Aman Bajpai, Mr. Ratnesh Litoriya, International Journal of Emerging Technology and Advanced Engineering, (ISSN 2250-2459, Volume 2, Issue 5, May 2012. [6]. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining,2012 [7]. A Density-Based Clustering Structure Mining Algorithm for Data Streams,Huan Wang, Yanwei Yu, Qin Wang, Yadong Wan, proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, August 2012, Pages 69-76 [8]. A Fast Convergence Clustering Algorithm Merging MCMC and EM Methods,David Sergio Matusevich, Carlos Ordonez, Veerabhadran Baladandayuthapani, proceedings of the 22nd ACM international conference on Conference on information & knowledge management, October 2013, Pages 1525-1528 [9]. Data Clustering: A Review, A.K.JAIN Michigan State University [10]. A Few Useful Things to Know about Machine Learning, Department of Computer Science and Engineering University of Washington [11]. http://home.deib.polimi.it/matteucc/clustering/tutorial_html/ [12]. http://www.cs.put.poznan.pl/jstefanowski/sed/dm-7clusteringnew.pdf 24