Study of Euclidean and Manhattan Distance Metrics using Simple K-Means Clustering

Similar documents
Data Mining Project Report. Document Clustering. Meryem Uzun-Per

K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India.

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Social Media Mining. Data Mining Essentials

Comparison of K-means and Backpropagation Data Mining Algorithms

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Grid Density Clustering Algorithm

An Introduction to Data Mining

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

International Journal of Advance Research in Computer Science and Management Studies

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

The Scientific Data Mining Process

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model

Machine Learning using MapReduce

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Chapter 7. Cluster Analysis

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Personalized Hierarchical Clustering

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Random forest algorithm in big data environment

How To Classify Data Stream Mining

Bisecting K-Means for Clustering Web Log data

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

An Overview of Knowledge Discovery Database and Data mining Techniques

Information Retrieval and Web Search Engines

Neural Networks Lesson 5 - Cluster Analysis

A Novel Approach for Network Traffic Summarization

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Distances, Clustering, and Classification. Heatmaps

Quality Assessment in Spatial Clustering of Data Mining

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Rule based Classification of BSE Stock Data with Data Mining

A Comparative Study of clustering algorithms Using weka tools

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Proposed Application of Data Mining Techniques for Clustering Software Projects

A QoS-Aware Web Service Selection Based on Clustering

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

A Relevant Document Information Clustering Algorithm for Web Search Engine

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Data Mining Analytics for Business Intelligence and Decision Support

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

Meta-learning. Synonyms. Definition. Characteristics

Natural Language Querying for Content Based Image Retrieval System

Data, Measurements, Features

A New Approach for Evaluation of Data Mining Techniques

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

A Fast Fraud Detection Approach using Clustering Based Method

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Customer Classification And Prediction Based On Data Mining Technique

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

How To Cluster

Clustering Technique in Data Mining for Text Documents

The SPSS TwoStep Cluster Component

Introduction to Machine Learning Using Python. Vikram Kamath

Chapter ML:XI (continued)

Using Data Mining for Mobile Communication Clustering and Characterization

EVALUATING THE EFFECT OF DATASET SIZE ON PREDICTIVE MODEL USING SUPERVISED LEARNING TECHNIQUE

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Standardization and Its Effects on K-Means Clustering Algorithm

A Survey on Intrusion Detection System with Data Mining Techniques

Chapter 20: Data Analysis

Data Mining Solutions for the Business Environment

Specific Usage of Visual Data Analysis Techniques

The Data Mining Process

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Clustering Marketing Datasets with Data Mining Techniques

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Unsupervised Data Mining (Clustering)

Web Document Clustering

Fuzzy Clustering Technique for Numerical and Categorical dataset

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

International Journal of Innovative Research in Computer and Communication Engineering

A Review of Data Mining Techniques

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Detecting Spam Using Spam Word Associations

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Transcription:

Study of and Distance Metrics using Simple K-Means Clustering Deepak Sinwar #1, Rahul Kaushik * # Assistant Professor, * M.Tech Scholar Department of Computer Science & Engineering BRCM College of Engineering & Technology, Bahal Abstract Clustering is the task of assigning a set of objects into groups called clusters in which objects in the same cluster are more similar to each other than to those in other clusters. Generally clustering is used to find out the similar, dissimilar and outlier items from the databases. The main idea behind the clustering is the distance between the data items. The work carried out in this paper is based on the study of two popular distance metrics viz. and. A series of experiments has been performed to validate the study. We use two real and one synthetic datasets on simple K-Means clustering. The theoretical analysis and experimental results show that the method outperforms method in terms of number of iterations performed during centroid calculation. Keywords - Clustering,,, Distance, K-Means, Outliers I. INTRODUCTION Data mining is one of the essential step of the "Knowledge Discovery from Databases (KDD)" process, a pretty young and interdisciplinary field of computer science, is the process that attempts to discover interesting yet hidden patterns in large data sets. It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management ways, data processing, model and inference considerations, complexity considerations, post-processing of discovered structures, visualization, and online updating. Commonly used data mining tasks [1] are classified as: Classification is the task of generalizing well-known structure to apply to new data for which no classification is present. For example, classification of records on the bases of the class attribute. Prediction and Regression are also considered as a part of classification methods. Clustering is the task of discovering groups on the bases of the similarities of data items within the clusters and dissimilarities outside the clusters on the other hand from data set. Anomaly detection (Outlier/change/deviation detection) is also considered as a part of clustering techniques. This step generally used for identification of unusual/ abnormal data records or errors, which can be interesting sometimes. In both the cases outliers may require further investigation and processing. Association rule mining (Dependency modelling) It is the task of finding interesting associations between various attributes of the dataset. The associations are generally based on the newly, interesting yet hidden patterns. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But Page 70

how to decide what constitutes a good clustering? It can be shown that there is no absolute best criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding natural clusters and describe their unknown properties ( natural data types), in finding useful and suitable groupings ( useful data classes) or in finding unusual data objects (outlier detection). Distance metric is the main criteria in clustering like Simple K-means. Generally we may prefer to choose either or for this purpose. In the next section of the paper we describe the background work related to the clustering and distance metrics, whereas section III will elaborate the experimental work and conclusion with future work will be discussed in section IV. II. BACKGROUND WORK Several distance metrics, such as the L1 metric ( Distance), the L metric ( Distance) and the Vector Cosine Angle Distance (VCAD) have been proposed in the literature for measuring similarity between feature vectors [6]. Measures of distance have always been a part of human history. Distance (ED) is one such measure of distance and its permanence is a testament to its simplicity and longevity. Although other methods to determine distance have emerged, Distance remains the conventional method to measure distance between two objects in space, when that space meets the prerequisites [9]. The distance [7] of two n-dimensional vectors, x and y, is defined as: d n ( i, j) ( xi 1 yi 1) ( xi yi )... ( xin yin ) i 1 and (or city block ) distance [7] is defined as : d ( i, j) xi 1 j1 i j... in and are popularly used distance metrics to determine the similarities between the data items of a cluster. Vadivel et al. [], has performed their comparative study of both these methods in the application of image jn retrieval system. From their experimental results they conclude that the distance gives the best performance in terms of precision of retrieved images. There may be cases where one measure performs better than other; which is totally depending upon the criterion adopted, the parameters used for validation the study etc. There are two categories of clustering algorithms (Kaugfman and Rousseeuw, 1990): Partitioning Clustering and Hierarchical Clustering. Partitioning Clustering (PC) (Duda and Hart, 1973, Kaugfman and Rousseeuw, 1990) starts with an initial partition, then tries all possible moving or swapping of data points from one group to another iteratively to optimize the objective measurement function. Each cluster is represented either by the centroid of the cluster (KMEANS), or by one object centrally located in the cluster (KMEDOIDS). It guarantees convergence to a local minimum, but the quality of the local minimum is very sensitive to the initial partition, and the worst case time complexity is exponential. Hierarchical Clustering (HC) (Duda and Hart, 1973, Murtagh, 1983) does not try to find the best clusters, instead it keeps merging (agglomerative HC) the closest pair, or splitting (divisive HC) the farthest pair, of objects to form clusters. With a reasonable distance measurement, the best time complexity of a practical HC algorithm is O(N). On the other hand an efficient and scalable data clustering method was proposed [13] named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the data distribution. They have implemented it in a system and studied its performance extensively in terms of memory requirements, running time, clustering quality, stability and scalability; they also compare it with other available methods. Finally, BIRCH is applied to solve two real-life problems: one is building an iterative and interactive pixel classification tool, and the other is generating the initial codebook for image compression. Most data clustering algorithms in Statistics are distance-based approaches. That is, they assume that there is a distance measurement between any two instances (or data points), and that this measurement can be used for making similarity decisions; and () they represent clusters by some kind of center measure. III. EXPERIMENTAL WORK A series of experiments has been performed to validate the study. We have used two real and one synthetic data sets viz. Iris, Diabetes and BIRCH respectively for our purpose. The details of the datasets are given below: Page 71

7 4 11 6 7 18 13 Relation Name Number of Instances Type of Data 5 39 30 31 7 76 Total No. of Iterations performed using Distance: 17 Iris 150 Numeric Diabetes 768 Numeric Total No. of Iterations performed using Distance: 146 BIRCH (Synthetic) 136 Numeric As shown in Table-I, these three datasets were tested for studying the two basic distance metrics viz. and All the experiments were performed on Intel Core TM i3-370m, with GB DDR3 Memory. We have used WEKA 3.7.10 as our development tool for clustering of data items. The Table-I depicts the summary of the experiments. Table-I: Summary of iterations performed by Simple K- Means Clustering using and distances Num_ Cluster Number of Iterations BIRCH IRIS DIABITIES 3 7 6 4 6 3 3 5 3 3 13 1 4 6 6 4 4 1 16 5 6 7 4 4 16 9 6 4 7 6 7 9 11 on Simple K-Means clustering method provided within the WEKA data mining tool. Other configurations are unaltered except the numclusters parameter, whose value is taken between to 7 for all experiments. Clustering has been performed on these datasets one by one, by setting once the distancefunction as Distance function and Distance on the other hand. Fig. 1: Number of iterations performed on Iris Dataset Page 7

uses during the clustering process. We have used Simple K- Means as the clustering mechanism to validate our study. The theoretical analysis and experimental results show that the method outperforms method in terms of number of iterations performed during centroid calculation. Two real and one synthetic datasets are used during the overall process of comparative study. This work may be extended by taking more clustering algorithms with high dimensional real datasets. REFERENCES [1] A. Ghoting, S. Parthasarathy & M.E. Otey, " Fast mining of distance-based outliers in high-dimensional datasets", Data Mining & Knowledge Discovery, 008, 16, pp. 349 364 Fig. : Number of iterations performed on Diabetes Dataset [] A. Vadivel, A.K.Majumdar & S. Sural, " Performance comparison of distance metrics in content based Image retrieval applications", Intl. Conference on Information Technology [3] F. Angiulli, " Distance-based outlier queries in data streams: the novel task and algorithms", Data Min Knowl Disc, Springer, 010, 0, pp. 90 34 [4] F. Angiulli, R. Ben-Eliyahu & L.Palopoli, "Outlier detection using default reasoning", Elsevier's Artificial Intelligence 17 (008) 1837 187 [5] H. Yu, J. Yang, J. Han & X. Li, " Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing", Data Mining and Knowledge Discovery, 11, 95 31, 005 Fig. 3: Number of iterations performed on BIRCH Dataset As shown in Figure 1, and 3, that the number of iterations performed by Simple K-Means Clustering using distance are less as compared to the distance. These iterations are counted simply in the calculation of centroid points during the overall clustering process. IV. CONCLUSIONS This paper focus on the study of two popular distance metrics viz. and, which are generally [6] J. Hafner, H. Sawhney, W. Equitz, M. Flickner, and W. Niblack, "Efficient Color Histogram Indexing for Quadratic Form Distance Functions", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 7, pp. 79-736, 1995 [7] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publisher, San Francisco, USA, 001. [8] K.A. Yoon, D.W. Bae, "A pattern-based outlier detection method identifying abnormal attributes in software project data", Elsevier's Information and Software Technology 5 (010) 137 151 Page 73

[9] L. D. Chase, " Distance", College of Natural Resources, Colorado State University, Fort Collins, Colorado, USA, 84-146-94, NR 505, December 8, 008 [10] M. Bouguessa, " Clustering categorical data in projected spaces", Data Mining & Knowledge Discovery, Springer, 013, 10618-013-0336-8 [11] M. Zhou & J. Tao, "An Outlier Mining Algorithm Based on Attribute Entropy", Elsevier's Procedia Environmental Sciences 11 (011) 13 138 [1] S. Wu & S. Wang, "Information-Theoretic Outlier Detection for Large-Scale Categorical Data", IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 5, NO. 3, MARCH 013, pp. 589-603 [13] T.Zhang, R. Ramakrishnan & M. Livny, " BIRCH: A New Data Clustering Algorithm and Its Applications", Data Mining and Knowledge Discovery, 1, 1997, pp. 141 18 [14] W. Wong, W. Liu & M. Bennamoun, " Tree-Traversing Ant Algorithm for term clustering based on featureless similarities", Data Mining & Knowledge Discovery, 007, 15, pp. 349 381 [15] Z. Xue, Y. Shang & A. Feng, " Semi-supervised outlier detection based on fuzzy rough C-means clustering", Elsevier's Mathematics and Computers in Simulation 80 (010) 1911 191 [16] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann & I. H. Witten (009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. Page 74