The Result Analysis of the Cluster Methods by the Classification of Municipalities

Similar documents
LVQ Plug-In Algorithm for SQL Server

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining Solutions for the Business Environment

Standardization and Its Effects on K-Means Clustering Algorithm

Chapter 12 Discovering New Knowledge Data Mining

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining: A Preprocessing Engine

Segmentation of stock trading customers according to potential value

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Comparison of Supervised and Unsupervised Learning Classifiers for Travel Recommendations

Data Mining and Neural Networks in Stata

How To Use Neural Networks In Data Mining

A Study of Web Log Analysis Using Clustering Techniques

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Visualization of Breast Cancer Data by SOM Component Planes

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

USING SELF-ORGANISING MAPS FOR ANOMALOUS BEHAVIOUR DETECTION IN A COMPUTER FORENSIC INVESTIGATION

Clustering Connectionist and Statistical Language Processing

Performance Analysis of Data Mining Techniques for Improving the Accuracy of Wind Power Forecast Combination

Review on Financial Forecasting using Neural Network and Data Mining Technique

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Quality Assessment in Spatial Clustering of Data Mining

Self Organizing Maps: Fundamentals

College information system research based on data mining

On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data

Usage of Data Mining Techniques on Marketing Research Data

Social Media Mining. Data Mining Essentials

Pattern Recognition Using Feature Based Die-Map Clusteringin the Semiconductor Manufacturing Process

Chapter ML:XI (continued)

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Visualization of textual data: unfolding the Kohonen maps.

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Customer Classification And Prediction Based On Data Mining Technique

USING SELF-ORGANIZED MAPS AND ANALYTIC HIERARCHY PROCESS FOR EVALUATING CUSTOMER PREFERENCES IN NETBOOK DESIGNS

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Prediction of Heart Disease Using Naïve Bayes Algorithm

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

CONTEMPORARY DECISION SUPPORT AND KNOWLEDGE MANAGEMENT TECHNOLOGIES

PLAANN as a Classification Tool for Customer Intelligence in Banking

Analysis of Performance Metrics from a Database Management System Using Kohonen s Self Organizing Maps

Multiscale Object-Based Classification of Satellite Images Merging Multispectral Information with Panchromatic Textural Features

Subject Description Form

ultra fast SOM using CUDA

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

UNIVERSITY OF BOLTON SCHOOL OF ENGINEERING MS SYSTEMS ENGINEERING AND ENGINEERING MANAGEMENT SEMESTER 1 EXAMINATION 2015/2016 INTELLIGENT SYSTEMS

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Improving students learning process by analyzing patterns produced with data mining methods

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

D A T A M I N I N G C L A S S I F I C A T I O N

Credit Card Fraud Detection Using Self Organised Map

Hierarchical Cluster Analysis Some Basics and Algorithms

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Customer Data Mining and Visualization by Generative Topographic Mapping Methods

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach

TIM 50 - Business Information Systems

Clustering UE 141 Spring 2013

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services

Cluster Analysis for Evaluating Trading Strategies 1

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Self Organizing Maps for Visualization of Categories

Assessing Data Mining: The State of the Practice

S.Thiripura Sundari*, Dr.A.Padmapriya**

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Sensitivity Analysis for Data Mining

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

A Comparative Study of clustering algorithms Using weka tools

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

The Research of Data Mining Based on Neural Networks

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES

How To Cluster

An Introduction to Neural Networks

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

INTERACTIVE DATA EXPLORATION USING MDS MAPPING

Visualization of large data sets using MDS combined with LVQ.

Keywords Data Mining, Knowledge Discovery, Direct Marketing, Classification Techniques, Customer Relationship Management

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Optional Insurance Compensation Rate Selection and Evaluation in Financial Institutions

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Mining Applications in Higher Education

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Transcription:

The Result Analysis of the Cluster Methods by the Classification of Municipalities PAVEL PETR, KAŠPAROVÁ MILOSLAVA System Engineering and Informatics Institute Faculty of Economics and Administration University of Pardubice Studentská 95, 532 10 Pardubice Czech Republic miloslava.kasparova@upce.cz http://www.upce.cz Abstract: The goal of this paper is the result analysis of chosen cluster methods by the classification of municipalities. A nonhierarchical cluster method with a variable number of clusters is used for the determination of a convenient number of clusters. A Kohonen map is chosen as an alternative approach to standard cluster methods. Nonhierarchical cluster methods with the fixed number of clusters are used for the classification of municipalities, too. The results of methods are analyzed. The classification of municipalities is the starting basis for the creation of a prediction model (PM). The PM is determined for the employees of the Regional Authority to support their decision making by financing of the subsidized bus connection lines. Key words: cluster analysis, cluster method, neural nets, clustering, municipalities 1 Introduction It was possible to do the municipality classification for a PM creation. The PM is meant for the employees of the Regional Authority to support their decision making by the financing of the subsidized bus connection lines. The municipalities classification in to clusters will result in a creation of municipality input chains which are in bus line connections. Every municipality that is in the bus line connection will be represented by the number of clusters. The classification is performed by chosen methods of cluster analysis [2, 3, 4]. Cluster analysis groups data objects into clusters in such a way that the objects belonging to the same cluster are similar, while those belonging to different ones are dissimilar. An existence of n objects is an initial condition for the usage of the cluster analysis. The n municipalities where the object j is the municipality j, are the objects of the clustering. Every object is described by p characteristics. A vector of measurement O j that contains p characteristics in formula (1): O j = {z 1j, z 2j,, z pj } (1) for j-th object O j, (j = 1, 2,, n). The input set of the objects which are determined for the clustering, it is possible to write in formula of objects matrix O. The general formula of objects matrix O is following (2): z11 z12... z1 p z21 z22... z2 p O =, (2) z j1 z j2... z jp zn1 zn2... zn p where : n object characteristics number; p number of characteristics. The task of clustering is then to divide the set of objects into the disjunctive clusters. The decision making about the object clustering in cluster is realized on the basis of the similarity by application of metric [3, 4, 5]. A sum of the square errors to centre of clusters E [4] is chosen as a criterion of the quality of clustering. It is defined in this way: Let Ω = {M 1, M 2,,, M k } is the clustering of objects set in k clusters M 1 = {O 11, O 12,, O 1n 1 }, M 2 = {O 21, O 22,, O 2n 2 },, M k = {O k1, O k2,, O kn k },

where O hj is object j of h-th cluster M h. Then E is determined in formula (3): k nh = d 2 ( Ohj,Th ) h= 1 j= 1 where: ( ) 2 O, T E, (3) d - the square of the Euclidian hj h metric of object O hj to the centre T h of cluster M h. T h - the centre of cluster M h ; it is determined by the vector of mean values of characteristics i of objects in cluster M h in formula T h = (t h1, t h2,, t hp ), for its characteristics i, where i = 1, 2,, p, is (4): n 1 h t hi = zhji, (4) nh j= 1 where: n h - number of objects in cluster M h ; z hji - characteristic i of object j in cluster M h. A set of typical point is defined for the realization of cluster methods. The typical point characterizes the typical object municipality. It comes out from p-dimensional Euclidian space E p. It is more approaches for the determination of initial typical points, for example [4]: the selection of first s points from randomly organized set of points; marking of points by serial numbers 1, 2,,n and a generation randomly s serial numbers from set {1, 2,, n}, points which belong to them then are initial typical points; the selection s points which are chosen by an analyst. 2 Problem Definition The goal is to classify municipalities on the basis of the same or similar values of their characteristics. The classification of municipalities is the starting basis for PM [1]. It is realized by values of the given characteristics using the cluster analysis methods. Characteristics of municipalities are following: KO, PCO, PK, KV Information land registers POC, PMO, PZO, POV15-64, PVOC, PVZ, PVM Population SKO, POS, POL, ZDZ Occurrence of facilities PLY, VOD, KAN Technical facilities of the municipality M-O Other The metric of input data was created firstly. It contents 452 municipalities of the chosen region. The municipalities were identified by the municipality code (KO) and characterized by the number of municipality parts (PCO), the number of land registers (PK), the local area in hectares (KV), the number of population in municipalities (POC), the number of men (PMO) and women (PZO) that live in the municipality, the number of population at the age 15 to 64 (POV15-64), the average age of population in municipalities (PVOC), the average age of women (PVZ) and men (PVM), the occurrence of: schools (SKO), post offices (POS), the police (POL), medical facilities (ZDZ); technical facilities of municipalities the gas (PLY), the duct (VOD), the sewerage (KAN) and the determination of municipalities or a village (M-O) according to [6]. The reason of municipalities classification is: a generalization of municipalities characteristics; a creation of the simplified input chain configuration. 3 Cluster Analysis Methods The municipalities classification was realized in two steps. Firstly it was necessary to determine a convenient number of clusters. There were used a nonhierarchical cluster method [7] with a variable number of clusters and a Kohonen map [8, 9, 10]. Secondly nonhierarchical cluster methods with a fixed number of clusters were used: the Forgy method [4, 11], the K-Means method [2, 3] and the Jancey method [4, 12]. The classification of municipalities in the fixed number was realized by these methods. The parameter E in formula (3) was a criterion of the quality of the clustering. 3.1 Nonhierarchical Cluster Method For the solution of an algorithm are determined the following input parameters: the set of the input objects, which is the matrix of municipalities O in formula (2); the matrix of the typical points T(s,p), where s is the number of inputs typical points (s = 15) and p is the number of the municipality (the approach the selection of the first s points was chosen for the creation of the typical point matrix); the value of a mergence threshold and dividing of clusters H, where H = 0.6; maximal number of iterations I where I = 2n; minimal number of objects in cluster MP where MP {1, 2,, 15}. The iterations of algorithm consist of steps. The algorithm is described in [7].

3.1.1 Results of the Nonhierarchical Cluster Method The 69 clusters were created on the basis of the parameter MP, for MP = 1. The resulting parameter E was E = 58.3. The objects were classified in to 60 clusters by the parameter MP = 2 and with E = 60.12. The lowest k = 16 was extracted by MP = 15 and E = 79.88. By the increase of the parameter MP the number of the created clusters k is decreased. A graphical representation of the dependence of E and k on MP is in Fig. 1. k [ ] 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MP k Fig. 1 The graphical representation of the dependence of E and k on MP The increase of the parameter PM in the cluster results in the increase of the criterion E. The low values of criterion E were achieved by the creation of the big number of clusters 38 k 69. The stabilization of cluster number was done by the value of the parameter 11 MP 15. The values E = 82.14 and k = 20 were for MP = 11, the values E = 79.88 and k = 16 was for MP = 15, more in the Table 1. Table 1: Number of the clusters k and the parameter E by the dependence on the parameter MP MP 11 12 13 14 15 E 82.14 78.88 86.84 86.84 79.88 k 20 20 18 18 16 E 100 90 80 70 60 50 40 30 20 10 0 E [ ] multidimensional proximity relationships between the clusters. The Kohonen map consists of two layers of neurons or units: an input layer and an output layer. The input layer is fully connected to the output layer, and each connection has an associated weight. Another way to think of the network structure is to think of each output layer unit having an associated center, represented as a vector of inputs to which it most strongly responds (where each element of the center vector is a weight from the output unit to the corresponding input unit) [8]. The parameters of the Kohonen map are represented as weights between input units and output units, or alternately, as a cluster center associated with each output unit, more for example in [3, 8]. 3. 2. 1 Results of the Kohonen Map The clustering differed in dimension and time of neural net training in interval from 1 to 10 minutes. The clustering of objects in a big number of clusters by different dimensions of topological grid of the Kohonen map results from the achieved results. The lowest number min = 18 was achieved by the map dimension 4 x 5 and the time of neural net training 2 minutes. The biggest number max = 56 was obtained by the map dimension 8 x 8. 3. 3 Result Comparison The goal of the methods was to determine the appropriate number of clusters of municipalities. It was achieved the appropriate number opt by the big value of the criterion E by the nonhierarchical cluster method. rises by the dimensions increase of the Kohonen map. According to a visualization of resulting clusters of objects by different dimensions it is possible to find out hypothetical clusters of objects. An example of representation of clusters is in Fig. 2. 3.2. Kohonen Map The Kohonen map (self-organizing map) was used as an alternative cluster method. The Kohonen map [8, 9, 10] is a special kind of neural network that performs unsupervised learning. It takes the input vectors in formula (1) for j = 1, 2, n and performs a type of spatially organized clustering, or feature mapping, to group similar records together and collapse the input space to a two-dimensional space that approximates the Fig. 2 The graphical representation of the supposed number of clusters from the Kohonen map

3. 4 Nonhierarchical Cluster Methods with Fixed Number of Clusters The nonhierarchical cluster methods by the fixed number of clusters are used in the next step for municipalities classification. There are: the K-Means method, the Forgy method and the Jancey method. The input matrix of objects O in formula (2) is loaded at the start of the clustering by these methods. The clustering is realized for k = 15, 16, 17, 18, 19, 20. 3. 4. 1 K-Means Method The k-means method [3, 4, 8] is a clustering method, used to group records based on similarity of values for a set of input fields. The basic idea is to try to discover k clusters in such a way that the records within each cluster are similar to each other and distinct from records in other clusters. K-means is an iterative algorithm; an initial set of clusters is defined, and the clusters are repeatedly updated until no more improvement is possible (or the number of iterations exceeds a specified limit) [8]. A cluster algorithm is described for example in [8].The criterion of the quality of clustering is the parameter E in formula (3). Results of the K-Means Method The values of parameter E for every k are in the Table 2. Table 2: Results by the K-Means method 134.30 133.41 131.64 122.18 121.14 112.80 3. 4. 2 Forgy Method The method [11] is based on the alternation of two steps. The first step is a calculation of the typical points of clusters of objects. The second step is the creation of clusters of objects by the clustering of every object to the typical point to which the object is nearest. The iterations are realized till a stable dividing in the final clusters is created, more in [4, 11]. The criterion of the quality of clustering is the parameter E in formula (3). Table 3: Results by the Forgy method 128.84 130.68 117.61 116.00 116.63 113.03 3. 4. 3 Jancey Method The Jancey method [4, 12] is almost the same as the Forgy method. It differs in the way of the calculation of new typical points of the created clusters [4]. The criterion of the quality of clustering is the parameter E in formula (3). Results of the Jancey Method The values of parameter E for every k are in the Table 4. Table 4: Results by the Jancey method 122.57 3. 5 Result Comparison 126.22 118.51 112.30 112.22 111.23 It was achieved the lowest parameter E = 112.80 by number = 20 by the K-Means method. The considerable decrease of parameter E = 122.18 was by k = 18. It was achieved the lowest parameter E = 113.03 by k = 20 by the Forgy method. The considerable decrease of parameter E = 117.61 was by k = 17. By the Jancey method almost the same results of parameter E by k = 18 and k = 19 were achieved, very similar to k = 20, too. The graphical representation of parameter E for every k and the methods: Jancey, Forgy and K-Means is in Fig. 3. Results of the Forgy Method The values of parameter E for every k are in the Table3.

E [ ] 140 135 References: 130 125 120 115 110 105 100 k [ ] Forgy Method Jancey Method K-Means Method Fig. 3 The graphical representation of the dependence of E on every k 4. Conclusions By results in Fig. 3 the clustering of objects in 18 clusters is possible. This number of clusters is determined on the basis of the values that are achieved by the Jancey method. The results of clustering by this method with the fixed number of clusters for k = 18 are in Table 5. Table 5: Results of the clustering by methods: Forgy, Jancey and K-Means The method Forgy Jancey K-Means 18 18 18 square deviations E 116,00 112,30 122,18 On the basis of the clustering of n municipalities in k clusters it is possible to create a model of classification of municipalities by means of decision trees [5, 13]. According to the rules from the decision trees it will be possible to classify new municipalities automatically on the basis of characteristics values in the cluster for next processing in PM. [1] Kašparová, M. Prediction Models Analysis of Financing of Basic Transport Services. WSEAS Transactions on Systems, January 2006, vol. 5, no. 1, s. 211-218. ISSN: 1109-2777. [2] Freitas, A. A. Data Mining and Knowledge Discovery with Evolutionary Algorithms, Berlin: Springer, 2002. [3] Guidici, P. Applied Data Mining: Statistical Methods for Business and Industry, West Sussex: Wiley, 2003. [4] Lukasová, A. - Šarmanová, J. Metody shlukové analýzy, SNTL Nakladatelství technické literatury v Praze, 1985. [5] Han, J. - Kamber, M. Data Mining: Concepts and Techniques, Morgan Kaufmann Press, 2001. [6] Zákon č. 129/2000 Sb., o krajích (krajské zřízení). [7] Petr, P. Terciální zpracování radiolokační informace v heterogenní síti čidel. [dissertation], VVTŠ, Liptovský Mikuláš, 1987. [8] SPSS Inc. Clementine 7.0 User s Guide, 2002. [9] Berry, M. J. A. - Linoff, G. S. Data Mining Techniques: For Marketing, Sales, and Customers Relationship Management, Wiley, 2004. [10] Pyle, D., Business Modeling and Data Mining, Morgan Kaufmann Publishers, 2003. [11] Forgy, E. W., Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classifications. In: Biometrics Soc. Meetings, Riverside, 1965. [12] Jancey, R. C. Multidimensional Group Analysis, Austral J. Botany, 14, 1966. [13] Rusell, S. J. - Norvig, P., Artificial Intelligence: A Modern Approach, Prentice Hall, 2002.