Comparing large datasets structures through unsupervised learning

Size: px
Start display at page:

Download "Comparing large datasets structures through unsupervised learning"

Transcription

1 Comparing large datasets structures through unsupervised learning Guénaël Cabanes and Younès Bennani LIPN-CNRS, UMR 7030, Université de Paris 13 99, Avenue J-B. Clément, Villetaneuse, France Abstract. In data mining, the problem of measuring similarities between different subsets is an important issue which has been little investigated up to now. In this paper, a novel method is proposed based on unsupervised learning. Different subsets of a dataset are characterized by means of a model which implicitly corresponds to a set of prototypes, each one capturing a different modality of the data. Then, structural differences between two subsets are reflected in the corresponding model. Differences between models are detected using a similarity measure based on data density. Experiments over synthetic and real datasets illustrate the effectiveness, efficiency, and insights provided by our approach. 1 Introduction In recent years, the datasets size has shown an exponential growth. Studies exhibit that the amount of data doubles every year. However, the ability to analyze these data remains inadequate. The problem of mining these data to measure similarities between different datasets becomes an important issue which has been little investigated up to now. A major application may be the analysis of time evolving datasets, by computing a model of the data structure over different periods of time, and comparing them to detect the changes when they occurred. Nevertheless, there are many other possible applications, like large datasets comparison, clustering merging, stability measure and so on. As the study of data streams and large databases is a difficult problem because of the computing costs and the big storage volumes involved, two issues appear to play a key role in such an analysis: (i) a good condensed description of the data properties [1,2] and (ii) a measure capable of detecting changes in the data structure [3,4]. In this paper we propose a new algorithm which is able to perform these two tasks. The solution we propose consists of an algorithm, which first constructs an abstract representation of the datasets to compare and then evaluates the dissimilarity between them based on this representation. The abstract representation is based on the learning of a variant of Self- Organizing Map (SOM) [5], which is enriched with structural information extracted from the data. Then we propose a method to estimate, from the abstract representation, the underlying data density function. The dissimilarity is a measure of the divergence This work was supported in part by the CADI project (N o ANR-07 TLOG 003) financed by the ANR (Agence Nationale de la Recherche).

2 2 Guénaël Cabanes and Younès Bennani between two estimated densities. A great advantage of this method is that each enriched SOM is at the same time a very informative and a highly condensed description of the data structure that can be stored easily for a future use. Also, as the algorithm is very effective both in terms of computational complexity and in terms of memory requirements, it can be used for comparing large datasets or for detecting structural changes in data-streams. The remainder of this paper is organized as follows. Section 2 presents the new algorithm. Section 3 describes the validation protocol and some results. Conclusion and future work perspectives are given in Section 4. 2 A new two-levels algorithm to compare data structure The basic assumption in this work is that data are described as vectors of numeric attributes and that the datasets to compare have the same type. First, each dataset is modeled using an enriched Self-organizing Map (SOM) model (adapted from [6]), constructing an abstract representation which is supposed to capture the essential data structure. Then, each dataset density function is estimated from the abstract representation. Finally, different datasets are compared using a dissimilarity measure based upon the density functions. The idea is to combine the dimension reduction and the fast learning SOM capabilities in the first level to construct a new reduced vector space, then applies other analysis in this new space. These are called two-levels methods. The two-levels methods are known to reduce greatly the computational time, the effects of noise and the curse of dimensionality [6]. Furthermore, it allows some visual interpretation of the result using the two-dimensional map generated by the SOM. 2.1 Abstract algorithm schema The algorithm proceeds in three steps : 1. The first step is the learning of the enriched SOM. During the learning, each SOM prototype is extended with novel information extracted from the data. These structural informations will be used in the second step to infer the density function. More specifically, the attributes added to each prototype are: Density modes. It is a measure of the data density surrounding the prototype (local density). The local density is a measure of the amount of data present in an area of the input space. We use a Gaussian kernel estimator [7] for this task. Local variability. It is a measure of the data variability that is represented by the prototype. It can be defined as the average distance between the prototypes and the represented data. The neighborhood. This is a prototype s neighborhood measure. The neighborhood value of two prototypes is the number of data that are well represented by each one. 2. The second step is the construction, from each enriched SOM, of a density function which will be used to estimate the density of the input space. This function is constructed by induction from the information associated to the prototypes of the SOM, and is represented as a mixture model of spherical normal functions.

3 Comparing large datasets structures through unsupervised learning 3 3. The last step accomplishes the comparison of two different datasets, using a dissimilarity measure able to compare the two density functions constructed in the previous steps. 2.2 Prototypes enrichment In this step some global information is extracted from the data and stored in the prototypes during the learning of the SOM. The Kohonen SOM can be classified as a competitive unsupervised learning neural network [5]. A SOM consists in a two dimensional map of M neurons (units) which are connected to n inputs according to n weights connections w j = (w 1j,..., w nj ) (also called prototypes) and to their neighbors with topological links. The training set is used to organize these maps under topological constraints of the input space. Thus, an optimal spatial organization is determined by the SOM from the input data. In our algorithm, the SOM s prototypes will be enriched by adding new numerical values extracted from the dataset. The enrichment algorithm proceeds in three phases: Input : The data X = {x k } N k=1. Output : The density D i and the local variability s i associated to each prototype w i. The neighborhood values v i,j associated with each pair of prototype w i and w j. Algorithm: 1. Initialization : Initialize the SOM parameters i, j initialize to zero the local densities (D i ), the neighborhood values (v i,j ), the local variability (s i ) and the number of data represented by w i (N i ). 2. Choose randomly a data x k X : Compute d(w, x k ), the euclidean distance between the data x k and each prototype w i. Find the two closest prototypes (BMUs: Best Match Units) w u and w u : u = arg min(d(w i, x k )) and u = arg min(d(w i, x k )) i i u. 3. Update structural values : Number of data: N u = N u + 1. Variability: s u = s u + d(w u, x k ). Density: i, D i = D i + 1 2πh e d(w i,x k )2 2h 2. Neighborhood: v u,u = v u,u + 1.

4 4 Guénaël Cabanes and Younès Bennani 4. Update the SOM prototypes w i as defined in [5]. 5. repeat T times step 2 to Final structural values: i, s i = s i /N i and D i = D i /N. In this study we used the default parameters of the SOM Toolbox [8] for the learning of the SOM and we use T = max(n, 50 M) as in [8]. The number M of prototypes must neither be too small (the SOM does not fit the data well) nor too large (time consuming). To choose M close to N seems to be a good trade-off [8]. The last parameter to choose is the bandwidth h. The choice of h is important for good results, but its optimal value is difficult to calculate and time consuming (see [9]). A heuristic that seems relevant and gives good results consists in defining h as the average distance between a prototype and its closest neighbor [6]. At the end of this process, each prototype is associated with a density and a variability value, and each pair of prototypes is associated with a neighborhood value. The substantial information about the structure of the data is captured by these values. Then, it is no longer necessary to keep data in memory. 2.3 Estimation of the density function The objective of this step is to estimate the density function which associates a density value to each point of the input space. We already have an estimation of the value of this function at the position of the prototypes (i.e. D i ). We must infer from this an approximation of the function. Our hypothesis here is that this function may be properly approximated in the form of a mixture of Gaussian kernels. Each kernel K is a Gaussian function centered on a prototype. The density function can therefore be written as: f(x) = M 1 α i K i (x) with K i (x) = N e d(w i,x) 2 2h 2 i 2πh i i=1 The most popular method to fit mixture models (i.e. to find h i and α i ) is the expectation-maximization (EM) algorithm [10]. However, this algorithm needs to work in the data input space. As here we work on enriched SOM instead of dataset, we can t use EM algorithm. Thus, we propose the heuristic to choose h i : h i = j v i,j N i+n j (s i N i + d i,j N j ) j v i,j d i,j is the euclidean distance between w i and w j. The idea is that h i is the standard deviation of data represented by K i. These data are also represented by w i and their neighbors. Then h i depends on the variability s i computed for w i and the distance d i,j between w i and his neighbors, weighted by the number of data represented by each prototype and the connectivity value between w i and his neighborhood. Now, since the density D for each prototype w is known (f(w i ) = D i ), we can use a gradient descent method to determine the weights α i. The α i are initialized with the

5 Comparing large datasets structures through unsupervised learning 5 values of D i, then these values are reduced gradually to better fit D = M i=1 α ik i (w). To do this, we optimize the following criterion: α = arg min α 1 M M M (α j K j (w i )) D i i=1 j=1 Thus, we now have a density function that is a model of the dataset represented by the enriched SOM Algorithm complexity The complexity of the algorithm is scaled as O(T M), with T the number of steps and M the number of prototypes in the SOM. It is recommended to set at least T > 10 M for a good convergence of the SOM [5]. In this study we use T = max(n, 50 M) as in [8]. This means that if N > 50 M (large database), the complexity of the algorithm is O(N M), i.e. is linear in N for a fixed size of the SOM. Then the whole process is very fast and is suited for the treatment of large databases. Also very large databases can be handled by fixing T < N (this is similar as working on a random subsample of the database). This is much faster than traditional density estimator algorithms as the Kernel estimator [7] (that also needs to keep all data in memory) or the Gaussian Mixture Model [11] estimated with the EM algorithm (as the convergence speed can become extraordinarily slow [12,13]). 2.5 The dissimilarity measure We can now define a measure of dissimilarity [ between ] two datasets[ A and B, represented by two SOMs: SOM A = {wi A}M A i=1, f A and SOM B = {wi B}M B i=1, f B ] With M A and M B the number of prototypes in models A and B, and f A and f B the density function of A and B computed in 2.3. The dissimilarity between A and B is given by: CBd(A, B) = i=1 f ( ( ) ) A wi A f log A (w A i ) f B (wi A ) M A + M A M B j=1 f B ( w B j ( ) ) f log B (w B j ) f A (wj B ) The idea is to compare the density functions f A and f B for each prototype w of A and B. If the distributions are identical, these two values must be very close. This measure is an adaptation of the weighted Monte Carlo approximation of the symmetrical Kullback Leibler measure (see [14]), using the prototypes of a SOM as a sample of the database. M B

6 6 Guénaël Cabanes and Younès Bennani 3 Validation 3.1 Description of the used datasets In order to demonstrate the performance of the proposed dissimilarity measure, we used nine artificial datasets generators and one real dataset. Ring 1, Ring 2, Ring 3, Spiral 1 and Spiral 2 are 5 two-dimensional nonconvex distributions with different density and variance. Ring 1 is a ring of radius 1 (high density), Ring 2 a ring of radius 3 (low density) and Ring 3 a ring of radius 5 (average density) ; Spiral 1 and Spiral 2 are two parallel spirals. The density in the spirals decreases with the radius. Datasets from these distributions can be generated randomly. Noise 1 to Noise 4 are 4 different two-dimensional distributions, composed of different Gaussian distributions and a heavy homogeneous noise. Finally, the Shuttle real dataset come from the UCI repository. It s a nine-dimensional dataset with instances. These data are divided in seven class. Approximately 80% of the data belongs to class Validity of the dissimilarity measure It s impossible to prove that two distributions are exactly the same. Anyway, a low dissimilarity value is only consistent with a similar distribution, and does of course give an indication of the similarity between the two sample distributions. On the other hand, a very high dissimilarity does show, to the given level of significance, that the distributions are different. Then, if our measure of dissimilarity is efficient, it should be possible to compare different datasets (with the same attributes) to detect the presence of similar distributions, i.e. the dissimilarity of datasets generated from the same distribution law must be much smaller than the dissimilarity of datasets generated from very different distribution. To test this hypothesis we applied the following protocol: 1. We generated 250 different datasets from the Ring 1, Ring 2, Ring 3, Spiral 1 and Spiral 2 distributions (50 datasets from each). Each sets contain between 500 and data. 2. For each of these datasets, we learned an enriched SOM to obtain a set of representative prototypes of the data. The number of prototypes is randomly chosen from 50 to We computed a density function for each SOM and compared them to each other with the proposed dissimilarity measure. 4. Each SOM was labeled depending on the distribution represented (labels are ring 1, ring 2, ring 3, spiral 1 and spiral 2 ). We then calculated an index of compactness and separability of SOM having the same label with the generalized index of Dunn [15]. This index is even bigger than SOM with the same label are similar together and dissimilar to other labels, according to the dissimilarity function used.

7 Comparing large datasets structures through unsupervised learning 7 We used the same protocol with Noise 1 to Noise 4 and with the Shuttle database. Two kinds of distributions have been extracted from the Shuttle database, by using random sub-sampling (with replacement) of data from class 1 ( Shuttle 1 ) and data from other classes ( Shuttle 2 ). We compared the results with some distance-based measures usually used to compare two sets of data (here we compare two sets of prototypes from the SOMs). These measures are the average distance (Ad: the average distance between all pair of prototypes in the two SOMs), the minimum distance (Md: the smallest Euclidean distance between prototypes in the two SOMs) and the Ward distance (Wd: The distance between the two centroids, with some weight depending on the number of prototypes in the two SOMs)) [16]. All results are in Table 1. The visual inspection of the dissimilarity matrix obtained from the different measures is consistant with the values of the Dunn index (Table 1) to show that the density-based proposed similarity measure (CBd) is much more effective than distance based measures (see Fig. 1 for an example). (a) Ad (b) Md (c) Wd (d) CBd Fig. 1. Visualizations of the dissimilarity matrix obtained from the different measures for the comparaisons of the different Noise 1 to 4 distributions. Darker cells mean higher similarity between the two distributions. Datasets from the same distribution are sorted, then their comparisons appear as four box along the diagonal of the matrix. As shown in table 1, our dissimilarity measure using density is much more effective than measures that only use the distances between prototypes. Indeed, the Dunn index for density based measure is much higher than distance based ones for the three kinds of datasets tested: non-convex data, noisy data and real data. This means that our measure of dissimilarity is much more effective for the detection of similar dataset. Table 1. Value of the Dunn index obtained from various dissimilarity measures to compare various data distributions. Distributions to compare Average Minimum Ward Proposed Ring 1 to 3 + Spiral 1 and Noise 1 to Shuffle 1 and

8 8 Guénaël Cabanes and Younès Bennani 4 Conclusion and future works In this article, we proposed a new algorithm for modeling data structure, based on the learning of a SOM, and a measure of dissimilarity between cluster structures. The advantages of this algorithm are not only the low computational cost and the low memory requirement, but also the high accuracy achieved in fitting the structure of the modeled datasets. These properties make it possible to apply the algorithm to the analysis of large data bases, and especially large data streams, which requires both speed and economy of resources. The results obtained on the basis of artificial and real datasets are very encouraging. The continuation of our work will focus on the validation of the algorithm on more real case studies. More specifically, it will be applied to the analysis of real data streams, commonly used as benchmark for this class of algorithms. We will also try to extend the method to structured or symbolic datasets, via kernel-based SOM algorithm. Finally, we wish to deepen the concept of stability applied to this kind of analysis. References 1. Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: SIGMOD Conference. (2001) 2. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: VLDB. (2002) Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: In 2006 SIAM Conference on Data Mining. (2006) 4. Aggarwal, C., Yu, P.: A Survey of Synopsis Construction Methods in Data Streams. In C. Aggarwal, ed.: Data Streams: Models and Algorithms. Springer (2007) 5. Kohonen, T.: Self-Organizing Maps. Springer-Verlag, Berlin (2001) 6. Cabanes, G., Bennani, Y.: A local density-based simultaneous two-level algorithm for topographic clustering. In: IJCNN, IEEE (2008) Silverman, B.: Using kernel density estimates to investigate multi-modality. Journal of the Royal Statistical Society, Series B 43 (1981) Vesanto, J.: Neural network tool for data mining: SOM Toolbox (2000) 9. Sain, S., Baggerly, K., Scott, D.: Cross-Validation of Multivariate Densities. Journal of the American Statistical Association 89 (1994) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39 (1977) Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458) (2002) Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26(2) (1984) Park, H., Ozeki, T.: Singularity and Slow Convergence of the EM algorithm for Gaussian Mixtures. Neural Process Letters 29 (2009) Hershey, J.R., Olsen, P.A.: Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Volume 4. (2007) Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4 (1974) Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1988)

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Visualization of Breast Cancer Data by SOM Component Planes

Visualization of Breast Cancer Data by SOM Component Planes International Journal of Science and Technology Volume 3 No. 2, February, 2014 Visualization of Breast Cancer Data by SOM Component Planes P.Venkatesan. 1, M.Mullai 2 1 Department of Statistics,NIRT(Indian

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Self Organizing Maps: Fundamentals

Self Organizing Maps: Fundamentals Self Organizing Maps: Fundamentals Introduction to Neural Networks : Lecture 16 John A. Bullinaria, 2004 1. What is a Self Organizing Map? 2. Topographic Maps 3. Setting up a Self Organizing Map 4. Kohonen

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods

More information

Quality Assessment in Spatial Clustering of Data Mining

Quality Assessment in Spatial Clustering of Data Mining Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad

More information

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis Arumugam, P. and Christy, V Department of Statistics, Manonmaniam Sundaranar University, Tirunelveli, Tamilnadu,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Nonlinear Discriminative Data Visualization

Nonlinear Discriminative Data Visualization Nonlinear Discriminative Data Visualization Kerstin Bunte 1, Barbara Hammer 2, Petra Schneider 1, Michael Biehl 1 1- University of Groningen - Institute of Mathematics and Computing Sciences P.O. Box 47,

More information

INTERACTIVE DATA EXPLORATION USING MDS MAPPING

INTERACTIVE DATA EXPLORATION USING MDS MAPPING INTERACTIVE DATA EXPLORATION USING MDS MAPPING Antoine Naud and Włodzisław Duch 1 Department of Computer Methods Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract: Interactive

More information

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets Aaditya Prakash, Infosys Limited aaadityaprakash@gmail.com Abstract--Self-Organizing

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data

On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data Jorge M. L. Gorricha and Victor J. A. S. Lobo CINAV-Naval Research Center, Portuguese Naval Academy,

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

A Computational Framework for Exploratory Data Analysis

A Computational Framework for Exploratory Data Analysis A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Cluster analysis Cosmin Lazar. COMO Lab VUB

Cluster analysis Cosmin Lazar. COMO Lab VUB Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,

More information

SACOC: A spectral-based ACO clustering algorithm

SACOC: A spectral-based ACO clustering algorithm SACOC: A spectral-based ACO clustering algorithm Héctor D. Menéndez, Fernando E. B. Otero, and David Camacho Abstract The application of ACO-based algorithms in data mining is growing over the last few

More information

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations Christian W. Frey 2012 Monitoring of Complex Industrial Processes based on Self-Organizing Maps and

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Visualization by Linear Projections as Information Retrieval

Visualization by Linear Projections as Information Retrieval Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland jaakko.peltonen@tkk.fi

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,

More information

Visualization of textual data: unfolding the Kohonen maps.

Visualization of textual data: unfolding the Kohonen maps. Visualization of textual data: unfolding the Kohonen maps. CNRS - GET - ENST 46 rue Barrault, 75013, Paris, France (e-mail: ludovic.lebart@enst.fr) Ludovic Lebart Abstract. The Kohonen self organizing

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Learning Vector Quantization: generalization ability and dynamics of competing prototypes

Learning Vector Quantization: generalization ability and dynamics of competing prototypes Learning Vector Quantization: generalization ability and dynamics of competing prototypes Aree Witoelar 1, Michael Biehl 1, and Barbara Hammer 2 1 University of Groningen, Mathematics and Computing Science

More information

Knowledge Discovery in Stock Market Data

Knowledge Discovery in Stock Market Data Knowledge Discovery in Stock Market Data Alfred Ultsch and Hermann Locarek-Junge Abstract This work presents the results of a Data Mining and Knowledge Discovery approach on data from the stock markets

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

More information

Model-Based Cluster Analysis for Web Users Sessions

Model-Based Cluster Analysis for Web Users Sessions Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece gpallis@ccf.auth.gr

More information

Visualizing class probability estimators

Visualizing class probability estimators Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers

More information

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz Luigi Di Caro 1, Vanessa Frias-Martinez 2, and Enrique Frias-Martinez 2 1 Department of Computer Science, Universita di Torino,

More information

Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps

Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps Technical Report OeFAI-TR-2002-29, extended version published in Proceedings of the International Conference on Artificial Neural Networks, Springer Lecture Notes in Computer Science, Madrid, Spain, 2002.

More information

On Density Based Transforms for Uncertain Data Mining

On Density Based Transforms for Uncertain Data Mining On Density Based Transforms for Uncertain Data Mining Charu C. Aggarwal IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532 charu@us.ibm.com Abstract In spite of the great progress in

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python. Vikram Kamath Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Data topology visualization for the Self-Organizing Map

Data topology visualization for the Self-Organizing Map Data topology visualization for the Self-Organizing Map Kadim Taşdemir and Erzsébet Merényi Rice University - Electrical & Computer Engineering 6100 Main Street, Houston, TX, 77005 - USA Abstract. The

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

Local Anomaly Detection for Network System Log Monitoring

Local Anomaly Detection for Network System Log Monitoring Local Anomaly Detection for Network System Log Monitoring Pekka Kumpulainen Kimmo Hätönen Tampere University of Technology Nokia Siemens Networks pekka.kumpulainen@tut.fi kimmo.hatonen@nsn.com Abstract

More information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing in a heterogeneous computer system by self-organizing Kohonen network Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

More information

The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Data Mining and Neural Networks in Stata

Data Mining and Neural Networks in Stata Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca mario.lucchini@unimib.it maurizio.pisati@unimib.it

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Self Organizing Maps for Visualization of Categories

Self Organizing Maps for Visualization of Categories Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl

More information

Resource-bounded Fraud Detection

Resource-bounded Fraud Detection Resource-bounded Fraud Detection Luis Torgo LIAAD-INESC Porto LA / FEP, University of Porto R. de Ceuta, 118, 6., 4050-190 Porto, Portugal ltorgo@liaad.up.pt http://www.liaad.up.pt/~ltorgo Abstract. This

More information

High-dimensional labeled data analysis with Gabriel graphs

High-dimensional labeled data analysis with Gabriel graphs High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use

More information

Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence

Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence Taiyeong Lee, Yongqiao Xiao, Xiangxiang Meng, David Duling SAS Institute, Inc 100 SAS Campus Dr. Cary, NC 27513,

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC

TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC 777 TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC M.R. Walker. S. Haghighi. A. Afghan. and L.A. Akers Center for Solid State Electronics Research Arizona State University Tempe. AZ 85287-6206 mwalker@enuxha.eas.asu.edu

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data Self-Organizing g Maps (SOM) Ke Chen Outline Introduction ti Biological Motivation Kohonen SOM Learning Algorithm Visualization Method Examples Relevant Issues Conclusions 2 Introduction Self-organizing

More information

ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization

ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 1, JANUARY 2002 237 ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization Hujun Yin Abstract When used for visualization of

More information

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS 1 PANKAJ SAXENA & 2 SUSHMA LEHRI 1 Deptt. Of Computer Applications, RBS Management Techanical Campus, Agra 2 Institute of

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Customer Data Mining and Visualization by Generative Topographic Mapping Methods

Customer Data Mining and Visualization by Generative Topographic Mapping Methods Customer Data Mining and Visualization by Generative Topographic Mapping Methods Jinsan Yang and Byoung-Tak Zhang Artificial Intelligence Lab (SCAI) School of Computer Science and Engineering Seoul National

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

Emotion Detection from Speech

Emotion Detection from Speech Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction

More information

Models of Cortical Maps II

Models of Cortical Maps II CN510: Principles and Methods of Cognitive and Neural Modeling Models of Cortical Maps II Lecture 19 Instructor: Anatoli Gorchetchnikov dy dt The Network of Grossberg (1976) Ay B y f (

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

On the effects of dimensionality on data analysis with neural networks

On the effects of dimensionality on data analysis with neural networks On the effects of dimensionality on data analysis with neural networks M. Verleysen 1, D. François 2, G. Simon 1, V. Wertz 2 * Université catholique de Louvain 1 DICE, 3 place du Levant, B-1348 Louvain-la-Neuve,

More information