Comparing large datasets structures through unsupervised learning
|
|
- Oswin Wilkinson
- 8 years ago
- Views:
Transcription
1 Comparing large datasets structures through unsupervised learning Guénaël Cabanes and Younès Bennani LIPN-CNRS, UMR 7030, Université de Paris 13 99, Avenue J-B. Clément, Villetaneuse, France Abstract. In data mining, the problem of measuring similarities between different subsets is an important issue which has been little investigated up to now. In this paper, a novel method is proposed based on unsupervised learning. Different subsets of a dataset are characterized by means of a model which implicitly corresponds to a set of prototypes, each one capturing a different modality of the data. Then, structural differences between two subsets are reflected in the corresponding model. Differences between models are detected using a similarity measure based on data density. Experiments over synthetic and real datasets illustrate the effectiveness, efficiency, and insights provided by our approach. 1 Introduction In recent years, the datasets size has shown an exponential growth. Studies exhibit that the amount of data doubles every year. However, the ability to analyze these data remains inadequate. The problem of mining these data to measure similarities between different datasets becomes an important issue which has been little investigated up to now. A major application may be the analysis of time evolving datasets, by computing a model of the data structure over different periods of time, and comparing them to detect the changes when they occurred. Nevertheless, there are many other possible applications, like large datasets comparison, clustering merging, stability measure and so on. As the study of data streams and large databases is a difficult problem because of the computing costs and the big storage volumes involved, two issues appear to play a key role in such an analysis: (i) a good condensed description of the data properties [1,2] and (ii) a measure capable of detecting changes in the data structure [3,4]. In this paper we propose a new algorithm which is able to perform these two tasks. The solution we propose consists of an algorithm, which first constructs an abstract representation of the datasets to compare and then evaluates the dissimilarity between them based on this representation. The abstract representation is based on the learning of a variant of Self- Organizing Map (SOM) [5], which is enriched with structural information extracted from the data. Then we propose a method to estimate, from the abstract representation, the underlying data density function. The dissimilarity is a measure of the divergence This work was supported in part by the CADI project (N o ANR-07 TLOG 003) financed by the ANR (Agence Nationale de la Recherche).
2 2 Guénaël Cabanes and Younès Bennani between two estimated densities. A great advantage of this method is that each enriched SOM is at the same time a very informative and a highly condensed description of the data structure that can be stored easily for a future use. Also, as the algorithm is very effective both in terms of computational complexity and in terms of memory requirements, it can be used for comparing large datasets or for detecting structural changes in data-streams. The remainder of this paper is organized as follows. Section 2 presents the new algorithm. Section 3 describes the validation protocol and some results. Conclusion and future work perspectives are given in Section 4. 2 A new two-levels algorithm to compare data structure The basic assumption in this work is that data are described as vectors of numeric attributes and that the datasets to compare have the same type. First, each dataset is modeled using an enriched Self-organizing Map (SOM) model (adapted from [6]), constructing an abstract representation which is supposed to capture the essential data structure. Then, each dataset density function is estimated from the abstract representation. Finally, different datasets are compared using a dissimilarity measure based upon the density functions. The idea is to combine the dimension reduction and the fast learning SOM capabilities in the first level to construct a new reduced vector space, then applies other analysis in this new space. These are called two-levels methods. The two-levels methods are known to reduce greatly the computational time, the effects of noise and the curse of dimensionality [6]. Furthermore, it allows some visual interpretation of the result using the two-dimensional map generated by the SOM. 2.1 Abstract algorithm schema The algorithm proceeds in three steps : 1. The first step is the learning of the enriched SOM. During the learning, each SOM prototype is extended with novel information extracted from the data. These structural informations will be used in the second step to infer the density function. More specifically, the attributes added to each prototype are: Density modes. It is a measure of the data density surrounding the prototype (local density). The local density is a measure of the amount of data present in an area of the input space. We use a Gaussian kernel estimator [7] for this task. Local variability. It is a measure of the data variability that is represented by the prototype. It can be defined as the average distance between the prototypes and the represented data. The neighborhood. This is a prototype s neighborhood measure. The neighborhood value of two prototypes is the number of data that are well represented by each one. 2. The second step is the construction, from each enriched SOM, of a density function which will be used to estimate the density of the input space. This function is constructed by induction from the information associated to the prototypes of the SOM, and is represented as a mixture model of spherical normal functions.
3 Comparing large datasets structures through unsupervised learning 3 3. The last step accomplishes the comparison of two different datasets, using a dissimilarity measure able to compare the two density functions constructed in the previous steps. 2.2 Prototypes enrichment In this step some global information is extracted from the data and stored in the prototypes during the learning of the SOM. The Kohonen SOM can be classified as a competitive unsupervised learning neural network [5]. A SOM consists in a two dimensional map of M neurons (units) which are connected to n inputs according to n weights connections w j = (w 1j,..., w nj ) (also called prototypes) and to their neighbors with topological links. The training set is used to organize these maps under topological constraints of the input space. Thus, an optimal spatial organization is determined by the SOM from the input data. In our algorithm, the SOM s prototypes will be enriched by adding new numerical values extracted from the dataset. The enrichment algorithm proceeds in three phases: Input : The data X = {x k } N k=1. Output : The density D i and the local variability s i associated to each prototype w i. The neighborhood values v i,j associated with each pair of prototype w i and w j. Algorithm: 1. Initialization : Initialize the SOM parameters i, j initialize to zero the local densities (D i ), the neighborhood values (v i,j ), the local variability (s i ) and the number of data represented by w i (N i ). 2. Choose randomly a data x k X : Compute d(w, x k ), the euclidean distance between the data x k and each prototype w i. Find the two closest prototypes (BMUs: Best Match Units) w u and w u : u = arg min(d(w i, x k )) and u = arg min(d(w i, x k )) i i u. 3. Update structural values : Number of data: N u = N u + 1. Variability: s u = s u + d(w u, x k ). Density: i, D i = D i + 1 2πh e d(w i,x k )2 2h 2. Neighborhood: v u,u = v u,u + 1.
4 4 Guénaël Cabanes and Younès Bennani 4. Update the SOM prototypes w i as defined in [5]. 5. repeat T times step 2 to Final structural values: i, s i = s i /N i and D i = D i /N. In this study we used the default parameters of the SOM Toolbox [8] for the learning of the SOM and we use T = max(n, 50 M) as in [8]. The number M of prototypes must neither be too small (the SOM does not fit the data well) nor too large (time consuming). To choose M close to N seems to be a good trade-off [8]. The last parameter to choose is the bandwidth h. The choice of h is important for good results, but its optimal value is difficult to calculate and time consuming (see [9]). A heuristic that seems relevant and gives good results consists in defining h as the average distance between a prototype and its closest neighbor [6]. At the end of this process, each prototype is associated with a density and a variability value, and each pair of prototypes is associated with a neighborhood value. The substantial information about the structure of the data is captured by these values. Then, it is no longer necessary to keep data in memory. 2.3 Estimation of the density function The objective of this step is to estimate the density function which associates a density value to each point of the input space. We already have an estimation of the value of this function at the position of the prototypes (i.e. D i ). We must infer from this an approximation of the function. Our hypothesis here is that this function may be properly approximated in the form of a mixture of Gaussian kernels. Each kernel K is a Gaussian function centered on a prototype. The density function can therefore be written as: f(x) = M 1 α i K i (x) with K i (x) = N e d(w i,x) 2 2h 2 i 2πh i i=1 The most popular method to fit mixture models (i.e. to find h i and α i ) is the expectation-maximization (EM) algorithm [10]. However, this algorithm needs to work in the data input space. As here we work on enriched SOM instead of dataset, we can t use EM algorithm. Thus, we propose the heuristic to choose h i : h i = j v i,j N i+n j (s i N i + d i,j N j ) j v i,j d i,j is the euclidean distance between w i and w j. The idea is that h i is the standard deviation of data represented by K i. These data are also represented by w i and their neighbors. Then h i depends on the variability s i computed for w i and the distance d i,j between w i and his neighbors, weighted by the number of data represented by each prototype and the connectivity value between w i and his neighborhood. Now, since the density D for each prototype w is known (f(w i ) = D i ), we can use a gradient descent method to determine the weights α i. The α i are initialized with the
5 Comparing large datasets structures through unsupervised learning 5 values of D i, then these values are reduced gradually to better fit D = M i=1 α ik i (w). To do this, we optimize the following criterion: α = arg min α 1 M M M (α j K j (w i )) D i i=1 j=1 Thus, we now have a density function that is a model of the dataset represented by the enriched SOM Algorithm complexity The complexity of the algorithm is scaled as O(T M), with T the number of steps and M the number of prototypes in the SOM. It is recommended to set at least T > 10 M for a good convergence of the SOM [5]. In this study we use T = max(n, 50 M) as in [8]. This means that if N > 50 M (large database), the complexity of the algorithm is O(N M), i.e. is linear in N for a fixed size of the SOM. Then the whole process is very fast and is suited for the treatment of large databases. Also very large databases can be handled by fixing T < N (this is similar as working on a random subsample of the database). This is much faster than traditional density estimator algorithms as the Kernel estimator [7] (that also needs to keep all data in memory) or the Gaussian Mixture Model [11] estimated with the EM algorithm (as the convergence speed can become extraordinarily slow [12,13]). 2.5 The dissimilarity measure We can now define a measure of dissimilarity [ between ] two datasets[ A and B, represented by two SOMs: SOM A = {wi A}M A i=1, f A and SOM B = {wi B}M B i=1, f B ] With M A and M B the number of prototypes in models A and B, and f A and f B the density function of A and B computed in 2.3. The dissimilarity between A and B is given by: CBd(A, B) = i=1 f ( ( ) ) A wi A f log A (w A i ) f B (wi A ) M A + M A M B j=1 f B ( w B j ( ) ) f log B (w B j ) f A (wj B ) The idea is to compare the density functions f A and f B for each prototype w of A and B. If the distributions are identical, these two values must be very close. This measure is an adaptation of the weighted Monte Carlo approximation of the symmetrical Kullback Leibler measure (see [14]), using the prototypes of a SOM as a sample of the database. M B
6 6 Guénaël Cabanes and Younès Bennani 3 Validation 3.1 Description of the used datasets In order to demonstrate the performance of the proposed dissimilarity measure, we used nine artificial datasets generators and one real dataset. Ring 1, Ring 2, Ring 3, Spiral 1 and Spiral 2 are 5 two-dimensional nonconvex distributions with different density and variance. Ring 1 is a ring of radius 1 (high density), Ring 2 a ring of radius 3 (low density) and Ring 3 a ring of radius 5 (average density) ; Spiral 1 and Spiral 2 are two parallel spirals. The density in the spirals decreases with the radius. Datasets from these distributions can be generated randomly. Noise 1 to Noise 4 are 4 different two-dimensional distributions, composed of different Gaussian distributions and a heavy homogeneous noise. Finally, the Shuttle real dataset come from the UCI repository. It s a nine-dimensional dataset with instances. These data are divided in seven class. Approximately 80% of the data belongs to class Validity of the dissimilarity measure It s impossible to prove that two distributions are exactly the same. Anyway, a low dissimilarity value is only consistent with a similar distribution, and does of course give an indication of the similarity between the two sample distributions. On the other hand, a very high dissimilarity does show, to the given level of significance, that the distributions are different. Then, if our measure of dissimilarity is efficient, it should be possible to compare different datasets (with the same attributes) to detect the presence of similar distributions, i.e. the dissimilarity of datasets generated from the same distribution law must be much smaller than the dissimilarity of datasets generated from very different distribution. To test this hypothesis we applied the following protocol: 1. We generated 250 different datasets from the Ring 1, Ring 2, Ring 3, Spiral 1 and Spiral 2 distributions (50 datasets from each). Each sets contain between 500 and data. 2. For each of these datasets, we learned an enriched SOM to obtain a set of representative prototypes of the data. The number of prototypes is randomly chosen from 50 to We computed a density function for each SOM and compared them to each other with the proposed dissimilarity measure. 4. Each SOM was labeled depending on the distribution represented (labels are ring 1, ring 2, ring 3, spiral 1 and spiral 2 ). We then calculated an index of compactness and separability of SOM having the same label with the generalized index of Dunn [15]. This index is even bigger than SOM with the same label are similar together and dissimilar to other labels, according to the dissimilarity function used.
7 Comparing large datasets structures through unsupervised learning 7 We used the same protocol with Noise 1 to Noise 4 and with the Shuttle database. Two kinds of distributions have been extracted from the Shuttle database, by using random sub-sampling (with replacement) of data from class 1 ( Shuttle 1 ) and data from other classes ( Shuttle 2 ). We compared the results with some distance-based measures usually used to compare two sets of data (here we compare two sets of prototypes from the SOMs). These measures are the average distance (Ad: the average distance between all pair of prototypes in the two SOMs), the minimum distance (Md: the smallest Euclidean distance between prototypes in the two SOMs) and the Ward distance (Wd: The distance between the two centroids, with some weight depending on the number of prototypes in the two SOMs)) [16]. All results are in Table 1. The visual inspection of the dissimilarity matrix obtained from the different measures is consistant with the values of the Dunn index (Table 1) to show that the density-based proposed similarity measure (CBd) is much more effective than distance based measures (see Fig. 1 for an example). (a) Ad (b) Md (c) Wd (d) CBd Fig. 1. Visualizations of the dissimilarity matrix obtained from the different measures for the comparaisons of the different Noise 1 to 4 distributions. Darker cells mean higher similarity between the two distributions. Datasets from the same distribution are sorted, then their comparisons appear as four box along the diagonal of the matrix. As shown in table 1, our dissimilarity measure using density is much more effective than measures that only use the distances between prototypes. Indeed, the Dunn index for density based measure is much higher than distance based ones for the three kinds of datasets tested: non-convex data, noisy data and real data. This means that our measure of dissimilarity is much more effective for the detection of similar dataset. Table 1. Value of the Dunn index obtained from various dissimilarity measures to compare various data distributions. Distributions to compare Average Minimum Ward Proposed Ring 1 to 3 + Spiral 1 and Noise 1 to Shuffle 1 and
8 8 Guénaël Cabanes and Younès Bennani 4 Conclusion and future works In this article, we proposed a new algorithm for modeling data structure, based on the learning of a SOM, and a measure of dissimilarity between cluster structures. The advantages of this algorithm are not only the low computational cost and the low memory requirement, but also the high accuracy achieved in fitting the structure of the modeled datasets. These properties make it possible to apply the algorithm to the analysis of large data bases, and especially large data streams, which requires both speed and economy of resources. The results obtained on the basis of artificial and real datasets are very encouraging. The continuation of our work will focus on the validation of the algorithm on more real case studies. More specifically, it will be applied to the analysis of real data streams, commonly used as benchmark for this class of algorithms. We will also try to extend the method to structured or symbolic datasets, via kernel-based SOM algorithm. Finally, we wish to deepen the concept of stability applied to this kind of analysis. References 1. Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: SIGMOD Conference. (2001) 2. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: VLDB. (2002) Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: In 2006 SIAM Conference on Data Mining. (2006) 4. Aggarwal, C., Yu, P.: A Survey of Synopsis Construction Methods in Data Streams. In C. Aggarwal, ed.: Data Streams: Models and Algorithms. Springer (2007) 5. Kohonen, T.: Self-Organizing Maps. Springer-Verlag, Berlin (2001) 6. Cabanes, G., Bennani, Y.: A local density-based simultaneous two-level algorithm for topographic clustering. In: IJCNN, IEEE (2008) Silverman, B.: Using kernel density estimates to investigate multi-modality. Journal of the Royal Statistical Society, Series B 43 (1981) Vesanto, J.: Neural network tool for data mining: SOM Toolbox (2000) 9. Sain, S., Baggerly, K., Scott, D.: Cross-Validation of Multivariate Densities. Journal of the American Statistical Association 89 (1994) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39 (1977) Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458) (2002) Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26(2) (1984) Park, H., Ozeki, T.: Singularity and Slow Convergence of the EM algorithm for Gaussian Mixtures. Neural Process Letters 29 (2009) Hershey, J.R., Olsen, P.A.: Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Volume 4. (2007) Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4 (1974) Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1988)
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
More informationVisualization of Breast Cancer Data by SOM Component Planes
International Journal of Science and Technology Volume 3 No. 2, February, 2014 Visualization of Breast Cancer Data by SOM Component Planes P.Venkatesan. 1, M.Mullai 2 1 Department of Statistics,NIRT(Indian
More informationMethodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
More informationSelf Organizing Maps: Fundamentals
Self Organizing Maps: Fundamentals Introduction to Neural Networks : Lecture 16 John A. Bullinaria, 2004 1. What is a Self Organizing Map? 2. Topographic Maps 3. Setting up a Self Organizing Map 4. Kohonen
More informationAn Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
More informationA Study of Web Log Analysis Using Clustering Techniques
A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept
More informationVisualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
More informationNeural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationA Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images
A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods
More informationQuality Assessment in Spatial Clustering of Data Mining
Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering
More informationRobust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
More informationA comparison of various clustering methods and algorithms in data mining
Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationA Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad
More informationAdvanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis
Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis Arumugam, P. and Christy, V Department of Statistics, Manonmaniam Sundaranar University, Tirunelveli, Tamilnadu,
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationNonlinear Discriminative Data Visualization
Nonlinear Discriminative Data Visualization Kerstin Bunte 1, Barbara Hammer 2, Petra Schneider 1, Michael Biehl 1 1- University of Groningen - Institute of Mathematics and Computing Sciences P.O. Box 47,
More informationINTERACTIVE DATA EXPLORATION USING MDS MAPPING
INTERACTIVE DATA EXPLORATION USING MDS MAPPING Antoine Naud and Włodzisław Duch 1 Department of Computer Methods Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract: Interactive
More informationUsing Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen
More informationStandardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
More informationReconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets
Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets Aaditya Prakash, Infosys Limited aaadityaprakash@gmail.com Abstract--Self-Organizing
More informationSpecific Usage of Visual Data Analysis Techniques
Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationOn the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data
On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data Jorge M. L. Gorricha and Victor J. A. S. Lobo CINAV-Naval Research Center, Portuguese Naval Academy,
More informationComparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm
More informationA Computational Framework for Exploratory Data Analysis
A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.
More informationData Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
More informationPRACTICAL DATA MINING IN A LARGE UTILITY COMPANY
QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationCluster analysis Cosmin Lazar. COMO Lab VUB
Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,
More informationSACOC: A spectral-based ACO clustering algorithm
SACOC: A spectral-based ACO clustering algorithm Héctor D. Menéndez, Fernando E. B. Otero, and David Camacho Abstract The application of ACO-based algorithms in data mining is growing over the last few
More informationMonitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations
Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations Christian W. Frey 2012 Monitoring of Complex Industrial Processes based on Self-Organizing Maps and
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationVisualization by Linear Projections as Information Retrieval
Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland jaakko.peltonen@tkk.fi
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationDecision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,
More informationVisualization of textual data: unfolding the Kohonen maps.
Visualization of textual data: unfolding the Kohonen maps. CNRS - GET - ENST 46 rue Barrault, 75013, Paris, France (e-mail: ludovic.lebart@enst.fr) Ludovic Lebart Abstract. The Kohonen self organizing
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationLearning Vector Quantization: generalization ability and dynamics of competing prototypes
Learning Vector Quantization: generalization ability and dynamics of competing prototypes Aree Witoelar 1, Michael Biehl 1, and Barbara Hammer 2 1 University of Groningen, Mathematics and Computing Science
More informationKnowledge Discovery in Stock Market Data
Knowledge Discovery in Stock Market Data Alfred Ultsch and Hermann Locarek-Junge Abstract This work presents the results of a Data Mining and Knowledge Discovery approach on data from the stock markets
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationPERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA
PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,
More informationModel-Based Cluster Analysis for Web Users Sessions
Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece gpallis@ccf.auth.gr
More informationVisualizing class probability estimators
Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers
More informationAnalyzing The Role Of Dimension Arrangement For Data Visualization in Radviz
Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz Luigi Di Caro 1, Vanessa Frias-Martinez 2, and Enrique Frias-Martinez 2 1 Department of Computer Science, Universita di Torino,
More informationUsing Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps
Technical Report OeFAI-TR-2002-29, extended version published in Proceedings of the International Conference on Artificial Neural Networks, Springer Lecture Notes in Computer Science, Madrid, Spain, 2002.
More informationOn Density Based Transforms for Uncertain Data Mining
On Density Based Transforms for Uncertain Data Mining Charu C. Aggarwal IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532 charu@us.ibm.com Abstract In spite of the great progress in
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationMobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationIntroduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationData topology visualization for the Self-Organizing Map
Data topology visualization for the Self-Organizing Map Kadim Taşdemir and Erzsébet Merényi Rice University - Electrical & Computer Engineering 6100 Main Street, Houston, TX, 77005 - USA Abstract. The
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based
More informationA Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
More information6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationLocal Anomaly Detection for Network System Log Monitoring
Local Anomaly Detection for Network System Log Monitoring Pekka Kumpulainen Kimmo Hätönen Tampere University of Technology Nokia Siemens Networks pekka.kumpulainen@tut.fi kimmo.hatonen@nsn.com Abstract
More informationLoad balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
More informationThe Role of Visualization in Effective Data Cleaning
The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationData Mining and Neural Networks in Stata
Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca mario.lucchini@unimib.it maurizio.pisati@unimib.it
More informationData Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationSelf Organizing Maps for Visualization of Categories
Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl
More informationResource-bounded Fraud Detection
Resource-bounded Fraud Detection Luis Torgo LIAAD-INESC Porto LA / FEP, University of Porto R. de Ceuta, 118, 6., 4050-190 Porto, Portugal ltorgo@liaad.up.pt http://www.liaad.up.pt/~ltorgo Abstract. This
More informationHigh-dimensional labeled data analysis with Gabriel graphs
High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use
More informationClustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence
Clustering Time Series Based on Forecast Distributions Using Kullback-Leibler Divergence Taiyeong Lee, Yongqiao Xiao, Xiangxiang Meng, David Duling SAS Institute, Inc 100 SAS Campus Dr. Cary, NC 27513,
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationTRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC
777 TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC M.R. Walker. S. Haghighi. A. Afghan. and L.A. Akers Center for Solid State Electronics Research Arizona State University Tempe. AZ 85287-6206 mwalker@enuxha.eas.asu.edu
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationForschungskolleg Data Analytics Methods and Techniques
Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!
More informationSupervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
More informationCluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009
Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationSelf-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data
Self-Organizing g Maps (SOM) Ke Chen Outline Introduction ti Biological Motivation Kohonen SOM Learning Algorithm Visualization Method Examples Relevant Issues Conclusions 2 Introduction Self-organizing
More informationViSOM A Novel Method for Multivariate Data Projection and Structure Visualization
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 1, JANUARY 2002 237 ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization Hujun Yin Abstract When used for visualization of
More informationANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS
ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS 1 PANKAJ SAXENA & 2 SUSHMA LEHRI 1 Deptt. Of Computer Applications, RBS Management Techanical Campus, Agra 2 Institute of
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationUnsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning
Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationMachine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationCustomer Data Mining and Visualization by Generative Topographic Mapping Methods
Customer Data Mining and Visualization by Generative Topographic Mapping Methods Jinsan Yang and Byoung-Tak Zhang Artificial Intelligence Lab (SCAI) School of Computer Science and Engineering Seoul National
More informationAn Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,
More informationEmotion Detection from Speech
Emotion Detection from Speech 1. Introduction Although emotion detection from speech is a relatively new field of research, it has many potential applications. In human-computer or human-human interaction
More informationModels of Cortical Maps II
CN510: Principles and Methods of Cognitive and Neural Modeling Models of Cortical Maps II Lecture 19 Instructor: Anatoli Gorchetchnikov dy dt The Network of Grossberg (1976) Ay B y f (
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationOn the effects of dimensionality on data analysis with neural networks
On the effects of dimensionality on data analysis with neural networks M. Verleysen 1, D. François 2, G. Simon 1, V. Wertz 2 * Université catholique de Louvain 1 DICE, 3 place du Levant, B-1348 Louvain-la-Neuve,
More information