# Visualization of textual data: unfolding the Kohonen maps.

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Visualization of textual data: unfolding the Kohonen maps. CNRS - GET - ENST 46 rue Barrault, 75013, Paris, France ( Ludovic Lebart Abstract. The Kohonen self organizing maps (SOM) can be viewed as a visualisation tool that performs a sort of compromise between a high-dimensional set of clusters and the 2-dimensional plane generated by some principal axes techniques. The paper proposes, through Contiguity Analysis, a set of linear projectors providing a representation as close as possible to a SOM map. In so doing, we can assess the locations of points representing the elements via a partial bootstrap procedure. Keywords: Contiguity analysis, Kohonen maps, SOM, Bootstrap. 1 Introduction For many users of visualisation tools, the Kohonen self organising maps (SOM) outperform both usual clustering techniques and principal axes techniques (principal components analysis, correspondence analysis, etc.). Indeed, the displays of identifiers of words (or text units) within rectangular or octagonal cells allow for clear and legible printings. The SOM grid, basically non-linear, can then be viewed as a compromise between a high-dimensional set of clusters and the planes generated by any pairs of principal axes. One can regret however the absence of assessment procedures and of valid statistical inference as well. The paper proposes, through Contiguity Analysis (briefly reminded in section 2), a set of linear projectors providing a representation as close as possible to a SOM map (section 3 and 4). An example of application is given in section 5. Via a partial bootstrap procedure, we can now provide these representations with the projection of confidence areas (e.g. ellipses) around the location of words (section 6). 2 Brief reminder about contiguity analysis Let us consider a set of multivariate observations (n observations described by p variables, leading to a (n, p) matrix X), having an a priori graph structure. The n observations are also the n vertices of a symmetric graph G, whose associated (n, n) matrix is M (m ii = 1 if vertices i and i are joined by an edge, m ii = 0 otherwise). We denote by N the (n, n) diagonal matrix

2 74 Lebart having the degree of each vertex i as diagonal element n i (n i stands here for n ii ). y is the vector whose i th component is y i. Note that: n i = i m ii. U designates the square matrix such that u ij = 1 for all i and j. y being a random variable taking values on each vertex i of a symmetric graph G the local variance of y, v (y), is defined as: v (y) = (1/n) (y i m i ) 2 where: m i = (1/n i) i m ii y i. It is the average of the adjacent values of vertex i. Note that if G is a complete graph (all pairs (i,i ) are joined by an edge), v (y) is nothing but v(y), the classical empirical variance. When the observations are distributed randomly on the graph, both v (y) and v(y) are estimates of the variance of y. The contiguity ratio (analogue to the Geary contiguity ratio [Geary, 1954]), is written: c (y) = v (y)/v(y). It can be generalized : a) to different distances between vertices in the graph, b) to multivariate observations (both generalizations are dealt with in: [Lebart, 1969]). This section is devoted to the second generalization: multivariate observations having an a priori graph structure. The multivariate analogue of the local variance is now the local covariance matrix V*, given by (using the previously defined notation): V = (1/n)X (I N 1 M) (I N 1 M)X The diagonalization of the corresponding local correlation matrix (Local Principal Component Analysis) [Aluja Banet and Lebart, 1984] produces a description of the local correlations that can be compared to the results of a PCA. Comparisons between correlation matrices (local and global) can be done through Procustean Analysis (see: [Gower and Dijksterhuis, 2004]). If the graph is made of k disjoined complete subgraphs, V* coincide with the classical within covariance matrix used in linear discriminant analysis. If the graph is complete (associated matrix = U defined above), then V* is the classical global covariance matrix V. Let u be a vector defining a linear combination u(i) of the p variables for vertex i: u(i) = j u j y ij = u y i The local variance of u(i) is: v (u) = u V u. The contiguity coefficient of u(i) can be written: c(u) = u V u/u Vu. Contiguity Analysis is the search for u that minimizes c(u). It produces linear functions having the properties of minimal contiguity. Instead of assigning an observation to a specific class, (case of discriminant analysis) these functions allows one to assign it in a specific part of the graph. Therefore, Contiguity Analysis can be used to discriminate between overlapping classes.

3 3 SOM maps and associated graphs Visualization of textual data 75 The self organizing maps (SOM maps) [Kohonen, 1989] aim at clustering a set of multivariate observations. The obtained clusters are displayed as the vertices of a rectangular (chessboard like) or octagonal graph. The distances between vertices on the graph are supposed to reflect, as much as possible, the distances between clusters in the initial space. Let us summarize the principles of the algorithm: The size of the graph, and consequently, the number of clusters are chosen a priori (for example: a square grid with 5 rows and 5 columns, leading to 25 clusters). The algorithm is similar to the MacQueen algorithm [MacQueen, 1967] in its on-line version, and to the k-means algorithm [Forgy, 1984] in its batch version. Let us consider n points in a p-dimensional space (rows of the (n, p) matrix X). At the outset, to each cluster k is assigned a provisional centre C k with p components (e.g.: chosen at random). For each step t, the element i(t) is assigned to its nearest provisional centre C k(t). Such centre, together with its neighbours on the grid, is then modified according to the formula: C k(t+1) = C k(t) + ε(t)(i(t) C k(t) ). In this formula, ε(t) is an adaptation parameter (0 ε 1) which is a (slowly) decreasing function of t, as those usually involved in stochastic approximation algorithms. The process is reiterated, and eventually stabilizes, but the partition obtained may depend on the initial choice of the centres. In the batch version of the algorithm, the centres are updated only after a complete pass of the data. Figure 1 represent a stylised symmetric matrix (70, 70) M 0 associated to a partition of n=70 elements in k=8 classes (or clusters). Rows and columns represent the same set of n elements (elements belonging to a same class of the partition form a subset of consecutive rows and columns). The graph consists of 8 cliques. All the cells of the black blocks contains the value 1. All the cells outside these diagonal blocks contains the value 0. The 8 classes of the previous partition have been obtained through a SOM algorithm from a square 3 x 3 grid (with an empty class). The left hand side matrix of figure 1 does not take into account the topology of the grid: links between elements do exist only within clusters. In the right hand side of figure 1, two elements i and j are linked (m ij = 1) in the graph if they belong either to a same cluster, or to contiguous clusters. Owing to the small size of the SOM grid (figure 2), the diagonal adjacency is not taken into account. (e.g.: elements belonging to cluster 7 are considered as contiguous to those of clusters 4 and 8, but not to the elements of cluster 5). Similarly to matrices M 0 and M 1, a matrix M 2 can be defined, that extends the definition of the edges of the graph to diagonal links. In the simple example of figure 3, the elements of cluster 7, for example, are considered as contiguous to the elements of clusters 4, 8, and 5.

4 76 Lebart Fig.1. Stylised incidence matrices M 0 of the graph associated with a simple partition (left), and M 1, relating to a SOM map (right) (all the cells in the white areas contain the value 0 whereas those in the black areas contain the value 1) Fig.2. The a priori SOM grid 4 Linear projectors onto the best SOM plane The matrices M 0, M 1, and M 2 can be easily obtained as a by-product of the SOM algorithm. In the case of contiguity analysis involving the graph G 0 the associated matrix of which is M 0, the local variance coincide with the within variance, and the result is a classical linear discriminant analysis of Fisher (LDA). In the plane spanned by the two first principal axes, the clusters are optimally located in the sense of the LDA criterion. In the cases of contiguity analysis using the graphs G 1 or G 2 (associated matrices M 1, or M 2 ), the principal planes strive to reconstitute the positions of the clusters in the SOM map. In the initial p-dimensional space, the SOM map can be schematised by the graph whose vertices are the centroids of the clusters. Those vertices are joined by an edge if the corresponding clusters are contiguous in the grid used in the algorithm. This graph in a high dimensional space will be partially or totally unfolded by the contiguity analysis. The following example will show the different phases of the procedure. 5 An example of application An open-ended question has been included in a multinational survey conducted in seven countries (Japan, France, Germany, Italy, Nederland, United Kingdom, USA) in the late nineteen eighties [Hayashi et al., 1992]. The respondents were asked : What is the single most important thing in life for you?. The illustrative example is limited to the British sample. The

5 Visualization of textual data 77 counts for the first phase of numeric coding are as follows: Out of 1043 responses, there are occurrences (tokens), with distinct words (types). When the words appearing at least 25 times are selected, there remain 9815 occurrences of these words, with 88 distinct words. In this ex- Fig.3. A (3 x 3) Kohonen map applied to the words used in the 1043 responses ample we focus on a partitioning of the sample into 9 categories, obtained by cross-tabulating age (3 categories) with educational level (3 categories). The nine identifiers combine age categories (-30, 30-55, +55) with educational levels (low, medium, high). Note that the SOM map (figure 3) provides a simultaneous representation of words and of categories of respondents. This is due to the fact that the input data are the coordinates provided by a correspondence analysis of the lexical contingency table cross-tabulating the words and the categories. Figure 4 represents the plane spanned by the two first axes of the contiguity analysis using the matrix M 1. We can check that the graph describing the SOM map (the vertices of which C1, C2,...C9 are the centroids of the elements of the corresponding cells of figure 3), is, in this particular case, a satisfactory representation of the initial map. The pattern of the nine centroids is similar to the original grid exemplified by figure 3. The background of figure 5 is identical to that of figure 4. It contains in addition the convex hulls of the nine clusters C1, C2,..., C9.. Each of those convex hulls correspond exactly (if we except some double or hidden points) to a cell of figure 3. We note that these convex hulls are relatively well separated. In fact, figure 5 contains much more information than figure 3, since we have now an idea of the shapes and sizes of the clusters, of the

6 78 Lebart Fig.4. Principal plane of the contiguity analysis using matrix M 1. The points C1, C2,...C9 represent the centroids of the 9 clusters derived from the SOM map. degree to which they overlap. We are now aware of their relative distances, and, another piece of information missing in figure 3, we can observe the configurations of elements within each cluster. Fig.5. Principal plane of the contiguity analysis using matrix M 1, with both the centroids of the 9 clusters and their convex hulls

7 Visualization of textual data 79 6 Assessing SOM maps through partial bootstrap We are provided at this stage with a tool allowing us to explore a continuous space. We can take advantage of having a projection onto a plane (and possibly onto a higher dimensional space, although the outputs are much more complicated in that case) to project the bootstrap replicates of the original data set. This can be done in the framework of a partial bootstrap procedure. In the context of principal axes techniques (such as SVD, PCA, and also contiguity analysis), Bootstrap resampling techniques [Efron and Tibshirani, 1993] are used to produce confidence areas on two-dimensional displays. The bootstrap replication scheme allows one to draw confidence ellipses for both active elements (i.e.: elements participating in building principal axes) and supplementary elements (projected a posteriori). Fig. 6. Bootstrap ellipses of confidence of the 5 words: freedom, health, money, peace, wife in the same principal contiguity plane as in figure 4 and 5 In the example of the previous section, the words are the rows of a contingency table. The perturbation of such table under a bootstrap re-sampling procedure leads to new coordinates for the replicated rows. Without recomputing the whole contiguity analysis for each replicated sample (conservative procedure of total bootstrap), one can project the replicated rows as supplementary elements on a common reference space, exemplified above by figures 4 and 5. Always on that same space, figure 6 shows a sample of the replicates of five points (small stars visible around the words freedom, health, money, peace, wife) and the confidence ellipses that contain approximately 90 % of these replicated points. Such procedures of partial bootstrap [Lebart,

8 80 Lebart 2004] give satisfactory estimates of the relative uncertainty about the location of points. Although the background of figures 5 and 6 are the same, it is preferable, to keep the results legible, to draw the confidence ellipses on a distinct figure. It can be seen for instance that the words freedom and money, both belonging to cluster C4, have different behaviours with respect to the re-sampling variability. The location of freedom is much more fuzzy. That word could belong to some neighbouring clusters as well. 7 Conclusion We have intended to immerse the SOM maps, obtained through an algorithm often viewed as a black box, into an analytical framework (the linear algebra of contiguity analysis) and into an inferential setting as well (re-sampling techniques of bootstrap). That does not question the undeniable qualities of clarity and readability of the SOM maps. But it may perhaps help to assess their scientific status: like most exploratory tools, they help to rapidly uncover some patterns. However, they should be complemented with statistical procedures whenever deeper interpretation is needed. References [Aluja Banet and Lebart, 1984]T. Aluja Banet and L. Lebart. Local and partial principal component analysis and correspondence analysis. In M. Novak T. Havranek, Z. Sidak, editor, COMPSTAT Proceedings, pages , [Efron and Tibshirani, 1993]B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, [Forgy, 1984]R. Forgy. Cluster analysis of multivariate data : efficiency versus interpretability of classifications. In Biometric Society Meetings,, page 768, [Geary, 1954]R. C. Geary. The contiguity ratio and statistical mapping. The Incorporated Statistician, pages , [Gower and Dijksterhuis, 2004]J. C. Gower and G. B. Dijksterhuis. Procustes Problem. Oxford Statistical Science Series, Oxford, [Hayashi et al., 1992]C. Hayashi, T. Suzuki, and M. Sasaki. Data Analysis for Social Comparative Research: International Perspective. North-Holland, Amsterdam, [Kohonen, 1989]T. Kohonen. Self-Organization and Associative Memory. Springer Verlag, Berlin, [Lebart, 1969]L. Lebart. Analyse statistique de la contiguité. Publications de l ISUP, pages , [Lebart, 2004]L. Lebart. Validation techniques in text mining. In S. Sirmakensis, editor, Text Mining and its Application, pages , [MacQueen, 1967]J. B. MacQueen. Some methods for classification and analysis of multivariate observations. Proc. Symp. Math. Statist. and Probability (5th), pages , 1967.

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

### Visualising Class Distribution on Self-Organising Maps

Visualising Class Distribution on Self-Organising Maps Rudolf Mayer, Taha Abdel Aziz, and Andreas Rauber Institute for Software Technology and Interactive Systems Vienna University of Technology Favoritenstrasse

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### Visualization of Breast Cancer Data by SOM Component Planes

International Journal of Science and Technology Volume 3 No. 2, February, 2014 Visualization of Breast Cancer Data by SOM Component Planes P.Venkatesan. 1, M.Mullai 2 1 Department of Statistics,NIRT(Indian

### Random Vectors and the Variance Covariance Matrix

Random Vectors and the Variance Covariance Matrix Definition 1. A random vector X is a vector (X 1, X 2,..., X p ) of jointly distributed random variables. As is customary in linear algebra, we will write

### Exploratory data analysis for microarray data

Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization

### High-dimensional labeled data analysis with Gabriel graphs

High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use

### Self Organizing Maps: Fundamentals

Self Organizing Maps: Fundamentals Introduction to Neural Networks : Lecture 16 John A. Bullinaria, 2004 1. What is a Self Organizing Map? 2. Topographic Maps 3. Setting up a Self Organizing Map 4. Kohonen

### December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

### A Computational Framework for Exploratory Data Analysis

A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

### PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,

### Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

### Data, Measurements, Features

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

### Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Rank one SVD: un algorithm pour la visualisation d une matrice non négative L. Labiod and M. Nadif LIPADE - Universite ParisDescartes, France ECAIS 2013 November 7, 2013 Outline Outline 1 Data visualization

### A permutation can also be represented by describing its cycles. What do you suppose is meant by this?

Shuffling, Cycles, and Matrices Warm up problem. Eight people stand in a line. From left to right their positions are numbered,,,... 8. The eight people then change places according to THE RULE which directs

### 10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

### Didacticiel - Études de cas

1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.

### Clustering in Machine Learning. By: Ibrar Hussain Student ID:

Clustering in Machine Learning By: Ibrar Hussain Student ID: 11021083 Presentation An Overview Introduction Definition Types of Learning Clustering in Machine Learning K-means Clustering Example of k-means

### Self Organizing Maps for Visualization of Categories

Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz Luigi Di Caro 1, Vanessa Frias-Martinez 2, and Enrique Frias-Martinez 2 1 Department of Computer Science, Universita di Torino,

### Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?

### Recall that two vectors in are perpendicular or orthogonal provided that their dot

Orthogonal Complements and Projections Recall that two vectors in are perpendicular or orthogonal provided that their dot product vanishes That is, if and only if Example 1 The vectors in are orthogonal

### Mining Social-Network Graphs

342 Chapter 10 Mining Social-Network Graphs There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is

### Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### Data topology visualization for the Self-Organizing Map

Data topology visualization for the Self-Organizing Map Kadim Taşdemir and Erzsébet Merényi Rice University - Electrical & Computer Engineering 6100 Main Street, Houston, TX, 77005 - USA Abstract. The

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Tutorial Segmentation and Classification

MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

### Nonlinear Discriminative Data Visualization

Nonlinear Discriminative Data Visualization Kerstin Bunte 1, Barbara Hammer 2, Petra Schneider 1, Michael Biehl 1 1- University of Groningen - Institute of Mathematics and Computing Sciences P.O. Box 47,

### Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

### Data Exploration Data Visualization

Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

### INTERACTIVE DATA EXPLORATION USING MDS MAPPING

INTERACTIVE DATA EXPLORATION USING MDS MAPPING Antoine Naud and Włodzisław Duch 1 Department of Computer Methods Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract: Interactive

### Partial Least Squares (PLS) Regression.

Partial Least Squares (PLS) Regression. Hervé Abdi 1 The University of Texas at Dallas Introduction Pls regression is a recent technique that generalizes and combines features from principal component

### USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu ABSTRACT This

### Unsupervised and supervised dimension reduction: Algorithms and connections

Unsupervised and supervised dimension reduction: Algorithms and connections Jieping Ye Department of Computer Science and Engineering Evolutionary Functional Genomics Center The Biodesign Institute Arizona

### Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary

### Applied Multivariate Analysis

Neil H. Timm Applied Multivariate Analysis With 42 Figures Springer Contents Preface Acknowledgments List of Tables List of Figures vii ix xix xxiii 1 Introduction 1 1.1 Overview 1 1.2 Multivariate Models

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Identification of noisy variables for nonmetric and symbolic data in cluster analysis

Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,

### Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

### Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### Graph theoretic approach to analyze amino acid network

Int. J. Adv. Appl. Math. and Mech. 2(3) (2015) 31-37 (ISSN: 2347-2529) Journal homepage: www.ijaamm.com International Journal of Advances in Applied Mathematics and Mechanics Graph theoretic approach to

### 270107 - MD - Data Mining

Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 015 70 - FIB - Barcelona School of Informatics 715 - EIO - Department of Statistics and Operations Research 73 - CS - Department of

### Nonlinear Iterative Partial Least Squares Method

Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

### Clustering & Visualization

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

### USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu

### Multivariate Analysis of Variance (MANOVA): I. Theory

Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the

### Constrained Clustering of Territories in the Context of Car Insurance

Constrained Clustering of Territories in the Context of Car Insurance Samuel Perreault Jean-Philippe Le Cavalier Laval University July 2014 Perreault & Le Cavalier (ULaval) Constrained Clustering July

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Zachary Monaco Georgia College Olympic Coloring: Go For The Gold

Zachary Monaco Georgia College Olympic Coloring: Go For The Gold Coloring the vertices or edges of a graph leads to a variety of interesting applications in graph theory These applications include various

### Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

### Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

### CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment

### CLUSTER ANALYSIS FOR SEGMENTATION

CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

### The Result Analysis of the Cluster Methods by the Classification of Municipalities

The Result Analysis of the Cluster Methods by the Classification of Municipalities PAVEL PETR, KAŠPAROVÁ MILOSLAVA System Engineering and Informatics Institute Faculty of Economics and Administration University

### Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### How to report the percentage of explained common variance in exploratory factor analysis

UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report

### 2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

### Divide-and-Conquer. Three main steps : 1. divide; 2. conquer; 3. merge.

Divide-and-Conquer Three main steps : 1. divide; 2. conquer; 3. merge. 1 Let I denote the (sub)problem instance and S be its solution. The divide-and-conquer strategy can be described as follows. Procedure

### USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS Koua, E.L. International Institute for Geo-Information Science and Earth Observation (ITC).

### OLAP Visualization Operator for Complex Data

OLAP Visualization Operator for Complex Data Sabine Loudcher and Omar Boussaid ERIC laboratory, University of Lyon (University Lyon 2) 5 avenue Pierre Mendes-France, 69676 Bron Cedex, France Tel.: +33-4-78772320,

### Introduction to Principal Components and FactorAnalysis

Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### Statistical and computational challenges in networks and cybersecurity

Statistical and computational challenges in networks and cybersecurity Hugh Chipman Acadia University June 12, 2015 Statistical and computational challenges in networks and cybersecurity May 4-8, 2015,

### Neural Networks Kohonen Self-Organizing Maps

Neural Networks Kohonen Self-Organizing Maps Mohamed Krini Christian-Albrechts-Universität zu Kiel Faculty of Engineering Institute of Electrical and Information Engineering Digital Signal Processing and

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.