Identification of noisy variables for nonmetric and symbolic data in cluster analysis


 Blanche Stewart
 3 years ago
 Views:
Transcription
1 Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, Jelenia Gora, Poland Abstract. A proposal of an extended version of the HINoV method for the identification of the noisy variables (Carmone et al [1999]) for nonmetric, mixed, and symbolic interval data is presented in this paper. Proposed modifications are evaluated on simulated data from a variety of models. The models contain the known structure of clusters. In addition, the models contain a different number of noisy (irrelevant) variables added to obscure the underlying structure to be recovered. 1 Introduction Choosing variables is the one of the most important steps in a cluster analysis. Variables used in applied clustering should be selected and weighted carefully. In a cluster analysis we should include only those variables that are believed to help to discriminate the data (see Milligan [1996], p. 348). Two classes of approaches, while choosing the variables for cluster analysis, can facilitate a cluster recovery in the data (see e.g. Gnanadesikan et al [1995]; Milligan [1996], pp ): variable selection (selecting a subset of relevant variables), variable weighting (introducing relative importance of the variables according to their weights). Carmone et al [1999] discussed the literature on the variable selection and weighting (the characteristics of six methods and their limitations) and proposed the HINoV method for the identification of the noisy variables, in the area of the variable selection, to remedy problems with these methods. They demonstrated its robustness with metric data and kmeans algorithm. The authors suggest further studies of the HINoV method with different types of data and other clustering algorithms on p In this paper we propose extended version of the HINoV method for nonmetric, mixed, and symbolic interval data. The proposed modifications are evaluated for eight clustering algorithms on simulated data from a variety of models.
2 2 Marek Walesiak and Andrzej Dudek 2 Characteristics of the HINoV method and its modifications Algorithm of Heuristic Identification of Noisy Variables (HINoV) method for metric data (see Carmone et al [1999]) is following: 1. A data matrix [x ij ] containing n objects and m normalized variables measured on a metric scale (i = 1,..., n; j = 1,..., m) is a starting point. 2. Cluster, via kmeans method, the observed data separately for each j th variable for a given number of clusters u. It is possible to use clustering methods based on a distance matrix (pam or any hierarchical agglomerative method: single, complete, average, mcquitty, median, centroid, Ward). 3. Calculate adjusted Rand indices R jl (j, l = 1,..., m) for partitions formed from all distinct pairs of the m variables (j l). Due to a fact that adjusted Rand index is symmetrical we need to calculate m(m 1)/2 values. 4. Construct m m adjusted Rand matrix (parim). Sum rows or columns for each jth variable R j = m R jl (topri): l=1 Variable parim topri M 1 R R 1m R 1 M 2 R R 2m R M m R m1 R m2... R m 5. Rank topri values R 1, R 2,..., R m in a decreasing order (stopri) and plot the scree diagram. The size of the topri values indicate a contribution of that variable to the cluster structure. A scree diagram identifies sharp changes in the topri values. Relatively lowvalued topri variables (the noisy variables) are identified and eliminated from the further analysis (say h variables). 6. Run a cluster analysis (based on the same classification method) with the selected m h variables. The modification of the HINoV method for nonmetric data (where number of objects is much more than a number of categories) differs in steps 1, 2, and 6 (see Walesiak [2005]): 1. A data matrix [x ij ] containing n objects and m ordinal and/or nominal variables is a starting point. 2. For each jth variable we receive natural clusters, where the number of clusters equals the number of categories for that variable (for instance five for Likert scale or seven for semantic differential scale). 6. Run a cluster analysis with one of clustering methods based on a distance appropriate to nonmetric data (GDM2 for ordinal data see Jajuga et al [2003]; Sokal and Michener distance for nominal data) with the selected m h variables. The modification of the HINoV method for symbolic interval data differs in steps 1 and 2:
3 Identification of noisy variables for nonmetric and symbolic data A symbolic data array containing n objects and m symbolic interval variables is a starting point. 2. Cluster the observed data with one of clustering methods (pam, single, complete, average, mcquitty, median, centroid, Ward) based on a distance appropriate to the symbolic interval data (e.g. Hausdorff distance  see Billard and Diday [2006], p. 246) separately for each jth variable for a given number of clusters u. Functions HINoV.Mod and HINoV.Symbolic of clustersim computer program working in R allow adequately using mixed (metric, nonmetric), and the symbolic interval data. The proposed modifications of the HINoV method are evaluated on simulated data from a variety of models. 3 Simulation models We generate data sets in eleven different scenarios. The models contain the known structure of clusters. In the models 211 the noisy variables are simulated independently from the uniform distribution. Model 1. No cluster structure. 200 observations are simulated from the uniform distribution over the unit hypercube in 10 dimensions (see Tibshirani et al [2001], p. 418). Model 2. Two elongated clusters in 5 dimensions (3 noisy variables). Each cluster contains 50 observations. The observations in each of the two clusters are independent bivariate normal random variables with means (0, 0), (1, 5), and covariance matrix (σ jj = 1, σ jl = 0.9). Model 3. Three elongated clusters in 7 dimensions (5 noisy variables). Each cluster is randomly chosen to have 60, 30, 30 observations, and the observations are independently drawn from bivariate normal distribution with means (0, 0), (1.5, 7), (3, 14) and covariance matrix (σ jj = 1, σ jl = 0.9). Model 4. Three elongated clusters in 10 dimensions (7 noisy variables). Each cluster is randomly chosen to have 70, 35, 35 observations, and the observations are independently drawn from multivariate normal distribution with means (1.5, 6, 3), (3, 12, 6), (4.5, 18, 9), and identity covariance matrix, where σ jj = 1 (1 j 3), σ 12 = σ 13 = 0.9, and σ 23 = 0.9. Model 5. Five clusters in 3 dimensions that are not well separated (1 noisy variable). Each cluster contains 25 observations. The observations are independently drawn from bivariate normal distribution with means (5, 5), ( 3, 3), (3, 3), (0, 0), ( 5, 5), and identity covariance matrix (σ jj = 1, σ jl = 0.9). Model 6. Five clusters in 5 dimensions that are not well separated (2 noisy variables). Each cluster contains 30 observations. The observations are independently drawn from multivariate normal distribution with means (5, 5, 5), ( 3, 3, 3), (3, 3, 3), (0, 0, 0), ( 5, 5, 5), and covariance matrix, where σ jj = 1 (1 j 3), and σ jl = 0.9 (1 j l 3).
4 4 Marek Walesiak and Andrzej Dudek Model 7. Five clusters in 10 dimensions (8 noisy variables). Each cluster is randomly chosen to have 50, 20, 20, 20, 20 observations, and the observations are independently drawn from bivariate normal distribution with means (0, 0), (0, 10), (5, 5), (10, 0), (10, 10), and identity covariance matrix (σ jj = 1, σ jl = 0). Model 8. Five clusters in 9 dimensions (6 noisy variables). Each cluster contains 30 observations. The observations are independently drawn from multivariate normal distribution with means (0, 0, 0), (10, 10, 10), ( 10, 10, 10), (10, 10, 10), ( 10, 10, 10), and identity covariance matrix, where σ jj = 3 (1 j 3), and σ jl = 2 (1 j l 3). Model 9. Four clusters in 6 dimensions (4 noisy variables). Each cluster is randomly chosen to have 50, 50, 25, 25 observations, and the observations are independently drawn from bivariate normal distribution with means ( 4, 5), (5, 14), (14, 5), (5, 4), and identity covariance matrix (σ jj = 1, σ jl = 0). Model 10. Four clusters in 12 dimensions (9 noisy variables). Each cluster contains 30 observations. The observations are independently drawn from multivariate normal distribution with means ( 4, 5, 4), (5, 14, 5), (14, 5, 14), (5, 4, 5), and identity covariance matrix, where σ jj = 1 (1 j 3), and σ jl = 0 (1 j l 3). Model 11. Four clusters in 10 dimensions (9 noisy variables). Each cluster contains 35 observations. The observations on the first variable are independently drawn from univariate normal distribution with means 2, 4, 10, 16 respectively, and identity variance σj 2 = 0.5 (1 j 4). Ordinal data. The clusters in models 111 contain continuous data and a discretization process is performed on each variable to obtain ordinal data. [ The number of categories ]/ k determines the width of each class intervals: max{x ij ) min{x ij } k. Independently for each variable each class interval i i receive category 1,..., k and the actual value of variable x ij is replaced by these categories. In simulation study k = 5 (for k = 7 we have received similar results). Symbolic interval data. To obtain symbolic interval data the data were generated for each model twice into sets A and B and minimal (maximal) value of {a ij, b ij } is treated as the beginning (the end) of an interval. Fifty realizations were generated from each setting. 4 Discussion on the simulation results In testing the robustness of the HINoV modified algorithm using simulated ordinal or symbolic interval data, the major criterion was the identification of the noisy variables. The HINoVselected variables contain variables with the highest topri values. In models 211 the number of nonnoisy variables is known. Due to this fact, in simulation study, the number of the HINoVselected variables equals the number of nonnoisy variables in each model.
5 Identification of noisy variables for nonmetric and symbolic data... 5 When the noisy variables were identified, the next step was to run the one of clustering methods based on distance matrix (pam, single, complete, average, mcquitty, median, centroid, Ward) with the nonnoisy subset of variables (HINoVselected variables) and with all variables. Then each clustering result was compared with the known cluster structure from models 211 using Hubert and Arabie s [1985] corrected Rand index (see Table 1 and 2). Table 1. Cluster recovery for all variables and HINoVselected subsets of variables for ordinal data (five categories) by experimental model and clustering method Clustering method Model pam ward single complete average mcquitty median centroid a b a b a b a b a b a b a b a b a b b r ccr 98.22% 98.00% 94.44% 90.67% 97.11% 89.56% 98.89% 98.44% 11 a , , b , , ,06419 a (b) values represent Hubert and Arabie s adjusted Rand indices averaged over fifty replications for each model with all variables (with HINoVselected variables); r = b ā; ccr corrected cluster recovery. Some conclusions can be drawn from the simulations results: 1. The cluster recovery that used only the HINoVselected variables for ordinal data (Table 1) and symbolic interval data (Table 2) was better than the one that used all variables for all models 210 and each clustering method. 2. Among 450 simulated data sets (nine models with 50 runs) the HINoV method was better (see ccr in Table 1 and 2):
6 6 Marek Walesiak and Andrzej Dudek Table 2. Cluster recovery for all variables and HINoVselected subsets of variables for symbolic interval data by experimental model and clustering method Model Clustering method pam ward single complete average mcquitty median centroid 2 a b a b a b a b a b a b a b a b a b b r ccr 94.67% 91.78% 97.33% 99.11% 96.22% 96.44% 99.56% 99.78% 11 a b a (b); r = b ā; ccr see Table 1. from 89.56% (mcquitty) to 98.89% (median) of runs for ordinal data, from 91.78% (ward) to 99,78% (centroid) of runs for symbolic interval data. 3. Figure 1 shows the relationship between the values of adjusted Rand indices averaged over fifty replications and models 210 with the HINoVselected variables ( b) and values showing an improvement ( r) of average adjusted Rand indices (cluster recovery with the HINoV selected variables against all variables) separately for eight clustering methods and types of data (ordinal, symbolic interval). Based on adjusted Rand indices averaged over fifty replications and models 210 the improvements in cluster recovery (HINoV selected variables against all variables) are varying: for ordinal data from (mcquitty) to (centroid), for symbolic interval data from (ward) to (centroid).
7 Identification of noisy variables for nonmetric and symbolic data pam ward mcquitty average complete single centroid median 0.9 average pam ward centroid b mcquitty complete median 0.6 single ordinal data symbolic data r Fig. 1. The relationship between values of b and r Source: own research 5 Conclusions The HINoV algorithm has limitations for analyzing nonmetric and symbolic interval data almost the same as the ones mentioned in Carmone et al [1999] article for metric data. First, the HINoV is of a little use with a nonmetric data set or a symbolic data array in which all variables are noisy (no cluster structure see model 1). In this situation topri values are similar and close to zero (see Table 3). Table 3. Mean and standard deviation of topri values for 10 variables in model 1 Ordinal data with five categories Symbolic data array Variable mean sd mean sd Second, the HINoV method depends on the relationship between pairs of variables. If we have only one variable with a cluster structure and the others are noisy, the HINoV will not be able to isolate this nonnoisy variable (see Table 4).
8 8 Marek Walesiak and Andrzej Dudek Table 4. Mean and standard deviation of topri values for 10 variables in model 11 Ordinal data with five categories Symbolic data array Variable mean sd mean sd Third, if all variables have the same cluster structure (no noisy variables) the topri values will be large and similar for all variables. The suggested selection process using a scree diagram will be ineffective. Fourth, an important problem is to decide on a proper number of clusters in stage two of the HINoV algorithm with symbolic interval data. To resolve this problem we should initiate the HINoV algorithm with a different number of clusters. References BILLARD, L., DIDAY, E. (2006): Symbolic data analysis. Conceptual statistics and data mining, Wiley, Chichester. CARMONE, F.J., KARA, A. and MAXWELL, S. (1999): HINoV: a new method to improve market segment definition by identifying noisy variables, Journal of Marketing Research, vol. 36, November, GNANADESIKAN, R., KETTENRING, J.R., and TSAO, S.L. (1995): Weighting and selection of variables for cluster analysis, Journal of Classification, vol. 12, no. 1, HUBERT, L.J., ARABIE, P. (1985): Comparing partitions, Journal of Classification, vol. 2, no. 1, JAJUGA, K., WALESIAK, M., BAK, A. (2003): On the General Distance Measure, In: M., Schwaiger, and O., Opitz (Eds.), Exploratory data analysis in empirical research, SpringerVerlag, Berlin, Heidelberg, MILLIGAN, G.W. (1996): Clustering validation: results and implications for applied analyses, In: P., Arabie, L.J., Hubert, G., de Soete (Eds.), Clustering and classification, World Scientific, Singapore, TIBSHIRANI, R., WALTHER, G., HASTIE, T. (2001): Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society, ser. B, vol. 63, part 2, WALESIAK, M. (2005): Variable selection for cluster analysis approaches, problems, methods, Plenary Session of the Committee on Statistics and Econometrics of the Polish Academy of Sciences, 15, March, Wroclaw.
NEW METHOD OF VARIABLE SELECTION FOR BINARY DATA CLUSTER ANALYSIS
STATISTICS IN TRANSITION new series, June 2016 295 STATISTICS IN TRANSITION new series, June 2016 Vol. 17, No. 2, pp. 295 304 NEW METHOD OF VARIABLE SELECTION FOR BINARY DATA CLUSTER ANALYSIS Jerzy Korzeniewski
More informationSTANDARDISATION OF DATA SET UNDER DIFFERENT MEASUREMENT SCALES. 1 The measurement scales of variables
STANDARDISATION OF DATA SET UNDER DIFFERENT MEASUREMENT SCALES Krzysztof Jajuga 1, Marek Walesiak 1 1 Wroc law University of Economics, Komandorska 118/120, 53345 Wroc law, Poland Abstract: Standardisation
More informationThere are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationA Kmeanslike Algorithm for Kmedoids Clustering and Its Performance
A Kmeanslike Algorithm for Kmedoids Clustering and Its Performance HaeSang Park*, JongSeok Lee and ChiHyuck Jun Department of Industrial and Management Engineering, POSTECH San 31 Hyojadong, Pohang
More informationA ROUGH CLUSTER ANALYSIS OF SHOPPING ORIENTATION DATA. Kevin Voges University of Canterbury. Nigel Pope Griffith University
Abstract A ROUGH CLUSTER ANALYSIS OF SHOPPING ORIENTATION DATA Kevin Voges University of Canterbury Nigel Pope Griffith University Mark Brown University of Queensland Track: Marketing Research and Research
More informationCLUSTERING is a process of partitioning a set of objects into
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 5, MAY 2005 657 Automated Variable Weighting in kmeans Type Clustering Joshua Zhexue Huang, Michael K. Ng, Hongqiang Rong,
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationClustering & Association
Clustering  Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More informationStandardization and Its Effects on KMeans Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 3993303, 03 ISSN: 0407459; eissn: 0407467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationRobust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
More informationBasic Questions in Cluster Analysis
Cluster Analysis 1 Cluster analysis, like reduced space analysis (factor analysis), is concerned with data matrices in which the variables have not been partitioned beforehand into criterion versus predictor
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationFoundation of Quantitative Data Analysis
Foundation of Quantitative Data Analysis Part 1: Data manipulation and descriptive statistics with SPSS/Excel HSRS #10  October 17, 2013 Reference : A. Aczel, Complete Business Statistics. Chapters 1
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationCluster Analysis using R
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationClassify then Summarize or Summarize then Classify
Classify then Summarize or Summarize then Classify DIMACS, Rutgers University Piscataway, NJ 08854 Workshop Honoring Edwin Diday held on September 4, 2007 What is Cluster Analysis? Software package? Collection
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationNeural Networks Lesson 5  Cluster Analysis
Neural Networks Lesson 5  Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt.  Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationClustering and Data Mining in R
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
More informationApplied Multivariate Analysis
Neil H. Timm Applied Multivariate Analysis With 42 Figures Springer Contents Preface Acknowledgments List of Tables List of Figures vii ix xix xxiii 1 Introduction 1 1.1 Overview 1 1.2 Multivariate Models
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationTime series clustering and the analysis of film style
Time series clustering and the analysis of film style Nick Redfern Introduction Time series clustering provides a simple solution to the problem of searching a database containing time series data such
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationLecture 20: Clustering
Lecture 20: Clustering Wrapup of neural nets (from last lecture Introduction to unsupervised learning Kmeans clustering COMP424, Lecture 20  April 3, 2013 1 Unsupervised learning In supervised learning,
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 3448 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationSteven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 306022501
CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 306022501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationLatent Class (Finite Mixture) Segments How to find them and what to do with them
Latent Class (Finite Mixture) Segments How to find them and what to do with them Jay Magidson Statistical Innovations Inc. Belmont, MA USA www.statisticalinnovations.com Sensometrics 2010, Rotterdam Overview
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationExploratory Data Analysis with MATLAB
Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationBehavioral Market Segmentation of Binary Guest Survey Data with Bagged Clustering
Behavioral Market Segmentation of Binary Guest Survey Data with Bagged Clustering Sara Dolničar 1 and Friedrich Leisch 1 Department of Tourism and Leisure Studies, Vienna University of Economics and Business
More informationVisualization of textual data: unfolding the Kohonen maps.
Visualization of textual data: unfolding the Kohonen maps. CNRS  GET  ENST 46 rue Barrault, 75013, Paris, France (email: ludovic.lebart@enst.fr) Ludovic Lebart Abstract. The Kohonen self organizing
More informationData Preparation and Statistical Displays
Reservoir Modeling with GSLIB Data Preparation and Statistical Displays Data Cleaning / Quality Control Statistics as Parameters for Random Function Models Univariate Statistics Histograms and Probability
More informationVisualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87100 Toruń, Poland. www.phys.uni.torun.pl/kmk
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table
More informationClustering in Machine Learning. By: Ibrar Hussain Student ID:
Clustering in Machine Learning By: Ibrar Hussain Student ID: 11021083 Presentation An Overview Introduction Definition Types of Learning Clustering in Machine Learning Kmeans Clustering Example of kmeans
More informationComputer program review
Journal of Vegetation Science 16: 355359, 2005 IAVS; Opulus Press Uppsala.  Ginkgo, a multivariate analysis package  355 Computer program review Ginkgo, a multivariate analysis package Bouxin, Guy Haute
More informationUse of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,
More informationStatistics Review PSY379
Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototypebased Fuzzy cmeans Mixture Model Clustering Densitybased
More informationSTATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp. 238239
STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. by John C. Davis Clarificationof zonationprocedure described onpp. 3839 Because the notation used in this section (Eqs. 4.8 through 4.84) is inconsistent
More informationPrinciple Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
More informationClustering and Cluster Evaluation. Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton
Clustering and Cluster Evaluation Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton Clustering Methods Agglomerative Start with all separate, end with some connected Partitioning / Divisive Start
More informationFactor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models
Factor Analysis Principal components factor analysis Use of extracted factors in multivariate dependency models 2 KEY CONCEPTS ***** Factor Analysis Interdependency technique Assumptions of factor analysis
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationThe Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
More informationMultivariate Statistical Inference and Applications
Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A WileyInterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim
More informationTradeOff between Disclosure Risk and Information Loss Using Multivariate Microaggregation: A Case Study on Business Data
TradeOff between Disclosure Ris and Information Loss Using Multivariate Microaggregation: A Case Study on Business Data Josep A. Sànchez, Julià Urrutia, and Enric Ripoll Institut d Estadística de Catalunya
More informationNCSS Statistical Software. KMeans Clustering
Chapter 446 Introduction The kmeans algorithm was developed by J.A. Hartigan and M.A. Wong of Yale University as a partitioning technique. It is most useful for forming a small number of clusters from
More informationDegeneracy on Kmeans clustering
on Kmeans clustering Abdulrahman Alguwaizani Assistant Professor Math Department College of Science King Faisal University Kingdom of Saudi Arabia Office # 0096635899717 Mobile # 00966505920703 Email:
More informationCrowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
More informationMultivariate Analysis
Table Of Contents Multivariate Analysis... 1 Overview... 1 Principal Components... 2 Factor Analysis... 5 Cluster Observations... 12 Cluster Variables... 17 Cluster KMeans... 20 Discriminant Analysis...
More informationA successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.
MARKET SEGMENTATION The simplest and most effective way to operate an organization is to deliver one product or service that meets the needs of one type of customer. However, to the delight of many organizations
More informationSTOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL
Stock AsianAfrican Market Trends Journal using of Economics Cluster Analysis and Econometrics, and ARIMA Model Vol. 13, No. 2, 2013: 303308 303 STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationImportant Characteristics of Cluster Analysis Techniques
Cluster Analysis Can we organize sampling entities into discrete classes, such that withingroup similarity is maximized and amonggroup similarity is minimized according to some objective criterion? Sites
More informationBUSINESS ANALYTICS. Data Preprocessing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.
Tomáš Horváth BUSINESS ANALYTICS Lecture 3 Data Preprocessing Information Systems and Machine Learning Lab University of Hildesheim Germany Overview The aim of this lecture is to describe some data preprocessing
More informationA Solution Manual and Notes for: Exploratory Data Analysis with MATLAB by Wendy L. Martinez and Angel R. Martinez.
A Solution Manual and Notes for: Exploratory Data Analysis with MATLAB by Wendy L. Martinez and Angel R. Martinez. John L. Weatherwax May 7, 9 Introduction Here you ll find various notes and derivations
More informationResearch Online. University of Wollongong. Sara Dolnicar University of Wollongong, Publication Details
University of Wollongong Research Online Faculty of Commerce  Papers (Archive) Faculty of Business 2003 Using cluster analysis for market segmentation  typical misconceptions, established methodological
More informationClustering. 15381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv BarJoseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationProbability and Statistics
CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b  0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute  Systems and Modeling GIGA  Bioinformatics ULg kristel.vansteen@ulg.ac.be
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. DensityBased Methods 6. GridBased Methods 7. ModelBased
More informationHow to report the percentage of explained common variance in exploratory factor analysis
UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: LorenzoSeva, U. (2013). How to report
More informationData Mining and Clustering Techniques
DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: K Data Mining and Clustering Techniques I. K. Ravichandra Rao Professor and Head Documentation Research and Training Center
More informationA Case Study in Software Enhancements as Six Sigma Process Improvements: Simulating Productivity Savings
A Case Study in Software Enhancements as Six Sigma Process Improvements: Simulating Productivity Savings Dan Houston, Ph.D. Automation and Control Solutions Honeywell, Inc. dxhouston@ieee.org Abstract
More informationHierarchical Cluster Analysis Some Basics and Algorithms
Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationThe right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median
CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box
More informationPackage EstCRM. July 13, 2015
Version 1.4 Date 2015711 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu
More informationMethod of Data Center Classifications
Method of Data Center Classifications Krisztián Kósi Óbuda University, Bécsi út 96/B, H1034 Budapest, Hungary kosi.krisztian@phd.uniobuda.hu Abstract: This paper is about the Classification of big data
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
More informationIntroduction to Multivariate Analysis
Introduction to Multivariate Analysis Lecture 1 August 24, 2005 Multivariate Analysis Lecture #18/24/2005 Slide 1 of 30 Today s Lecture Today s Lecture Syllabus and course overview Chapter 1 (a brief
More informationChapter 14: Analyzing Relationships Between Variables
Chapter Outlines for: Frey, L., Botan, C., & Kreps, G. (1999). Investigating communication: An introduction to research methods. (2nd ed.) Boston: Allyn & Bacon. Chapter 14: Analyzing Relationships Between
More informationMaking the Most of Missing Values: Object Clustering with Partial Data in Astronomy
Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationSyllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015
Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015 Lecture: MWF: 1:001:50pm, GEOLOGY 4645 Instructor: Mihai
More informationKnowledge Discovery in Stock Market Data
Knowledge Discovery in Stock Market Data Alfred Ultsch and Hermann LocarekJunge Abstract This work presents the results of a Data Mining and Knowledge Discovery approach on data from the stock markets
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationMULTIVARIATE DATA ANALYSIS i.*.'.. ' 4
SEVENTH EDITION MULTIVARIATE DATA ANALYSIS i.*.'.. ' 4 A Global Perspective Joseph F. Hair, Jr. Kennesaw State University William C. Black Louisiana State University Barry J. Babin University of Southern
More informationCLUSTERING (Segmentation)
CLUSTERING (Segmentation) Dr. Saed Sayad University of Toronto 2010 saed.sayad@utoronto.ca http://chemeng.utoronto.ca/~datamining/ 1 What is Clustering? Given a set of records, organize the records into
More informationBasic Concepts in Research and Data Analysis
Basic Concepts in Research and Data Analysis Introduction: A Common Language for Researchers...2 Steps to Follow When Conducting Research...3 The Research Question... 3 The Hypothesis... 4 Defining the
More informationVisualizing nonhierarchical and hierarchical cluster analyses with clustergrams
Visualizing nonhierarchical and hierarchical cluster analyses with clustergrams Matthias Schonlau RAND 7 Main Street Santa Monica, CA 947 USA Summary In hierarchical cluster analysis dendrogram graphs
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationCONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
More informationInvestigation of Latent Semantic Analysis for Clustering of Czech News Articles
Investigation of Latent Semantic Analysis for Clustering of Czech News Articles Michal Rott, Petr Cerva Institute of Information Technology and Electronics Technical University of Liberec Studentska 2,
More information