# STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Size: px
Start display at page:

Download "STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and"

Transcription

1 Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics

2 Table of Contents CLUSTERING TECHNIQUES AND STATISTICA... 1 Summary of Clustering Algorithms... 1 Joining (Tree Clustering)... 2 k-means Clustering... 3 EM (Expectation Maximization) Clustering... 3 CASE STUDY: DEFINING CLUSTERS OF SHOPPING CENTER PATRONS... 5 DATA ANALYSIS WITH STATISTICA... 7 k-means Cluster Analysis Results... 7 EM Cluster Analysis Results... 8 Feature Selection and Variable Screening Results... 9 CONCLUSION... 10

3 Page 1 Clustering Techniques and STATISTICA The term cluster analysis (first used by Tryon, 1939) actually encompasses a number of different classification algorithms. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. Clustering techniques have been applied to a wide variety of research problems. Hartigan (1975) provides an excellent summary of the many published studies reporting the results of cluster analyses. For example, in the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc., is essential for successful therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc., by applying cluster analytic techniques. In general, whenever there is a need to classify a mountain of information into manageable meaningful piles, cluster analyses are of great utility. Clustering algorithms work well with all kinds of data including categorical, numerical, and textual data. Usually the only decision the user must make is to ask for a specific number of candidate clusters. The advanced features in STATISTICA will help the user even with this aspect of the analyses (i.e., to determine the right number of clusters). The clustering algorithm will find the best partitioning of all the customer records (in our example) and will provide descriptions of the means or centroids ( average responses ) of each cluster in terms of the user s input data. In many cases, these clusters have an obvious interpretation that provides insight into the natural segmentation of customers that visit the mall. Clustering techniques generally belong to the group of undirected data mining tools; these techniques are also sometimes referred to as unsupervised learning because there is no particular dependent or outcome variable (such as Amount Purchased) to predict (and that would be used to supervise the learning process, i.e., to lead to the best predictive model). Instead, the goal of undirected data mining or unsupervised learning is to discover structure in the data as a whole. There is no target variable to be predicted, so no distinction is being made between independent and dependent variables. Summary of Clustering Algorithms Clustering techniques are used for combining observed cases or observations into clusters (groups) that satisfy two main criteria: 1. Each group or cluster is homogeneous; observations (cases) that belong to the same group are similar to each other. 2. Each group or cluster should be different from other clusters, that is, observations (cases) that belong to one cluster should be different from those contained in (grouped into) other clusters.

4 Page 2 Let s start by understanding the different methods for clustering. Clustering methods have been widely distinguished into three categories. 1. Joining (Tree Clustering) 2. k-means Clustering 3. EM (Expectation Maximization) Clustering Following are brief explanations of these methods. Joining (Tree Clustering) The purpose of this algorithm is to join together objects (e.g., animals) into successively larger clusters, using some measure of similarity or distance. A typical result of this type of clustering is the hierarchical tree. Consider a Horizontal Hierarchical Tree Plot (see graph below). On the left of the plot, we begin with each object in a class by itself. Now imagine that, in very small steps, we relax our criterion as to what is and is not unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster. As a result, we link more and more objects together and aggregate larger and larger clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined together. In these plots, the horizontal axis denotes the linkage distance (in Vertical Icicle Plots, the vertical axis denotes the linkage distance). Thus, for each node in the graph (where a new cluster is formed) we can read off the criterion distance at which the respective elements were linked together into a new single cluster. When the data contain a clear structure in terms of clusters of objects that are similar to each other, then this structure will often be reflected in the hierarchical tree as distinct branches. As the result of successful analyses with the joining method, one is able to detect clusters (branches) and interpret those branches.

5 k-means Clustering Page 3 k-means Clustering is very different from Joining (Tree Clustering) and is widely used in real-world scenarios. Suppose that you already have hypotheses concerning the number of clusters in your cases or variables. You may want to tell the computer to form, say, exactly 3 clusters that are to be as distinct as possible. This is the type of research question that can be addressed by k-means Clustering. In general, the k-means method will produce exactly k different clusters of greatest possible distinction. Suppose a medical researcher has a hunch from clinical experience that her heart patients fall into three different categories with regard to physical fitness. She might wonder whether this intuition could be quantified, that is, whether a k-means Cluster Analysis of the physical fitness measures would indeed produce the three clusters of patients as expected. If so, the means on the different measures of physical fitness for each cluster would represent a quantitative way of expressing the researcher s hypothesis or intuition (i.e., patients in cluster 1 are high on measure 1, low on measure 2, etc.). Computationally, you may think of this k-means method as analyses of variance (ANOVA) in reverse. The program will start with k random clusters, and then move objects between those clusters with the goal to 1) minimize variability within clusters and(2) maximize variability between clusters. This is analogous to ANOVA in reverse in the sense that the significance test in ANOVA evaluates the between-group variability against the within-group variability when computing the significance test for the hypothesis that the means in the groups are different from each other. In k-means Clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results. Defining distances between and within clusters, and correct clustering. One important rule to note is that the classification produced with all clustering tools (k-means as well as joining methods) is very dependent upon the particular metric that is used to determine the distance between objects (e.g., observations) and clusters. Remember that all clustering methods attempt to arrange observations so that those that are similar in some respect are assigned to the same cluster, and those that are dissimilar to different clusters. Because there are many ways in which distance can be defined (e.g., as Euclidean distance, squared Euclidean distance, percent disagreement, and so on), there is no such thing as a single correct classification, although there have been attempts to define concepts such as optimal classifications. EM (Expectation Maximization) Clustering The EM (Expectation Maximization) Clustering technique is another tool offered in STATISTICA. The general purpose of this technique is also to detect clusters in observations (or variables) and to assign those observations to the clusters. The EM (expectation maximization) algorithm extends the k-means Clustering approach to clustering in two important ways: 1. Instead of assigning cases or observations to clusters to maximize the differences in means for continuous variables (or disagreement for categorical variables), the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.

6 Page 4 2. Unlike the classic implementation of k-means Clustering, the general EM algorithm can be applied to both continuous and categorical variables (although in STATISTICA the classic k- Means algorithm was modified to accommodate categorical variables as well, by defining appropriate measures of distance ). The EM algorithm for clustering is described in detail in Witten and Frank (2001). The basic approach and logic of this clustering method is as follows. Suppose you measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two clusters of observations with different means (and perhaps different standard deviations); within each sample, the distribution of values for the continuous variable follows the normal distribution. The resulting distribution of values (in the population) may look like this: Mixtures of distributions. The illustration shows two normal distributions with different means and different standard deviations, and the sum of the two distributions. Only the mixture (sum) of the two normal distributions (with different means and standard deviations) would be observed. The goal of EM clustering is to estimate the means and standard deviations for each cluster to maximize the likelihood of the observed data (distribution). Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters. Categorical variables. The EM algorithm can also accommodate categorical variables. The program will first randomly assign different probabilities (weights, to be precise) to each class or category, for each cluster. In successive iterations, these probabilities are refined (adjusted) to maximize the likelihood of the data given the specified number of clusters. Classification probabilities instead of classifications. The results of EM clustering are different from those computed by k-means Clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a result you can usually review an actual assignment of observations to clusters, based on the (largest) classification probability.

7 Page 5 Case Study: Defining Clusters of Shopping Center Patrons This case study is based on a sub-sample from a survey containing 502 questions completed by patrons of a shopping mall in the San Francisco Bay area (see Hastie, Tibshshirani, Friedman, 2001)). The general purpose of this study is to identify homogeneous typical groups of visitors to the shopping center. Once such prototypical groups of customers have been identified, special marketing campaigns, services, or recommendations for particular types of stores can be developed for each group in order to enhance the overall attractiveness and quality of the shopping experience. This clustering example analyzes a large number of undifferentiated customers to see if they fall into natural groupings. This is a pure example of undirected data mining or unsupervised learning where the analyst has no clear prior knowledge of the types of customers that visit the shopping mall and is hoping that the data-mining tool will reveal some meaningful structure by identifying clusters or groups in these relatively homogeneous shoppers. Specifically, the file used in this example contains 8995 observations with 14 variables regarding customer information. The information contained in this data set is summarized below: 1. ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE) 1 Less than \$10, \$20,000 to \$24, \$40,000 to \$49, \$10,000 to \$14, \$25,000 to \$29, \$50,000 to \$74, \$15,000 to \$19, \$30,000 to \$39, \$75,000 or more 2. SEX 1. Male 2. Female 3. MARITAL STATUS 1. Married 3. Divorced or separated 5. Single, never married 2. Living together not married 4. Widowed 4. AGE thru thru thru thru thru and Over thru 44

8 Page 6 5. EDUCATION 1. Grade 8 or less 3. Graduated high school 5. College graduate 2. Grades 9 to to 3 years of college 6. Grad Study 6. OCCUPATION 1. Professional/Managerial 4. Clerical/Service Worker 7. Military 2. Sales Worker 5. Homemaker 8. Retired 3. Factory Worker/Laborer/Driver 6. Student, HS or College 9. Unemployed 7. LIVED HOW LONG IN THE SAN FRANCISCO /OAKLAND/SAN JOSE AREA 1. Less than one year 3. Four to six years 5. More than ten years 2. One to three years 4. Seven to ten years 8. DUAL INCOMES (IF MARRIED) 1. Not Married 3. No 2. Yes 9. PERSONS IN YOUR HOUSEHOLD 1. One 4. Four 7. Seven 2. Two 5. Five 8. Eight 3. Three 6. Six 9. Nine or more 10. PERSONS IN HOUSEHOLD UNDER One 4. Four 7. Seven 2. Two 5. Five 8. Eight 3. Three 6. Six 9. Nine or more 11. HOUSEHOLDER STATUS 1. Own 3. Live with Parents/Family 2. Rent

9 Page TYPE OF HOME 1. House 3. Apartment 5. Other 2. Condominium 4. Mobile Home 13. ETHNIC CLASSIFICATION 1. American Indian 4. East Indian 7. White 2. Asian 5. Hispanic 8. Other 3. Black 6. Pacific Islander 14 LANGUAGES SPOKEN MOST OFTEN IN YOUR HOME 1. English 2. Spanish 3. Other Data Analysis with STATISTICA Data preparation is a crucial step in any data mining project. In this case, however, the data preparation is complete. Data have been verified to be accurate and reasonable; there are no missing data, etc. The following illustration shows the STATISTICA Data Miner workspace, where the Generalized k- Means Clustering tool and the Generalized EM Clustering tool are used to cluster data in the Marketing data set. Once the clusters are determined, the Feature Selection tool was used to determine the variables that most strongly influence cluster membership. k-means Cluster Analysis Results In the results below, the Final Classification and Distance to Centroid are given as output for the k- Means Cluster Analysis. This output can be used to compare cases in the various predicted clusters and their strength of cluster membership.

10 Page 8 An important variable in determining clusters for the k-means algorithm is Annual Income. In the plot below, the frequencies of each income category for the two clusters are available. Lower income individuals are more often classified in Cluster 2, and higher income individuals are classified in Cluster 1. EM Cluster Analysis Results The output for EM Cluster Analysis will be quite similar to that of k-means Clustering. Along with the final cluster classification output, EM Cluster Analysis provides probabilities of cluster membership. In the output spreadsheet below, these probabilities are shown for the first several cases in the data set.

11 Page 9 Similar to the k-means results, the frequencies of observations classified in the 2 clusters can be seen for Marital Status. Married people are classified in Cluster 1. Single, never married people are most likely classified in Cluster 2. From this plot, it is easy to tell that Marital Status is an important variable to determine cluster membership. Feature Selection and Variable Screening Results Now that we have detected cluster assignments in our data set, it s of great importance to find the predictor variables (typically with data set having numerous variables) that show the strongest relationship to the final cluster assignment variable, thereby identifying the ones that best discriminate between the clusters. The resulting spreadsheet clearly arranges the Best Categorical Predictor variables from top to bottom based on the chi-square test in order of their relative importance. The histogram, showing the same information graphically, clearly indicates that MARITALSTAT was the most influential predictor in the clustering process followed by AGE, HSEHLDSTATUS etc.

12 Conclusion Page 10 Summarizing the analyses of the best predictors into a short table, we get the following results. Note that this table indicates the respective dominant class or category for each of the variables listed in the rows, i.e., the class or category that occurred with the greatest frequency within the respective cluster. Cluster 1 Cluster 2 Marital Status Married Single, Never Married Age 25 through 34, 35 through through 24 Household Status Own Rent Dual Income Yes, Dual Income Not Married Occupation Professional/Managerial Students, HS or College Annual Income \$50,000 to \$74,999 Less than 10,000 We can conclude from this table that Cluster 1 includes Married Individuals from Dual income households, between the Ages of 25 and 44, who Own houses, hold Professional/ Managerial positions, and have Annual Incomes between \$50,000 and \$75,000. Cluster 2 members are Single, between the Ages of 18-24, Rent houses, are in High School or College, and earn less than \$10,000. The above information clearly indicates that the clustering technique has helped us identify two meaningful distinct groups of shoppers from our marketing data set. This information could be exploited to better serve the needs of the customers visiting the mall, which can improve sales, paving the way for higher profitability. For example, special marketing campaigns could be designed based on this better understanding of who visits the mall (e.g., special promotions for college students); this information could also be used to market store locations in the mall to prospective retailers who specifically cater to the groups that we identified. In general, the better we know and understand our customers, the better we are prepared to serve them and, hence, ensure a successful retail enterprise.

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### Clustering Marketing Datasets with Data Mining Techniques

Clustering Marketing Datasets with Data Mining Techniques Özgür Örnek International Burch University, Sarajevo oornek@ibu.edu.ba Abdülhamit Subaşı International Burch University, Sarajevo asubasi@ibu.edu.ba

### Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

### Statistical Databases and Registers with some datamining

Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University

### Cluster analysis with SPSS: K-Means Cluster Analysis

analysis with SPSS: K-Means Analysis analysis is a type of data classification carried out by separating the data into groups. The aim of cluster analysis is to categorize n objects in k (k>1) groups,

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### Tutorial Segmentation and Classification

MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

### Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams Matthias Schonlau RAND 7 Main Street Santa Monica, CA 947 USA Summary In hierarchical cluster analysis dendrogram graphs

### from Larson Text By Susan Miertschin

Decision Tree Data Mining Example from Larson Text By Susan Miertschin 1 Problem The Maximum Miniatures Marketing Department wants to do a targeted mailing gpromoting the Mythic World line of figurines.

### Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010

Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010 Ernst van Waning Senior Sales Engineer May 28, 2010 Agenda SPSS, an IBM Company SPSS Statistics User-driven product

### Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### How To Find Out How Different Groups Of People Are Different

Determinants of Alcohol Abuse in a Psychiatric Population: A Two-Dimensionl Model John E. Overall The University of Texas Medical School at Houston A method for multidimensional scaling of group differences

### Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,

### Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

### Cluster Analysis for Evaluating Trading Strategies 1

CONTRIBUTORS Jeff Bacidore Managing Director, Head of Algorithmic Trading, ITG, Inc. Jeff.Bacidore@itg.com +1.212.588.4327 Kathryn Berkow Quantitative Analyst, Algorithmic Trading, ITG, Inc. Kathryn.Berkow@itg.com

### Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

### Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

### Clustering UE 141 Spring 2013

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### Rethinking the Cultural Context of Schooling Decisions in Disadvantaged Neighborhoods: From Deviant Subculture to Cultural Heterogeneity

Rethinking the Cultural Context of Schooling Decisions in Disadvantaged Neighborhoods: From Deviant Subculture to Cultural Heterogeneity Sociology of Education David J. Harding, University of Michigan

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

### Demographics of Atlanta, Georgia:

Demographics of Atlanta, Georgia: A Visual Analysis of the 2000 and 2010 Census Data 36-315 Final Project Rachel Cohen, Kathryn McKeough, Minnar Xie & David Zimmerman Ethnicities of Atlanta Figure 1: From

### How To Cluster

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### Unsupervised learning: Clustering

Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

### COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics

ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE: COS-STAT-747 Principles of Statistical Data Mining

### CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### Time series clustering and the analysis of film style

Time series clustering and the analysis of film style Nick Redfern Introduction Time series clustering provides a simple solution to the problem of searching a database containing time series data such

### KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of

### CHAPTER 6 DATA MINING

Camm, Cochran, Fry, Ohlmann 1 Chapter 6 CHAPTER 6 DATA MINING CONTENTS 6.1 DATA SAMPLING 6.2 DATA PREPARATION Treatment of Missing Data Identification of Erroneous Data and Outliers Variable Representation

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

### . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

### BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

### 4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### 1 Choosing the right data mining techniques for the job (8 minutes,

CS490D Spring 2004 Final Solutions, May 3, 2004 Prof. Chris Clifton Time will be tight. If you spend more than the recommended time on any question, go on to the next one. If you can t answer it in the

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### Data Mining Applications in Higher Education

Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

### Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu

Building Data Cubes and Mining Them Jelena Jovanovic Email: jeljov@fon.bg.ac.yu KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the

### !"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"!"#"\$%&#'()*+',\$\$-.&#',/"-0%.12'32./4'5,5'6/%&)\$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

### Behavioral Segmentation

Behavioral Segmentation TM Contents 1. The Importance of Segmentation in Contemporary Marketing... 2 2. Traditional Methods of Segmentation and their Limitations... 2 2.1 Lack of Homogeneity... 3 2.2 Determining

### Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

### COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies

### CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS

CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS Venkat Venkateswaran Department of Engineering and Science Rensselaer Polytechnic Institute 275 Windsor Street Hartford,

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

### CLUSTER ANALYSIS FOR SEGMENTATION

CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every

### WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Brochure More information from http://www.researchandmarkets.com/reports/2170926/ Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

### Easily Identify the Right Customers

PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your

### Visualization methods for patent data

Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

### 5.2 Customers Types for Grocery Shopping Scenario

------------------------------------------------------------------------------------------------------- CHAPTER 5: RESULTS AND ANALYSIS -------------------------------------------------------------------------------------------------------

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

### Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

### Americans Current Views on Smoking 2013: An AARP Bulletin Survey

Americans Current Views on Smoking 2013: An AARP Bulletin Survey November 2013 Americans Current Views on Smoking 2013: An AARP Bulletin Survey Report Prepared by Al Hollenbeck, Ph.D. Copyright 2013 AARP

### Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

### Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

### Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

### 11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

### Using Library Dependencies for Clustering

Using Library Dependencies for Clustering Jochen Quante Software Engineering Group, FB03 Informatik, Universität Bremen quante@informatik.uni-bremen.de Abstract: Software clustering is an established approach

### Decision Trees What Are They?

Decision Trees What Are They? Introduction...1 Using Decision Trees with Other Modeling Approaches...5 Why Are Decision Trees So Useful?...8 Level of Measurement... 11 Introduction Decision trees are a

### Fairfield Public Schools

Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

### Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

### An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

### PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

### Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

### Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

### Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important Floyd Ray Martin, FSA, MAAA Thomas A. McInteer, FSA, MAAA Jonathan P. Polon, FSA Dental Insurance Fraud Detection

### ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

ElegantJ BI White Paper The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis Integrated Business Intelligence and Reporting for Performance Management, Operational

### Basic Concepts in Research and Data Analysis

Basic Concepts in Research and Data Analysis Introduction: A Common Language for Researchers...2 Steps to Follow When Conducting Research...3 The Research Question... 3 The Hypothesis... 4 Defining the

### Easily Identify Your Best Customers

IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### 2015 Workshops for Professors

SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

### White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does

### The SPSS TwoStep Cluster Component

White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster