CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.
|
|
- Randall Green
- 8 years ago
- Views:
Transcription
1 CLUSTER ANALYSIS Introduction Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data. Cluster analysis can be considered an alternative to Factor Analysis. Cluster analysis differs from discriminant analysis. o In cluster analysis the group membership is unknown prior to the analysis. In the biological sciences, an area where cluster analysis has been widely used is taxonomy. o In taxonomy individuals are classified into arbitrary groups based on measurements of the individuals. o The classification moves from the most general to the most specific. Kingdom Phylum Subphylum Class Order Family Genus Species In economics, cluster analysis can be used for data mining. o For example, in a market survey you could classify patrons into groups based on their answers to many questions. Warnings for cluster analysis. o Groupings from cluster analysis can be different based on the method of analysis used. o Since the groups are not known a priori, it can be difficult to determine if the results make sense in the context of the research being conducted.
2 o Knowledge of the population you are sampling and common sense are two important tools when it comes to interpreting results from cluster analysis. Basic Concepts of Cluster Analysis Cluster analysis can be divided into two basic steps, 1. Initial analysis of data. 2. Analytical clustering using one of many methods of amalgamation. Initial analysis o It is always a good idea before any statistical analysis to plot a scatter diagram of your data to see if there are any irregularities that need to be address using a transformation. o A common transformation in multivariate analyses is to standardize your data so that it has a mean of 0 and a variance of 1.0 Standardized Y! = (Y! Y) S! o If in visualizing your data you seem to see clusters that are elliptical in shape, you want to use a transformation method that will make the resultant pooled within cluster covariance matrix spherical. Analytical clustering v The method PROC ACELUS (Approximate Covariance Estimation for Clustering) procedure in SAS will perform the transformation. v Neither cluster membership nor the number of clusters needs to be known. Distance Measures o Distance measures can be studied in large data sets to determine similarities or clusters. o The opposite of similarity is distance. o Distance values can be calculated for each pair of observations. o Statistical methods to calculate distance are very sensitive to outliers. So you are encouraged to run diagnostics on your data to identify outliers and remove them if necessary.
3 o The most commonly used distance measurement is the Euclidian Distance. Distance (x,y) =Σ! (x! y! )! o Different methods to determine distance will provide different results. Cluster Analysis Process o In the initial cluster analysis, all individuals begin in the same cluster. o In subsequent rounds of analyses, the entries are placed into more and more clusters. o At the end of the cluster analysis, all individuals are in their own cluster. o During the various rounds of cluster analysis, the distances between new clusters must be determined and we need to be ale to determine when two clusters are sufficiently close to be linked together. o Two of the most common methods of cluster analysis are, Unweighted Pair- Group Mean Average (UPGMA): the distance between any two clusters is the average distance between all individuals in the different clusters. Ward s Method: a minimum variance method that uses an ANOVA approach. The method tries to minimize the sum of squares of any two clusters that are formed at each step of the cluster analysis. Estimating the Number of Clusters o Three methods that can be used to estimate the number of clusters are the, 1. Cubic clustering criterion (CCC) method: the estimated number of clusters occurs at the start of a peak on the graph. There may be more than one peak per plot. 2. Pseudo F: estimated number of clusters occurs at the start of peaks on the graph. There may be more than one peak per plot. 3. t 2 The graph is read right to left. The estimated number of clusters occurs at the start of a peak. There may be more than one peak per plot. Precautions When Using Cluster Analysis Unless there is considerable separation between inherent groups when you view the scatter plots, it is not realistic to expect Cluster Analysis to provide clear results.
4 Cluster Analysis is very sensitive to outliers. Results from the different Cluster Analysis methods may give you very different results. If you have large amounts of data, one method of simple validation of the results from Cluster Analysis is to conduct the analysis on the two halves of your data. It would be preferable to select the individuals to be assigned to the two halves at random. Example of Cluster Analysis In this example, I am using data from one of my students (Sintayehu Daba) PhD dissertation. Sintayehu is evaluating barley lines from three regions, Ethiopia and Kenya, ICARDA, and North Dakota, USA. Sintayehu collected data on many different plant characters, agronomic traits, and disease resistance. In the analysis, I am trying to determine if cluster analysis will successfully separate the data into distinct clusters based on the data collected. SAS Commands options pageno=1; data all; input Entry Source Color Hull_cover Row Orrow DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; datalines;
5 ;; data two; set all;
6 if row=2; ods graphics on; ods rtf file='cluster.rtf'; proc cluster data=two method=ave print=15 ccc pseudo; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; copy orrow; title 'Cluster Analysis Using the UPGMA Method'; proc tree noprint ncl=3 out=out; copy row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD orrow; proc freq; tables cluster*orrow / nopercent norow nocol plot=none; proc candisc noprint out=can; class cluster; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; proc cluster data=two method=ward print=15 ccc pseudo; var row Color Hull_cover Row DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; copy orrow; title 'Cluster analysis Using Wards Method'; proc tree noprint ncl=3 out=out; copy row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD orrow; proc freq; tables cluster*orrow / nopercent norow nocol plot=none; proc candisc noprint out=can; class cluster; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; ods rtf close; ods graphics off;
7 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative Root-Mean-Square Total-Sample Standard Deviation Root-Mean-Square Distance Between Observations Number of Clusters Clusters Joined Freq Semipartial R-Square R-Square Cluster History Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Norm RMS Distance Tie 15 CL27 CL CL52 CL CL19 CL CL42 CL CL21 CL CL12 OB CL13 CL CL11 CL
8 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Number of Clusters Clusters Joined Freq Semipartial R-Square R-Square Cluster History Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Norm RMS Distance Tie 7 CL9 CL CL15 CL CL6 CL CL26 OB CL5 CL OB2 CL CL3 CL The semipartial R 2 measures the homogeneity of merged clusters. This value reflects decreasing homogeneity of members in a cluster as clusters are combined to make new clusters. R 2 reflects the differences between clusters, so you want this value to be high. At the start of the clustering process all entries are their own cluster; thus, the R 2 is 1. As more clusters are combined, the R 2 value should decrease. At the end of the analysis when all observations are in the same cluster, the R 2 value should theoretically be 0. The approximate expected R 2 value is part of the output presented when the CCC value is requested. The approximate expected R 2 value reflects an estimated value given a uniform null hypothesis. Ties o o o At each level of the clustering process, Proc Cluster identifies pairs of clusters with the minimum distance between them. Sometimes there can be two or more pairs of clusters with the same minimum distance. This often occurs with discrete data. In such cases the tie must be broken in some arbitrary way. If there are ties, then the results of the cluster analysis depend on the order of the observations in the data set. A tie means that at a particular step in the cluster analysis, two pairs of clusters had the same minimum distance and possibly some of the later steps some of the clusters are not uniquely determined. Ties that occur early in the cluster analysis usually have little effect on the later stages. Ties that occur in the middle parts of the cluster analysis should be investigated. Ties that occur late in the cluster analysis are a sign that a solid or concrete solution may not be possible. There are routines you can run to determine if Ties are affecting the outcome of your analyses.
9 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis
10 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis
11 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Table of CLUSTER by Orrow (Using Non-standardized Data) CLUSTER Orrow Frequency Total Total Frequency Missing = 1 Table of CLUSTER by Orrow (Using Standardized Data) CLUSTER Orrow Frequency Total Total Frequency Missing = 1
12 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Non- standardized Data
13 Cluster Analysis Using the UPGMA Method The FREQ Procedure (Using Standardized Data)
14 Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative Root-Mean-Square Total-Sample Standard Deviation Root-Mean-Square Distance Between Observations Number of Clusters Clusters Joined Freq Semipartial R-Square Cluster History R-Square Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Tie 15 CL22 CL CL26 CL CL15 CL CL28 CL CL17 CL OB2 OB CL31 CL
15 Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Number of Clusters Clusters Joined Freq Semipartial R-Square Cluster History R-Square Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Tie 8 CL9 CL CL34 CL CL13 CL CL11 CL CL10 CL CL21 CL CL4 CL CL3 CL
16 Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis
17 Cluster analysis Using Wards Method The FREQ Procedure Table of CLUSTER by Orrow (Non-standardized Data) CLUSTER Orrow Frequency Total Total Frequency Missing = 1 Table of CLUSTER by Orrow (Using Standardized Data) CLUSTER Orrow Frequency Total Total Frequency Missing = 1
18 Cluster analysis Using Wards Method The FREQ Procedure Using Non- standardized Data
19 Cluster analysis Using Wards Method The FREQ Procedure Using Standardized Data
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationCluster this! June 2011
Cluster this! June 2011 Agenda On the agenda today: SAS Enterprise Miner (some of the pros and cons of using) How multivariate statistics can be applied to a business problem using clustering Some cool
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
More informationThere are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
More informationFACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.
FACTOR ANALYSIS Introduction Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables Both methods differ from regression in that they don t have
More informationImproving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
More informationDidacticiel - Études de cas
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
More informationSIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.
SIMPLE LINEAR CORRELATION Simple linear correlation is a measure of the degree to which two variables vary together, or a measure of the intensity of the association between two variables. Correlation
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS
ORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS Many treatments are equally spaced (incremented). This provides us with the opportunity to look at the response curve
More informationPATTERN RECOGNITION TECHNIQUES TO CLASSIFY THE EUROPEAN EMERGING MARKETS COMPANIES FROM THE VALUATION PERSPECTIVE
Professor Georgeta VINTILĂ, PhD E-mail: vintilageorgeta@yahoo.fr Ştefan Cristian GHERGHINA, PhD Candidate E-mail: stefan.gherghina@fin.ase.ro Bucharest University of Economic Studies PATTERN RECOGNITION
More informationSteven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501
CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups
More informationTechnical Notes for HCAHPS Star Ratings
Overview of HCAHPS Star Ratings Technical Notes for HCAHPS Star Ratings As part of the initiative to add five-star quality ratings to its Compare Web sites, the Centers for Medicare & Medicaid Services
More informationIBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationData Mining and Visualization
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationCONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationBasic Statistical and Modeling Procedures Using SAS
Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom
More informationHierarchical Cluster Analysis Some Basics and Algorithms
Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this
More informationStatistics & Analysis
NESUG How to Increase Sales of Orthopedic Equipment in United States: Factor and Cluster Analysis using SAS and R George Obsekov American College of Radiology Research Center Philadelphia, PA INTRODUCTION
More informationTutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More informationPaper DV-06-2015. KEYWORDS: SAS, R, Statistics, Data visualization, Monte Carlo simulation, Pseudo- random numbers
Paper DV-06-2015 Intuitive Demonstration of Statistics through Data Visualization of Pseudo- Randomly Generated Numbers in R and SAS Jack Sawilowsky, Ph.D., Union Pacific Railroad, Omaha, NE ABSTRACT Statistics
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More information4 Other useful features on the course web page. 5 Accessing SAS
1 Using SAS outside of ITCs Statistical Methods and Computing, 22S:30/105 Instructor: Cowles Lab 1 Jan 31, 2014 You can access SAS from off campus by using the ITC Virtual Desktop Go to https://virtualdesktopuiowaedu
More informationDiagrams and Graphs of Statistical Data
Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in
More informationStatistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant
Statistical Analysis NBAF-B Metabolomics Masterclass Mark Viant 1. Introduction 2. Univariate analysis Overview of lecture 3. Unsupervised multivariate analysis Principal components analysis (PCA) Interpreting
More informationProfile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases:
Profile Analysis Introduction Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases: ) Comparing the same dependent variables
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationCourse Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
More informationDecision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,
More informationbusiness statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
More informationBusiness Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
More informationEngineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationDongfeng Li. Autumn 2010
Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis
More informationVariables. Exploratory Data Analysis
Exploratory Data Analysis Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data. A common situation is for a data set to be represented as a matrix. There is
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationNeural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering
More informationII. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
More informationClustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
More informationData Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
More informationMultivariate Analysis
Table Of Contents Multivariate Analysis... 1 Overview... 1 Principal Components... 2 Factor Analysis... 5 Cluster Observations... 12 Cluster Variables... 17 Cluster K-Means... 20 Discriminant Analysis...
More informationVertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th
Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th Standard 3: Data Analysis, Statistics, and Probability 6 th Prepared Graduates: 1. Solve problems and make decisions that depend on un
More informationCustomer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona
Paper 1285-2014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have
More informationData Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
More informationSummarizing and Displaying Categorical Data
Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationThe importance of graphing the data: Anscombe s regression examples
The importance of graphing the data: Anscombe s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008 B. Weaver, NHRC 2008 1 The Objective
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary
More informationTime series clustering and the analysis of film style
Time series clustering and the analysis of film style Nick Redfern Introduction Time series clustering provides a simple solution to the problem of searching a database containing time series data such
More informationGeostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt
More informationQuantitative Methods for Finance
Quantitative Methods for Finance Module 1: The Time Value of Money 1 Learning how to interpret interest rates as required rates of return, discount rates, or opportunity costs. 2 Learning how to explain
More informationIntroduction to Multivariate Analysis
Introduction to Multivariate Analysis Lecture 1 August 24, 2005 Multivariate Analysis Lecture #1-8/24/2005 Slide 1 of 30 Today s Lecture Today s Lecture Syllabus and course overview Chapter 1 (a brief
More informationThe Statistics of Income (SOI) Division of the
Brian G. Raub and William W. Chen, Internal Revenue Service The Statistics of Income (SOI) Division of the Internal Revenue Service (IRS) produces data using information reported on tax returns. These
More informationSurvey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationThis can be useful to temporarily deactivate programming segments without actually deleting the statements.
EXST 700X SAS Programming Tips Page 1 SAS Statements: All SAS statements end with a semicolon, ";". A statement may occur on one line, or run across several lines. Several statements can also be placed
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationA Demonstration of Hierarchical Clustering
Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationCalculation of Minimum Distances. Minimum Distance to Means. Σi i = 1
Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between
More informationList of Examples. Examples 319
Examples 319 List of Examples DiMaggio and Mantle. 6 Weed seeds. 6, 23, 37, 38 Vole reproduction. 7, 24, 37 Wooly bear caterpillar cocoons. 7 Homophone confusion and Alzheimer s disease. 8 Gear tooth strength.
More informationHow To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)
Statgraphics Centurion XVII (currently in beta test) is a major upgrade to Statpoint's flagship data analysis and visualization product. It contains 32 new statistical procedures and significant upgrades
More informationA successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.
MARKET SEGMENTATION The simplest and most effective way to operate an organization is to deliver one product or service that meets the needs of one type of customer. However, to the delight of many organizations
More informationThe Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
More informationExploratory Data Analysis
Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationStatistics, Data Analysis & Econometrics
Using the LOGISTIC Procedure to Model Responses to Financial Services Direct Marketing David Marsh, Senior Credit Risk Modeler, Canadian Tire Financial Services, Welland, Ontario ABSTRACT It is more important
More informationHow To Identify Noisy Variables In A Cluster
Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,
More informationA Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationNon-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
More informationChapter 1 Introduction. 1.1 Introduction
Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations
More informationQuick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7
Chapter 1 Introduction 1 SAS: The Complete Research Tool 1 Objectives 2 A Note About Syntax and Examples 2 Syntax 2 Examples 3 Organization 4 Chapter by Chapter 4 What This Book Is Not 5 Chapter 2 SAS
More informationPrincipal Component Analysis
Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded
More informationCluster Analysis. Aims and Objectives. What is Cluster Analysis? How Does Cluster Analysis Work? Postgraduate Statistics: Cluster Analysis
Aims and Objectives By the end of this seminar you should: Cluster Analysis Have a working knowledge of the ways in which similarity between cases can be quantified (e.g. single linkage, complete linkage
More informationHow To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
More informationCluster Analysis using R
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other
More informationAudit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila
Audit Analytics --An innovative course at Rutgers Qi Liu Roman Chinchila A new certificate in Analytic Auditing Tentative courses: Audit Analytics Special Topics in Audit Analytics Forensic Accounting
More informationLocal outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
More informationPaper 159 2010. Exploring, Analyzing, and Summarizing Your Data: Choosing and Using the Right SAS Tool from a Rich Portfolio
Paper 159 2010 Exploring, Analyzing, and Summarizing Your Data: Choosing and Using the Right SAS Tool from a Rich Portfolio ABSTRACT Douglas Thompson, Assurant Health This is a high level survey of Base
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster
More informationChapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
More informationPaper PO 015. Figure 1. PoweReward concept
Paper PO 05 Constructing Baseline of Customer s Hourly Electric Usage in SAS Yuqing Xiao, Bob Bolen, Diane Cunningham, Jiaying Xu, Atlanta, GA ABSTRACT PowerRewards is a pilot program offered by the Georgia
More informationCOM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3
COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping
More informationData analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
More informationChapter 5 Analysis of variance SPSS Analysis of variance
Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
More information10. Analysis of Longitudinal Studies Repeat-measures analysis
Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.
More informationOverview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,
More informationData Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA
Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA An Overview of SAS Enterprise Miner The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3.
More information