A Demonstration of Hierarchical Clustering



Similar documents
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

SAS Analyst for Windows Tutorial

Tutorial for proteome data analysis using the Perseus software platform

How To Cluster

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Oracle Service Bus Examples and Tutorials

SQL Server 2005: Report Builder

MultiExperiment Viewer Quickstart Guide

JustClust User Manual

MicroStrategy Quick Guide: Creating Prompts ITU Data Mart Support Group, Reporting Services

4 Other useful features on the course web page. 5 Accessing SAS

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

T Analyst User Guide 1

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

COC131 Data Mining - Clustering

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Market Pricing Override

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Directions for using SPSS

VP-ASP Shopping Cart Quick Start (Free Version) Guide Version 6.50 March

Using Impatica for Power Point

Data exploration with Microsoft Excel: analysing more than one variable

Appendix A How to create a data-sharing lab

Final Project Report

1. To ensure the appropriate level of security, you will need Microsoft Windows XP or above.

Scatter Plots with Error Bars

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

DOCUMENT MANAGEMENT SYSTEM

Hierarchical Clustering Analysis

Multivariate Analysis

How to install and use the File Sharing Outlook Plugin

Nuclear Science and Technology Division (94) Multigroup Cross Section and Cross Section Covariance Data Visualization with Javapeño

28 What s New in IGSS V9. Speaker Notes INSIGHT AND OVERVIEW

InfiniteInsight 6.5 sp4

SPSS Introduction. Yi Li

Software Licensing Management North Carolina State University software.ncsu.edu

RuleBender Tutorial

Module 3: Correlation and Covariance

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

Note: With v3.2, the DocuSign Fetch application was renamed DocuSign Retrieve.

Working with Excel in Origin

Instructions for SPSS 21

Reporting. Understanding Advanced Reporting Features for Managers

Tutorial Segmentation and Classification

MetroBoston DataCommon Training

Content Management System User Guide

Quick Start Using DASYLab with your Measurement Computing USB device

Basic SQL Server operations

Quick Start. Creating a Scoring Application. RStat. Based on a Decision Tree Model

Speedlink software will run on Windows NT, Windows 7, and Windows 8; it will run on both 32 byte and 64 byte versions of Windows.

Using Internet or Windows Explorer to Upload Your Site

Designing portal site structure and page layout using IBM Rational Application Developer V7 Part of a series on portal and portlet development

Statgraphics Getting started

Release Notes. Asset Control and Contract Management Solution 6.1. March 30, 2005

Setting up a Scheduled task to upload pupil records to ParentPay

Advantage for Windows Copyright 2012 by The Advantage Software Company, Inc. All rights reserved. Client Portal blue Installation Guide v1.

Appendix 2.1 Tabular and Graphical Methods Using Excel

Job Scheduler User Guide IGSS Version 11.0

Factor Analysis. Chapter 420. Introduction

AppShore Premium Edition Campaigns How to Guide. Release 2.1

IGSS. Interactive Graphical SCADA System. Quick Start Guide

SAS BI Dashboard 4.3. User's Guide. SAS Documentation

owncloud Configuration and Usage Guide

Metatrader 4 Tutorial

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

PRODUCT DATA. PULSE Data Manager Types 7767-A, -B and -C. Uses and Features

Designing a Graphical User Interface

Bonita Open Solution. Introduction Tutorial. Version 5.7. Application Development User Guidance Profile: Application Developer

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

10. Comparing Means Using Repeated Measures ANOVA

Beginner s Matlab Tutorial

McAfee Endpoint Encryption Reporting Tool

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

MSOW. MSO for the Web MSONet Workstation Configuration Guide

Principal Component Analysis

Setting Up Outlook on Workstation to Capture s

Ultimus and Microsoft Active Directory

Scientific Graphing in Excel 2010

Chapter 2 The Data Table. Chapter Table of Contents

CLC Bioinformatics Database

Introduction to Visio 2003 By Kristin Davis Information Technology Lab School of Information The University of Texas at Austin Summer 2005

IBM BPM V8.5 Standard Consistent Document Managment

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

HOW TO CREATE AN HTML5 JEOPARDY- STYLE GAME IN CAPTIVATE

1. Right click using your mouse on the desktop and select New Shortcut.

They can be obtained in HQJHQH format directly from the home page at:

What's New in ADP Reporting?

Oracle BI Extended Edition (OBIEE) Tips and Techniques: Part 1

CATIA V5 Tutorials. Mechanism Design & Animation. Release 18. Nader G. Zamani. University of Windsor. Jonathan M. Weaver. University of Detroit Mercy

DataPA OpenAnalytics End User Training

14.1. bs^ir^qfkd=obcib`qflk= Ñçê=emI=rkfuI=~åÇ=léÉåsjp=eçëíë

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

Excel Tutorial. Bio 150B Excel Tutorial 1

Getting Started With SPSS

Enablement Material Workflow Overview Available Workflow types and Samples

QAS Small Business for Salesforce CRM

The Peer Reviewer s Guide to Editorial Manager

USC Marshall School of Business Marshall Information Services

Transcription:

Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised learning procedures. This recitation will focus on two of these procedures: hierarchical clustering and principal component analysis. The data sets myraw.xls and prospect.xls will be used to demonstrate the methods. (As always, there is no guarantee that either will provide substantial insights for these particular data sets). Agglomerative hierarchical clustering is described in Section 14.3.12 of the textbook. SAS has algorithms for these methods, but they are not directly available from within Enterprise Miner. Keep in mind that Enterprise Miner is simply a user-friendly interface that invokes basic SAS routines to perform all of its statistical tasks. Although a few nodes invoke SAS s hierarchical clustering algorithms, none of them provides direct access to clustering options or to the resulting dendrograms. Therefore, it will be necessary to introduce some features of basic SAS, and, more specifically, the basic procedures SAS uses for clustering. Principal component analysis (PCA) is described in Section 14.5.1 of the textbook. Enterprise Miner does have a node that performs PCA, although the same node also performs certain types of supervised learning. This recitation will only cover the use of that node for PCA. A Demonstration of Hierarchical Clustering 1. A SAS program file on the course website will be used as a model for the analysis. Go to the course website http://www.orie.cornell.edu/~davidr/or474 and follow the links Minitab and SAS Programs and Output SAS Programs. There will be two links with the name Donations Cluster. Either one will let you download the program, but the top link may be more convenient if your computer is configured properly. Click the top Donations Cluster link and choose to open the file from its current location. This should automatically open SAS and load the program into a program editor window. If this procedure fails, click the other Donations Cluster link to see a plain text version of the SAS program. Start SAS, then copy the program from the browser window into a program editor window in SAS. (There should already be a blank program editor window with a label like Editor Untitled1. If not, you will need to create one by choosing View Enhanced Editor from the menu bar.) 2. Download and save myraw.xls from the course website (under Course Data Sets, labeled as Donations data). Import the data into SAS. (Do not start Enterprise Miner all of this demonstration will be performed in base SAS.) 1

3. Basic SAS is a command-line-style environment that allows you to apply predefined statistical algorithms to data sets in the SAS libraries. The current version provides a convenient window-based environment: Explorer and Results windows on the left hand side for browsing data and program results, and Editor, Output, and Log windows to the right for writing SAS program scripts and viewing output and notes on the processing performed. The SAS program from the website should appear in an Editor window. It is a short script that invokes three predefined SAS procedures: fastclus, cluster, and tree. In general, the syntax for invoking a predefined SAS procedure is proc procedure name, followed by a series of options, many in the form of parameter name/value pairs written as parameter name = parameter value, and then (possibly) a series of additional statements pertaining to the procedure. The proc fastclus statement performs K-means clustering. In the program, it is invoked with the following options: data=sasuser.donations: data will be read from the library Sasuser in the data file Donations maxc=10: the maximum number of clusters allowed should be 10 (This will often be the same as the number of clusters found, as long as the number is small and the data is nondegenerate.) mean=mean: the cluster means (centers) will be stored in a temporary data set named mean The next line is a var statement that modifies the proc fastclus statement. In this case, it specifies which variables will be used for clustering. Only the listed variables will be used. The end of the series of specifications for the proc fastclus statement is marked with the run statement. You may have imported the data under a different name or into a different library than the one listed in the program. If so, make any necessary changes to the program now. The proc cluster statement performs agglomerative hierarchical clustering. In the program, it is invoked with the following specifications: data=mean: data will be read from the temporary data file mean method=average: the measure of intergroup dissimilarity used in the clustering will be group average (Other options include single for single linkage and complete for complete linkage.) PRINT=20: the maximum number of clusters at the lowest level of the dendrogram to be displayed will be 20 (This does not actually produce a dendrogram, but the next procedure will.) Again, the specifications end with a run statement. The proc tree statement creates a dendrogram graphic based on the output of proc cluster. In the program, it is invoked with the following specifications: 2

horizontal: the height axis of the dendrogram will be horizontal and the root will be at the left spaces=2: there will be 2 spaces between adjacent objects on the final printed output This is followed by a final run statement. In summary, the script will first perform a K-means clustering of the data based on the specified variables, then perform a group average hierarchical clustering on the cluster centers from the K-means clustering, and finally display the results of the hierarchical clustering in a horizontal dendrogram. 4. With the Editor window active, select Run Submit to run the program. A Graph window will soon appear, displaying a dendrogram of the hierarchical clustering of the K-means centers. Note that there are 10 of these, labeled OB1 through OB10. There will also be new information in the Output and Log windows: summary statistics from both of the clustering procedures in the Output window, and run-time processing notes in the Log window. If there were any run-time errors, information in the Log window could help you diagnose them. The output statistics can be browsed conveniently with the aid of the Results window. Click the Results tab and note that three folders are listed, corresponding to each of the three procedures run. Double click on the folders and then on their contents to pull up the corresponding information in the Output or Graph window. Note: The descriptions of the SAS code given here are necessarily brief and incomplete. For more detailed information, including complete syntax and comprehensive lists of options for the predefined procedures, consult the general SAS system help. For information on a specific procedure, try a search on the name of that procedure. A Demonstration of Principal Component Analysis 1. Start SAS, if it is not already running. Download and import the dataset prospect.xls from its usual location on the course website. (Recall that this is the customer demographic database information that was used in the November 4 recitation.) Start Enterprise Miner and create a new project. 2. Drag an Input Data Source node onto the diagram, open it, and input the data set. (Also, change the metadata sample to be the full data set.) Check the Variables tab. In the November 4 recitation, the variable LOC was rejected in favor of the simpler and more pertinent variable CLIMATE. Set the Model Role of LOC to rejected. Close the node (saving changes). 3. Connect a Princomp/Dmneural node (from the Tools menu) after the Input Data Source node. This node has two separate roles: to fit yet another type of predictive model 3

(based on applying neural networks to principal components) to a dataset, and to extract the principal components of a multivariate dataset for use in later nodes. Only the second role will be demonstrated here. Recall that this data set does have a small percentage of missing values. The Princomp/Dmneural node will automatically perform mean imputation for any numeric variables having missing values and create a new missing category for any class variables having missing values. Since there are only a few missing values, such simple imputation methods will hopefully be acceptable, so a Replacement node will not be used. Open the Princomp/Dmneural node. The Variables tab will be active and will show the usual information. Click the General tab. By default, the box labeled Only do principal components analysis will be checked, because no target variables have been specified. The box labeled Reject the original input variables may also be checked. This specifies that only the principal components will be passed in a usable form to any subsequent nodes in the flow. These options are suitable, so leave them as they are. Click the PrinComp tab. The options here allow control over the principal component analysis. You can extract principal components from either the Uncorrected covariance matrix (presumably the second moment matrix, in which the variable means are not subtracted), the Covariance matrix, or the Correlation matrix. Choose Correlation matrix. This choice with make the principal components invariant to the scales of the variables, which is important when variables are of different orders of magnitude, as they are in this data set. The other options on this tab specify how many of the principal components will be extracted (largest eigenvalues first, of course). See the help files for details. There are only a few variables in this data set, so there will only be a few principal components. To extract all of the principal components, leave these at their default settings. Close the node. Note: Principal component analysis requires numeric variables. Therefore, the node automatically converts class variables into a set of dummy numeric indicator variables, one for each class, and then performs the PCA with these indicator variables in place of the class variables. 4. Run the Principal components node (as it is now labeled) and choose to view the results. The results window will appear with the PrinComp tab active, showing a graphical representation of the eigenvalues, the variances associated with the principal components. The radio buttons allow you to make different types of eigenvalue plots. Such plots are sometimes used to determine whether the data set can be reduced to a smaller number of effective variables, as might be the case if only a few eigenvalues are large. Click the Details... button. A table of eigenvalue information will be displayed when the Eigenvalues button is active (as it is by default). Select the Eigenvectors radio 4

button. This gives a table of the loadings of the principal components, i.e. the degrees to which each variable contributes to them. Such a table can sometimes be used to assign interpretations to the principal components. Note that all of the dummy variables for each class variable appear in this table, including the dummy variables for the missing classes. Examine the loadings for the first principal component. Do you see anything interesting in this pattern of loadings? Can you explain it? Examine the loadings for the second principal component. What kind of demographic variation does it appear to capture? Try It Yourself 1. Perform a principal component analysis on the covariance matrix instead of the correlation matrix. Can you explain the resulting eigenvalues and eigenvectors? 2. Perform a more sophisticated imputation of the missing values using the Replacement node before performing PCA. How do the results change? Created by Trevor Park on November 17, 2002 5