Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised learning procedures. This recitation will focus on two of these procedures: hierarchical clustering and principal component analysis. The data sets myraw.xls and prospect.xls will be used to demonstrate the methods. (As always, there is no guarantee that either will provide substantial insights for these particular data sets). Agglomerative hierarchical clustering is described in Section 14.3.12 of the textbook. SAS has algorithms for these methods, but they are not directly available from within Enterprise Miner. Keep in mind that Enterprise Miner is simply a user-friendly interface that invokes basic SAS routines to perform all of its statistical tasks. Although a few nodes invoke SAS s hierarchical clustering algorithms, none of them provides direct access to clustering options or to the resulting dendrograms. Therefore, it will be necessary to introduce some features of basic SAS, and, more specifically, the basic procedures SAS uses for clustering. Principal component analysis (PCA) is described in Section 14.5.1 of the textbook. Enterprise Miner does have a node that performs PCA, although the same node also performs certain types of supervised learning. This recitation will only cover the use of that node for PCA. A Demonstration of Hierarchical Clustering 1. A SAS program file on the course website will be used as a model for the analysis. Go to the course website http://www.orie.cornell.edu/~davidr/or474 and follow the links Minitab and SAS Programs and Output SAS Programs. There will be two links with the name Donations Cluster. Either one will let you download the program, but the top link may be more convenient if your computer is configured properly. Click the top Donations Cluster link and choose to open the file from its current location. This should automatically open SAS and load the program into a program editor window. If this procedure fails, click the other Donations Cluster link to see a plain text version of the SAS program. Start SAS, then copy the program from the browser window into a program editor window in SAS. (There should already be a blank program editor window with a label like Editor Untitled1. If not, you will need to create one by choosing View Enhanced Editor from the menu bar.) 2. Download and save myraw.xls from the course website (under Course Data Sets, labeled as Donations data). Import the data into SAS. (Do not start Enterprise Miner all of this demonstration will be performed in base SAS.) 1
3. Basic SAS is a command-line-style environment that allows you to apply predefined statistical algorithms to data sets in the SAS libraries. The current version provides a convenient window-based environment: Explorer and Results windows on the left hand side for browsing data and program results, and Editor, Output, and Log windows to the right for writing SAS program scripts and viewing output and notes on the processing performed. The SAS program from the website should appear in an Editor window. It is a short script that invokes three predefined SAS procedures: fastclus, cluster, and tree. In general, the syntax for invoking a predefined SAS procedure is proc procedure name, followed by a series of options, many in the form of parameter name/value pairs written as parameter name = parameter value, and then (possibly) a series of additional statements pertaining to the procedure. The proc fastclus statement performs K-means clustering. In the program, it is invoked with the following options: data=sasuser.donations: data will be read from the library Sasuser in the data file Donations maxc=10: the maximum number of clusters allowed should be 10 (This will often be the same as the number of clusters found, as long as the number is small and the data is nondegenerate.) mean=mean: the cluster means (centers) will be stored in a temporary data set named mean The next line is a var statement that modifies the proc fastclus statement. In this case, it specifies which variables will be used for clustering. Only the listed variables will be used. The end of the series of specifications for the proc fastclus statement is marked with the run statement. You may have imported the data under a different name or into a different library than the one listed in the program. If so, make any necessary changes to the program now. The proc cluster statement performs agglomerative hierarchical clustering. In the program, it is invoked with the following specifications: data=mean: data will be read from the temporary data file mean method=average: the measure of intergroup dissimilarity used in the clustering will be group average (Other options include single for single linkage and complete for complete linkage.) PRINT=20: the maximum number of clusters at the lowest level of the dendrogram to be displayed will be 20 (This does not actually produce a dendrogram, but the next procedure will.) Again, the specifications end with a run statement. The proc tree statement creates a dendrogram graphic based on the output of proc cluster. In the program, it is invoked with the following specifications: 2
horizontal: the height axis of the dendrogram will be horizontal and the root will be at the left spaces=2: there will be 2 spaces between adjacent objects on the final printed output This is followed by a final run statement. In summary, the script will first perform a K-means clustering of the data based on the specified variables, then perform a group average hierarchical clustering on the cluster centers from the K-means clustering, and finally display the results of the hierarchical clustering in a horizontal dendrogram. 4. With the Editor window active, select Run Submit to run the program. A Graph window will soon appear, displaying a dendrogram of the hierarchical clustering of the K-means centers. Note that there are 10 of these, labeled OB1 through OB10. There will also be new information in the Output and Log windows: summary statistics from both of the clustering procedures in the Output window, and run-time processing notes in the Log window. If there were any run-time errors, information in the Log window could help you diagnose them. The output statistics can be browsed conveniently with the aid of the Results window. Click the Results tab and note that three folders are listed, corresponding to each of the three procedures run. Double click on the folders and then on their contents to pull up the corresponding information in the Output or Graph window. Note: The descriptions of the SAS code given here are necessarily brief and incomplete. For more detailed information, including complete syntax and comprehensive lists of options for the predefined procedures, consult the general SAS system help. For information on a specific procedure, try a search on the name of that procedure. A Demonstration of Principal Component Analysis 1. Start SAS, if it is not already running. Download and import the dataset prospect.xls from its usual location on the course website. (Recall that this is the customer demographic database information that was used in the November 4 recitation.) Start Enterprise Miner and create a new project. 2. Drag an Input Data Source node onto the diagram, open it, and input the data set. (Also, change the metadata sample to be the full data set.) Check the Variables tab. In the November 4 recitation, the variable LOC was rejected in favor of the simpler and more pertinent variable CLIMATE. Set the Model Role of LOC to rejected. Close the node (saving changes). 3. Connect a Princomp/Dmneural node (from the Tools menu) after the Input Data Source node. This node has two separate roles: to fit yet another type of predictive model 3
(based on applying neural networks to principal components) to a dataset, and to extract the principal components of a multivariate dataset for use in later nodes. Only the second role will be demonstrated here. Recall that this data set does have a small percentage of missing values. The Princomp/Dmneural node will automatically perform mean imputation for any numeric variables having missing values and create a new missing category for any class variables having missing values. Since there are only a few missing values, such simple imputation methods will hopefully be acceptable, so a Replacement node will not be used. Open the Princomp/Dmneural node. The Variables tab will be active and will show the usual information. Click the General tab. By default, the box labeled Only do principal components analysis will be checked, because no target variables have been specified. The box labeled Reject the original input variables may also be checked. This specifies that only the principal components will be passed in a usable form to any subsequent nodes in the flow. These options are suitable, so leave them as they are. Click the PrinComp tab. The options here allow control over the principal component analysis. You can extract principal components from either the Uncorrected covariance matrix (presumably the second moment matrix, in which the variable means are not subtracted), the Covariance matrix, or the Correlation matrix. Choose Correlation matrix. This choice with make the principal components invariant to the scales of the variables, which is important when variables are of different orders of magnitude, as they are in this data set. The other options on this tab specify how many of the principal components will be extracted (largest eigenvalues first, of course). See the help files for details. There are only a few variables in this data set, so there will only be a few principal components. To extract all of the principal components, leave these at their default settings. Close the node. Note: Principal component analysis requires numeric variables. Therefore, the node automatically converts class variables into a set of dummy numeric indicator variables, one for each class, and then performs the PCA with these indicator variables in place of the class variables. 4. Run the Principal components node (as it is now labeled) and choose to view the results. The results window will appear with the PrinComp tab active, showing a graphical representation of the eigenvalues, the variances associated with the principal components. The radio buttons allow you to make different types of eigenvalue plots. Such plots are sometimes used to determine whether the data set can be reduced to a smaller number of effective variables, as might be the case if only a few eigenvalues are large. Click the Details... button. A table of eigenvalue information will be displayed when the Eigenvalues button is active (as it is by default). Select the Eigenvectors radio 4
button. This gives a table of the loadings of the principal components, i.e. the degrees to which each variable contributes to them. Such a table can sometimes be used to assign interpretations to the principal components. Note that all of the dummy variables for each class variable appear in this table, including the dummy variables for the missing classes. Examine the loadings for the first principal component. Do you see anything interesting in this pattern of loadings? Can you explain it? Examine the loadings for the second principal component. What kind of demographic variation does it appear to capture? Try It Yourself 1. Perform a principal component analysis on the covariance matrix instead of the correlation matrix. Can you explain the resulting eigenvalues and eigenvectors? 2. Perform a more sophisticated imputation of the missing values using the Replacement node before performing PCA. How do the results change? Created by Trevor Park on November 17, 2002 5