SELF-ORGANISING MAPPING NETWORKS (SOM) WITH SAS E-MINER C.Sarada, K.Alivelu and Lakshmi Prayaga Directorate of Oilseeds Research, Rajendranagar, Hyderabad saradac@yahoo.com Self Organising mapping networks (SOM) (Kohonen, 2001) is a specific family of neural networks uses unsupervised training. In unsupervised training no target output is provided and the network evolves until stabilisation. SOM can be used for data visualisation, clustering, estimation, vector projection and a variety of other purposes. It is an effective modelling tool for the visualisation of high dimensional data. Non linear statistical relationships between high dimensional data are converted into simple geometric relationships of their image points on a low dimensional display, usually a two dimensional grid of nodes. The SOM inspired by the way in which various human sensory impressions neurologically mapped into the brain such the spatial or other relationship between stimuli corresponds to spatial relationships among the neurons A general architecture of SOM consists of a set of input nodes, output nodes and weight parameters. Each input node is fully connected to every output node via a variable connection. A weight parameter is associated with each of these connections. The weights between the input nodes and output nodes are iteratively changed during the learning phase until a termination criterion is satisfied. For each input vector, there is one associated winner node on the output map. A simple SOM Algorithm Each data from data set recognizes themselves by competing for representation. SOM mapping steps starts from initializing the weight vectors. From there a sample vector is selected randomly and the map of weight vectors is searched to find which weight best represents that sample. Each weight vector has neighboring weights that are close to it. The weight that is chosen is rewarded by being able to become more like that randomly selected sample vector. The neighbors of that weight are also rewarded by being able to become more like the chosen sample vector. From this step the number of neighbors and how much each weight can learn decreases over time. This whole process is repeated a large number of times, usually more than 1000 times.
In sum, learning occurs in several steps and over many iterations: 1. Each node's weights are initialized. 2. A vector is chosen at random from the set of training data. 3. Every node is examined to calculate which one's weights are most like the input vector. The winning node is commonly known as the Best Matching Unit (BMU). 4. Then the neighbourhood of the BMU is calculated. The amount of neighbors decreases over time. 5. The winning weight is rewarded with becoming more like the sample vector. The nighbors also become more like the sample vector. The closer a node is to the BMU, the more its weights get altered and the farther away the neighbor is from the BMU, the less it learns. 6. Repeat step 2 for N iterations. SOM vs. Classical Clustering methods Many studies compared the SOM with the classical clustering methods (Chen et al., 1995, Mangiameli et al. 1996, Waller et al. 1998). Chen et al 1995 investigated the performance of SOM and hierarchical clustering methods and found that hierarchical methods are influenced by the relative dispersion of the data. Mangiameli et al., 1996 studied the performance of the SOM neural network and seven hierarchical clustering methods is tested on 252 data sets with various levels of imperfections that include data dispersion, outliers, irrelevant variables, and non uniform cluster densities. His study revealed that SOM is superior in accuracy and robustness compared to the other clustering methods. They are conceptually easy to understand and more efficient for grouping large datasets than the smaller datasets such as microarray experiments for gene expression studies where thousands of genes/observations involved, Grouping of customers for large business / banking sector etc. In SAS Enterprise Miner, the profiling portion is very similar to clustering technique. However, there are limitations like 1.SOM networks can be prone to issues with missing data as in all other neural network algorithms and regressions. 2. SOM can produce differencing results as they produce maps form sampled data so it may take a number of trials to obtain a map that is consistent with same training data. They are rather computationally intensive. Illustration Data: A lab experiment was conducted at Directorate of Oilseeds Research, Hyderabad to study the response of 29 safflower genotypes to water stress induced by PEG and to delineate the tolerant genotypes from susceptible ones. The observations on germination percentage, Days to minimum germination, seedling vigour, for different stress levels were recorded. the genotypes germinated under high stress conditions also recorded. Thus the main aim of the experiment is to classify the genotypes based on these parameters in to different groups. A dataset Stress.xls having variables viz., sno, genotype, interval variables: g3, g4, g5 (Germination percentage at 3 different stress levels) s3, s4,s5 (corresponding seedling vigour), Ordinal variables :sd3, sd4, sd5 ( days to maximum germination) and binary variable : 204
highstress (genotypes germinated at high stress conditions) has been created. Make a SAS dataset file named stress in the SASUSER library. Analysis of data with SOM with Enterprise Miner 6.1 - A step-wise Procedure: Create the Diagram SOM Create the input file stress assign the roles and levels for the variables drag the input file to the diagram area name the input file as stress. Go to explore tab and click and drag the SOM /Kohonen node to the diagram and connect the input file named stress and SOM /Kohonen node. Highlight the SOM/Kohonen Node we can observe property sheet in the left panel 205
Set of tables imported by this node Set of tables exported by this node Information about the analysis Variable properties Select SOM/Kohonen method want to use Change Options available with SOM/Kohonen Node present in the left panel. Change the following options internal standardization to standardisation option ( if required for the data), row to size 2 and column size 4 ( A grid size of 2 x 4 = 8 clusters) Go to the SOM/Kohonen Node then right click and select the option run gives the following window 206
Click on to the Results tab. the following results can be viewed from the results view tab can be seen Only main result windows are discussed here. The Map Window gives a topological mapping of all the input attributes to the clusters. The following figure gives the different attributes for viewing the topological map. Selecting the Nearest cluster option gives the following map. To view the table: click view tab table. 207
We can see SOM segment ID gives the cluster number for ex. SOM ID1.1 =cluster 1 and 2:1 =5. From the above figure it can be observed that cluster 1 and cluster 3 are distinct from others. The mean statistics window gives the clusterwise means of the variables. The summary statistics of the clusters (min, max, standard deviation ) can be seen from Analysis Statistics window. To study the each cluster properties in a detailed manner we can use the Segment profile node. 208
Click Assess drag segment profile icon to the diagram area and connect the node with SOM/Kohonen node right click and run The Segment Profile node results output is presented below The segment profile gives the frequency of each cluster as a pie chart. The Profile window displays a lattice, or grid, of plots comparing the distribution for the identified and report variables for both the segment and the total number of observations. Each row represents a single cluster. The far left margin identifies the cluster/segment, its count, and percentage of the total observations. By default, the rows are sorted in ascending size order from top to bottom. You can also sort rows alphanumerically by segment name by right-clicking to get the edit menu. Select Sort Segments. We can also change the response variable format to the count or the percent of the entire data and expand a graphic by using the edit menu. Representation of class and Internal variables are as follows. Class Variable displayed as two nested pie charts that consist of two concentric rings. The inner ring represents the distribution of the total observations. The outer ring represents the distribution for the given segment. Interval Variable displayed as a histogram. The blue shaded region represents the withinsegment distribution. The red outline represents the population distribution. The height of the histogram bars can be scaled by count or by percentage of the segment population. When you are using the percentage, the view shows the relative difference between the segment and the population. When you are using count, the view shows the absolute difference between the segment and total observations. The output window contains the variable summary, Frequency information for each cluster and Decision Tree Importance Profiles display the logworth or importance statistics for the variables that have been identified as factors that distinguish the segment from the total. If you scroll 209
through the segment Profiled node s output window, each set of variables by cluster/segment wise with the worth statistic and rank of for each variable are provided. In the above figure it can be seen that g5 variable is majorly contributed to the formation of cluster /segment 7. The same is represented as bar diagram in Variable worth window. References Chen, S.K., Mangiameli, P. and West, D. (1995). The comparative ability of self-organizing neural networks to define cluster structure. Omega, Int. J. Manage. Sci., 23, 271 279. Mangiameli.P, Shaw K. Chen and David West. (1996). A comparison of SOM neural network and hierarchical clustering methods. European Journal of Operational Research., 93, 402-417. Randall S.Collica (2007) CRM Segmentation and Clustering Using SAS Enterprise Miner SAS Publishing. SAS-Enterprise Miner 6.1 Help Documentation. 210