SELF-ORGANISING MAPPING NETWORKS (SOM) WITH SAS E-MINER



Similar documents
Self Organizing Maps: Fundamentals

Neural Network Add-in

Using SPSS, Chapter 2: Descriptive Statistics

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Segmentation of stock trading customers according to potential value

Data Mining with SQL Server Data Tools

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

A simple three dimensional Column bar chart can be produced from the following example spreadsheet. Note that cell A1 is left blank.

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data

Market Pricing Override

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

IBM SPSS Direct Marketing 23

An Introduction to Excel Pivot Tables

Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition

IBM SPSS Direct Marketing 22

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Credit Card Fraud Detection Using Self Organised Map

1. Go to your programs menu and click on Microsoft Excel.

MicroStrategy Desktop

GeoGebra Statistics and Probability

Using Microsoft Excel to Plot and Analyze Kinetic Data

Hierarchical Clustering Analysis

NATIONAL GENETICS REFERENCE LABORATORY (Manchester)

Text Analytics using High Performance SAS Text Miner

Advanced Microsoft Excel 2010

Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Data Mining and Neural Networks in Stata

A Demonstration of Hierarchical Clustering

An Introduction to Point Pattern Analysis using CrimeStat

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Scatter Plots with Error Bars

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

BUSINESS DATA ANALYSIS WITH PIVOTTABLES

Introduction to Microsoft Excel 2007/2010

Formulas, Functions and Charts

Cluster software and Java TreeView

SuperViz: An Interactive Visualization of Super-Peer P2P Network

ICP Data Validation and Aggregation Module Training document. HHC Data Validation and Aggregation Module Training Document

2030 Districts Performance Metrics Toolkit

The Basics of SAS Enterprise Miner 5.2

Data Visualization. Prepared by Francisco Olivera, Ph.D., Srikanth Koka Department of Civil Engineering Texas A&M University February 2004

Final Software Tools and Services for Traders

Drawing a histogram using Excel

A fast, powerful data mining workbench designed for small to midsize organizations

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Describing, Exploring, and Comparing Data

Instructions for SPSS 21

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

Environmental Remote Sensing GEOG 2021

Data analysis process

Regression Clustering

Polynomial Neural Network Discovery Client User Guide

Excel Using Pivot Tables

AMS 7L LAB #2 Spring, Exploratory Data Analysis

c360 Relationship Charts User Guide

Using the SAS Enterprise Guide (Version 4.2)

UNIVERSITY OF BOLTON SCHOOL OF ENGINEERING MS SYSTEMS ENGINEERING AND ENGINEERING MANAGEMENT SEMESTER 1 EXAMINATION 2015/2016 INTELLIGENT SYSTEMS

Chapter 4 Displaying and Describing Categorical Data

Gestation Period as a function of Lifespan

Excel Using Pivot Tables

Artificial Intelligence and Machine Learning Models

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Scientific Graphing in Excel 2010

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Chapter 12 Discovering New Knowledge Data Mining

IBM SPSS Neural Networks 22

Visualization of Breast Cancer Data by SOM Component Planes

Data Mining and Visualization

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Exploratory Spatial Data Analysis

Comparison of Supervised and Unsupervised Learning Classifiers for Travel Recommendations

Petrel TIPS&TRICKS from SCM

4 Other useful features on the course web page. 5 Accessing SAS

WEKA Explorer User Guide for Version 3-4-3

Applying MapCalc Map Analysis Software

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Decision Trees from large Databases: SLIQ

Produced by Flinders University Centre for Educational ICT. PivotTables Excel 2010

Data exploration with Microsoft Excel: analysing more than one variable

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

JustClust User Manual

A Tutorial on dynamic networks. By Clement Levallois, Erasmus University Rotterdam

Business Objects 4.1 Quick User Guide

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Microsoft Excel Basics

Data Mining mit der JMSL Numerical Library for Java Applications

Snap 9 Professional s Scanning Module

Getting Started With Mortgage MarketSmart

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

SAS VISUAL ANALYTICS AN OVERVIEW OF POWERFUL DISCOVERY, ANALYSIS AND REPORTING

Transcription:

SELF-ORGANISING MAPPING NETWORKS (SOM) WITH SAS E-MINER C.Sarada, K.Alivelu and Lakshmi Prayaga Directorate of Oilseeds Research, Rajendranagar, Hyderabad saradac@yahoo.com Self Organising mapping networks (SOM) (Kohonen, 2001) is a specific family of neural networks uses unsupervised training. In unsupervised training no target output is provided and the network evolves until stabilisation. SOM can be used for data visualisation, clustering, estimation, vector projection and a variety of other purposes. It is an effective modelling tool for the visualisation of high dimensional data. Non linear statistical relationships between high dimensional data are converted into simple geometric relationships of their image points on a low dimensional display, usually a two dimensional grid of nodes. The SOM inspired by the way in which various human sensory impressions neurologically mapped into the brain such the spatial or other relationship between stimuli corresponds to spatial relationships among the neurons A general architecture of SOM consists of a set of input nodes, output nodes and weight parameters. Each input node is fully connected to every output node via a variable connection. A weight parameter is associated with each of these connections. The weights between the input nodes and output nodes are iteratively changed during the learning phase until a termination criterion is satisfied. For each input vector, there is one associated winner node on the output map. A simple SOM Algorithm Each data from data set recognizes themselves by competing for representation. SOM mapping steps starts from initializing the weight vectors. From there a sample vector is selected randomly and the map of weight vectors is searched to find which weight best represents that sample. Each weight vector has neighboring weights that are close to it. The weight that is chosen is rewarded by being able to become more like that randomly selected sample vector. The neighbors of that weight are also rewarded by being able to become more like the chosen sample vector. From this step the number of neighbors and how much each weight can learn decreases over time. This whole process is repeated a large number of times, usually more than 1000 times.

In sum, learning occurs in several steps and over many iterations: 1. Each node's weights are initialized. 2. A vector is chosen at random from the set of training data. 3. Every node is examined to calculate which one's weights are most like the input vector. The winning node is commonly known as the Best Matching Unit (BMU). 4. Then the neighbourhood of the BMU is calculated. The amount of neighbors decreases over time. 5. The winning weight is rewarded with becoming more like the sample vector. The nighbors also become more like the sample vector. The closer a node is to the BMU, the more its weights get altered and the farther away the neighbor is from the BMU, the less it learns. 6. Repeat step 2 for N iterations. SOM vs. Classical Clustering methods Many studies compared the SOM with the classical clustering methods (Chen et al., 1995, Mangiameli et al. 1996, Waller et al. 1998). Chen et al 1995 investigated the performance of SOM and hierarchical clustering methods and found that hierarchical methods are influenced by the relative dispersion of the data. Mangiameli et al., 1996 studied the performance of the SOM neural network and seven hierarchical clustering methods is tested on 252 data sets with various levels of imperfections that include data dispersion, outliers, irrelevant variables, and non uniform cluster densities. His study revealed that SOM is superior in accuracy and robustness compared to the other clustering methods. They are conceptually easy to understand and more efficient for grouping large datasets than the smaller datasets such as microarray experiments for gene expression studies where thousands of genes/observations involved, Grouping of customers for large business / banking sector etc. In SAS Enterprise Miner, the profiling portion is very similar to clustering technique. However, there are limitations like 1.SOM networks can be prone to issues with missing data as in all other neural network algorithms and regressions. 2. SOM can produce differencing results as they produce maps form sampled data so it may take a number of trials to obtain a map that is consistent with same training data. They are rather computationally intensive. Illustration Data: A lab experiment was conducted at Directorate of Oilseeds Research, Hyderabad to study the response of 29 safflower genotypes to water stress induced by PEG and to delineate the tolerant genotypes from susceptible ones. The observations on germination percentage, Days to minimum germination, seedling vigour, for different stress levels were recorded. the genotypes germinated under high stress conditions also recorded. Thus the main aim of the experiment is to classify the genotypes based on these parameters in to different groups. A dataset Stress.xls having variables viz., sno, genotype, interval variables: g3, g4, g5 (Germination percentage at 3 different stress levels) s3, s4,s5 (corresponding seedling vigour), Ordinal variables :sd3, sd4, sd5 ( days to maximum germination) and binary variable : 204

highstress (genotypes germinated at high stress conditions) has been created. Make a SAS dataset file named stress in the SASUSER library. Analysis of data with SOM with Enterprise Miner 6.1 - A step-wise Procedure: Create the Diagram SOM Create the input file stress assign the roles and levels for the variables drag the input file to the diagram area name the input file as stress. Go to explore tab and click and drag the SOM /Kohonen node to the diagram and connect the input file named stress and SOM /Kohonen node. Highlight the SOM/Kohonen Node we can observe property sheet in the left panel 205

Set of tables imported by this node Set of tables exported by this node Information about the analysis Variable properties Select SOM/Kohonen method want to use Change Options available with SOM/Kohonen Node present in the left panel. Change the following options internal standardization to standardisation option ( if required for the data), row to size 2 and column size 4 ( A grid size of 2 x 4 = 8 clusters) Go to the SOM/Kohonen Node then right click and select the option run gives the following window 206

Click on to the Results tab. the following results can be viewed from the results view tab can be seen Only main result windows are discussed here. The Map Window gives a topological mapping of all the input attributes to the clusters. The following figure gives the different attributes for viewing the topological map. Selecting the Nearest cluster option gives the following map. To view the table: click view tab table. 207

We can see SOM segment ID gives the cluster number for ex. SOM ID1.1 =cluster 1 and 2:1 =5. From the above figure it can be observed that cluster 1 and cluster 3 are distinct from others. The mean statistics window gives the clusterwise means of the variables. The summary statistics of the clusters (min, max, standard deviation ) can be seen from Analysis Statistics window. To study the each cluster properties in a detailed manner we can use the Segment profile node. 208

Click Assess drag segment profile icon to the diagram area and connect the node with SOM/Kohonen node right click and run The Segment Profile node results output is presented below The segment profile gives the frequency of each cluster as a pie chart. The Profile window displays a lattice, or grid, of plots comparing the distribution for the identified and report variables for both the segment and the total number of observations. Each row represents a single cluster. The far left margin identifies the cluster/segment, its count, and percentage of the total observations. By default, the rows are sorted in ascending size order from top to bottom. You can also sort rows alphanumerically by segment name by right-clicking to get the edit menu. Select Sort Segments. We can also change the response variable format to the count or the percent of the entire data and expand a graphic by using the edit menu. Representation of class and Internal variables are as follows. Class Variable displayed as two nested pie charts that consist of two concentric rings. The inner ring represents the distribution of the total observations. The outer ring represents the distribution for the given segment. Interval Variable displayed as a histogram. The blue shaded region represents the withinsegment distribution. The red outline represents the population distribution. The height of the histogram bars can be scaled by count or by percentage of the segment population. When you are using the percentage, the view shows the relative difference between the segment and the population. When you are using count, the view shows the absolute difference between the segment and total observations. The output window contains the variable summary, Frequency information for each cluster and Decision Tree Importance Profiles display the logworth or importance statistics for the variables that have been identified as factors that distinguish the segment from the total. If you scroll 209

through the segment Profiled node s output window, each set of variables by cluster/segment wise with the worth statistic and rank of for each variable are provided. In the above figure it can be seen that g5 variable is majorly contributed to the formation of cluster /segment 7. The same is represented as bar diagram in Variable worth window. References Chen, S.K., Mangiameli, P. and West, D. (1995). The comparative ability of self-organizing neural networks to define cluster structure. Omega, Int. J. Manage. Sci., 23, 271 279. Mangiameli.P, Shaw K. Chen and David West. (1996). A comparison of SOM neural network and hierarchical clustering methods. European Journal of Operational Research., 93, 402-417. Randall S.Collica (2007) CRM Segmentation and Clustering Using SAS Enterprise Miner SAS Publishing. SAS-Enterprise Miner 6.1 Help Documentation. 210