CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理



Similar documents
Schneps, Leila; Colmez, Coralie. Math on Trial : How Numbers Get Used and Abused in the Courtroom. New York, NY, USA: Basic Books, p i.

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Computational Framework for Exploratory Data Analysis

ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization

A Study of Web Log Analysis Using Clustering Techniques

Data Mining and Neural Networks in Stata

Visualization of large data sets using MDS combined with LVQ.

Visualization of Topology Representing Networks

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

USING SELF-ORGANISING MAPS FOR ANOMALOUS BEHAVIOUR DETECTION IN A COMPUTER FORENSIC INVESTIGATION

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data

Neural Network Add-in

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Self Organizing Maps: Fundamentals

Chapter ML:XI (continued)

Data visualization is a graphical presentation

Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

NEURAL NETWORK FUNDAMENTALS WITH GRAPHS, ALGORITHMS, AND APPLICATIONS

Customer Data Mining and Visualization by Generative Topographic Mapping Methods

Using Predictive Analytics to Detect Fraudulent Claims

Visualization of Breast Cancer Data by SOM Component Planes

6.2.8 Neural networks for data mining

TIETS34 Seminar: Data Mining on Biometric identification

Christian Bettstetter. Mobility Modeling, Connectivity, and Adaptive Clustering in Ad Hoc Networks

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Conclusions and Future Directions

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Online data visualization using the neural gas network

High-dimensional labeled data analysis with Gabriel graphs

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Feature Selection vs. Extraction

How To Cluster

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining on Sequences with recursive Self-Organizing Maps

Unsupervised and supervised dimension reduction: Algorithms and connections

Comparison of Supervised and Unsupervised Learning Classifiers for Travel Recommendations

CITY UNIVERSITY OF HONG KONG. Revenue Optimization in Internet Advertising Auctions

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Segmentation of stock trading customers according to potential value

MANAGING QUEUE STABILITY USING ART2 IN ACTIVE QUEUE MANAGEMENT FOR CONGESTION CONTROL

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

1. Classification problems

Data Clustering and Topology Preservation Using 3D Visualization of Self Organizing Maps

Using multiple models: Bagging, Boosting, Ensembles, Forests

CITY UNIVERSITY OF HONG KONG. A Study of Electromagnetic Radiation and Specific Absorption Rate of Mobile Phones with Fractional Human Head Models

Visualization of textual data: unfolding the Kohonen maps.

Digital image processing

Project Participants

Contents. Dedication List of Figures List of Tables. Acknowledgments

Visualization of Large Font Databases

A MATLAB Toolbox and its Web based Variant for Fuzzy Cluster Analysis

INTERACTIVE DATA EXPLORATION USING MDS MAPPING

Comparison of K-means and Backpropagation Data Mining Algorithms

Exploratory Data Analysis with MATLAB

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

NEURAL NETWORKS A Comprehensive Foundation

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

The Research of Data Mining Based on Neural Networks

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

How To Use Neural Networks In Data Mining

8 Visualization of high-dimensional data

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

DYNAMIC GRAPH ANALYSIS FOR LOAD BALANCING APPLICATIONS

Expanding Self-organizing Map for Data. Visualization and Cluster Analysis

Cartogram representation of the batch-som magnification factor

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Data Mining using Rule Extraction from Kohonen Self-Organising Maps

EVALUATION OF NEURAL NETWORK BASED CLASSIFICATION SYSTEMS FOR CLINICAL CANCER DATA CLASSIFICATION

Using Data Mining for Mobile Communication Clustering and Characterization

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Unsupervised Data Mining (Clustering)

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Icon and Geometric Data Visualization with a Self-Organizing Map Grid

Selecting Pedagogical Protocols Using SOM

Visual decisions in the analysis of customers online shopping behavior

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Analysis of Performance Metrics from a Database Management System Using Kohonen s Self Organizing Maps

Quality Assessment in Spatial Clustering of Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining

How To Identify A Churner

Content Based Analysis of Databases Using Self-Organizing Maps

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Supervised model-based visualization of high-dimensional data

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

A General Tracking and Auditing Architecture for the OpenACS framework

DYNAMIC FUZZY PATTERN RECOGNITION WITH APPLICATIONS TO FINANCE AND ENGINEERING LARISA ANGSTENBERGER

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

How To Understand Multivariate Models

Transcription:

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment of the Requirements for the Degree of Master of Philosophy 哲 學 碩 士 學 位 by Xu Yang 徐 楊 October 2010 二 零 一 零 年 十 月

i Abstract Data visualization, which is the graphical presentation of data information, has been widely applied in industrial areas, e.g. signal compression, pattern recognition, image processing, etc. Self-Organizing Map (SOM), a widely used visualization method proposed by Kohonen, is an unsupervised learning network to visualize high-dimensional data in a low-dimensional map. SOM is able to present the data topology by assigning each datum to a neuron with the highest similarity, so that the data with similar features are mapped onto adjacent neurons. The traditional SOM uses the uniform grid map. Thus, some input data are projected onto the same neuron. Apparently, it is not an effective way for preserving the data relationship between clusters or within one cluster. And pre-defining the map size is another disadvantage of SOM. In this thesis, a new algorithm named Polar Self-Organizing Map (PolSOM) is proposed. PolSOM is constructed on a 2-D polar map with two variables, radius and angle, which represent data weight and feature respectively. Compared with the traditional algorithms which project data on a Cartesian map by using Euclidian distance as the only variable, PolSOM can manifest the precise data topology, and obtain the intra cluster density and inter cluster density by a new clustering criterion, synthetical cluster density (SCD). In PolSOM, not only similar data are grouped

ii together, data characteristics are reflected by their positions on the map. PolSOM, in fact, adopts the conventional hard assignment for the training process. In this thesis, a new variant of PolSOM algorithm, named Probabilistic Polar Self-Organizing Map (PPoSOM), is proposed. Instead of using the hard assignment, PPoSOM employs the soft assignment that the assignment of a datum to a neuron is based on a probabilistic function. It is developed to enhance the visualization performance. It is worth noting that the obtained principled weight-updating rule of PPoSOM makes the visualization performance, measured by synthetical cluster density (SCD), greatly improve. SOM usually considers the whole data set in one go, whereas the representative data are not well utilized. The learning process is found to be rigid and time-consuming when one is dealing with large data sets. In this thesis, we propose to apply density based data reduction method as pre-processing, which means that the proposed method extracts representative data preliminarily for the SOM training. This method is found to be particularly useful in terms of reducing the overall computational time. Finally, an interesting application on World University Rankings is included in this thesis. A new perspective on studying the nature of different universities is introduced. SOM is employed to provide estimates for the missing scores of some universities. Principal component analysis (PCA) is used to analyze the underlying nature of each university. The results and analyses show that we can effectively determine the underlying characteristics of a university by studying

iii how its ranking varies with the change of weights on different features. Our study shows that the proposed approach is more effective than simply relying on a linear weighted sum ranking system.

v Contents Abstract... i Acknowledgements. iv List of Figures..... viii List of Tables...... xi Chapter 1 Introduction... 1 1.1 Motivations.... 1 1.2 Contributions of This Thesis.. 7 1.3 Organization of This Thesis... 9 1.4 List of Published/ Submitted Papers... 10 Chapter 2 Data Visualization... 12 2.1 Introduction.... 12 2.2 Polar Self-Organizing Map.... 13 2.2.1 Background of Sammon s Mapping and ViSOM Algorithm.. 13 2.2.1.1 Sammon s Nonlinear Mapping.. 13 2.2.1.2 Visualization-induced SOM (ViSOM)... 14

vi 2.2.2 Polar SOM (PolSOM). 15 2.2.2.1 PolSOM. 15 2.2.2.2 Synthetical Cluster Density (SCD)... 18 2.2.3 Property of PolSOM 20 2.2.4 Experimental Results... 22 2.2.4.1 Two Six-Dimensional Synthetic Data Sets 23 2.2.4.2 Iris Data Set.. 26 2.2.4.3 Wine Data Set 28 2.3 Probabilistic Polar SOM (PPoSOM). 31 2.3.1 PPoSOM.. 31 2.3.2 Experimental Results... 34 2.3.2.1 Two Six-Dimensional Synthetic Data Sets 34 2.3.2.2 Iris Data Set... 38 2.3.2.3 Wine Data Set 40 2.3.2.4 Wisconsin Breast Cancer Data Set 42 2.3.2.5 2009 World s University Rankings Data Set. 44 2.4 Conclusion. 48 Chapter 3 Efficient SOM Learning Scheme.. 51 3.1 Introduction 51 3.2 Efficient SOM Learning Algorithm.. 53 3.3 Experimental Results. 56

vii 3.4 Conclusion. 59 Chapter 4 Data Handling on World University Rankings.... 61 4.1 Introduction... 61 4.2 Ranking Methodologies 65 4.3 Data Handling on World University Rankings.. 68 4.3.1 Estimates for the Missing Data 68 4.3.2 Analyses on Universities.. 70 4.4 Results and Analyses. 75 4.5 Conclusion. 92 Chapter 5 General Conclusion and Future Work. 94 5.1 General Conclusion... 94 5.2 Future Work... 96 Bibliography.. 99

viii List of Figures Figure 1.1 Architecture of the SOM network. 3 Figure 2.1 The topology preservation property of PolSOM. 22 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Visualization maps of the first 6-D synthetic data set. (a) PolSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Visualization maps of the second 6-D synthetic data set. (a) PolSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Visualization maps of the Iris data set. (a) PolSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Transpositional visualization map by PolSOM for Iris data set. Visualization maps of the Wine data set. (a) PolSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Transpositional visualization map by PolSOM for Wine data set. Visualization maps of the first 6-D synthetic data set. (a) PPoSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Visualization maps of the second 6-D synthetic data set. (a) PPoSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. 24 25 27 28 29 30 36 37 Figure 2.10 Visualization maps of the Iris data set. (a) PPoSOM. (b) 39

ix Figure 2.11 Sammon s mapping. (c) SOM. (d) ViSOM. Transpositional visualization map by PPoSOM for Iris data set. 39 Figure 2.12 Visualization maps of the Wine data set. (a) PPoSOM. (b) 41 Sammon s mapping. (c) SOM. (d) ViSOM. Figure 2.13 Transpositional visualization map by PPoSOM for Wine data 41 set. Figure 2.14 Visualization maps of the Wisconsin breast cancer data set. 43 (a) PPoSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Figure 2.15 Visualization maps of the 2009 Times High Education 45 World s University Rankings data set. (a) PPoSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Figure 2.16 Visualization maps of the 2009 Times High Education 47 World s University Rankings data set. (a) PPoSOM (Sample 1), (b) PPoSOM (Sample 2), (c) Sammon s mapping (Sample 1), (d) Sammon s mapping (Sample 2), (e) SOM (Sample 1), (f) SOM (Sample 2), (g) ViSOM (Sample 1), (h) Figure 3.1 ViSOM (Sample 2). The influence of different ratios on the classification accuracy. 59 Figure 4.1 Prediction of missing values of input data by using SOM. 69 Figure 4.2 An example of the application of SOM for predicting the 70

x missing values. Figure 4.3 Principal components obtained from 121 universities. 72 Figure 4.4 The rankings of 10 most popular universities obtained by 121 universities analyses. 75 Figure 4.5 Estimating missing scores for Delft University of 78 Technology by SOM. Figure 4.6 Principal components obtained from 285 universities. 81 Figure 4.7 The rankings of 10 most popular universities obtained by 285 universities analyses. 82 Figure 4.8 The rankings of universities in China. 83 Figure 4.9 The rankings of the universities in Asia. 88 Figure 4.10 The rankings of the universities in Australia. 88 Figure 4.11 The rankings of the universities in Europe. 89 Figure 4.12 The rankings of the public universities in USA. 90 Figure 4.13 The rankings of the private universities in USA. 90 Figure 5.1 The visualization maps by PolSOM. 97

xi List of Tables Table 2.1 Part of the 2009 Times High Education World s University Rankings Data Set. 21 Table 2.2 Comparisons of mapping by using SCD. 36 Table 3.1 Evaluated data sets. 58 Table 3.2 Table 4.1 Table 4.2 Comparisons of average training time (seconds), accuracy and QE of three data sets. The first two principle components of 121 universities data set. The correlation between the two principal components and the original variables in 121 universities data set. 58 73 73 Table 4.3 The five universities in HEEACT rankings. 77 Table 4.4 Table 4.5 Table 4.6 Estimated scores of Delft University of Technology in HEEACT rankings. The first two principle components of 285 universities data set. The correlation between the two principal components and the original variables in 285 universities data set. 77 80 80 Table 4.7 Universities in Figure 4.9. 86 Table 4.8 Universities in Figure 4.10. 87

xii Table 4.9 Universities in Figure 4.11. 89 Table 4.10 Universities in Figure 4.12 and 4.13. 91