CITY UNIVERSITY OF HONG KONG 香港城市大學. Self-Organizing Map: Visualization and Data Handling 自組織神經網絡 : 可視化和數據處理

CITY UNIVERSITY OF HONG KONG 香港城市大學 Self-Organizing Map: Visualization and Data Handling 自組織神經網絡 : 可視化和數據處理 Submitted to Department of Electronic Engineering 電子工程學系 in Partial Fulfillment of the Requirements for the Degree of Master of Philosophy 哲學碩士學位 by Xu Yang 徐楊 October 2010 二零一零年十月

i Abstract Data visualization, which is the graphical presentation of data information, has been widely applied in industrial areas, e.g. signal compression, pattern recognition, image processing, etc. Self-Organizing Map (SOM), a widely used visualization method proposed by Kohonen, is an unsupervised learning network to visualize high-dimensional data in a low-dimensional map. SOM is able to present the data topology by assigning each datum to a neuron with the highest similarity, so that the data with similar features are mapped onto adjacent neurons. The traditional SOM uses the uniform grid map. Thus, some input data are projected onto the same neuron. Apparently, it is not an effective way for preserving the data relationship between clusters or within one cluster. And pre-defining the map size is another disadvantage of SOM. In this thesis, a new algorithm named Polar Self-Organizing Map (PolSOM) is proposed. PolSOM is constructed on a 2-D polar map with two variables, radius and angle, which represent data weight and feature respectively. Compared with the traditional algorithms which project data on a Cartesian map by using Euclidian distance as the only variable, PolSOM can manifest the precise data topology, and obtain the intra cluster density and inter cluster density by a new clustering criterion, synthetical cluster density (SCD). In PolSOM, not only similar data are grouped

ii together, data characteristics are reflected by their positions on the map. PolSOM, in fact, adopts the conventional hard assignment for the training process. In this thesis, a new variant of PolSOM algorithm, named Probabilistic Polar Self-Organizing Map (PPoSOM), is proposed. Instead of using the hard assignment, PPoSOM employs the soft assignment that the assignment of a datum to a neuron is based on a probabilistic function. It is developed to enhance the visualization performance. It is worth noting that the obtained principled weight-updating rule of PPoSOM makes the visualization performance, measured by synthetical cluster density (SCD), greatly improve. SOM usually considers the whole data set in one go, whereas the representative data are not well utilized. The learning process is found to be rigid and time-consuming when one is dealing with large data sets. In this thesis, we propose to apply density based data reduction method as pre-processing, which means that the proposed method extracts representative data preliminarily for the SOM training. This method is found to be particularly useful in terms of reducing the overall computational time. Finally, an interesting application on World University Rankings is included in this thesis. A new perspective on studying the nature of different universities is introduced. SOM is employed to provide estimates for the missing scores of some universities. Principal component analysis (PCA) is used to analyze the underlying nature of each university. The results and analyses show that we can effectively determine the underlying characteristics of a university by studying

iii how its ranking varies with the change of weights on different features. Our study shows that the proposed approach is more effective than simply relying on a linear weighted sum ranking system.

v Contents Abstract... i Acknowledgements. iv List of Figures..... viii List of Tables...... xi Chapter 1 Introduction... 1 1.1 Motivations.... 1 1.2 Contributions of This Thesis.. 7 1.3 Organization of This Thesis... 9 1.4 List of Published/ Submitted Papers... 10 Chapter 2 Data Visualization... 12 2.1 Introduction.... 12 2.2 Polar Self-Organizing Map.... 13 2.2.1 Background of Sammon s Mapping and ViSOM Algorithm.. 13 2.2.1.1 Sammon s Nonlinear Mapping.. 13 2.2.1.2 Visualization-induced SOM (ViSOM)... 14

vi 2.2.2 Polar SOM (PolSOM). 15 2.2.2.1 PolSOM. 15 2.2.2.2 Synthetical Cluster Density (SCD)... 18 2.2.3 Property of PolSOM 20 2.2.4 Experimental Results... 22 2.2.4.1 Two Six-Dimensional Synthetic Data Sets 23 2.2.4.2 Iris Data Set.. 26 2.2.4.3 Wine Data Set 28 2.3 Probabilistic Polar SOM (PPoSOM). 31 2.3.1 PPoSOM.. 31 2.3.2 Experimental Results... 34 2.3.2.1 Two Six-Dimensional Synthetic Data Sets 34 2.3.2.2 Iris Data Set... 38 2.3.2.3 Wine Data Set 40 2.3.2.4 Wisconsin Breast Cancer Data Set 42 2.3.2.5 2009 World s University Rankings Data Set. 44 2.4 Conclusion. 48 Chapter 3 Efficient SOM Learning Scheme.. 51 3.1 Introduction 51 3.2 Efficient SOM Learning Algorithm.. 53 3.3 Experimental Results. 56

vii 3.4 Conclusion. 59 Chapter 4 Data Handling on World University Rankings.... 61 4.1 Introduction... 61 4.2 Ranking Methodologies 65 4.3 Data Handling on World University Rankings.. 68 4.3.1 Estimates for the Missing Data 68 4.3.2 Analyses on Universities.. 70 4.4 Results and Analyses. 75 4.5 Conclusion. 92 Chapter 5 General Conclusion and Future Work. 94 5.1 General Conclusion... 94 5.2 Future Work... 96 Bibliography.. 99

viii List of Figures Figure 1.1 Architecture of the SOM network. 3 Figure 2.1 The topology preservation property of PolSOM. 22 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Visualization maps of the first 6-D synthetic data set. (a) PolSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Visualization maps of the second 6-D synthetic data set. (a) PolSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Visualization maps of the Iris data set. (a) PolSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Transpositional visualization map by PolSOM for Iris data set. Visualization maps of the Wine data set. (a) PolSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Transpositional visualization map by PolSOM for Wine data set. Visualization maps of the first 6-D synthetic data set. (a) PPoSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Visualization maps of the second 6-D synthetic data set. (a) PPoSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. 24 25 27 28 29 30 36 37 Figure 2.10 Visualization maps of the Iris data set. (a) PPoSOM. (b) 39

ix Figure 2.11 Sammon s mapping. (c) SOM. (d) ViSOM. Transpositional visualization map by PPoSOM for Iris data set. 39 Figure 2.12 Visualization maps of the Wine data set. (a) PPoSOM. (b) 41 Sammon s mapping. (c) SOM. (d) ViSOM. Figure 2.13 Transpositional visualization map by PPoSOM for Wine data 41 set. Figure 2.14 Visualization maps of the Wisconsin breast cancer data set. 43 (a) PPoSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Figure 2.15 Visualization maps of the 2009 Times High Education 45 World s University Rankings data set. (a) PPoSOM. (b) Sammon s mapping. (c) SOM. (d) ViSOM. Figure 2.16 Visualization maps of the 2009 Times High Education 47 World s University Rankings data set. (a) PPoSOM (Sample 1), (b) PPoSOM (Sample 2), (c) Sammon s mapping (Sample 1), (d) Sammon s mapping (Sample 2), (e) SOM (Sample 1), (f) SOM (Sample 2), (g) ViSOM (Sample 1), (h) Figure 3.1 ViSOM (Sample 2). The influence of different ratios on the classification accuracy. 59 Figure 4.1 Prediction of missing values of input data by using SOM. 69 Figure 4.2 An example of the application of SOM for predicting the 70

x missing values. Figure 4.3 Principal components obtained from 121 universities. 72 Figure 4.4 The rankings of 10 most popular universities obtained by 121 universities analyses. 75 Figure 4.5 Estimating missing scores for Delft University of 78 Technology by SOM. Figure 4.6 Principal components obtained from 285 universities. 81 Figure 4.7 The rankings of 10 most popular universities obtained by 285 universities analyses. 82 Figure 4.8 The rankings of universities in China. 83 Figure 4.9 The rankings of the universities in Asia. 88 Figure 4.10 The rankings of the universities in Australia. 88 Figure 4.11 The rankings of the universities in Europe. 89 Figure 4.12 The rankings of the public universities in USA. 90 Figure 4.13 The rankings of the private universities in USA. 90 Figure 5.1 The visualization maps by PolSOM. 97

xi List of Tables Table 2.1 Part of the 2009 Times High Education World s University Rankings Data Set. 21 Table 2.2 Comparisons of mapping by using SCD. 36 Table 3.1 Evaluated data sets. 58 Table 3.2 Table 4.1 Table 4.2 Comparisons of average training time (seconds), accuracy and QE of three data sets. The first two principle components of 121 universities data set. The correlation between the two principal components and the original variables in 121 universities data set. 58 73 73 Table 4.3 The five universities in HEEACT rankings. 77 Table 4.4 Table 4.5 Table 4.6 Estimated scores of Delft University of Technology in HEEACT rankings. The first two principle components of 285 universities data set. The correlation between the two principal components and the original variables in 285 universities data set. 77 80 80 Table 4.7 Universities in Figure 4.9. 86 Table 4.8 Universities in Figure 4.10. 87

xii Table 4.9 Universities in Figure 4.11. 89 Table 4.10 Universities in Figure 4.12 and 4.13. 91

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理

CITY UNIVERSITY OF HONG KONG 香港城市大學. Self-Organizing Map: Visualization and Data Handling 自組織神經網絡 : 可視化和數據處理