Improved SOM-Based High-Dimensional Data Visualization Algorithm

Computer and Information Siene; Vol. 5, No. 4; 2012 ISSN 1913-8989 E-ISSN 1913-8997 Published by Canadian Center of Siene and Eduation Improved SOM-Based High-Dimensional Data Visualization Algorithm Wang Zhisheng 1 & Xu Xiaobing 1 1 Management Siene and Engineering, Business Shool, University of Shanghai for Siene and Tehnology, Shanghai, China Correspondene: Wang Zhisheng, Management Siene and Engineering, Business Shool, University of Shanghai for Siene and Tehnology, Shanghai 200093, China. E-mail: greatwnag@163.om Reeived: May 10, 2012 Aepted: May 21, 2012 Online Published: June 15, 2012 doi:10.5539/is.v5n4p110 URL: http://dx.doi.org/10.5539/is.v5n4p110 Supported by Leading Aademi Disipline Projet of Shanghai Muniipal Government, Projet Number: S30504, and the Innovation Fund Projet for Graduate Student of Shanghai Projet Number: JWCXSL1102 Abstrat In this paper, a new high-dimensional data visualization algorithm based on the Self-Organizing Map (SOM) is proposed. It is named TDSOM (three-dimensional self-organizing map) to desribe its speial harateristis. TDSOM trains the high-dimensional data with SOM network and projets it into partiular point sets in the three-dimensional oordinate system. In the three-dimensional oordinate system, the x axis represents attributes of the original data set; the y axis represents the weight of eah attribute; the z axis represents different ategories of the mapping result. The most important is that researhers an wath the three-dimensional model from different viewpoints by rotating it and gain some interesting patterns. Through the experiment, TDSOM is proved to be muh more aurate and more analytial than the traditional methods in displaying the high-dimensional data. The main innovation of the new TDSOM algorithm is the presentation of large data in three-dimensional oordinate system whih provides a muh wider view than the two-dimensional one. What s more, users are able to disover some interesting patterns aording to their own researh areas through the model. The algorithm an be widely applied in areas suh as data mining, pattern reognition and so on. Keywords: self-organizing map, neural networks, data visualization, lustering analysis 1. Introdution Data visualization, with is used to display multi-dimensional data graphially, has been widely applied in pattern reognition, image proessing and so on. There are many algorithms being used in data visualization, suh as satter plot matries(yu, Zhou, & Zhang, 2007) whih displays the different ombination of data items with many hild graphis, parallel oordinates (Wegman, 1990) whih rearrange the order of data items in two-dimensional spae, staked tehniques (Feiner & Beshers, 1900) whih embed one oordinate system in another one, one trees whih divide the n-dimensional data set to subspaes, Chernoff fae (Soete & Corte, 1985) and star hart (Willis, 1992) whih onvert the multi-dimensional data set into speifi graphis and so on. These methods have made great ontribution to visualizing the multi-dimensional data set, but it is hard to find the haraters through them when analyzing the data set with more dimensions. To solve the problem, researhers put forward dimension redution methods whih visualizing data sets after projeting the high-dimensional data to the low-dimensional spae. These methods effetively extrat the distribution harateristis of high-dimensional data set and redue the omplexity of final visualization result. However, this requires a higher performane of the mapping algorithm. Representative ones inlude the prinipal omponent analysis (Kambhatla & Leen, 1997), multidimensional saling (Friedman & Tukey, 1974), self-organizing map (Kohonen, 1990; Vesanto, 1999) and so on. And researhers are keeping improve these methods ontinuously. In 1982, Professor Kohonen from Finland proposed the neural network model named SOM. SOM is an unsupervised neural network projeting high-dimensional data to low dimensional spae with ompetition learning. No supervision means not pointing the output when the network was being trained. And ompetition learning is the seletion rule of survival of the fittest. SOM produe a two-dimensional graph by means of projeting the input spae into output spae. In the graph, input spae reords are projeted to the neural nodes 110

with whih they have the maximum similarity. And this algorithm has high exeution effiieny. However, the traditional SOM algorithms usually visualize data with onsistent grid format, so the size of the network had to be designed in advane. It makes the presentation apability of the output graph largely limited. What s more, SOM does not reserve the distane information of network nodes. Although this fault an be made up by designing olor map based on U-matrix, the struture and distribution of data were usually presented in a distortive format. Dynami self-organizing map (Alahakoon, Halgamuga, & Srinivasan, 2000), an improved algorithm of the SOM, generates a two-dimensional graphi with a grid whih grew dynamially. But the irregular network shape is not able to visualize the final result appropriately. The visualization-indued self-organizing map (Yin, 2002) preserves the distane information of data as well as the topologial struture. But it annot present the harateristis of different attributes. The polar self-organizing map (PolSOM) (L. Xu, Y. Xu, & Chow, 2010) and its improved probabilisti polar self-organizing map (PPoSOM) (L. Xu, Y. Xu, & Chow, 2011) whih are based on SOM are reently put forward. They visualize data with the two-dimensional polar oordinates and projet the radius and angle of the oordinates to the distane and harateristis of the data. The PolSOM and PPoSOM present the distane diversity between the network nodes and preserve the topology. What s more, they present the value weight and feature harateristi with polar oordinates feature effetively. However, with the inrease of reord amounts and the growth of data dimensions, these methods based on polar oordinates may lead to the onfusion of the visualization result. A new algorithm based on SOM and used to visualize high-dimensional data is named as three-dimensional SOM, and we all it TDSOM for short. TDSOM disards the traditional model that projets data to the two-dimensional oordinates, but projets data to the three-dimensional oordinates to enfore the effet of the visualization. TDSOM projets the three oordinate variables of the three-dimensional oordinates to the attributes, values and ategory of the dataset. 2. SOM Algorithm and TDSOM Algorithm Priniple 2.1 SOM Algorithm Early SOM algorithm mainly projets the multi-dimensional to the two-dimensional spae. The two-dimensional plane onsists of many neural nodes. In the graph, eah node has a weight vetor with the same dimension and distribute on the grid made up of retangles or hexagons. In SOM, an input datum x i is represented as a d-dimensional feature vetor x i =(x i1, x i2,, x id ) T. And eah neural node is represented as the orresponding vetor w i =(w j1, w j2,, w jd ) T. At the beginning of the training, the algorithm selet one datum x randomly and ompute the similarity of the datum x and the node j. The similarity is measured by Eulidian distanes, and the node whih has the minimum distane is the winning one: x w min xw, j 1,,N (1) where N is the number of neurons. After getting the best mathed node, the algorithm updates the weights of its neighbor nodes. Usually, the neighborhood funtion should be defined before updating them. And Gaussian funtion is usually hosen: j 2 2 j j h exp p p / 2 t, j N where p j and p are the oordinates of neuron j and, respetively, N is the neighboring set of winning neuron, p j -p is the distane between j and, δ(t) is the neighboring radius that monotonially dereases with time. The weight updating formula is: wj t1 wj t t h j x t wj t, jn (3) Where ζ(t) is the learning rate that monotonially dereases with time. After training, input data with similarity are projeted onto adjaent neural nodes in the grid of the output spae. Thus, SOM is able to reserve the topologial relationship of the input spae. But due to more than one datum are projeted onto the same neural node, the inter-point distane is not preserved. What s more, the format of the output spae usually is onsistent, SOM is not able to display the intern-neuron distane. (2) 111

2.2 TDSOM Algorithm Priniple TDSOM is the newly proposed algorithm in the artile to exhibit the harateristis of high-dimensional input spae. Unlike the traditional algorithms, its visualization model is onstruted in the three-dimensional oordinate system whih is muh wider than the two-dimensional spae. Reords data are projeted to the ertain oordinates of the three-dimensional spae after being proessing by the algorithm. In the diretion of X axis, the three-dimensional model is divided into d bar areas (where d is the dimensions of the input data) horizontally, and different x value represents different attribute of the data set. In the diretion of Y axis, the vertial value of the point represents the weight of the orresponding attribute. In the diretion of Z axis, different value means different lass of the points. Eah node of the neural network is represented with a vetor of the same dimension w=(w 1, w 2,, w d ) T (where d is the dimension of the input data). After seleting the input datum x=(x 1, x 2,, x d ) T from the data set, the algorithm gets the winning neural node aording to following formula: w x w x, j 1, 2,,n (4) j After this, TDSOM update the winning node and its neighbor nodes with Formula (2) following the below: wj t1 wj t 1hj x t w t, if j w t1 w t h x t w t, else if jn j j 2 j where η 1 and η 2 are the learning rates of the t-th step, and η 1 >η 2. η 1 is in the region [0, 0.3], and represents the learning rate of the winning node towards its urrent training sample. While η 1 is in the region [0, 0.1], and η 2 represents the learning rate of the neighborhood nodes learning to the urrent training sample. The learning rates have effets on the onvergene. While η 1 andη 2 inrease, the training proess is speeded up. Set the training period, and start to train the network aording to it. During the training, TDSOM save the times of eah node hosen to be the winning one w j (j=1,, N). And during the last training period, the winning nodes of all input data J are reorded. After the training, the U-matrix that stores the distanes information of the neural nodes in the network is omputed. Then the neural network is visualized aording to gray model generated from the U-matrix. In the result, the distane between similar neural nodes is loser, so their olors are similar. Presume that the distane between nodes is loser, their olor is deeper. Finally, some dark areas ome to being in the plane and eah of them represents one lustering. Aording to the information of input data projeting to lustering areas, the input data subset orresponding to ertain lustering an be reorded. Follow the formula: x C, k 1,, m, if xj and J C (6) k k where J is the winning node of datum x, m is the lustering number divided aording to U-matrix and C k is the k-th lustering. After the above proessing, TDSOM projets one datum onto l points of the oordinate system, and they are represented with Pl(l=1,2,,d) (where d is the dimension of the datum x ): x Pl i, i 1,2,,d (7) y i Pl x, i 1,2,,d (8) z k Pl k, x C (9) where Pl x, Pl y and Pl z are the x, y and z oordinates of the points. The holisti exeuting steps of TDSOM algorithm are as follows: Step(1): Normalize the input data. Initialize the weight of all neural nodes randomly, and initialize their oordinates in the output spae. Set the training period T. Step(2): Randomly selet an input datum and find its orresponding winning neural node by Equation (4). Step(3): Update the weights of winning node and its neighborhood set by Equation (5) Reord the winning times (5) 112

W j. Step(4): If the training period reahes T, stop and go to Step(5). Otherwise, redue the learning rate η 1,η 2 and neighborhood radius δ(t) and go bak to Step(2). Step(5): Draw the ontour map aording to W j whih is the numbers of reords being projeted to the orresponding neural node. And draw the gray graph aording to the U-matrix. Get lustering aording to Equation (6). Step(6): Update the oordinates by Equation (7), Equation (8) and Equation (9) aording to lustering result and winning neuron x w, and draw the three-dimensional model. Through the training above, every input datum have a points set in the three-dimensional oordinate system. Data with the similar attributes gather on the same group, and researhers an ompare the distribution of all lusters attributes and detet their weight from different viewpoints. The final model is made up of points set but not neural nodes, so the distane of points is reserved. 3. Experimental Results Wine reognition data set whih was released in UC Irvine learning repository is one of the most widely applied data sets. These data are the results of a hemial analysis of wines grown in the region in Italy but derived from three different ultivars. The analysis determined the quantities of 13 onstituents found in eah type of wines. It is hard to represent harateristis of the data set with so many attributes learly through traditional data visualization methods. So the effetiveness of TDSOM visualizing high-dimensional data is easy to be defeted by this wine reognition data set. After being trained by the TDSOM, the results are listed as follows: Figure 1. SOM ontour map Figure 2. SOM U-matrix graysale Figure 3. TDSOM projeting map (horizontally) Figure 4. TDSOM projeting map (vertially) 113

In Figure 1, most data are mainly projeted to three areas where their olors are darker than edges and gathered as three lusters. Figure 2 is the SOM U-matrix graysale whih is the traditional method of visualization. Although lusters Edges are not lear enough, the result also ontains three lusters. Figure 3 and Figure 4 are sreenshots of the three-dimensional model whih are taken horizontally and vertially. In the two figures, the numbers of 1 to 13 in x axis represent thirteen attributes of the wine reognition data set; the value of y oordinate represent the weight of attributes; 1, 2 and 3 on the z axis represent three different areas. In Figure 3, researhers an distinguish three different lusters projeted from original data set, and different layers ontain the orresponding points set of its lustering. Rotating the three-dimensional model and observing the Figure 4, researhers an find the distribution of front attributes easily. For example, observe the 3 attribute of three lusters, researhers an easily find out that although the distribution of Cluster 1 is disperse, the values of the set mainly fous on the area whih values are smaller; that Cluster 2 middle; and that Cluster 3 bigger. Aording the above illustration, researhers may get knowledge of the distribution of all the attributes of different lusters by rotating the model and deteting it from different view point. Based on the experiment results, it is noted that obvious improvement has been made in visualizing high-dimensional data set by TDSOM, and that the distribution of eah attribute of different lusters is effetively represented. 4. Conlusion In this paper, a new improved mapping method, TDSOM, is proposed for visualization and projetion of high-dimensional data. The ategories, attributes and weight of the data set are represented learly, and the distribution differenes of lusters are also obvious. What s more, researhers an find the harateristis of eah luster. However, TDOM may be modified for further improvement. During the experiment, one aurate model usually ontains thousands of millions of points. And this requires high-performane omputer system. With the development of omputer tehnology, the problem may be solved in the near further. At last, what s worth mentioning, due to the upstanding visualization result, TDSOM is worth widely applying in relevant areas. Referenes Alahakoon, D., Halgamuga, S., & Srinivasan, B. (2000). Dynami self-organizing maps with ontrolled growth for knowledge disovery. IEEE transations on neural networks, 11(3), 601-614. http://dx.doi.org/10.1109/72.846732 Feiner, S., & Beshers, C. (1900). Visualizing n-dimensional virtual worlds with n-vision. Computer Graphis, 24(2), 37-38. Friedman, J. H., & Tukey, J. W. (1974). A projetion pursuit algorithm for exploratory data analysis. IEEE Transations on Computers, C-23(9), 881-890. http://dx.doi.org/10.1109/t-c.1974.224051 Kambhatla, N., & Leen, T. (1997). Dimension redution by loal prinipal omponent analysis. Neural Computation, 9(7), 1493-151. http://dx.doi.org/10.1162/neo.1997.9.7.1493 Kohonen, T. (1990). The Self-organizing map. Proeedings of the IEEE, 78(9), 1464-1480. Soete, G., & Corte, W. (1985). On the pereptual saliene of features of hernoff faes for representing multivariate data. Applied Psyhologial Measurement, 9(3), 275-280. http://dx.doi.org/10.1177/014662168500900305 Vesanto, J. (1999). SOM-based data visualization methods. Intelligent Data Analysis, 3(2), 111-126. http://dx.doi.org/10.1016/s1088-467x(99)00013-x Wegman, E. J. (1990). Hyperdimensional data-analysis using parallel oordinantes. Amerian Statistial Assoiation, 85, 664-675. Willis, J. P. (1992). Star plots diagrams with strong visual impat for simultaneously displaying variations in 4 to 9 sample harateristis. pp. 75-79. Barren River Resort, KY: World Sientifi Publ Copteltd. Xu, L., Xu, Y., & Chow, S. (2010). PolSOM: A new method for multi-dimensional data visualization. Pattern reognition, 43(4), 1668-1675. http://dx.doi.org/10.1016/j.patog.2009.09.025 Xu, Y., Xu, L., & Chow, S. (2011). PPoSOM: A new variant of PolSOM by using probabilisti assignment for multidimensional data visualization. Neuroomputing, 74(11), 2018-2027. http://dx.doi.org/10.1016/j.neuom.2010.06.028 Yin, H. (2002). ViSOM-A novel method for multivariate data projetion and struture visualization. IEEE transations on neural networks, 13(1), 237-243. http://dx.doi.org/10.1109/72.977314 114

Yu, X. S., Zhou, N., & Zhang, F. F. (2007). Researh on Methods of High-dimensional Data Visualization. Information Siene, 25(1), 117-120. 115