On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data

Transcription

1 On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data Jorge M. L. Gorricha and Victor J. A. S. Lobo CINAV-Naval Research Center, Portuguese Naval Academy, and ISEGI-UNL. Abstract. The Self-Organizing Map (SOM) is an artificial neural network that is very effective for clustering via visualization. Ideally, so as produce a good model, the output space dimension of the SOM should match the intrinsic dimension of the data. However, because it is very difficult or even impossible to visualize SOM s with more than two dimensions, the vast majority of applications use SOM with a regular two-dimensional (2D) grid of nodes. For complex problems, this poses a limitation on the quality of the results obtained. There are no theoretical problems in generating SOMs with higher dimensional output spaces, but the 3D SOMs have met limited success. In this paper we show that the 3D SOM can be used successfully for visualizing clusters in georeferenced data. To overcome the problem of visualizing the 3D grid of units, we start by assigning one primary color (of the RGB color scheme) to each of the three dimensions of the 3D SOM. We then use those colors when representing, on a geographic map, the geo-referenced elements that are mapped to each SOM unit. We then provide a comparison of a 2D and 3D SOM for a concrete problem. The results obtained point to a significant increase in the clustering quality due to use of 3D SOMs. Keywords: Clustering, geo-referenced data, geospatial clustering, threedimensional self-organizing Maps, visualization. 1 INTRODUCTION There is a wide range of problems that need to be addressed in a geo-spatial perspective. Some of these problems are often associated with environmental and socio-economic phenomena where the geographic position is a determinant element for analysis [1]. In such kind of analysis, frequently based on geo-referenced secondary data [1], the focus is centered in the search of patterns and spatial relationships, without defined a priori hypotheses [2]. This is achieved through clustering, defined as the unsupervised classification of patterns into groups [3]. It is also widely recognized that visualization is a potentially useful technique for pattern exploratory analysis and may, under certain circumstances, contribute to discover new knowledge. Moreover, when applied to geo-referenced data, visualization may allow the explanation of complex structures and phenomena in a spatial perspective [4]. Visualization is, in this perspective, defined as the use of

2 visual representations of data obtained with interactive computer systems in order to amplify cognition [5]. It is in this context that unsupervised neural networks, such as the SOM [6-8], have been proposed as tools for visualizing geo-referenced data [4]. In fact, the SOM algorithm performs both vector quantization (process of representing a given data set by a reduced set of reference vectors [9-11]) and vector projection, making this Artificial Neural Network (ANN) a very effective method for clustering via visualization [12]. Among all the strategies for visualizing the SOM, we are particularly interested in those that allow dealing with spatial dependency. One of the methods used to visualize geo-referenced data using the SOM consists in assigning different colors to the units of the SOM network, defined only in two dimensions (2D SOM), so that each geo-referenced element can be geographically represented with the color of its Best Matching Unit (BMU) [13]. Because it is very difficult or even impossible to visualize SOM s with more than two dimensions [14, 15], the output space of this ANN is generally defined by a regular two-dimensional grid of nodes (2D SOM). This approach, supported by a non-linear projection of data on a two-dimensional surface, performs a dimensionality reduction that generally leads to a loss of information, and for this reason there is a strong probability that some of the existing clusters will be undetectable [12]. However, in the particular case of geo-referenced data, it is possible to consider the use of a three-dimensional SOM for this purpose, thus adding one more dimension in the analysis, and consequently reducing information loss. As we shall see later, the inclusion of a third dimension in the analysis will allow us to identify some of the clusters that are undifferentiated in SOM s with the output space defined only in two dimensions. This paper is divided into five sections as follows: Section 2 presents the theoretical framework of the problem under review, especially regarding the use of SOM as a tool for visualizing clusters in a spatial perspective; In Section 3 we propose a method for visualizing clusters in geo-referenced data that uses the output space of a three dimensional SOM; Section 4 presents the results and discusses practical applications of the presented method, including experiments with real and artificial data; Finally, in Section 5 we present the general conclusions. 2 The Self-Organizing Map The SOM is an ANN based on an unsupervised learning process that performs a nonlinear mapping of high dimensional input data onto an ordered and structured array of nodes, generally of lower dimension [6]. As a result of this process, and by combining the properties of a vector quantization and vector projection algorithm, the SOM compresses information and reduces dimensionality [15, 16]. In its most usual form, the SOM algorithm performs a number of successive iterations until the reference vectors associated to the nodes of a bi-dimensional network represent, as far as possible, the input patterns that are closer to those nodes (vector quantization). In the end, every input pattern in the data set is mapped to one of the network nodes (vector projection).

3 After this optimization process, topological relations amongst input patterns are, whenever possible, preserved through the mapping process, allowing the similarities and dissimilarities in the data to be represented in the output space [7]. Therefore, the SOM algorithm establishes a nonlinear between the input data space and the map grid that is called the output space. Because the SOM converts nonlinear statistical relationships that exist in data into geometric relationships, able to be represented visually [6, 7], it can be considered a visualization method for multidimensional data specially adapted to display the clustering structure [17, 18], or in other words, as a diagram of clusters [7]. When compared with other clustering tools, the SOM is distinguished mainly by the fact that, during the learning process, the algorithm tries to guarantee the topological ordering of its units, thus allowing an analysis of proximity between the clusters and the visualization of their structure [13]. In order to transform the SOM into a better tool for exploratory data analysis, several methods have been developed increasing the capabilities of this algorithm for that purpose. These methods explore both perspectives of SOM: vector projection (an output space perspective) and vector quantization (an input data space perspective). Usually, the visualization of SOM is based on two-dimensional constructs such as the U-Matrix [19, 20], component planes, hit maps, and other similar variants [20, 21] or by exploring the data topology [22]. However the aim of this paper is not focused on all those strategies but only in those that allow visualizing clusters in a geo-spatial perspective. Typically, a clustering tool should ensure the representation of the existing data patterns, the definition of proximity between these patterns, the characterization of clusters and the final evaluation of output [3]. In the case of geo-referenced data, the clustering tool should also ensure that the groups are made in line with geographical closeness [13]. Thus, the geo-spatial perspective is, in fact, a crucial point that makes the difference between clustering geo-referenced data and other data. Recognizing that fact and knowing that the visualization of SOM can be considered by other means than the usually used methods, we will look now to one specific approach that has been proposed in order to deal with geo-spatial features. An alternative way to visualize the SOM can be reached by taking advantage of the very nature of geo-referenced data, coloring the geographic map with label colors obtained from the SOM units [13]. This approach is proposed in the Prototypically Exploratory Geovisualization Environment [23] (PEGE). This software incorporates the possibility of linking SOM to the geographic representation of data by color, allowing its analysis in a geo-spatial perspective. One possible application of PEGE, that constitutes the bottom line of this paper, consists in assigning colors to the map units of a 2D SOM with some kind of criterion (similarity by example) and finally coloring the geo-referenced elements with those colors. Fig. 1 shows an example of clustering geo-referenced data based on the application of this method. A color was assigned to each map unit of a 2D SOM defined with nine units (3x3). This map was trained with data related to the main causes of death in several European countries. As we can see through this example, the geo-spatial perspective seems to be essential to understand this particular phenomenon.

4 Fig. 1. The principal causes of death with a 2D SOM. This example was obtained by training a 2D SOM with data related to the main causes of death in several European countries. Each country was painted with the same color of its BMU in the SOM. Data Source: EUROSTAT. 3 Clustering Geo-referenced Data With 3D SOM In this section we propose a clustering method for geo-referenced data based on a visualization of the output space of a 3D SOM. This method is no more than a association of each of the three orthogonal axes (x, y and z) that define the SOM grid to one of the three primary colors: red, green and blue (RGB Scheme). As a result, each of the three dimensions of the 3D SOM will be expressed by a change in tone of one particular primary color (RGB), and each SOM unit will have a distinct color label. After that we can paint each geographic element with its BMU color. Fig. 2 represents schematically a SOM with 27 units (3x3x3) in RGB space followed by the geographical representation of several geo-referenced elements painted with colors labels of their BMU's. Formally, let us consider a SOM 3D defined with three dimensions [ u v w] and a rectangular topology. The SOM grid or the output space ( N ) is a set of ( u v w) units (nodes) defined in, such that: 3 T 3 N { n i [ x y z] : i 1, 2,..., ( u v w)} (1) Where x, y and z are the unit coordinates in the output space, such that: x 0,1,...,( u 1) y 0,1,...,( v 1) z 0,1,...,( w 1) (2)

5 Fig. 2. Linking SOM s knowledge to cartographic representation. A color is assigned to each SOM unit (following the topological order). Then the geo-referenced elements are painted with the color of their BMU s in the SOM. These coordinates must be adjusted to fit the RGB values, which typically vary between 0 and 1. The new coordinates ( R, G, B) of the unit n i in RGB space can be obtained through the range normalization of the initial values: x y z R ; G ; B (3) ( u 1) ( v 1) ( w 1) Finally, the interior of the polygon that defines each geo-referenced element mapped to the unit n i (BMU) can receive the color ( R, G, B), as may be seen in Fig. 2. The process is then repeated for all units of the map grid. 4 Experimental Results To quantify the efficiency of the proposed method we conducted several experiments. In this section we present the experimental results obtained using two geo-referenced data sets: a first one using artificial data, where we know exactly the number and extension of the clusters; and a second experiment using real data.

6 Experiment with Artificial Data To illustrate the use of tridimensional SOM s for clustering geo-referenced data, we designed a dataset for that purpose, inspired in one of the fields of application for this kind of tools: ecological modeling. In this special case, the geo-referenced dataset refers to an area of intensive fishing where there is a particular interest in the spatial analysis of the distribution of five species of great commercial importance. The dataset was constructed in order to characterize 225 sea areas, exclusively based on their biodiversity. We simulated a sampling procedure, assuming that each sample was representative of an area of approximately 50 square miles. All samples are geo-referenced to the centroid of the area, defined with geographical coordinates (x and y) and their attributes are the amount of each of five species of interest, expressed in tons. The initial data set was designed so that variables are in the same scale. However, as the variables have very different variances a Z-Score normalization was carried out to guarantee that all the variances are equal to 1. As we can see in Fig. 3 and Fig. 4, the map has a total of twelve well defined areas (geo-clusters), including a few small areas of spatial outliers. In fact, if we analyze only the attributes that refer to the species of interest, there are only eight distinct groups of data. Fig. 3 also represents the distribution of each variable. The dark areas correspond to high values of each variable. (a) (b) (c) (d) (e) Fig. 3. Artificial Dataset. The distribution of each variable it is also represented. The dark areas correspond to high values of each variable: (a) Variable 1; (b) Variable 2; (c) Variable 3; (d) Variable 4; (e) Variable 5.

7 Fig. 4. Artificial Dataset. All the twelve clusters are delimited. The first experiment was conducted in order to compare SOM s with different dimensions (3D SOM versus 2D SOM). Considering the size of the data set (225 georeferenced elements), we decided to use the following map sizes with a total of 64 network units for both models: 2D SOM: 8 8; 3D SOM: In the experiments, we always used the SOM Batch Algorithm implemented in SOMToolbox [24] with the following parameterizations: - Gaussian neighborhood function (Were tested several models with different neighborhood functions but the results were always better with this function); - The lattice was defined as rectangular for the 3D SOM (unique option allowed by SOMToolbox for SOM s with more than two dimensions) and hexagonal for the 2D SOM. The hexagonal lattice gives better results for 2D SOM s and each unit has the same number of neighbors as the units of the 3D SOMs (except, naturally, for the border units). By following this strategy we guarantee that the 3D SOM is compared with the best model of 2D SOM s; - The learning rate was 0.5 for the unfolding phase and 0.05 for the fine-tuning phase; - In both models we used an unfolding phase with 12 epochs and a fine-tuning phase with 48 epochs. Random and linear initializations were tested. Five hundred models were assessed for both topologies (using random initialization), and although we present statistical numerical results, the figures were obtained with a particular SOM that we chose as the best model. Considering that all the measures available to assess the map quality [6] have advantages and disadvantages and that it is not possible to indicate the best one, we opted for the models of both topologies that presented the minimum quantization error (QE). We also analyzed the topological error (TE), but since it proved to be always very low TE, this measure was not used to choose the final model. The topological error was

8 calculated as the proportion of all data vectors for which first and second BMUs are not adjacent units, i.e., where distance (measured in output space) between the first and second BMU is greater than 2 for the 2D SOM and 3 for 3D SOM. The results are presented and summarized in table 1: Table 1. Quantization error and topological error Random Initialization 2D SOM 3D SOM QE 0,3138 0,3697 Model with the minimum QE TE 0, QE 0,3333 0,4181 ( =0,0105) ( =0,0032) Average Values TE 0, ,00326 ( =0,0214) ( =0,0095) Linear Initialization QE 0,3282 0,4057 Linear Initialization Model TE 0 0 The value of standard deviation is between Brackets. Using the methodology proposed in section III we get the cartographic representation of both models, using the 2D SOM and 3D SOM. In Fig. 5 we present the result of the application of color labels linking the output space of a 2D SOM with the cartographic representation.

9 Fig. 5. Cartographic representation with 2D SOM. By inspection of the map we can t identify more than six well defined clusters and there is a false continuum linking several zones. As we can see, the cartographic representation of the 2D SOM does not show all the eight clusters. In fact, we can hardly say by inspection of the geographic map that there are more than six clusters. As regards the differentiation of the twelve defined areas, we may say that there is a mixed zone composed by zone 5 and zone 7; there is a false continuum linking zone 4 to zone 6; and some ambiguity between zone 1 and zone 3 and between zone 2 and zone 4. In Fig. 6 we show the U-matrix using the 2D SOM. The U-matrix exposes all the eight clusters. Fig. 6. U-Matrix 2D SOM. Despite the results obtained with the cartographic representation of 2D SOM (Fig. 5), it is important to note that the U-Matrix shows all eight groups very effectively. However, it is difficult to analyze this information in a geospatial perspective; in particular, it is difficult to identify the twelve different areas. Fig. 7 shows the geographic map with color labels obtained from the 3D SOM. In this particular case, it seems that the 3D SOM exposes all the eight clusters and all the

10 twelve different areas. However, there still remain some doubts relative to some areas, especially in zones 4 and 5. Fig. 7. Cartographic representation with 3D SOM. All the eight clusters are well defined. However, there still remain some doubts relative to zones 4 and 5.

11 Lisbon s metropolitan area Another experiment was conducted using a real geo-referenced data set to train several SOM s. This data set consists of 61 socio-demographic variables which describe a total of 3978 geo-referenced elements belonging to Lisbon s metropolitan area (see Fig. 8). The data was collected during the 2001 census and the variables describe the region according to five main areas of interest: type of construction, family structure, age structure, education levels and economic activities. Fig. 8. Lisbon Metropolitan Area. The data set was collected during the 2001 census and consists in 61 socio-demographic variables which describe a total of 3978 geo-referenced elements belonging to the Lisbon s metropolitan. Because the variables have different scales and ranges, we performed a linear range normalization to guarantee that all the variables take values between 0 and 1. As previously, these second tests were also conducted in order to compare qualitatively SOM s with different dimensions. Taking into account the size of the data set (3978 geo-referenced elements), we choose the following map sizes with a total of 512 network units for the 3D SOM and 2D SOM: - 2D SOM: 16 32; - 3D SOM: 8 8 8; Once again, we used the SOM Batch Algorithm parameterized this way: - Neighborhood function: Gaussian; - The lattice was defined rectangular for the 3D SOM and hexagonal for the 2D SOM; - The learning rate was 0.5 for the unfolding phase and 0.05 for the fine-tuning phase; In both models we used a unfolding phase with 8 epochs and a fine-tuning phase with 24 epochs. Both random initialization and linear initialization were tested. One hundred models were assessed for both topologies (with random

12 initialization). Once more, we opted for the maps of both topologies that present the minimum quantization error among all models with an acceptable topological error. The results are presented and summarized in Table 2: Table 2. Quantization error and topological error Random Initialization 2D SOM 3D SOM Model with the minimum QE QE 0,6170 0,6459 TE 0,0339 0,0261 QE 0,6197 0,6494 ( =0,0010) ( =0,0016) Average Values TE 0,0371 0,0343 ( =0,0031) ( =0,0069) Linear Initialization QE 0,6191 0,6458 Linear Initialization Model TE 0,0422 0,0206 The analysis of the U-Matrix presented in Fig. 9 indicates that there are several clusters, including some with well-defined borders. The darker blue shades represent dense areas in the input space. On the contrary, the red shades indicate sparse areas. In this work the interest lies not in the analysis of existing clusters but essentially in the comparison between the representations offered by two the types of topologies (2D SOM and 3D SOM).

13 Fig. 9. U-Matrix of a 2D SOM. It seems evident that the data set has a very complex structure with several clusters. Fig. 10 represents part of Lisbon s city center. The 2D SOM in Fig. 10 (a) is much less informative than the representation offered by the 3D SOM in Fig. 10 (b). In the cartographic representation, the results obtained with 2D SOM, when compared with the SOM 3D, are much less detailed. (a) (b) Fig. 10 Lisbon centre visualized with both 2D SOM and 3D SOM. (a) Represents the 2D SOM visualization; (b) represents the 3D SOM visualization (only output space). Naturally, the discrimination provided by 3D SOM may be artificial and forced. But the analysis of some particular differences between the maps points in the opposite direction: there are differences and some of those differences are visualized better with the inclusion of one more dimension. Let us consider the zone highlighted on both maps represented in Fig. 10. With the 2D SOM, the zone is similar to the neighborhood; on the contrary, with 3D SOM there is a difference. The zone indicated in the map is, in fact, different from its

14 neighbors and corresponds to the old Lisbon center ( Baixa Pombalina ). The main difference (among others) is the construction profile. Lisbon s centre and the nearby zones are essentially buildings constructed before 1919, very different from the rest of the city. In a global analysis it seems that the 2D SOM is not reflecting the main differences in the construction profile. 5 Conclusion In this paper we have presented a method for clustering geo-referenced data using the three dimensional SOM. The 3D SOM was compared with the 2D SOM using two datasets: one artificial dataset that consisted of 225 geo-referenced elements with 5 variables; and one real life data set that consisted of 3978 geo-referenced elements described by 61 variables. The experiments were conducted using several parameterizations of the SOM algorithm in order to optimize the final results of both topologies. In the first experiment, using an artificial dataset with clusters and geo-clusters known a priori, the 3D SOM has proved to be more effective in detecting the predefined homogenous groups from a spatial perspective. Nevertheless even with the use of one additional dimension there are still some difficulties to classify correctly all the geo-referenced elements. In what concerns to the effectiveness of the 3D SOM when applied to real data, we can say that the 3D topology was, in the tested data set, much more informative and revealed differences between geo-referenced elements that weren t accessible with the application of 2D SOM. However, the high discrimination of geo-referenced data provided by the application of 3D SOM creates a complex visualization scheme that makes it difficult to identify the global trends in data. So, the application of 3D SOM seems better suited to a more fine and detailed analysis. References 1. Openshaw, S., Developing Automated and Smart Spatial Pattern Exploration Tools for Geographical Information Systems Applications. The Statistician, (1): p Miller, H.J. and J. Han, Overview of geographic data mining and knowledge discovery, in Geographic Data Mining and Knowledge Discovery, H.J. Miller and J. Han, Editors. 2001, Taylor & Francis: London. 3. Jain, A.K., M.N. Murty, and P.J. Flynn, Data Clustering: A Review. ACM Computing Surveys, (3): p Koua, E.L. Using self-organizing maps for information visualization and knowledge discovery in complex geospatial datasets. in Proceedings of 21st International Cartographic Renaissance (ICC) Durban: International Cartographic Association. 5. Card, S.K., J.D. Mackinlay, and B. Shneiderman, Readings in Information Visualization: Using Vision to Think. 1999, Morgan Kaufmann Publishers: San Francisco. 6. Kohonen, T., Self-organizing Maps. 3rd ed. Springer Series in Information Sciences, ed. T.S. Huang, T. Kohonen, and M.R. Schroeder. 2001, New York: Springer.

15 7. Kohonen, T., The self-organizing map. Neurocomputing, (1-3): p Kohonen, T., The self-organizing map. Proceedings of the IEEE, (9): p Gersho, A., Principles of quantization. IEEE Transactions on Circuits and Systems, (7): p Gersho, A., Quantization. IEEE Communications Magazine, (5): p Buhmann, J. and H. Khnel. Complexity optimized vector quantization: a neural network approach. in Proceedings of DCC '92, Data Compression Conference. 1992: IEEE Comput. Soc. Press. 12. Flexer, A., On the use of self-organizing maps for clustering and visualization. Intelligent Data Analysis, (5): p Skupin, A. and P. Agarwal, What is a Self-organizing Map?, in Self-Organising Maps: applications in geographic information science, P. Agarwal and A. Skupin, Editors. 2008, John Wiley & Sons: Chichester, England. p Bação, F., V. Lobo, and M. Painho, The self-organizing map, the Geo-SOM, and relevant variants for geosciences. Computers & Geosciences, (2): p Vesanto, J., SOM Based Data Visualization Methods. Intelligent Data Analysis, (2): p Vesanto, J., et al., SOM Toolbox for Matlab , Helsinki Universitu of Techology: Espoo, Finland. 17. Himberg, J. A SOM based cluster visualization and its application for false coloring. in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks Como, Italy. 18. Kaski, S., J. Venna, and T. Kohonen. Coloring that reveals high-dimensional structures in data. in Proceedings of 6th International Conference on Neural Information Processing Perth, WA: IEEE. 19. Ultsch, A. and H.P. Siemon. Kohonen's self organizing feature maps for exploratory data analysis. in Proceedings of International Neural Network Conference Paris: Kluwer Academic Press. 20. Ultsch, A. Maps for the visualization of high-dimensional data spaces. in Proceedings of the workshop on self-organizing maps Japan: Kyushu. 21. Kraaijveld, M.A., J. Mao, and A.K. Jain, A nonlinear projection method based on Kohonen's topology preserving maps. IEEE Transactions on Neural Networks, (3): p Tasdemir, K. and E. Merenyi, Exploiting Data Topology in Visualization and Clustering of Self-Organizing Maps. Neural Networks, IEEE Transactions on, (4): p Koua, E.L. and M. Kraak, An Integrated Exploratory Geovisualization Environment Based on Self-Organizing Map, in Self-Organising Maps: applications in geographic information science, P. Agarwal and A. Skupin, Editors. 2008, John Wiley & Sons: Chichester, England. p Alhoniemi, E., et al., SOM Toolbox