Comparing large datasets structures through unsupervised learning

Transcription

1 Comparing large datasets structures through unsupervised learning Guénaël Cabanes and Younès Bennani LIPN-CNRS, UMR 7030, Université de Paris 13 99, Avenue J-B. Clément, Villetaneuse, France Abstract. In data mining, the problem of measuring similarities between different subsets is an important issue which has been little investigated up to now. In this paper, a novel method is proposed based on unsupervised learning. Different subsets of a dataset are characterized by means of a model which implicitly corresponds to a set of prototypes, each one capturing a different modality of the data. Then, structural differences between two subsets are reflected in the corresponding model. Differences between models are detected using a similarity measure based on data density. Experiments over synthetic and real datasets illustrate the effectiveness, efficiency, and insights provided by our approach. 1 Introduction In recent years, the datasets size has shown an exponential growth. Studies exhibit that the amount of data doubles every year. However, the ability to analyze these data remains inadequate. The problem of mining these data to measure similarities between different datasets becomes an important issue which has been little investigated up to now. A major application may be the analysis of time evolving datasets, by computing a model of the data structure over different periods of time, and comparing them to detect the changes when they occurred. Nevertheless, there are many other possible applications, like large datasets comparison, clustering merging, stability measure and so on. As the study of data streams and large databases is a difficult problem because of the computing costs and the big storage volumes involved, two issues appear to play a key role in such an analysis: (i) a good condensed description of the data properties [1,2] and (ii) a measure capable of detecting changes in the data structure [3,4]. In this paper we propose a new algorithm which is able to perform these two tasks. The solution we propose consists of an algorithm, which first constructs an abstract representation of the datasets to compare and then evaluates the dissimilarity between them based on this representation. The abstract representation is based on the learning of a variant of Self- Organizing Map (SOM) [5], which is enriched with structural information extracted from the data. Then we propose a method to estimate, from the abstract representation, the underlying data density function. The dissimilarity is a measure of the divergence This work was supported in part by the CADI project (N o ANR-07 TLOG 003) financed by the ANR (Agence Nationale de la Recherche).

2 2 Guénaël Cabanes and Younès Bennani between two estimated densities. A great advantage of this method is that each enriched SOM is at the same time a very informative and a highly condensed description of the data structure that can be stored easily for a future use. Also, as the algorithm is very effective both in terms of computational complexity and in terms of memory requirements, it can be used for comparing large datasets or for detecting structural changes in data-streams. The remainder of this paper is organized as follows. Section 2 presents the new algorithm. Section 3 describes the validation protocol and some results. Conclusion and future work perspectives are given in Section 4. 2 A new two-levels algorithm to compare data structure The basic assumption in this work is that data are described as vectors of numeric attributes and that the datasets to compare have the same type. First, each dataset is modeled using an enriched Self-organizing Map (SOM) model (adapted from [6]), constructing an abstract representation which is supposed to capture the essential data structure. Then, each dataset density function is estimated from the abstract representation. Finally, different datasets are compared using a dissimilarity measure based upon the density functions. The idea is to combine the dimension reduction and the fast learning SOM capabilities in the first level to construct a new reduced vector space, then applies other analysis in this new space. These are called two-levels methods. The two-levels methods are known to reduce greatly the computational time, the effects of noise and the curse of dimensionality [6]. Furthermore, it allows some visual interpretation of the result using the two-dimensional map generated by the SOM. 2.1 Abstract algorithm schema The algorithm proceeds in three steps : 1. The first step is the learning of the enriched SOM. During the learning, each SOM prototype is extended with novel information extracted from the data. These structural informations will be used in the second step to infer the density function. More specifically, the attributes added to each prototype are: Density modes. It is a measure of the data density surrounding the prototype (local density). The local density is a measure of the amount of data present in an area of the input space. We use a Gaussian kernel estimator [7] for this task. Local variability. It is a measure of the data variability that is represented by the prototype. It can be defined as the average distance between the prototypes and the represented data. The neighborhood. This is a prototype s neighborhood measure. The neighborhood value of two prototypes is the number of data that are well represented by each one. 2. The second step is the construction, from each enriched SOM, of a density function which will be used to estimate the density of the input space. This function is constructed by induction from the information associated to the prototypes of the SOM, and is represented as a mixture model of spherical normal functions.

3 Comparing large datasets structures through unsupervised learning 3 3. The last step accomplishes the comparison of two different datasets, using a dissimilarity measure able to compare the two density functions constructed in the previous steps. 2.2 Prototypes enrichment In this step some global information is extracted from the data and stored in the prototypes during the learning of the SOM. The Kohonen SOM can be classified as a competitive unsupervised learning neural network [5]. A SOM consists in a two dimensional map of M neurons (units) which are connected to n inputs according to n weights connections w j = (w 1j,..., w nj ) (also called prototypes) and to their neighbors with topological links. The training set is used to organize these maps under topological constraints of the input space. Thus, an optimal spatial organization is determined by the SOM from the input data. In our algorithm, the SOM s prototypes will be enriched by adding new numerical values extracted from the dataset. The enrichment algorithm proceeds in three phases: Input : The data X = {x k } N k=1. Output : The density D i and the local variability s i associated to each prototype w i. The neighborhood values v i,j associated with each pair of prototype w i and w j. Algorithm: 1. Initialization : Initialize the SOM parameters i, j initialize to zero the local densities (D i ), the neighborhood values (v i,j ), the local variability (s i ) and the number of data represented by w i (N i ). 2. Choose randomly a data x k X : Compute d(w, x k ), the euclidean distance between the data x k and each prototype w i. Find the two closest prototypes (BMUs: Best Match Units) w u and w u : u = arg min(d(w i, x k )) and u = arg min(d(w i, x k )) i i u. 3. Update structural values : Number of data: N u = N u + 1. Variability: s u = s u + d(w u, x k ). Density: i, D i = D i + 1 2πh e d(w i,x k )2 2h 2. Neighborhood: v u,u = v u,u + 1.

4 4 Guénaël Cabanes and Younès Bennani 4. Update the SOM prototypes w i as defined in [5]. 5. repeat T times step 2 to Final structural values: i, s i = s i /N i and D i = D i /N. In this study we used the default parameters of the SOM Toolbox [8] for the learning of the SOM and we use T = max(n, 50 M) as in [8]. The number M of prototypes must neither be too small (the SOM does not fit the data well) nor too large (time consuming). To choose M close to N seems to be a good trade-off [8]. The last parameter to choose is the bandwidth h. The choice of h is important for good results, but its optimal value is difficult to calculate and time consuming (see [9]). A heuristic that seems relevant and gives good results consists in defining h as the average distance between a prototype and its closest neighbor [6]. At the end of this process, each prototype is associated with a density and a variability value, and each pair of prototypes is associated with a neighborhood value. The substantial information about the structure of the data is captured by these values. Then, it is no longer necessary to keep data in memory. 2.3 Estimation of the density function The objective of this step is to estimate the density function which associates a density value to each point of the input space. We already have an estimation of the value of this function at the position of the prototypes (i.e. D i ). We must infer from this an approximation of the function. Our hypothesis here is that this function may be properly approximated in the form of a mixture of Gaussian kernels. Each kernel K is a Gaussian function centered on a prototype. The density function can therefore be written as: f(x) = M 1 α i K i (x) with K i (x) = N e d(w i,x) 2 2h 2 i 2πh i i=1 The most popular method to fit mixture models (i.e. to find h i and α i ) is the expectation-maximization (EM) algorithm [10]. However, this algorithm needs to work in the data input space. As here we work on enriched SOM instead of dataset, we can t use EM algorithm. Thus, we propose the heuristic to choose h i : h i = j v i,j N i+n j (s i N i + d i,j N j ) j v i,j d i,j is the euclidean distance between w i and w j. The idea is that h i is the standard deviation of data represented by K i. These data are also represented by w i and their neighbors. Then h i depends on the variability s i computed for w i and the distance d i,j between w i and his neighbors, weighted by the number of data represented by each prototype and the connectivity value between w i and his neighborhood. Now, since the density D for each prototype w is known (f(w i ) = D i ), we can use a gradient descent method to determine the weights α i. The α i are initialized with the

5 Comparing large datasets structures through unsupervised learning 5 values of D i, then these values are reduced gradually to better fit D = M i=1 α ik i (w). To do this, we optimize the following criterion: α = arg min α 1 M M M (α j K j (w i )) D i i=1 j=1 Thus, we now have a density function that is a model of the dataset represented by the enriched SOM Algorithm complexity The complexity of the algorithm is scaled as O(T M), with T the number of steps and M the number of prototypes in the SOM. It is recommended to set at least T > 10 M for a good convergence of the SOM [5]. In this study we use T = max(n, 50 M) as in [8]. This means that if N > 50 M (large database), the complexity of the algorithm is O(N M), i.e. is linear in N for a fixed size of the SOM. Then the whole process is very fast and is suited for the treatment of large databases. Also very large databases can be handled by fixing T < N (this is similar as working on a random subsample of the database). This is much faster than traditional density estimator algorithms as the Kernel estimator [7] (that also needs to keep all data in memory) or the Gaussian Mixture Model [11] estimated with the EM algorithm (as the convergence speed can become extraordinarily slow [12,13]). 2.5 The dissimilarity measure We can now define a measure of dissimilarity [ between ] two datasets[ A and B, represented by two SOMs: SOM A = {wi A}M A i=1, f A and SOM B = {wi B}M B i=1, f B ] With M A and M B the number of prototypes in models A and B, and f A and f B the density function of A and B computed in 2.3. The dissimilarity between A and B is given by: CBd(A, B) = i=1 f ( ( ) ) A wi A f log A (w A i ) f B (wi A ) M A + M A M B j=1 f B ( w B j ( ) ) f log B (w B j ) f A (wj B ) The idea is to compare the density functions f A and f B for each prototype w of A and B. If the distributions are identical, these two values must be very close. This measure is an adaptation of the weighted Monte Carlo approximation of the symmetrical Kullback Leibler measure (see [14]), using the prototypes of a SOM as a sample of the database. M B

6 6 Guénaël Cabanes and Younès Bennani 3 Validation 3.1 Description of the used datasets In order to demonstrate the performance of the proposed dissimilarity measure, we used nine artificial datasets generators and one real dataset. Ring 1, Ring 2, Ring 3, Spiral 1 and Spiral 2 are 5 two-dimensional nonconvex distributions with different density and variance. Ring 1 is a ring of radius 1 (high density), Ring 2 a ring of radius 3 (low density) and Ring 3 a ring of radius 5 (average density) ; Spiral 1 and Spiral 2 are two parallel spirals. The density in the spirals decreases with the radius. Datasets from these distributions can be generated randomly. Noise 1 to Noise 4 are 4 different two-dimensional distributions, composed of different Gaussian distributions and a heavy homogeneous noise. Finally, the Shuttle real dataset come from the UCI repository. It s a nine-dimensional dataset with instances. These data are divided in seven class. Approximately 80% of the data belongs to class Validity of the dissimilarity measure It s impossible to prove that two distributions are exactly the same. Anyway, a low dissimilarity value is only consistent with a similar distribution, and does of course give an indication of the similarity between the two sample distributions. On the other hand, a very high dissimilarity does show, to the given level of significance, that the distributions are different. Then, if our measure of dissimilarity is efficient, it should be possible to compare different datasets (with the same attributes) to detect the presence of similar distributions, i.e. the dissimilarity of datasets generated from the same distribution law must be much smaller than the dissimilarity of datasets generated from very different distribution. To test this hypothesis we applied the following protocol: 1. We generated 250 different datasets from the Ring 1, Ring 2, Ring 3, Spiral 1 and Spiral 2 distributions (50 datasets from each). Each sets contain between 500 and data. 2. For each of these datasets, we learned an enriched SOM to obtain a set of representative prototypes of the data. The number of prototypes is randomly chosen from 50 to We computed a density function for each SOM and compared them to each other with the proposed dissimilarity measure. 4. Each SOM was labeled depending on the distribution represented (labels are ring 1, ring 2, ring 3, spiral 1 and spiral 2 ). We then calculated an index of compactness and separability of SOM having the same label with the generalized index of Dunn [15]. This index is even bigger than SOM with the same label are similar together and dissimilar to other labels, according to the dissimilarity function used.

7 Comparing large datasets structures through unsupervised learning 7 We used the same protocol with Noise 1 to Noise 4 and with the Shuttle database. Two kinds of distributions have been extracted from the Shuttle database, by using random sub-sampling (with replacement) of data from class 1 ( Shuttle 1 ) and data from other classes ( Shuttle 2 ). We compared the results with some distance-based measures usually used to compare two sets of data (here we compare two sets of prototypes from the SOMs). These measures are the average distance (Ad: the average distance between all pair of prototypes in the two SOMs), the minimum distance (Md: the smallest Euclidean distance between prototypes in the two SOMs) and the Ward distance (Wd: The distance between the two centroids, with some weight depending on the number of prototypes in the two SOMs)) [16]. All results are in Table 1. The visual inspection of the dissimilarity matrix obtained from the different measures is consistant with the values of the Dunn index (Table 1) to show that the density-based proposed similarity measure (CBd) is much more effective than distance based measures (see Fig. 1 for an example). (a) Ad (b) Md (c) Wd (d) CBd Fig. 1. Visualizations of the dissimilarity matrix obtained from the different measures for the comparaisons of the different Noise 1 to 4 distributions. Darker cells mean higher similarity between the two distributions. Datasets from the same distribution are sorted, then their comparisons appear as four box along the diagonal of the matrix. As shown in table 1, our dissimilarity measure using density is much more effective than measures that only use the distances between prototypes. Indeed, the Dunn index for density based measure is much higher than distance based ones for the three kinds of datasets tested: non-convex data, noisy data and real data. This means that our measure of dissimilarity is much more effective for the detection of similar dataset. Table 1. Value of the Dunn index obtained from various dissimilarity measures to compare various data distributions. Distributions to compare Average Minimum Ward Proposed Ring 1 to 3 + Spiral 1 and Noise 1 to Shuffle 1 and

8 8 Guénaël Cabanes and Younès Bennani 4 Conclusion and future works In this article, we proposed a new algorithm for modeling data structure, based on the learning of a SOM, and a measure of dissimilarity between cluster structures. The advantages of this algorithm are not only the low computational cost and the low memory requirement, but also the high accuracy achieved in fitting the structure of the modeled datasets. These properties make it possible to apply the algorithm to the analysis of large data bases, and especially large data streams, which requires both speed and economy of resources. The results obtained on the basis of artificial and real datasets are very encouraging. The continuation of our work will focus on the validation of the algorithm on more real case studies. More specifically, it will be applied to the analysis of real data streams, commonly used as benchmark for this class of algorithms. We will also try to extend the method to structured or symbolic datasets, via kernel-based SOM algorithm. Finally, we wish to deepen the concept of stability applied to this kind of analysis. References 1. Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: SIGMOD Conference. (2001) 2. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: VLDB. (2002) Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: In 2006 SIAM Conference on Data Mining. (2006) 4. Aggarwal, C., Yu, P.: A Survey of Synopsis Construction Methods in Data Streams. In C. Aggarwal, ed.: Data Streams: Models and Algorithms. Springer (2007) 5. Kohonen, T.: Self-Organizing Maps. Springer-Verlag, Berlin (2001) 6. Cabanes, G., Bennani, Y.: A local density-based simultaneous two-level algorithm for topographic clustering. In: IJCNN, IEEE (2008) Silverman, B.: Using kernel density estimates to investigate multi-modality. Journal of the Royal Statistical Society, Series B 43 (1981) Vesanto, J.: Neural network tool for data mining: SOM Toolbox (2000) 9. Sain, S., Baggerly, K., Scott, D.: Cross-Validation of Multivariate Densities. Journal of the American Statistical Association 89 (1994) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39 (1977) Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458) (2002) Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Review 26(2) (1984) Park, H., Ozeki, T.: Singularity and Slow Convergence of the EM algorithm for Gaussian Mixtures. Neural Process Letters 29 (2009) Hershey, J.R., Olsen, P.A.: Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing. Volume 4. (2007) Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4 (1974) Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1988)