Visualization, Clustering and Classification of Multidimensional Astronomical Data

Transcription

1 Visualization, Clustering and Classification of Multidimensional Astronomical Data Antonino Staiano, Angelo Ciaramella, Lara De Vinco, Ciro Donalek, Giuseppe Longo, Giancarlo Raiconi, Roberto Tagliaferri, Roberto Amato, Carmine Del Mondo, Giuseppe Mangano, Gennaro Miele Dipartimento di Matematica ed Informatica, Università di Salerno, Fisciano (Sa), Italy {astaiano, rtagliaferri, ciaram, Dipartimento di Scienze Fisiche, Università Federico II di Napoli, Italy {longo, INFOTEL S.r.l., Via Strauss, Battipaglia (Sa), Italy Abstract Due to the recent technological advances, Data Mining in massive data sets has evolved as a crucial research field for many if not all areas of research: from astronomy to high energy physics, to genetics etc. In this paper we discuss an implementation of the Probabilistic Principal Surfaces (PPS) which was developed within the framework of the AstroNeural collaboration. PPS are a nonlinear latent variable model which may be regarded as a complete mathematical framework to accomplish some fundamental data mining activities such as: visualization, clustering and classification of high dimensional data. The effectiveness of the proposed model is exemplified referring to a complex astronomical data set. I. INTRODUCTION The explosive growth in the quantity, quality and accessibility of data which is currently experienced in all fields of science and human endeavor, has triggered the search for a new generation of computational theories and tools capable to assist humans in extracting useful information (knowledge) from the available and planned massive data sets. This revolution has two main aspects: on the one hand in astronomy (as well as in high energy physics, genetics, social sciences, and in many other fields) traditional interactive data analysis and data visualization methods have proved to be far inadequate to cope with data sets which are characterized by huge volumes and/or complexity (ten or hundreds of parameter or features per record, cf. [] and references therein). In second place, the simultaneous analysis of hundreds of parameters unveils previously unknown patterns which may lead to a deeper understanding of the underlaying phenomena and trends. The field of Data Mining is therefore becoming of paramount importance not only in its traditional arena but also as an auxiliary tool for almost all fields of research. In this paper we discuss how three common tasks in data analysis (data visualization, clustering and data classification) may be performed using Spherical Probabilistic Principal Surfaces (PPS) as a common framework. Visualization: it is a crucial step in the process of data analysis, enabling an understanding of the relations that exists within the data by displaying them in such a way that the discovered patterns are emphasized. Clustering: it is perhaps the most important and widely used method of unsupervised learning. It may be synthetised problem of identifying groupings of similar points that are relatively isolated from each other, or in other words to partition the data into dissimilar groups of similar items. Classification: it concerns with assigning a given pattern to one of a number of possible classes which depends on the problem at hand. Such classes may be the results of a labeling accomplished on groupings resulting from a clustering procedure. PPS [6], [7] (discussed in Section II) are a nonlinear extension of principal components, in that each node on the PPS is the average of all data points that projects near/onto it. From a theoretical standpoint, the PPS is a generalization of the Generative Topographic Mapping (GTM) [2], [3], which can be seen as a parametric alternative to Self Organizing Maps (SOM) []. Some advantages of PPS includes its parametric and flexible formulation for any geometry/topology in any dimension, guaranteed convergence (indeed the PPS training is accomplished through the Expectation-Maximization algorithm). A PPS is governed by its latent topology and, owing to their flexibility, a variety of PPS topology can be created one of which is the 3D sphere. The sphere is finite and unbounded, with all nodes distributed at the edge, it therefore is ideal for emulating the sparseness and peripheral property of high-d data. Furthermore, the sphere topology can be easily comprehended by humans and thereby be og great help in visualizing high-d data (Section III-A). Since PPS generates a probability density function of the input data, in the form of a mixture of Gaussians, it can be used both for clustering (Section III-B) and classification (Section III-C) purposes. To illustrate the power and the effectivness of the model, we shall discuss a case study in the field of astronomy using a real and complex data set (Section IV). All results discussed here were obtained in the framework of the AstroNeural collaboration: a joint project between the Department of Mathematics and Informatics of the University of Salerno and the Department of Physical Sciences of the University Federico II in Napoli.

2 The main goal of the collaboration is to implement a user friendly data mining tool capable to deal with heterogeneous, high dimensionality data sets. (a) Manifold in latent space R 3 x (b) Manifold in feature space R D t y(x) (c) t projected onto manifold in latent space R 3 E[x t] II. PPS: THEORETICAL DESCRIPTION PPS defines a non-linear, parametric mapping y(x; W) from a Q-dimensional latent space (x R Q ) to a D- dimensional data space (t R D ), where normally Q < D. The mapping y(x; W) (defined continuous and differentiable) maps every point in the latent space to a point into the data space. Since the latent space is Q-dimensional, these points will be confined to a Q-dimensional manifold non-linearly embedded into the D-dimensional data space. PPS builds a constrained mixture of Gaussians (where the priors are all fixed to M ) p(t W, Σ m ) = M M p(t x m, W, Σ m ), () m= and each component has the form: Σ m 2 2π D 2 e { 2 (y(xm;w) t)σ m (y(xm;w) t)t }, (2) where t is a point in the data space and Σ m denotes the noise variance. The covariance is defined as Σ m = α Q e q (x)e T (D αq) D q (x) + e d (x)e T d (x), β β(d Q) q= d=q+ (3) < α < D Q where α is a clamping factor which determines the orientation of the covariance, and {e q (x)} Q q= is the set of orthonormal vectors tangential to the manifold at y(x; w), {e d (x)} D d=q+ is the set of orthonormal vectors orthogonal to the manifold in y(x; w). The complete set of orthonormal vectors {e d (x)} D d= spans R D. The EM algorithm[8] can be used to estimate the PPS parameters W and β, while the clamping factor is fixed by the user and is assumed to be constant during the EM iterations. The form of the mapping y(x; w) is defined as a generalized linear regression model y(x; w) = Wφ(x) (4) where the elements of φ(x) consist of L fixed basis functions {φ l (x)} L l=, and W is a D L matrix. A. Spherical PPS If Q = 3 is chosen, a spherical manifold [6] can be constructed using a PPS with nodes {x m } M m= arranged regularly on the surface of a sphere in R 3 latent space, with the latent basis functions evenly distributed on the sphere at a lower density. Fig.. manifold in R 3 data space. (c) Projection of data points t onto the latent spherical manifold. (a) The spherical manifold in R 3 latent space. (b) The spherical ( III. APPLICATION OF PPS TO DATA MINING A. Visualization After a PPS model is fitted to the data, several visualization possibilities are available for analyzing the data points. ( ) Data point projections onto the latent sphere: The data are projected into the latent space as points onto a sphere (Figure ). The latent manifold coordinates ˆx n of each data point t n are computed as M ˆx n x t n = xp(x t)dx = r mn x m m= where r mn are the latent variable responsibilities defined as r mn = p(x m t n ) = p(t n x m )P (x m ) M m = p(t n x m )P (x m ) = p(t n x m ) M m = p(t n x m ). (5) Since x m = and m r mn =, for n =,..., N, these coordinates lie within a unit sphere, i.e. ˆx n. 2) Interactively selecting points on the sphere: Having projected the data on the latent sphere, a typical task performed by most data analyzers is the localization of the most interesting data points, for instance the ones lying far away from more dense areas (outlayers), or those lying in the overlapping regions between clusters, and to investigate their characteristics by linking the data points on the sphere with their position in the original data set. For instance, in the astronomical application described in Section IV if the images corresponding to the data were available, the user might want to visualize the object corresponding to the data point selected on the sphere. The user is also allowed to select a latent variable and color all the points for which that specific latent variable is responsible (Figure 2). 3) Visualizing the latent variable responsibilities on the sphere: Some insights on the number of agglomerates localized into the spherical latent manifold is provided by the mean of the responsibility for each latent variable. Furthermore, if we build a spherical manifold which is composed by a set of faces each one delimited by four vertices, then we can color each face with colors varying in intensity on the basis of the value of the responsibility associate to that given vertex (and hence, to each latent variable). The overall result is that the sphere will contain regions denser than others and this information is easily visible and understandable (see Figure 3). Obviously, denser areas of the spherical manifold

3 Fig. 2. Data points selection phase. The bold black circles represent the latent variables; the blue points represent the projected input data points. When a latent variable is selected, each projected point for which the variable is responsible is colored. By selecting a data point the user is provided with information about it: coordinates and index corresponding to the position in the original catalog. might contain more than one cluster, and this calls for further investigations. Fig. 3. B. Clustering Probability density function on the latent sphere Once the user has an overall idea of the number of clusters on the sphere, he can exploit this information through the use of agglomerative hierarchical clustering techniques [9] to find out the clusters. This task is accomplished by running the clustering algorithm on the Gaussian centers in the data space. Once the center have been agglomerated, the points for which the centers falling in the same agglomerate are responsible, are assigned to the same cluster. The projections of the points into the latent space are then used to visualize the clusters onto the latent sphere [] (see Fig. 4). C. Classification Classification can be accomplished in a twofold way: i) by constructing a reference manifold for each class defined in the classification problem, and then assigning any test point to the class of its nearest manifold (PPSRM); ii) assigning a test data choosing the class with the maximum posterior class probability for a given new input(ppspr). Fig. 4. Clusters computed in data space by hierarchical clustering. In [] it was shown that this second form of classification leads to better performance. However, since PPS builds a probability density function as a mixture of Gaussian distributions trained through EM algorithm, its performance may degrade with increasing data dimensionality due to singularities and local maxima in the log-likelihood function, therefore we propose two schemes for designing a committee of spherical PPS to gain improved probability density functions and hence classification rates. The area of ensembles of learning machines is now a well defined field and has been successfully applied to neural networks especially in the case of supervised learning algorithms. Fewer cases can be found to unsupervised learning methodologies and to density estimation as well: among these, the works introduced in [3] and [4] both exploits consolidated techniques in supervised contexts as stacking [5] and bagging [5] and represent the basis of our implementations. ) Stacked PPS: StPPS: The combining scheme herein described may be seen as an instantiation of the method proposed in [4]. Let us suppose we are given with S probabilistic principal surface models (i.e., S density estimators) {P P S s (t)} S s=, where P P S s (t) is the s-th PPS model. Note that in the original formulation given in [4], the S density estimators could also be of different kind, for example finite mixtures with a fixed number of component densities or kernel density estimate with a fixed kernel and a single fixed global bandwidth in each dimension. Each of the S PPS models can be chosen to be diverse enough, i.e. by considering different number of latent variables and latent bases. To stack the S PPS models, we follow the procedure described below: i) Let D the training data set, with size D = N. Partition D v times, as in v-fold cross-validation. The v-th fold contains exactly (v ) N v training data points and N v test data points both from the training set D. For each fold: a) fit each of the S PPS models to the training subset of D. b) evaluate the likelihood of each data point in the test partition of D, for each of the S fitted models. ii) At the end of these preliminary steps, we obtain S density estimators for each of the N data points which

4 are organized in a matrix A, of size N S, where each entry a is is P P S s (t i ); iii) Use the matrix A to estimate the combination coefficients {π s } S s= that maximize the log-likelihood at the points t i of a stacked density model of the form: StPPS(t) = S π s P P S s (t i ) s= which corresponds to maximize ( N S ) ln π s P P S s (t), i= s= as a function of the weight vector (π,..., π S ). Direct maximization of this function is a non-linear optimization problem. We can apply the EM algorithm directly, by observing that the stacked mixture is a finite mixture density with weights (π,..., π S ). Thus, we can use the standard EM algorithm for mixtures, except that the parameters of the component densities P P S s (t) are fixed and the only parameters allowed to vary are the mixture weights. iv) The concluding phase consists in the parameters reestimation of each of the S component PPS models using all of the training data D. The stacked density model is then the linear combination of the so obtained component PPS models, with combining coefficients {π s } S s=. 2) Bagged PPS: BgPPS: This combining scheme employees bagging as mean to average a single PPS in a way similar to the model proposed in [3]. All we have to do is to train a number S of PPS with S bootstrap replicates of the original learning data set. At the end of this training process, we obtain S different density estimates which are then averaged to form the overall density estimate model. Formally speaking, let D be the original training set of size N and {P P S s } S s= a set of PPS models: i) create S bootstrap replicates (with replacement) of D, {D Boot (s)} S s= with size N; ii) train each of the S PPS models with a bootstrap replicate D Boot ; iii) at the end of the training we obtain S density estimates {P P S s } S s=; iv) average the S density estimates {P P S s } S s= as BgPPS(t) = S S P P S s (t). s= IV. CASE STUDY The GOODS (id est the Great Observatories Origin Deep Surveys) catalog is a catalog composed by 2845 objects (both galaxies and stars). The survey was conducted in 7 optical bands, namely U,B,V,R,I,J,K bands and for for the experiments described here we considered 3 different parameters (i.e., Kron radius, Flux and Magnitudes) for each band, thus summing to a total number of 2 parameters. The experiment s catalog mean classification error\standard deviation GOODS Catalog: PPSRM,PPSPR,StPPS, BgPPS best model statistics PPSRM PPSPR StPPS BgPPS PPS Classifier Models.344 mean std Fig. 7. GOODS Catalog: PPSRM, PPSPR, StPPS and BgPPS best model statistics therefore contains about 27 galaxies and 4 stars. From a computational point of view, the main peculiarity of this data set is that the majority of the objects are drop outs, i.e. they are not detected in at least one of the bands (id est, not detected in only one band, two bands, three bands and so on). The data set, therefore, contains four classes of objects, namely star (S), galaxy (G), star which are drop outs (SD) and galaxy which are drop outs (GD) (we do not care about the number of bands for which an object is a drop out). A. GOODS Catalog Visualizations As it can be seen from Figure 5(a), the PCA visualization of the GOODS catalogue provides no interesting information at all and displays only a single condensed group of data. In PCA, the class of galaxies which are drop outs (whose objects are yellow colored), which contains the majority of objects (about 24) is near totally hidden. The PPS projections (Figure 5(b)), instead, show a large group consisting of the drop out galaxies and overlapping objects of the remaining objects and a well bounded group of galaxies. Figure 6(a) and 6(b) also depict the latent variable probability densities for galaxy and star objects, respectively. Note, especially, how different these densities appear for each group of objects. B. GOODS Catalog Classification GOODS catalog classification task is very complex. As it had to be expected on the grounds of astronomical expertise, the four classes are heavily overlapping and even in the best cases there are classes (i.e., S and SD) whose objects are classified with an error rate about 6%. This is evident from the results obtained by the different PPS classifiers we compared, namely PPSRM, PPSPR, StPPS and BgPPS. Anyway, ensembles of PPS perform better than single PPS as it can be seen in Figure 7. BgPPS, in particular, obtain the best performance with a best case classification error of 2.5% as shown in Table I. The BgPPS parameter setting is shown in Table II

5 (a) Fig. 5. (a) GOODS 3D PCA projections, (b) PPS projections on the sphere. (b) GOODS Catalog: PPS Class Star Density in Latent Space GOODS Catalog: PPS Class Galaxy Density in Latent Space x (a) Fig. 6. (a) galaxy density on the sphere (b) star density on the sphere. (b) TABLE I GOODS CATALOG: CONFUSION MATRIX COMPUTED BY BgPPS BEST MODEL Classifier Confusion Matrix α BgP P S(2.5) S G SD GD S G SD 64 7 GD TABLE II GOODS CATALOG: BgPPS PARAMETER SETTINGS Parameter Value Description M 266 number of latent variables L 83 number of basis functions L fac basis functions width iter maximum number of iteration ɛ. early stopping threshold V. CONCLUSIONS AND FUTURE WORK We have described how spherical PPS works as a framework to address data mining activities such as visualization, clustering and classification and we have seen its power and effectiveness when dealing with high-d data as the astronomical data. Above all, the spherical PPS, which consists of a spherical latent manifold lying in a three dimensional latent space, is better suitable to high-d data since the sphere is able to capture the sparsity and periphery of data in large input spaces which are due to the curse of dimensionality. Currently we are pursuing two directions to further enhance our system: i) developing a clustering algorithm able to directly exploit the PPS mixture Gaussian density to compute the clusters. The algorithm is based on the Kullback-Leibler distance to decide if two Gaussian component of the PPS mixture model must be aggregated. In this way the clustering is able to follow the input data density and to

6 compute by itself the number of clusters. ii) building a hierarchical PPS for constructing localized nonlinear projection manifold as already done for GTM [2] and previously for a linear latent variable model [4]. Following [2], a hierarchy of PPS could be organized in a tree whose root corresponds to the PPS model trained on the entire data set at hand, and whose nodes, built interactively in a top-down fashion, represent PPS models trained in localized regions of the data input chosen in the ancestor plot PPS by the user, interactively. In all the sub-models one might exploit all the visualization and clustering options discussed in this paper. ACKNOWLEDGMENT The authors would like to thank all past and present members of the Astroneural collaboration. Astroneural is sponsored by the MIUR (Italian Bureau for University and Research) and by Regione Campania. The authors also wish to thank P. benvenuti for many discussions and for supporting this work since its beginning. REFERENCES [] J. Abello, P.M. Pardalos, M.G.C. Resende Editors: Handbook of Massive Data Sets, Kluwer Academic Publishers (22) [2] C. M. Bishop, M. Svensen, C.K.I. Williams, GTM: The Generative Topographic Mapping, Neural Computation, (), 998. [3] C.M. Bishop, M. Svensén, and C. K. I. Williams, Developments of the Generative Topographic Mapping, Neurocomputing 2, 998. [4] C.M. Bishop and M.E. Tipping, A hierarchical latent variable model for data visualization, IEEE Transactions on Pattern Analysis and Machine Intelligence 2(3), 28293,998 [5] L. Breiman, Bagging Predictors, Machine Learning, 26, 996 [6] K. Chang, Nonlinear Dimensionality Reduction Using Probabilistic Principal Surfaces, PhD Thesis, Department of Electrical and Computer Engineering, The University of Texas at Austin, USA, 2 [7] K. Chang, J. Ghosh, A unified Model for Probabilistic Principal Surfaces, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 23, NO., 2 [8] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-Likelihood from Incomplete Data Via the EM Algorithm, J. Royal Statistical Soc., Vol. 39, NO., 977 [9] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley and Sons, 2 [] T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin, (995) [] A. Staiano, Unsupervised Neural Networks For The Extraction of Scientific Information from Astronomical Data, PhD Thesis, University of Salerno, Italy, 23 [2] P. Tino, I. Nabney,Hierarchical GTM: constructing localized non-linear projection manifolds in a principled way, IEEE Transactions on Pattern Analysis and Machine Intelligence, in print [3] D. Ormoneit, V. Tresp, Averaging, Maximum Likelihood and Bayesian Estimation for Improving Gaussian Mixture Probability Density Estimates, IEEE Transaction on Neural Networks, Vol.9, NO. 4, 998 [4] P. Smyth, D.H. Wolpert, An evaluation of linearly combining density estimators via stacking, Machine Learning, Vol. 36, 999. [5] D.H. Wolpert, Stacked Generalization, Neural Networks, 5, 24, 992