Visualization, Clustering and Classification of Multidimensional Astronomical Data

Size: px
Start display at page:

Download "Visualization, Clustering and Classification of Multidimensional Astronomical Data"

Transcription

1 Visualization, Clustering and Classification of Multidimensional Astronomical Data Antonino Staiano, Angelo Ciaramella, Lara De Vinco, Ciro Donalek, Giuseppe Longo, Giancarlo Raiconi, Roberto Tagliaferri, Roberto Amato, Carmine Del Mondo, Giuseppe Mangano, Gennaro Miele Dipartimento di Matematica ed Informatica, Università di Salerno, Fisciano (Sa), Italy {astaiano, rtagliaferri, ciaram, Dipartimento di Scienze Fisiche, Università Federico II di Napoli, Italy {longo, INFOTEL S.r.l., Via Strauss, Battipaglia (Sa), Italy Abstract Due to the recent technological advances, Data Mining in massive data sets has evolved as a crucial research field for many if not all areas of research: from astronomy to high energy physics, to genetics etc. In this paper we discuss an implementation of the Probabilistic Principal Surfaces (PPS) which was developed within the framework of the AstroNeural collaboration. PPS are a nonlinear latent variable model which may be regarded as a complete mathematical framework to accomplish some fundamental data mining activities such as: visualization, clustering and classification of high dimensional data. The effectiveness of the proposed model is exemplified referring to a complex astronomical data set. I. INTRODUCTION The explosive growth in the quantity, quality and accessibility of data which is currently experienced in all fields of science and human endeavor, has triggered the search for a new generation of computational theories and tools capable to assist humans in extracting useful information (knowledge) from the available and planned massive data sets. This revolution has two main aspects: on the one hand in astronomy (as well as in high energy physics, genetics, social sciences, and in many other fields) traditional interactive data analysis and data visualization methods have proved to be far inadequate to cope with data sets which are characterized by huge volumes and/or complexity (ten or hundreds of parameter or features per record, cf. [] and references therein). In second place, the simultaneous analysis of hundreds of parameters unveils previously unknown patterns which may lead to a deeper understanding of the underlaying phenomena and trends. The field of Data Mining is therefore becoming of paramount importance not only in its traditional arena but also as an auxiliary tool for almost all fields of research. In this paper we discuss how three common tasks in data analysis (data visualization, clustering and data classification) may be performed using Spherical Probabilistic Principal Surfaces (PPS) as a common framework. Visualization: it is a crucial step in the process of data analysis, enabling an understanding of the relations that exists within the data by displaying them in such a way that the discovered patterns are emphasized. Clustering: it is perhaps the most important and widely used method of unsupervised learning. It may be synthetised problem of identifying groupings of similar points that are relatively isolated from each other, or in other words to partition the data into dissimilar groups of similar items. Classification: it concerns with assigning a given pattern to one of a number of possible classes which depends on the problem at hand. Such classes may be the results of a labeling accomplished on groupings resulting from a clustering procedure. PPS [6], [7] (discussed in Section II) are a nonlinear extension of principal components, in that each node on the PPS is the average of all data points that projects near/onto it. From a theoretical standpoint, the PPS is a generalization of the Generative Topographic Mapping (GTM) [2], [3], which can be seen as a parametric alternative to Self Organizing Maps (SOM) []. Some advantages of PPS includes its parametric and flexible formulation for any geometry/topology in any dimension, guaranteed convergence (indeed the PPS training is accomplished through the Expectation-Maximization algorithm). A PPS is governed by its latent topology and, owing to their flexibility, a variety of PPS topology can be created one of which is the 3D sphere. The sphere is finite and unbounded, with all nodes distributed at the edge, it therefore is ideal for emulating the sparseness and peripheral property of high-d data. Furthermore, the sphere topology can be easily comprehended by humans and thereby be og great help in visualizing high-d data (Section III-A). Since PPS generates a probability density function of the input data, in the form of a mixture of Gaussians, it can be used both for clustering (Section III-B) and classification (Section III-C) purposes. To illustrate the power and the effectivness of the model, we shall discuss a case study in the field of astronomy using a real and complex data set (Section IV). All results discussed here were obtained in the framework of the AstroNeural collaboration: a joint project between the Department of Mathematics and Informatics of the University of Salerno and the Department of Physical Sciences of the University Federico II in Napoli.

2 The main goal of the collaboration is to implement a user friendly data mining tool capable to deal with heterogeneous, high dimensionality data sets. (a) Manifold in latent space R 3 x (b) Manifold in feature space R D t y(x) (c) t projected onto manifold in latent space R 3 E[x t] II. PPS: THEORETICAL DESCRIPTION PPS defines a non-linear, parametric mapping y(x; W) from a Q-dimensional latent space (x R Q ) to a D- dimensional data space (t R D ), where normally Q < D. The mapping y(x; W) (defined continuous and differentiable) maps every point in the latent space to a point into the data space. Since the latent space is Q-dimensional, these points will be confined to a Q-dimensional manifold non-linearly embedded into the D-dimensional data space. PPS builds a constrained mixture of Gaussians (where the priors are all fixed to M ) p(t W, Σ m ) = M M p(t x m, W, Σ m ), () m= and each component has the form: Σ m 2 2π D 2 e { 2 (y(xm;w) t)σ m (y(xm;w) t)t }, (2) where t is a point in the data space and Σ m denotes the noise variance. The covariance is defined as Σ m = α Q e q (x)e T (D αq) D q (x) + e d (x)e T d (x), β β(d Q) q= d=q+ (3) < α < D Q where α is a clamping factor which determines the orientation of the covariance, and {e q (x)} Q q= is the set of orthonormal vectors tangential to the manifold at y(x; w), {e d (x)} D d=q+ is the set of orthonormal vectors orthogonal to the manifold in y(x; w). The complete set of orthonormal vectors {e d (x)} D d= spans R D. The EM algorithm[8] can be used to estimate the PPS parameters W and β, while the clamping factor is fixed by the user and is assumed to be constant during the EM iterations. The form of the mapping y(x; w) is defined as a generalized linear regression model y(x; w) = Wφ(x) (4) where the elements of φ(x) consist of L fixed basis functions {φ l (x)} L l=, and W is a D L matrix. A. Spherical PPS If Q = 3 is chosen, a spherical manifold [6] can be constructed using a PPS with nodes {x m } M m= arranged regularly on the surface of a sphere in R 3 latent space, with the latent basis functions evenly distributed on the sphere at a lower density. Fig.. manifold in R 3 data space. (c) Projection of data points t onto the latent spherical manifold. (a) The spherical manifold in R 3 latent space. (b) The spherical ( III. APPLICATION OF PPS TO DATA MINING A. Visualization After a PPS model is fitted to the data, several visualization possibilities are available for analyzing the data points. ( ) Data point projections onto the latent sphere: The data are projected into the latent space as points onto a sphere (Figure ). The latent manifold coordinates ˆx n of each data point t n are computed as M ˆx n x t n = xp(x t)dx = r mn x m m= where r mn are the latent variable responsibilities defined as r mn = p(x m t n ) = p(t n x m )P (x m ) M m = p(t n x m )P (x m ) = p(t n x m ) M m = p(t n x m ). (5) Since x m = and m r mn =, for n =,..., N, these coordinates lie within a unit sphere, i.e. ˆx n. 2) Interactively selecting points on the sphere: Having projected the data on the latent sphere, a typical task performed by most data analyzers is the localization of the most interesting data points, for instance the ones lying far away from more dense areas (outlayers), or those lying in the overlapping regions between clusters, and to investigate their characteristics by linking the data points on the sphere with their position in the original data set. For instance, in the astronomical application described in Section IV if the images corresponding to the data were available, the user might want to visualize the object corresponding to the data point selected on the sphere. The user is also allowed to select a latent variable and color all the points for which that specific latent variable is responsible (Figure 2). 3) Visualizing the latent variable responsibilities on the sphere: Some insights on the number of agglomerates localized into the spherical latent manifold is provided by the mean of the responsibility for each latent variable. Furthermore, if we build a spherical manifold which is composed by a set of faces each one delimited by four vertices, then we can color each face with colors varying in intensity on the basis of the value of the responsibility associate to that given vertex (and hence, to each latent variable). The overall result is that the sphere will contain regions denser than others and this information is easily visible and understandable (see Figure 3). Obviously, denser areas of the spherical manifold

3 Fig. 2. Data points selection phase. The bold black circles represent the latent variables; the blue points represent the projected input data points. When a latent variable is selected, each projected point for which the variable is responsible is colored. By selecting a data point the user is provided with information about it: coordinates and index corresponding to the position in the original catalog. might contain more than one cluster, and this calls for further investigations. Fig. 3. B. Clustering Probability density function on the latent sphere Once the user has an overall idea of the number of clusters on the sphere, he can exploit this information through the use of agglomerative hierarchical clustering techniques [9] to find out the clusters. This task is accomplished by running the clustering algorithm on the Gaussian centers in the data space. Once the center have been agglomerated, the points for which the centers falling in the same agglomerate are responsible, are assigned to the same cluster. The projections of the points into the latent space are then used to visualize the clusters onto the latent sphere [] (see Fig. 4). C. Classification Classification can be accomplished in a twofold way: i) by constructing a reference manifold for each class defined in the classification problem, and then assigning any test point to the class of its nearest manifold (PPSRM); ii) assigning a test data choosing the class with the maximum posterior class probability for a given new input(ppspr). Fig. 4. Clusters computed in data space by hierarchical clustering. In [] it was shown that this second form of classification leads to better performance. However, since PPS builds a probability density function as a mixture of Gaussian distributions trained through EM algorithm, its performance may degrade with increasing data dimensionality due to singularities and local maxima in the log-likelihood function, therefore we propose two schemes for designing a committee of spherical PPS to gain improved probability density functions and hence classification rates. The area of ensembles of learning machines is now a well defined field and has been successfully applied to neural networks especially in the case of supervised learning algorithms. Fewer cases can be found to unsupervised learning methodologies and to density estimation as well: among these, the works introduced in [3] and [4] both exploits consolidated techniques in supervised contexts as stacking [5] and bagging [5] and represent the basis of our implementations. ) Stacked PPS: StPPS: The combining scheme herein described may be seen as an instantiation of the method proposed in [4]. Let us suppose we are given with S probabilistic principal surface models (i.e., S density estimators) {P P S s (t)} S s=, where P P S s (t) is the s-th PPS model. Note that in the original formulation given in [4], the S density estimators could also be of different kind, for example finite mixtures with a fixed number of component densities or kernel density estimate with a fixed kernel and a single fixed global bandwidth in each dimension. Each of the S PPS models can be chosen to be diverse enough, i.e. by considering different number of latent variables and latent bases. To stack the S PPS models, we follow the procedure described below: i) Let D the training data set, with size D = N. Partition D v times, as in v-fold cross-validation. The v-th fold contains exactly (v ) N v training data points and N v test data points both from the training set D. For each fold: a) fit each of the S PPS models to the training subset of D. b) evaluate the likelihood of each data point in the test partition of D, for each of the S fitted models. ii) At the end of these preliminary steps, we obtain S density estimators for each of the N data points which

4 are organized in a matrix A, of size N S, where each entry a is is P P S s (t i ); iii) Use the matrix A to estimate the combination coefficients {π s } S s= that maximize the log-likelihood at the points t i of a stacked density model of the form: StPPS(t) = S π s P P S s (t i ) s= which corresponds to maximize ( N S ) ln π s P P S s (t), i= s= as a function of the weight vector (π,..., π S ). Direct maximization of this function is a non-linear optimization problem. We can apply the EM algorithm directly, by observing that the stacked mixture is a finite mixture density with weights (π,..., π S ). Thus, we can use the standard EM algorithm for mixtures, except that the parameters of the component densities P P S s (t) are fixed and the only parameters allowed to vary are the mixture weights. iv) The concluding phase consists in the parameters reestimation of each of the S component PPS models using all of the training data D. The stacked density model is then the linear combination of the so obtained component PPS models, with combining coefficients {π s } S s=. 2) Bagged PPS: BgPPS: This combining scheme employees bagging as mean to average a single PPS in a way similar to the model proposed in [3]. All we have to do is to train a number S of PPS with S bootstrap replicates of the original learning data set. At the end of this training process, we obtain S different density estimates which are then averaged to form the overall density estimate model. Formally speaking, let D be the original training set of size N and {P P S s } S s= a set of PPS models: i) create S bootstrap replicates (with replacement) of D, {D Boot (s)} S s= with size N; ii) train each of the S PPS models with a bootstrap replicate D Boot ; iii) at the end of the training we obtain S density estimates {P P S s } S s=; iv) average the S density estimates {P P S s } S s= as BgPPS(t) = S S P P S s (t). s= IV. CASE STUDY The GOODS (id est the Great Observatories Origin Deep Surveys) catalog is a catalog composed by 2845 objects (both galaxies and stars). The survey was conducted in 7 optical bands, namely U,B,V,R,I,J,K bands and for for the experiments described here we considered 3 different parameters (i.e., Kron radius, Flux and Magnitudes) for each band, thus summing to a total number of 2 parameters. The experiment s catalog mean classification error\standard deviation GOODS Catalog: PPSRM,PPSPR,StPPS, BgPPS best model statistics PPSRM PPSPR StPPS BgPPS PPS Classifier Models.344 mean std Fig. 7. GOODS Catalog: PPSRM, PPSPR, StPPS and BgPPS best model statistics therefore contains about 27 galaxies and 4 stars. From a computational point of view, the main peculiarity of this data set is that the majority of the objects are drop outs, i.e. they are not detected in at least one of the bands (id est, not detected in only one band, two bands, three bands and so on). The data set, therefore, contains four classes of objects, namely star (S), galaxy (G), star which are drop outs (SD) and galaxy which are drop outs (GD) (we do not care about the number of bands for which an object is a drop out). A. GOODS Catalog Visualizations As it can be seen from Figure 5(a), the PCA visualization of the GOODS catalogue provides no interesting information at all and displays only a single condensed group of data. In PCA, the class of galaxies which are drop outs (whose objects are yellow colored), which contains the majority of objects (about 24) is near totally hidden. The PPS projections (Figure 5(b)), instead, show a large group consisting of the drop out galaxies and overlapping objects of the remaining objects and a well bounded group of galaxies. Figure 6(a) and 6(b) also depict the latent variable probability densities for galaxy and star objects, respectively. Note, especially, how different these densities appear for each group of objects. B. GOODS Catalog Classification GOODS catalog classification task is very complex. As it had to be expected on the grounds of astronomical expertise, the four classes are heavily overlapping and even in the best cases there are classes (i.e., S and SD) whose objects are classified with an error rate about 6%. This is evident from the results obtained by the different PPS classifiers we compared, namely PPSRM, PPSPR, StPPS and BgPPS. Anyway, ensembles of PPS perform better than single PPS as it can be seen in Figure 7. BgPPS, in particular, obtain the best performance with a best case classification error of 2.5% as shown in Table I. The BgPPS parameter setting is shown in Table II

5 (a) Fig. 5. (a) GOODS 3D PCA projections, (b) PPS projections on the sphere. (b) GOODS Catalog: PPS Class Star Density in Latent Space GOODS Catalog: PPS Class Galaxy Density in Latent Space x (a) Fig. 6. (a) galaxy density on the sphere (b) star density on the sphere. (b) TABLE I GOODS CATALOG: CONFUSION MATRIX COMPUTED BY BgPPS BEST MODEL Classifier Confusion Matrix α BgP P S(2.5) S G SD GD S G SD 64 7 GD TABLE II GOODS CATALOG: BgPPS PARAMETER SETTINGS Parameter Value Description M 266 number of latent variables L 83 number of basis functions L fac basis functions width iter maximum number of iteration ɛ. early stopping threshold V. CONCLUSIONS AND FUTURE WORK We have described how spherical PPS works as a framework to address data mining activities such as visualization, clustering and classification and we have seen its power and effectiveness when dealing with high-d data as the astronomical data. Above all, the spherical PPS, which consists of a spherical latent manifold lying in a three dimensional latent space, is better suitable to high-d data since the sphere is able to capture the sparsity and periphery of data in large input spaces which are due to the curse of dimensionality. Currently we are pursuing two directions to further enhance our system: i) developing a clustering algorithm able to directly exploit the PPS mixture Gaussian density to compute the clusters. The algorithm is based on the Kullback-Leibler distance to decide if two Gaussian component of the PPS mixture model must be aggregated. In this way the clustering is able to follow the input data density and to

6 compute by itself the number of clusters. ii) building a hierarchical PPS for constructing localized nonlinear projection manifold as already done for GTM [2] and previously for a linear latent variable model [4]. Following [2], a hierarchy of PPS could be organized in a tree whose root corresponds to the PPS model trained on the entire data set at hand, and whose nodes, built interactively in a top-down fashion, represent PPS models trained in localized regions of the data input chosen in the ancestor plot PPS by the user, interactively. In all the sub-models one might exploit all the visualization and clustering options discussed in this paper. ACKNOWLEDGMENT The authors would like to thank all past and present members of the Astroneural collaboration. Astroneural is sponsored by the MIUR (Italian Bureau for University and Research) and by Regione Campania. The authors also wish to thank P. benvenuti for many discussions and for supporting this work since its beginning. REFERENCES [] J. Abello, P.M. Pardalos, M.G.C. Resende Editors: Handbook of Massive Data Sets, Kluwer Academic Publishers (22) [2] C. M. Bishop, M. Svensen, C.K.I. Williams, GTM: The Generative Topographic Mapping, Neural Computation, (), 998. [3] C.M. Bishop, M. Svensén, and C. K. I. Williams, Developments of the Generative Topographic Mapping, Neurocomputing 2, 998. [4] C.M. Bishop and M.E. Tipping, A hierarchical latent variable model for data visualization, IEEE Transactions on Pattern Analysis and Machine Intelligence 2(3), 28293,998 [5] L. Breiman, Bagging Predictors, Machine Learning, 26, 996 [6] K. Chang, Nonlinear Dimensionality Reduction Using Probabilistic Principal Surfaces, PhD Thesis, Department of Electrical and Computer Engineering, The University of Texas at Austin, USA, 2 [7] K. Chang, J. Ghosh, A unified Model for Probabilistic Principal Surfaces, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 23, NO., 2 [8] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-Likelihood from Incomplete Data Via the EM Algorithm, J. Royal Statistical Soc., Vol. 39, NO., 977 [9] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley and Sons, 2 [] T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin, (995) [] A. Staiano, Unsupervised Neural Networks For The Extraction of Scientific Information from Astronomical Data, PhD Thesis, University of Salerno, Italy, 23 [2] P. Tino, I. Nabney,Hierarchical GTM: constructing localized non-linear projection manifolds in a principled way, IEEE Transactions on Pattern Analysis and Machine Intelligence, in print [3] D. Ormoneit, V. Tresp, Averaging, Maximum Likelihood and Bayesian Estimation for Improving Gaussian Mixture Probability Density Estimates, IEEE Transaction on Neural Networks, Vol.9, NO. 4, 998 [4] P. Smyth, D.H. Wolpert, An evaluation of linearly combining density estimators via stacking, Machine Learning, Vol. 36, 999. [5] D.H. Wolpert, Stacked Generalization, Neural Networks, 5, 24, 992

Chapter XV Advanced Data Mining and Visualization Techniques with Probabilistic Principal Surfaces: Application to Astronomy and Genetics

Chapter XV Advanced Data Mining and Visualization Techniques with Probabilistic Principal Surfaces: Application to Astronomy and Genetics Chapter XV Advanced Data Mining and Visualization Techniques with Probabilistic Principal Surfaces: Application to Astronomy and Genetics Antonino Staiano University of Napoli, Parthenope, Italy Lara De

More information

Visualization of High Dimensional Scientific Data

Visualization of High Dimensional Scientific Data Visualization of High Dimensional Scientific Data Roberto Tagliaferri and Antonino Staiano Department of Mathematics and Computer Science, University of Salerno, Italy {robtag,astaiano}@unisa.it Copyright

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Customer Data Mining and Visualization by Generative Topographic Mapping Methods

Customer Data Mining and Visualization by Generative Topographic Mapping Methods Customer Data Mining and Visualization by Generative Topographic Mapping Methods Jinsan Yang and Byoung-Tak Zhang Artificial Intelligence Lab (SCAI) School of Computer Science and Engineering Seoul National

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

High-dimensional labeled data analysis with Gabriel graphs

High-dimensional labeled data analysis with Gabriel graphs High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Comparing large datasets structures through unsupervised learning

Comparing large datasets structures through unsupervised learning Comparing large datasets structures through unsupervised learning Guénaël Cabanes and Younès Bennani LIPN-CNRS, UMR 7030, Université de Paris 13 99, Avenue J-B. Clément, 93430 Villetaneuse, France cabanes@lipn.univ-paris13.fr

More information

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models Exploratory Data Analysis Using Radial Basis Function Latent Variable Models Alan D. Marrs and Andrew R. Webb DERA St Andrews Road, Malvern Worcestershire U.K. WR14 3PS {marrs,webb}@signal.dera.gov.uk

More information

DS6 Phase 4 Napoli group Astroneural 1,0 is available and includes tools for supervised and unsupervised data mining:

DS6 Phase 4 Napoli group Astroneural 1,0 is available and includes tools for supervised and unsupervised data mining: DS6 Phase 4 Napoli group Astroneural 1,0 is available and includes tools for supervised and unsupervised data mining: Preprocessing & visualization Supervised (MLP, RBF) Unsupervised (PPS, NEC+dendrogram,

More information

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers

Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers Ting Su tsu@ece.neu.edu Jennifer G. Dy jdy@ece.neu.edu Department of Electrical and Computer Engineering, Northeastern University,

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics

COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE: COS-STAT-747 Principles of Statistical Data Mining

More information

Visualization of Breast Cancer Data by SOM Component Planes

Visualization of Breast Cancer Data by SOM Component Planes International Journal of Science and Technology Volume 3 No. 2, February, 2014 Visualization of Breast Cancer Data by SOM Component Planes P.Venkatesan. 1, M.Mullai 2 1 Department of Statistics,NIRT(Indian

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps

Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps Technical Report OeFAI-TR-2002-29, extended version published in Proceedings of the International Conference on Artificial Neural Networks, Springer Lecture Notes in Computer Science, Madrid, Spain, 2002.

More information

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad

More information

Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data Neil D. Lawrence Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield,

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of

More information

Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

More information

Data Visualization with Simultaneous Feature Selection

Data Visualization with Simultaneous Feature Selection 1 Data Visualization with Simultaneous Feature Selection Dharmesh M. Maniyar and Ian T. Nabney Neural Computing Research Group Aston University, Birmingham. B4 7ET, United Kingdom Email: {maniyard,nabneyit}@aston.ac.uk

More information

Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

Mixtures of Robust Probabilistic Principal Component Analyzers

Mixtures of Robust Probabilistic Principal Component Analyzers Mixtures of Robust Probabilistic Principal Component Analyzers Cédric Archambeau, Nicolas Delannay 2 and Michel Verleysen 2 - University College London, Dept. of Computer Science Gower Street, London WCE

More information

270107 - MD - Data Mining

270107 - MD - Data Mining Coordinating unit: Teaching unit: Academic year: Degree: ECTS credits: 015 70 - FIB - Barcelona School of Informatics 715 - EIO - Department of Statistics and Operations Research 73 - CS - Department of

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Data Mining and Neural Networks in Stata

Data Mining and Neural Networks in Stata Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca mario.lucchini@unimib.it maurizio.pisati@unimib.it

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Toshio Sugihara Abstract In this study, an adaptive

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Learning Vector Quantization: generalization ability and dynamics of competing prototypes

Learning Vector Quantization: generalization ability and dynamics of competing prototypes Learning Vector Quantization: generalization ability and dynamics of competing prototypes Aree Witoelar 1, Michael Biehl 1, and Barbara Hammer 2 1 University of Groningen, Mathematics and Computing Science

More information

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks Chin. J. Astron. Astrophys. Vol. 5 (2005), No. 2, 203 210 (http:/www.chjaa.org) Chinese Journal of Astronomy and Astrophysics Automated Stellar Classification for Large Surveys with EKF and RBF Neural

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

Filtered Gaussian Processes for Learning with Large Data-Sets

Filtered Gaussian Processes for Learning with Large Data-Sets Filtered Gaussian Processes for Learning with Large Data-Sets Jian Qing Shi, Roderick Murray-Smith 2,3, D. Mike Titterington 4,and Barak A. Pearlmutter 3 School of Mathematics and Statistics, University

More information

Visualization of textual data: unfolding the Kohonen maps.

Visualization of textual data: unfolding the Kohonen maps. Visualization of textual data: unfolding the Kohonen maps. CNRS - GET - ENST 46 rue Barrault, 75013, Paris, France (e-mail: ludovic.lebart@enst.fr) Ludovic Lebart Abstract. The Kohonen self organizing

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Visualization methods for patent data

Visualization methods for patent data Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

More information

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID DAME Astrophysical DAta Mining & Exploration on GRID M. Brescia S. G. Djorgovski G. Longo & DAME Working Group Istituto Nazionale di Astrofisica Astronomical Observatory of Capodimonte, Napoli Department

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Probabilistic Latent Semantic Analysis (plsa)

Probabilistic Latent Semantic Analysis (plsa) Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg Rainer.Lienhart@informatik.uni-augsburg.de www.multimedia-computing.{de,org} References

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS Koua, E.L. International Institute for Geo-Information Science and Earth Observation (ITC).

More information

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data Self-Organizing g Maps (SOM) Ke Chen Outline Introduction ti Biological Motivation Kohonen SOM Learning Algorithm Visualization Method Examples Relevant Issues Conclusions 2 Introduction Self-organizing

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

A Computational Framework for Exploratory Data Analysis

A Computational Framework for Exploratory Data Analysis A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.

More information

UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision

UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision D.B. Grimes A.P. Shon R.P.N. Rao Dept. of Computer Science and Engineering University of Washington Seattle, WA

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Galaxy Morphological Classification

Galaxy Morphological Classification Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Model-Based Cluster Analysis for Web Users Sessions

Model-Based Cluster Analysis for Web Users Sessions Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece gpallis@ccf.auth.gr

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Classification Techniques for Remote Sensing

Classification Techniques for Remote Sensing Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara saksoy@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/ saksoy/courses/cs551

More information

Visualizing pay-per-view television customers churn using cartograms and flow maps

Visualizing pay-per-view television customers churn using cartograms and flow maps Visualizing pay-per-view television customers churn using cartograms and flow maps David L. García 1 and Àngela Nebot1 and Alfredo Vellido 1 1- Dept. de Llenguatges i Sistemes Informàtics Universitat Politècnica

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information