A Computational Framework for Exploratory Data Analysis
|
|
|
- Asher Dean
- 10 years ago
- Views:
Transcription
1 A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY , U.S.A. Abstract. We introduce the Exploration Machine (Exploratory Observation Machine, XOM) as a novel versatile method for the analysis of multidimensional data. XOM systematically inverts structural and functional components of so-called topology-preserving mappings. It provides a surprising flexibility to simultaneously contribute to complementary domains of unsupervised learning for exploratory pattern analysis, namely both structure-preserving dimensionality reduction and data clustering. We demonstrate XOM s applicability to synthetic and real-world data. 1 Introduction To extract useful knowledge from high-dimensional observations, such as provided by multidimensional imaging data, scientists need to concisely visualize hidden regularities in such data by methods of intelligent data reduction. A classical machine learning approach to this problem has been contributed by socalled topology-preserving mappings which have been pioneered by Kohonen s discovery of the Self-Organizing Map (SOM) algorithm almost three decades ago. This approach has found wide-spread use throughout science and technology for both visualization and partitioning of high-dimensional data. However, a significant drawback of the SOM is that it cannot accurately preserve an eventual cluster structure prevalent in the data as pointed out in [1]. In this contribution, we propose to systematically reverse the data-processing workflow in topology-preserving mappings in order to alleviate this problem. By simply exchanging functional and structural components of topology-preserving mappings, we obtain the Exploration Machine (Exploratory Observation Machine, XOM) as a novel computational framework for both structure-preserving dimensionality reduction and data clustering. After introducing the XOM algorithm, we analyze its applicability to both synthetic and real-world data. 2 The Exploration Machine Algorithm The Exploration Machine (XOM) algorithm can be resolved into three steps. For simplicity, let us first consider N real-valued input vectors r i in the observation space O, each of dimensionality D. Step 1: Define the topology of the input data in the observation space O by computing distances d(r i, r j ) between the data vectors r i, i {1,..., N}. This step is omitted, if the input data is already given as a set of distances between
2 input data items. Step 2: Define a hypothesis on the structure of the data in the embedding space E, represented by sampling vectors x k E, k {1,..., K}, K IN, and randomly initialize an image vector w i E, i {1,..., N} for each input vector r i. Typical choices for sampling distributions are: for structure-preserving visualization, use a uniform distribution (e.g. in a 2D square, as used in figs. 2 and 3A,B); for data clustering use a mixture of several (e.g. Gaussian) distributions with different centers, e.g. located on the vertices of a regular simplex. Step 3: Reconstruct the topology induced by the input data in O by moving the image vectors in the embedding space E using the computational scheme of a topology-preserving mapping T. The final positions of the image vectors w i represent the output of the algorithm. In the third step of the XOM algorithm, the topology-preserving mapping T can be considered as a free variable. Besides topographic vector quantizers, e.g. [2], a simple choice for T is Kohonen s self-organizing map algorithm [1], e.g. in its basic incremental version. Here, the image vectors w i are incrementally updated by a sequential learning procedure. For this purpose, the neighborhood couplings between the input data items are represented by a so-called cooperativity function ψ. A typical choice for ψ is a Gaussian ψ(r, r (x(t)), σ(t)) := exp ( (r r (x(t))) 2 ) 2σ(t) 2. (1) In the XOM context, r (x(t)) represents the best-match input data vector. For a randomly selected sampling vector x(t) E, this best-match input data vector is identified by x w r = min r x w r. Once the best-match input data vector r (x(t)) has been identified, the image vectors w r are updated in a sequential adaptation step according to the learning rule w r (t + 1) = w r (t) + ɛ(t) ψ(r, r (x(t)), σ(t)) (x(t) w r (t)), where t represents the iteration step, ɛ(t) a learning parameter, and σ(t) a measure for the width of the neighborhood taken into account by the cooperativity function ψ. In general, σ(t) as well as ɛ(t) are changed in a systematic manner depending on the number of iterations t by some appropriate annealing scheme, e.g. an exponential decay with t, as used in the examples of this paper. The algorithm is terminated, once a problem-specific cost criterion is satisified, or a maximum number of iterations has been completed. Although the above computational scheme formally resembles Kohonen s self-organizing map algorithm, both approaches actually describe fundamentally different concepts. Specifically, the meaning of the variables r, w, and x completely differs in the Exploration Machine: Whereas in Kohonen s algorithm the sampling vectors x represent the input data, this role is attributed to the vectors r in XOM, i.e. XOM completely inverts the role of input data and structure hypotheses, given the conventions of topology-preserving mappings as known from the literature. For detailed analyses, extensions, and variants of the Exploration Machine, such as related to computational complexity, convergence properties, free parameter selection, supervised learning, analysis of non-metric data, geodesic coordinates, out-of-sample extension, growing XOM variants, and constrained incremental learning, we refer to [3].
3 Fig. 1: Hepta Data Set. For explanation, see text. (A) (B) Fig. 2: XOM embedding of the Hepta data sets. The figure depicts the best (A) and the worst (B) embedding result obtained by XOM for the 40 data sets constructed according to the specifications explained in Fig Experiments Hepta Data Sets. In order to quantitatively evaluate the quality of dimensionality reduction by XOM and to relate its results to other methods known from the literature, we investigated the degree of structure preservation which can be achieved by classical and advanced recent nonlinear embedding methods, namely Principal Component Analysis (PCA), Locally Linear Embedding [4], and Isomap [5]. For this purpose, we used 40 data sets similar to a synthetic benchmark data set called Hepta proposed in [6] for the evaluation of structure preservation. A single realization of a Hepta data set is depicted in Fig. 1. It consists of 2300 points randomly sampled from seven Gaussian distributions, thus forming clusters in IR 3. The centroids of the six non-central Gaussian distributions span the coordinate axes of the IR 3, the respective clusters consist of 300 data points each. The central Gaussian distribution consists of 500 data points. We created 40 Hepta data sets according to these specifications. To quantitatively evaluate structure preservation, we used Sammon s error function [7]. Our results are summarized in Tab. 1. On average, XOM outper-
4 Table 1: Comparative evaluation of computation times and structure preservation for nonlinear embedding of 40 Hepta data sets as specified in Fig. 1. The table lists average computation times using an ordinary PC (Intel Pentium 4 CPU, 1.6 GHz, 512 MB RAM). Average and minimum values as well as the standard deviation of Sammon s error E were computed as a measure of structure preservation. The free parameters of all the methods examined in the comparison (except PCA) were optimized to obtain the best results, i.e. to minimize E. Note that XOM yields competitive structure preservation at acceptable computation times. Method Comp. Time (s) E min(e ) σ(e ) Isomap LLE PCA XOM formed LLE and Isomap with regard to structure preservation, although Isomap yielded better results in a few data sets. Interestingly, we frequently obtained poor results for PCA. This is caused by the spatial symmetry of the data set which makes the projection axis in PCA very sensitive to noise, i.e. to the random choice of data points sampled from the Gaussian distributions specified in the Hepta data set construction. Thus, different clusters are frequently projected onto each other in the embedding result, i.e. cannot be separated, which leads to impaired structure preservation. For illustration, the best and the worst of the 40 embedding results obtained by XOM are shown in Fig. 2. We emphasize that the results depicted in Tab. 1 depend on the structure of the data set, and do not allow to draw final conclusions on the overall performance of the nonlinear embedding algorithms with regard to the general degree of structure preservation or their computational expense. In addition, the choice of other measures for structure preservation may also result in different ranking scenarios. For example, we conjecture that PCA will be superior in situations where the data is approximately located in a linear subspace of the observation space. LLE and Isomap will perform better in situations where the data is not distributed inhomogeneously in the observation space, i.e. does not exhibit an underlying distinct cluster structure of almost isolated data patches, but rather consists of a connected single cluster. In such data sets, both LLE and Isomap can accurately reconstruct the data with a smaller number of nearest neighbors, which will also reduce their computational expense considerably. However, even taking all these limitations into account, our investigation at least shows that there exist classes of data sets where XOM yields competitive results in comparison to the methods known from the literature. Microarray Gene Expression Data. Fig. 3A shows the visualization result obtained by XOM for structure-preserving dimensionality reduction of gene expression profiles related to ribosomal metabolism, as a detailed visualization of a gene subset included in the genome-wide expression data taken from Eisen et
5 al. [8]. The figure illustrates the exploratory analysis of the 147 genes labeled as 5 (22 genes) and 8 (125 genes) according to the cluster assignment by Eisen et al. [8]. Besides several genes involved in respiration, cluster 5 (blue) contains genes related to mitochondrial ribosomal metabolism, whereas cluster 8 (orange) is dominated by genes encoding ribosomal proteins. In the XOM genome map of Fig. 3A, it is clearly visible at first glance that the data consists of two distinct clusters. Comparison with the functional annotation known for these genes reveals that the map overtly separates expression profiles related to mitochondrial and to extramitochondrial ribosomal metabolism. Fig. 3B shows an enlarged section of the map indicated by the small frame in Fig. 3A. Fig. 3C shows a data representation obtained by a SOM on the same data, using a regular grid of neurons. As can be clearly seen in the figure, the SOM cannot achieve a structure-preserving mapping result as provided by the Exploration Machine in Fig. 3A: Although the genes related to mitochondrial and to extramitochondrial ribosomal metabolism are collocated on the map, the distinct cluster structure underlying the data remains invisible, if the color coding is omitted. In other words, visualization by the Exploration Machine outperforms the SOM w.r.t. structure preservation in this example. For quantitative comparison of results we computed Sammon s error E [7] as a quantitative measure of structure preservation for several other embedding algorithms known from the literature. We obtained (1.00) for XOM, (1.11) for Sammon s mapping [7], (1.25) for Locally Linear Embedding (LLE) [4], (1.28) for PCA, (1.52) for Isomap [5], and (4.61) for SOM, where numbers in brackets denote relative values compared to XOM. Computation times in seconds were 0.72, 8.52, 1.36, 0.03, 0.27, for XOM, Sammon, LLE, PCA, Isomap, SOM, respectively. These results show that XOM yields competitive structure preservation results at acceptable computation time. Fig. 3D shows how XOM can be used for clustering of this data as well. To this end, the structure hypothesis in step 2 of the XOM algorithm is changed from a uniform distribution in a unit square as used in Fig. 3A to a set of two Gaussians centered at different locations of the exploration space, e.g. on top and bottom of the embedding space in Fig. 3D. As can be seen, the two clusters are separated completely. Note that the decision to use two clusters instead of any different number of clusters can be conveniently based on the results of structure-preserving XOM visualization in Fig. 3A, which can, thus, serve as a useful preprocessing step to clustering. The visualization obtained by SOM in Fig. 3C, in contrast, is not clearly indicative for the presence of exactly two clusters, if the color coding is omitted. 4 Conclusion and Outlook We have introduced the Exploration Machine as a novel versatile learning method for the analysis of multidimensional data. As can be concluded from our work, XOM yields competitive results when compared to established methods known from the literature. In addition, it is evident that XOM can directly be applied to both nonlinear embedding and clustering of non-metric data, as shown
6 (A) (B) (C) (D) Fig. 3: Visualization of genome expression profiles related to ribosomal metabolism using the Exploration Machine. (A) Genome map obtained by structure-preserving dimensionality reduction using the Exploration Machine. (B) Enlarged section of (A). (C) Data representation obtained by SOM. (D) XOM clustering result. For explanation, see text. in [3]. As an outlook, it is important that XOM can be used to systematically reverse other concepts related to topology-preserving learning. Obvious examples are batch and growing XOM variants [3] and the Fuzzy-Labeled XOM with Relevance Learning which is easy and straightforward to derive as a systematic XOM inversion of [9], thus linking the XOM framework to LVQ. Although future systematic scientific investigation has to further elucidate the properties of our method, we conjecture that XOM can constructively contribute to the field of learning with neural maps. References [1] Kohonen, T., [Self-Organizing Maps], Springer, Berlin, Heidelberg, New York, 3rd ed. (2001). [2] Graepel, T., Burger, M., and Obermayer, K., Self-organizing maps: Generalizations and new optimization techniques, Neurocomputing 21, (1998). [3] Wismüller, A., Exploratory Morphogenesis (XOM): A Novel Computational Framework for Self-Organization, Ph.D. thesis, Technical University of Munich, Department of Electrical and Computer Engineering (2006). [4] Roweis, S. and Saul, L., Nonlinear dimensionality reduction by locally linear embedding, Science 290(5500), (2000). [5] Tenenbaum, J., de Silva, V., and Langford, C., A global geometric framework for nonlinear dimensionality reduction, Science 290(5500), (2000). [6] Ultsch, A., Maps for the visualization of high-dimensional data spaces, in [Proc. of the Workshop on Self-Organizing Maps 2003 (WSOM03)], (2003). [7] Sammon, J., A nonlinear mapping for data structure analysis, IEEE Transactions on Computers C 18, (1969). [8] Eisen, M., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. USA 95, (1998). [9] Villmann, T., Seiffert, U., Schleif, F., Brüß, C., Geweniger, T., and Hammer, B., Fuzzy labeled self-organizing map with label-adjusted prototypes, in [ANNPR 2006, LNAI 4087], 46 56, Springer-Verlag, Berlin (2006).
Visualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
INTERACTIVE DATA EXPLORATION USING MDS MAPPING
INTERACTIVE DATA EXPLORATION USING MDS MAPPING Antoine Naud and Włodzisław Duch 1 Department of Computer Methods Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract: Interactive
Visualization of Topology Representing Networks
Visualization of Topology Representing Networks Agnes Vathy-Fogarassy 1, Agnes Werner-Stark 1, Balazs Gal 1 and Janos Abonyi 2 1 University of Pannonia, Department of Mathematics and Computing, P.O.Box
Visualization of Breast Cancer Data by SOM Component Planes
International Journal of Science and Technology Volume 3 No. 2, February, 2014 Visualization of Breast Cancer Data by SOM Component Planes P.Venkatesan. 1, M.Mullai 2 1 Department of Statistics,NIRT(Indian
Data topology visualization for the Self-Organizing Map
Data topology visualization for the Self-Organizing Map Kadim Taşdemir and Erzsébet Merényi Rice University - Electrical & Computer Engineering 6100 Main Street, Houston, TX, 77005 - USA Abstract. The
Online data visualization using the neural gas network
Online data visualization using the neural gas network Pablo A. Estévez, Cristián J. Figueroa Department of Electrical Engineering, University of Chile, Casilla 412-3, Santiago, Chile Abstract A high-quality
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理
CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment
Visualization by Linear Projections as Information Retrieval
Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland [email protected]
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca [email protected] Spain Manuel Martín-Merino Universidad
Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis
Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis Arumugam, P. and Christy, V Department of Statistics, Manonmaniam Sundaranar University, Tirunelveli, Tamilnadu,
USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS
USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS Koua, E.L. International Institute for Geo-Information Science and Earth Observation (ITC).
High-dimensional labeled data analysis with Gabriel graphs
High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use
ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 1, JANUARY 2002 237 ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization Hujun Yin Abstract When used for visualization of
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
Self Organizing Maps for Visualization of Categories
Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, [email protected]
HDDVis: An Interactive Tool for High Dimensional Data Visualization
HDDVis: An Interactive Tool for High Dimensional Data Visualization Mingyue Tan Department of Computer Science University of British Columbia [email protected] ABSTRACT Current high dimensional data visualization
So which is the best?
Manifold Learning Techniques: So which is the best? Todd Wittman Math 8600: Geometric Data Analysis Instructor: Gilad Lerman Spring 2005 Note: This presentation does not contain information on LTSA, which
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
Learning Vector Quantization: generalization ability and dynamics of competing prototypes
Learning Vector Quantization: generalization ability and dynamics of competing prototypes Aree Witoelar 1, Michael Biehl 1, and Barbara Hammer 2 1 University of Groningen, Mathematics and Computing Science
Data Mining and Neural Networks in Stata
Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca [email protected] [email protected]
QoS Mapping of VoIP Communication using Self-Organizing Neural Network
QoS Mapping of VoIP Communication using Self-Organizing Neural Network Masao MASUGI NTT Network Service System Laboratories, NTT Corporation -9- Midori-cho, Musashino-shi, Tokyo 80-88, Japan E-mail: [email protected]
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data
On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data Jorge M. L. Gorricha and Victor J. A. S. Lobo CINAV-Naval Research Center, Portuguese Naval Academy,
Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data
Self-Organizing g Maps (SOM) Ke Chen Outline Introduction ti Biological Motivation Kohonen SOM Learning Algorithm Visualization Method Examples Relevant Issues Conclusions 2 Introduction Self-organizing
Going Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps
Technical Report OeFAI-TR-2002-29, extended version published in Proceedings of the International Conference on Artificial Neural Networks, Springer Lecture Notes in Computer Science, Madrid, Spain, 2002.
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Comparing large datasets structures through unsupervised learning
Comparing large datasets structures through unsupervised learning Guénaël Cabanes and Younès Bennani LIPN-CNRS, UMR 7030, Université de Paris 13 99, Avenue J-B. Clément, 93430 Villetaneuse, France [email protected]
A Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Self Organizing Maps: Fundamentals
Self Organizing Maps: Fundamentals Introduction to Neural Networks : Lecture 16 John A. Bullinaria, 2004 1. What is a Self Organizing Map? 2. Topographic Maps 3. Setting up a Self Organizing Map 4. Kohonen
Selection of the Suitable Parameter Value for ISOMAP
1034 JOURNAL OF SOFTWARE, VOL. 6, NO. 6, JUNE 2011 Selection of the Suitable Parameter Value for ISOMAP Li Jing and Chao Shao School of Computer and Information Engineering, Henan University of Economics
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
Visualization of textual data: unfolding the Kohonen maps.
Visualization of textual data: unfolding the Kohonen maps. CNRS - GET - ENST 46 rue Barrault, 75013, Paris, France (e-mail: [email protected]) Ludovic Lebart Abstract. The Kohonen self organizing
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Neural Network Add-in
Neural Network Add-in Version 1.5 Software User s Guide Contents Overview... 2 Getting Started... 2 Working with Datasets... 2 Open a Dataset... 3 Save a Dataset... 3 Data Pre-processing... 3 Lagging...
Accurate and robust image superresolution by neural processing of local image representations
Accurate and robust image superresolution by neural processing of local image representations Carlos Miravet 1,2 and Francisco B. Rodríguez 1 1 Grupo de Neurocomputación Biológica (GNB), Escuela Politécnica
Load balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations
Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations Christian W. Frey 2012 Monitoring of Complex Industrial Processes based on Self-Organizing Maps and
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Visualizing an Auto-Generated Topic Map
Visualizing an Auto-Generated Topic Map Nadine Amende 1, Stefan Groschupf 2 1 University Halle-Wittenberg, information manegement technology [email protected] 2 media style labs Halle Germany [email protected]
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach
International Journal of Civil & Environmental Engineering IJCEE-IJENS Vol:13 No:03 46 Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach Mansour N. Jadid
Manifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
A Study of Web Log Analysis Using Clustering Techniques
A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept
Polynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
Dimensionality Reduction - Nonlinear Methods
Chapter 3 Dimensionality Reduction - Nonlinear Methods This chapter covers various methods for nonlinear dimensionality reduction, where the nonlinear aspect refers to the mapping between the highdimensional
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
Visualization of Large Font Databases
Visualization of Large Font Databases Martin Solli and Reiner Lenz Linköping University, Sweden ITN, Campus Norrköping, Linköping University, 60174 Norrköping, Sweden [email protected], [email protected]
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.
Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1 Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital
Hierarchical Clustering Analysis
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Unsupervised and supervised dimension reduction: Algorithms and connections
Unsupervised and supervised dimension reduction: Algorithms and connections Jieping Ye Department of Computer Science and Engineering Evolutionary Functional Genomics Center The Biodesign Institute Arizona
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Conclusions and Future Directions
Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
A Discussion on Visual Interactive Data Exploration using Self-Organizing Maps
A Discussion on Visual Interactive Data Exploration using Self-Organizing Maps Julia Moehrmann 1, Andre Burkovski 1, Evgeny Baranovskiy 2, Geoffrey-Alexeij Heinze 2, Andrej Rapoport 2, and Gunther Heidemann
Tracking Groups of Pedestrians in Video Sequences
Tracking Groups of Pedestrians in Video Sequences Jorge S. Marques Pedro M. Jorge Arnaldo J. Abrantes J. M. Lemos IST / ISR ISEL / IST ISEL INESC-ID / IST Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal
Visualization of General Defined Space Data
International Journal of Computer Graphics & Animation (IJCGA) Vol.3, No.4, October 013 Visualization of General Defined Space Data John R Rankin La Trobe University, Australia Abstract A new algorithm
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
A Performance Comparison of Five Algorithms for Graph Isomorphism
A Performance Comparison of Five Algorithms for Graph Isomorphism P. Foggia, C.Sansone, M. Vento Dipartimento di Informatica e Sistemistica Via Claudio, 21 - I 80125 - Napoli, Italy {foggiapa, carlosan,
Bag of Pursuits and Neural Gas for Improved Sparse Coding
Bag of Pursuits and Neural Gas for Improved Sparse Coding Kai Labusch, Erhardt Barth, and Thomas Martinetz University of Lübec Institute for Neuro- and Bioinformatics Ratzeburger Allee 6 23562 Lübec, Germany
Lecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
