Visualization of large data sets using MDS combined with LVQ.



Similar documents
INTERACTIVE DATA EXPLORATION USING MDS MAPPING

A Computational Framework for Exploratory Data Analysis

ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization

Data topology visualization for the Self-Organizing Map

Self Organizing Maps for Visualization of Categories

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

High-dimensional labeled data analysis with Gabriel graphs

Visualization of Breast Cancer Data by SOM Component Planes

Categorical Data Visualization and Clustering Using Subjective Factors

Meta-learning. Synonyms. Definition. Characteristics

Meta-learning via Search Combined with Parameter Optimization.

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Visualization of Topology Representing Networks

Multiscale Object-Based Classification of Satellite Images Merging Multispectral Information with Panchromatic Textural Features

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Selection of the Suitable Parameter Value for ISOMAP

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理

Visualizing class probability estimators

Data Mining and Neural Networks in Stata

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

Online data visualization using the neural gas network

On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES

Character Image Patterns as Big Data

The Value of Visualization 2

Comparing large datasets structures through unsupervised learning

A Study of Web Log Analysis Using Clustering Techniques

CURRICULUM VITAE. Norbert Jankowski. Department of Computer Methods. ul. Grudziądzka Toruń Poland

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

DATA MINING TECHNIQUES AND APPLICATIONS

HDDVis: An Interactive Tool for High Dimensional Data Visualization

CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Learning Vector Quantization: generalization ability and dynamics of competing prototypes

PLAANN as a Classification Tool for Customer Intelligence in Banking

Network Intrusion Detection Using an Improved Competitive Learning Neural Network

Data Mining: A Preprocessing Engine

Self Organizing Maps: Fundamentals

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

Ensembles of Similarity-Based Models.

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

2 Visualized IEC. Shiobaru, Minami-ku, Fukuoka, Japan Tel&Fax: ,

How To Make Visual Analytics With Big Data Visual

Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Overview of Knowledge Discovery Database and Data mining Techniques

Effective Data Mining Using Neural Networks

Classification of Engineering Consultancy Firms Using Self-Organizing Maps: A Scientific Approach

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Enhancing Quality of Data using Data Mining Method

Novelty Detection in image recognition using IRF Neural Networks properties

Knowledge Discovery from patents using KMX Text Analytics

Data visualization is a graphical presentation

Quality Assessment in Spatial Clustering of Data Mining

E-commerce Transaction Anomaly Classification

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

SVM Ensemble Model for Investment Prediction

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

A Discussion on Visual Interactive Data Exploration using Self-Organizing Maps

ADVANCED MACHINE LEARNING. Introduction

SURVIVABILITY ANALYSIS OF PEDIATRIC LEUKAEMIC PATIENTS USING NEURAL NETWORK APPROACH

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Towards better accuracy for Spam predictions

Exploratory Data Analysis with MATLAB

Decision Tree Learning on Very Large Data Sets

Visibility optimization for data visualization: A Survey of Issues and Techniques

How To Identify Noisy Variables In A Cluster

QoS Mapping of VoIP Communication using Self-Organizing Neural Network

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Standardization and Its Effects on K-Means Clustering Algorithm

An Evaluation of Neural Networks Approaches used for Software Effort Estimation

Models of Cortical Maps II

Data Mining - Evaluation of Classifiers

Network Intrusion Detection Systems

Grid Density Clustering Algorithm

Specific Usage of Visual Data Analysis Techniques

Prediction of Stock Performance Using Analytical Techniques

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Transcription:

Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk Abstract. A common task in data mining is the visualization of multivariate objects using various methods, allowing human observers to perceive subtle inter-relations in the dataset. Multidimensional scaling (MDS) is a well known technique used for this purpose, but it due to its computational complexity there are limitations on the number of objects that can be displayed. Combining MDS with a clustering method as Learning Vector Quantization allows to obtain displays of large databases that preserve both high accuracy of clustering methods and good visualization properties. 1 Introduction Visualization helps in data exploration by providing the user an idea about clusters in data and their mutual relations, shows outliers and allows for rough classification of data. A very popular way to visualize data is based on self-organizing topographic feature maps (SOM) [1], known also as the Kohonen networks. SOM is so popular because the method may be applied to very large databases and provides clusterization as well as visualization at the same time. Unfortunately SOM networks are among the worst classifiers, as found in empirical studies [2]. There are many better clustering methods than SOM, for example methods based on dendrograms. We have compared SOM visualization using simplexes (up to dimension 20), hypercubes and points on the hypersphere, with a method that preserves correct topographical relations among multidimensional objects [3,4]. These results clearly show that minimization of a quantitative measures for the distortions of original data topography, i.e. the multidimensional scaling technique [5,6], generates maps that preserve both local and global features in a better way than SOM does. Unfortunately MDS requires difficult minimization of cost function that is based on distances among the represented multidimensional objects. For N objects N(N 1)/2 distances enter the cost function, and if maps are 2-dimensional they have 2N 3 adaptive parameters (position (x, y) of an arbitrary point and the rotation angle, or y coordinate of a second point, may be fixed). The MDS-based interactive software developed by one of us [7] performs a mapping of multivariate data from a high-dimensional space (D-dimensional space, D 3) to data points in a lower d-dimensional space (d D). Usually d = 2in order to allow visualization by scatterplots, but higher d values may be useful for dimensionality reduction. The MDS dimensionality reduction is such that similar

(or close, in the sense of some distance measure in the D-dimensional space) multivariate objects are mapped on representative points close to each other in the d- dimensional representation space. The mapping should preserve topography of data vectors. Linear projections are often not able to preserve topography; such mappings have to be non-linear. Topography preservation may be unreachable in the whole feature space if the analyzed multivariate data is intrinsically high dimensional and cannot be imbedded in a lower dimensional space without much distortion. Combination of this algorithm with clusterization methods have several advantages. First, hierachical clusterization methods or any other clusterization may be used. Second, information about the cluster centers may be displayed on the map and combined with MDS information, giving more details than dendrograms typically used to display clusters. Third, multi-level mapping, starting from large clusters and creating smaller ones, may help to find a better MDS solution since initially smaller number of points are mapped. Fourth, it may be applied to large datasets, and thus be competitive to SOM, providing good clusterization and visualization. The process that combines clusterization with dimensionality reduction is described below. We have used Learning Vector Quantization (LVQ) for clusterization [1] but any other algorithm can be used. Since it is not clear how accurate are the maps created in this way, we have compared them to the original, fully optimized maps in the third section. A short discussion concludes this paper. 2 Combining MDS with LVQ A direct visualization by MDS of large datasets may be unpracticable due to the limitation on the number of simultaneously mapped objects. Various approaches have been proposed to tackle this problem. Our approach is similar to the so-called frame method [8]. It proceeds as follows: First build (by some heuristics) a smaller dataset (called the basis B), designed as a first approximation of the original large dataset. Then map the Basis using standard MDS. Finally add to the mapped Basis all the remaining cases from the original dataset. Let us denote D the dataset to be visualized and N D the number of cases in this dataset. Let B denote the basis and N B be the size of the basis, N t = N m + N f is the total number of points considered during the mapping, F = {P i,i = 1,N f } the set of known data points and by M = {P i,i = 1,N m } the set of new data points. In Chang and Lee [8] the basis B is build by selecting a subset of points from D, as a result the final display highly depends on which points are selected. We prefer to use a clustering technique on the dataset in order to obtain N B cluster centers that more uniformly represent the data. For the final step, we use relative MDS [13], where each new data is individually added to the mapped Basis using an algorithm that is much less time consuming than the full Stress minimization. The general process proposed for mapping large datasets is shown in figure 1. Relative MDS mapping differs from standard MDS in this respect that during minimization of the topography distortion measure (called Stress") only the points from M set are allowed to move while the points from F set are kept fixed. This is

N cases D Dataset Basis 1 LVQ 2 N B cases F features 3 relative MDS MDS Map F features 2-D Fig. 1. Process used for the visualization of large datasets: 1 cluster the data using LVQ to form the so-called Basis, 2 Map the Basis using standard metric MDS, 3 Add the remaining Dataset cases to the mapped Basis using relative MDS. achieved by modifying the Stress function so that it sums only over the distances that change during mapping, i.e. the distances between the added and the basis points, and the added points interpoint distances. The original Stress function: is redefined as: S(x)= S r (x)= N D w ( ) ij dˆ 2 ij d ij (1) ij N D N B i=1 j=1 w ij ( ˆ d ij d ij ) 2 (2) In relative mapping, the order in which the data are added does not influence the final result because each case is added independently from the other cases. Relative mapping can also be seen as an alternative to methods designed for the purpose of giving generalization capability to Sammon mapping, such as the ANN Sammon mapping [9], Neuroscale [10] or incremental scaling [11]. 3 Visualization of the Satimage dataset It is interesting to see where a new data points falls among known cases, discover the class of its neighbors (classified or labeled), and to get an insight on how a classifier would evaluate this new data. Satimage dataset comes from the Landsat MSS imagery, it consists of four digital images of the same scene in different spectral bands. This dataset is often used to perform comparison of classifiers, and it is part

of the UCI repository [12]. The dataset is made of N D = 4435 cases with 36 numerical features in the range 0 to 255. Each case stands for one pixel in the image, the pixels are attributed to 6 classes standing for different soil types photographed. The visualization of the N D cases has been performed by clustering the data into N B cluster centers. As we are interested in determining the optimal size for the basis, we performed a series of mappings with different values for N B and compared the resulting displays visually and by means of the final Stress. The following values were chosen: N B = 100,300,500,700,1000,1200,1300, this last value is the maximum reachable with 256 MB of RAM. The final Stress values obtained are presented in Fig. 2. As can be expected, the final Stress value decreases when the Basis size increases. The curve should converge towards the lowest Stress obtained when mapping all N D cases in one MDS run. Stress 0.0385 0.038 0.0375 0.037 0.0365 0.036 0.0355 0.035 0.0345 0.034 0.0335 0 200 400 600 800 1000 1200 1400 number of code vectors Fig. 2. Final Stress values obtained after relative mapping of the dataset cases added to a Basis of 100, 300, 500, 1000, 1200 and 1300 LVQ code vectors. The whole mapped Satimage dataset is shown in figure 3. It can be seen that a Basis size of N B = 500 cluster centers is enough to get quite good display because the obtained configuration does not change much for higher sizes. This optimal size should also be observed on the curve of Fig. 2, where the speed" of Stress decrease slows down. 4 Discussion The main advantage of MDS dimensionality reduction is that the topographical distortions induced can evaluated by the value of the Stress function reached during

minimization. In this paper we have shown that the main disadvantage of MDS, its computational complexity, may be removed by mapping smaller number of cluster centers and adding the remining points using relative maping. Interactive zooming on interesting areas of the input space [13] allows for exploration of the neighborhood of the case under inspection. Since there may be some uncertainty in the input one may generate a number (for example 100) of vectors drawn from a Gaussian distribution centered at this vector. The class of these additional points is determined using neural or other classification systems and corresponding points added (using relative mapping) to the map. Visual inspection of such map allows estimating the probability of misclassification, proportional to the number of points from alternative classes. Acknowledgments: Support by the Polish Committee for Scientific Research, grant 8 T11C 006 19, is gratefully acknowledged. References 1. Kohonen T. (1995) Self-Organizing Maps. Springer-Verlag, Heidelberg Berlin 2. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors. Machine Learning, Neural and Statistical Classification. Ellis Horwood, New York, 1994. 3. Duch W, Naud A. (1996) Multidimensional scaling and Kohonen s self-organizing maps. In Proc. 2nd Conf. eural Networks and their Applications, Szczyrk, Poland, Vol. I, pp. 138 143 4. Duch W, Naud A. (1996) On global self-organizing maps. In Proc. 4th European Symposium on Artificial Neural Networks, Brugges, pp. 91 96 5. Kruskal J. B. (1964) Non metric multidimensional scaling : a numerical method. Psychometrika 29, 115 129 6. Sammon J. W. (1969) A nonlinear mapping for data analysis. IEEE Transactions on Computers 5, 401 409 7. Naud A. (2001) Neural and statistical methods for the visualization of multidimensional data. PhD Thesis, Dept. of Informatics, Nicholas Copernicus University, Poland. http://www.phys.uni.torun.pl/kmk/publications.html 8. Chang C. L, Lee R. C. T. (1973) A heuristic relaxation method for nonlinear mapping in cluster analysis. IEEE Transactions on Systems, Man, and Cybernetics 3, 197 200 9. Kraaijveld M. A, Mao J, Jain A. K. (1995) A nonlinear projection method based on Kohonen s topology preserving maps. IEEE Transactions on Neural Networks, 6, 548 559. 10. Tipping M. E. (1996) Topographic mappings and feed-forward neural networks. PhD thesis, Aston University, Birmingham, UK 11. Basalaj W. (1999) Incremental multidimensional scaling method for database visualization. In Proceedings of Visual data Exploration and Analysis VI, SPIE, Vol. 3643, 149-158 12. Blake, C.L, Merz, C.J. (1998). UCI Repository of machine learning databases http://www.ics.uci.edu/ mlearn/mlrepository.html. Irvine, CA: University of California, Department of Information and Computer Science. 13. Naud A, Duch W. (2000) Interactive data exploration using MDS mapping. In: 5th Conf. on Neural Networks and Soft Computing, Zakopane, Poland, pp. 255-260

grey soil damp grey soil soil with vegetation stubble very damp grey soil cotton crop red soil (a) 100 code vectors, S = 0.0384 (b) 500 code vectors, S = 0.0358 (c) 1000 code vectors, S = 0.0342 (d) 1300 code vectors, S = 0.0339 Fig. 3. Visualization of Satimage dataset using different numbers of code vectors for LVQ clustering, to which the whole dataset has been added by relative mapping.