A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization



Similar documents
Visualization of large data sets using MDS combined with LVQ.

Unsupervised and supervised dimension reduction: Algorithms and connections

Self Organizing Maps for Visualization of Categories

Unsupervised Data Mining (Clustering)

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理

Data Mining Algorithms Part 1. Dejan Sarka

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Combining SVM classifiers for anti-spam filtering

A Computational Framework for Exploratory Data Analysis

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Principles of Data Mining by Hand&Mannila&Smyth

Machine Learning for Data Science (CS4786) Lecture 1

Learning outcomes. Knowledge and understanding. Competence and skills

ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Social Media Mining. Data Mining Essentials

EVALUATION OF NEURAL NETWORK BASED CLASSIFICATION SYSTEMS FOR CLINICAL CANCER DATA CLASSIFICATION

Self Organizing Maps: Fundamentals

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Visualization of textual data: unfolding the Kohonen maps.

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

INTERACTIVE DATA EXPLORATION USING MDS MAPPING

Knowledge Discovery from patents using KMX Text Analytics

DATA MINING TECHNIQUES AND APPLICATIONS

Nonlinear Discriminative Data Visualization

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

How To Create A Text Classification System For Spam Filtering

Data, Measurements, Features

Using Data Mining for Mobile Communication Clustering and Characterization

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

from Larson Text By Susan Miertschin

How To Cluster

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Introduction to Pattern Recognition

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Classification algorithm in Data mining: An Overview

ADVANCED MACHINE LEARNING. Introduction

6.2.8 Neural networks for data mining

A Study of Web Log Analysis Using Clustering Techniques

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Accurate and robust image superresolution by neural processing of local image representations

Reconstructing Self Organizing Maps as Spider Graphs for better visual interpretation of large unstructured datasets

Exploratory Data Analysis with MATLAB

Data Mining and Neural Networks in Stata

LVQ Plug-In Algorithm for SQL Server

Intrusion Detection. Jeffrey J.P. Tsai. Imperial College Press. A Machine Learning Approach. Zhenwei Yu. University of Illinois, Chicago, USA

Chapter ML:XI (continued)

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Data Mining Part 5. Prediction

Component Ordering in Independent Component Analysis Based on Data Power

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Machine Learning using MapReduce

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

TIETS34 Seminar: Data Mining on Biometric identification

How To Identify A Churner

COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics

Learning is a very general term denoting the way in which agents:

Machine Learning and Data Mining. Fundamentals, robotics, recognition

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Visualization by Linear Projections as Information Retrieval

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

CS Introduction to Data Mining Instructor: Abdullah Mueen

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Models of Cortical Maps II

Feature Selection vs. Extraction

Morphological analysis on structural MRI for the early diagnosis of neurodegenerative diseases. Marco Aiello On behalf of MAGIC-5 collaboration

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Cluster Analysis: Advanced Concepts

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Preprocessing. Week 2

Spam Filtering Based on Latent Semantic Indexing

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

CHAPTER VII CONCLUSIONS

Visualization of Breast Cancer Data by SOM Component Planes

Comparing large datasets structures through unsupervised learning

Machine Learning Introduction

Customer and Business Analytic

Data Mining mit der JMSL Numerical Library for Java Applications

Data Mining + Business Intelligence. Integration, Design and Implementation

Categorical Data Visualization and Clustering Using Subjective Factors

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

MS1b Statistical Data Mining

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

Transcription:

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad Pontificia de Salamanca mmartinmac@upsa.es Spain

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 1 Contents 1. Introduction 2. The Torgerson Multidimensional Scaling Algorithm 3. A Semi-supervised Multidimensional Scaling Algorithm 4. Experimental results 5. Conclusions and future research trends

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 2 Introduction (I) The Torgerson MDS algorithm is a popular visualization technique that helps to discover the underlying structure of high dimensional data. An interesting application is the visualization of the semantic relations among terms or documents in textual databases. However, the Torgerson MDS algorithm proposed in the literature suffers from a low discriminant power due to: The unsupervised nature. The curse of dimensionality.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 3 Introduction (II) Several search engines provide a categorization for a subset of documents. Problem overview Semantic classes C 1 C 2 C 3 C k C 4 Relation between terms and documents t 1, t 2,... t n Terms are usually not categorized Space of documents (R n ) f Space of terms (R d ) t 1,..., t n Torgerson MDS map Goal: To generate a visual representation of term relationships taking advantage of the document class labels.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 4 Our approach: Introduction (III) Define a semi-supervised similarity between terms that considers the document class labels. It should reflect whether two terms are related to the same semantic topics. It should reflect the semantic proximities between terms. Incorporate the semi-supervised similarity into the Torgerson MDS algorithm. This will preserve the nice properties of the optimization problem.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 5 Torgerson MDS Algorithm (I) The Torgerson MDS algorithm looks for an object configuration in a low dimensional space such that the interpattern distances are approximately preserved. Properties for text mining problems: It is based on an efficient linear algebraic operation (SVD). The optimization problem does not have local minima. For certain similarities it is equivalent to LSI.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 6 Drawbacks: Torgerson MDS Algorithm (II) Low discriminant power: Due to the unsupervised nature, different topics in the textual collection overlap significantly in the word map. It is affected by the curse of dimensionality.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 7 Semi-supervised MDS algorithm (I) Goal: To improve the discriminant power of Torgerson MDS algorithm that works in the space of terms considering a classification in the space of documents. The association between the terms (t i ) and the document class labels (C k ) is evaluated by the Mutual Information I (t i ; C k ). A supervised measure is defined that becomes large for terms that are correlated with the same categories: s 1 (t i, t j ) = k I (t i ; C k )I (t j ; C k ) k (I (t i ; C k )) 2 k (I (t j ; C k )) 2. (1)

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 8 Semi-supervised MDS Algorithm (II) The supervised measure will reflect just the semantic categories of the textual collection but not the term relationships which is interesting for visualization purposes. Therefore, a semi-supervised similarity should be defined that reflect both, the semantic categories and the term relationships inside each class. s(t i, t j ) = λs sup (t i, t j ) + (1 λ)s unsup (t i, t j ). (2) λ controls if the word map reflects better the semantic categories (λ large) or the semantic relations among terms (λ small).

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 9 Properties Semi-supervised Similarity Frequency 0 500000 1000000 1500000 2000000 Frequency 0e+00 2e+05 4e+05 6e+05 8e+05 0.0 0.2 0.4 0.6 0.8 1.0 cos(x,y) 0.0 0.2 0.4 0.6 0.8 1.0 s(x,y) Fig. 1: Cosine similarity histogram. Fig. 2: Semi-supervised similarity histogram. The histogram is smoother. It is more robust to the curse of dimensionality. Word maps will reflect better the term relationships.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 10 Working with partially labeled documents When only a small fraction of documents are labeled we proceeds as follows: Documents are categorized in a semi-supervised way using Transductive SVM. The Semi-supervised measures can now be computed in the usual way. The Torgerson MDS algorithm is applied to obtain a word map.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 11 Experimental results (I) The semi-supervised algorithm has been applied to the visualization of the semantic relations among terms. Evaluation of the visualization algorithms: The mapping algorithm is applied to generate the word map. A clustering algorithm is run in the map grouping the terms into 7 groups. Finally, the partition induced by the map is compared with the classes induced by the thesaurus.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 12 Experimental results (II) The agreement between the partition induced by the mapping algorithm and the thesaurus has been evaluated through several objective functions: F measure (F). Entropy measure (E): Small values suggest little overlapping among different topics in the word map. Mutual Information (I): Informs particularly about the position of the more specific terms in the word map.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 13 Experimental results (III) F E I Torgerson MDS 0.46 0.55 0.17 Least square MDS 0.53 0.52 0.16 Torgerson MDS (Average) 0.69 0.43 0.27 Torgerson MDS (Maximum) 0.77 0.36 0.31 Least square MDS (Average) 0.70 0.42 0.27 Least square MDS (Maximum) 0.76 0.38 0.31 The primary conclusions are the following: The semi-supervised techniques reduce significantly the overlapping among the different topics in the word map. The widely used F measure is significantly improved. The maximum semi-supervised measure increases particularly the discriminant power of the word maps.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 14 Experimental results (IV) y 0.4 0.3 0.2 0.1 0.0 0.1 PRIOR BAYESIAN NORMAL LEARNING MULTIDIMENSIONAL STATISTICAL MACHINE DISCRIMINANT PATTERN VISUAL PROBABILITY GAUSSIAN POTENTIAL FUZZY EXTRACTION LIKELIHOOD WAVELET RULE QUANTIZATION UNSUPERVISED PERCEPTRON CLUSTER OPTIMIZATION REDUCTION NEURAL PCA PRINCIPAL PROJECTION ESTIMATION DIMENSIONALITY NONLINEAR MAPPING NEURONS VISUALIZATION SOM MAPS PROTOTYPE SELF ORGANIZING Supervised learning KOHONEN DEFECTS FREQUENCY INTEGRATION THYRISTORS TRANSIENT SUBSTRATE DIFFUSION Unsupervised learning OPERATIONAL SILICON DIODES DEVICES ELECTRICAL SEMICONDUCTOR PHASE CIRCUIT THERMAL VOLTAGE LOAD POLARIZATION POWER WAVELENGTH BANDWIDTH SPEED LINES CABLE OPTICAL TRANSMISSION LASER FIBER LIGHT DOPED AMPLIFIER Semiconductor devices and optical cables TECHNOLOGY 0.0 0.1 0.2 0.3 0.4 x Fig. 1: Word map generated by the semi-supervised MDS algorithm.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 15 Conclusions and future research trends We have proposed a semi-supervised version of the Torgerson MDS algorithm. The new algorithm has been applied to the analysis of the semantic relations among terms in textual databases. The experimental results suggest that the proposed algorithm improves significantly the discriminant power of mapping techniques that rely solely on unsupervised measures. Future research will focus on the development of new semisupervised dimension reduction techniques.