A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Size: px

Start display at page:

Download "A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization"

Baldric Ward
10 years ago
Views:

1 A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca Spain Manuel Martín-Merino Universidad Pontificia de Salamanca Spain

Universidad Pontificia de Salamanca ablancogo@upsa.

2 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 1 Contents 1. Introduction 2. The Torgerson Multidimensional Scaling Algorithm 3. A Semi-supervised Multidimensional Scaling Algorithm 4. Experimental results 5. Conclusions and future research trends

The Torgerson Multidimensional Scaling Algorithm 3.

3 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 2 Introduction (I) The Torgerson MDS algorithm is a popular visualization technique that helps to discover the underlying structure of high dimensional data. An interesting application is the visualization of the semantic relations among terms or documents in textual databases. However, the Torgerson MDS algorithm proposed in the literature suffers from a low discriminant power due to: The unsupervised nature. The curse of dimensionality.

An interesting application is the visualization of the semantic relations among terms or documents in textual databases.

4 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 3 Introduction (II) Several search engines provide a categorization for a subset of documents. Problem overview Semantic classes C 1 C 2 C 3 C k C 4 Relation between terms and documents t 1, t 2,... t n Terms are usually not categorized Space of documents (R n ) f Space of terms (R d ) t 1,..., t n Torgerson MDS map Goal: To generate a visual representation of term relationships taking advantage of the document class labels.

Problem overview Semantic classes C 1 C 2 C 3 C k C 4 Relation between terms and documents t 1, t 2,.

5 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 4 Our approach: Introduction (III) Define a semi-supervised similarity between terms that considers the document class labels. It should reflect whether two terms are related to the same semantic topics. It should reflect the semantic proximities between terms. Incorporate the semi-supervised similarity into the Torgerson MDS algorithm. This will preserve the nice properties of the optimization problem.

It should reflect whether two terms are related to the same semantic topics.

6 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 5 Torgerson MDS Algorithm (I) The Torgerson MDS algorithm looks for an object configuration in a low dimensional space such that the interpattern distances are approximately preserved. Properties for text mining problems: It is based on an efficient linear algebraic operation (SVD). The optimization problem does not have local minima. For certain similarities it is equivalent to LSI.

interpattern distances are approximately preserved.

7 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 6 Drawbacks: Torgerson MDS Algorithm (II) Low discriminant power: Due to the unsupervised nature, different topics in the textual collection overlap significantly in the word map. It is affected by the curse of dimensionality.

power: Due to the unsupervised nature, different topics in the textual collection

8 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 7 Semi-supervised MDS algorithm (I) Goal: To improve the discriminant power of Torgerson MDS algorithm that works in the space of terms considering a classification in the space of documents. The association between the terms (t i ) and the document class labels (C k ) is evaluated by the Mutual Information I (t i ; C k ). A supervised measure is defined that becomes large for terms that are correlated with the same categories: s 1 (t i, t j ) = k I (t i ; C k )I (t j ; C k ) k (I (t i ; C k )) 2 k (I (t j ; C k )) 2. (1)

The association between the terms (t i ) and the document class labels (C k ) is evaluated by the Mutual Information I (t i ; C k ).

9 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 8 Semi-supervised MDS Algorithm (II) The supervised measure will reflect just the semantic categories of the textual collection but not the term relationships which is interesting for visualization purposes. Therefore, a semi-supervised similarity should be defined that reflect both, the semantic categories and the term relationships inside each class. s(t i, t j ) = λs sup (t i, t j ) + (1 λ)s unsup (t i, t j ). (2) λ controls if the word map reflects better the semantic categories (λ large) or the semantic relations among terms (λ small).

Therefore, a semi-supervised similarity should be defined that reflect both, the semantic categories and the term relationships inside each class.

10 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 9 Properties Semi-supervised Similarity Frequency Frequency 0e+00 2e+05 4e+05 6e+05 8e cos(x,y) s(x,y) Fig. 1: Cosine similarity histogram. Fig. 2: Semi-supervised similarity histogram. The histogram is smoother. It is more robust to the curse of dimensionality. Word maps will reflect better the term relationships.

6 0.8 1.0 cos(x,y) 0.0 0.2 0.4 0.6 0.8 1.0 s(x,y) Fig. 1: Cosine similarity histogram. Fig. 2: Semi-supervised similarity histogram.

11 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA Working with partially labeled documents When only a small fraction of documents are labeled we proceeds as follows: Documents are categorized in a semi-supervised way using Transductive SVM. The Semi-supervised measures can now be computed in the usual way. The Torgerson MDS algorithm is applied to obtain a word map.

as follows: Documents are categorized in a semi-supervised way using Transductive SVM.

12 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA Experimental results (I) The semi-supervised algorithm has been applied to the visualization of the semantic relations among terms. Evaluation of the visualization algorithms: The mapping algorithm is applied to generate the word map. A clustering algorithm is run in the map grouping the terms into 7 groups. Finally, the partition induced by the map is compared with the classes induced by the thesaurus.

Evaluation of the visualization algorithms: The mapping algorithm is applied to generate the word map.

13 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA Experimental results (II) The agreement between the partition induced by the mapping algorithm and the thesaurus has been evaluated through several objective functions: F measure (F). Entropy measure (E): Small values suggest little overlapping among different topics in the word map. Mutual Information (I): Informs particularly about the position of the more specific terms in the word map.

through several objective functions: F measure (F).

14 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA Experimental results (III) F E I Torgerson MDS Least square MDS Torgerson MDS (Average) Torgerson MDS (Maximum) Least square MDS (Average) Least square MDS (Maximum) The primary conclusions are the following: The semi-supervised techniques reduce significantly the overlapping among the different topics in the word map. The widely used F measure is significantly improved. The maximum semi-supervised measure increases particularly the discriminant power of the word maps.

27 Least square MDS (Maximum) 0.76 0.38 0.

15 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA Experimental results (IV) y PRIOR BAYESIAN NORMAL LEARNING MULTIDIMENSIONAL STATISTICAL MACHINE DISCRIMINANT PATTERN VISUAL PROBABILITY GAUSSIAN POTENTIAL FUZZY EXTRACTION LIKELIHOOD WAVELET RULE QUANTIZATION UNSUPERVISED PERCEPTRON CLUSTER OPTIMIZATION REDUCTION NEURAL PCA PRINCIPAL PROJECTION ESTIMATION DIMENSIONALITY NONLINEAR MAPPING NEURONS VISUALIZATION SOM MAPS PROTOTYPE SELF ORGANIZING Supervised learning KOHONEN DEFECTS FREQUENCY INTEGRATION THYRISTORS TRANSIENT SUBSTRATE DIFFUSION Unsupervised learning OPERATIONAL SILICON DIODES DEVICES ELECTRICAL SEMICONDUCTOR PHASE CIRCUIT THERMAL VOLTAGE LOAD POLARIZATION POWER WAVELENGTH BANDWIDTH SPEED LINES CABLE OPTICAL TRANSMISSION LASER FIBER LIGHT DOPED AMPLIFIER Semiconductor devices and optical cables TECHNOLOGY x Fig. 1: Word map generated by the semi-supervised MDS algorithm.

PERCEPTRON CLUSTER OPTIMIZATION REDUCTION NEURAL PCA PRINCIPAL PROJECTION ESTIMATION DIMENSIONALITY NONLINEAR MAPPING NEURONS VISUALIZATION SOM MAPS PROTOTYPE SELF ORGANIZING Supervised learning

16 A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA Conclusions and future research trends We have proposed a semi-supervised version of the Torgerson MDS algorithm. The new algorithm has been applied to the analysis of the semantic relations among terms in textual databases. The experimental results suggest that the proposed algorithm improves significantly the discriminant power of mapping techniques that rely solely on unsupervised measures. Future research will focus on the development of new semisupervised dimension reduction techniques.

The new algorithm has been applied to the analysis of the semantic relations among terms in textual databases.

Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk