A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad Pontificia de Salamanca mmartinmac@upsa.es Spain

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 1 Contents 1. Introduction 2. The Torgerson Multidimensional Scaling Algorithm 3. A Semi-supervised Multidimensional Scaling Algorithm 4. Experimental results 5. Conclusions and future research trends

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 2 Introduction (I) The Torgerson MDS algorithm is a popular visualization technique that helps to discover the underlying structure of high dimensional data. An interesting application is the visualization of the semantic relations among terms or documents in textual databases. However, the Torgerson MDS algorithm proposed in the literature suffers from a low discriminant power due to: The unsupervised nature. The curse of dimensionality.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 3 Introduction (II) Several search engines provide a categorization for a subset of documents. Problem overview Semantic classes C 1 C 2 C 3 C k C 4 Relation between terms and documents t 1, t 2,... t n Terms are usually not categorized Space of documents (R n ) f Space of terms (R d ) t 1,..., t n Torgerson MDS map Goal: To generate a visual representation of term relationships taking advantage of the document class labels.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 4 Our approach: Introduction (III) Define a semi-supervised similarity between terms that considers the document class labels. It should reflect whether two terms are related to the same semantic topics. It should reflect the semantic proximities between terms. Incorporate the semi-supervised similarity into the Torgerson MDS algorithm. This will preserve the nice properties of the optimization problem.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 5 Torgerson MDS Algorithm (I) The Torgerson MDS algorithm looks for an object configuration in a low dimensional space such that the interpattern distances are approximately preserved. Properties for text mining problems: It is based on an efficient linear algebraic operation (SVD). The optimization problem does not have local minima. For certain similarities it is equivalent to LSI.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 6 Drawbacks: Torgerson MDS Algorithm (II) Low discriminant power: Due to the unsupervised nature, different topics in the textual collection overlap significantly in the word map. It is affected by the curse of dimensionality.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 7 Semi-supervised MDS algorithm (I) Goal: To improve the discriminant power of Torgerson MDS algorithm that works in the space of terms considering a classification in the space of documents. The association between the terms (t i ) and the document class labels (C k ) is evaluated by the Mutual Information I (t i ; C k ). A supervised measure is defined that becomes large for terms that are correlated with the same categories: s 1 (t i, t j ) = k I (t i ; C k )I (t j ; C k ) k (I (t i ; C k )) 2 k (I (t j ; C k )) 2. (1)

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 8 Semi-supervised MDS Algorithm (II) The supervised measure will reflect just the semantic categories of the textual collection but not the term relationships which is interesting for visualization purposes. Therefore, a semi-supervised similarity should be defined that reflect both, the semantic categories and the term relationships inside each class. s(t i, t j ) = λs sup (t i, t j ) + (1 λ)s unsup (t i, t j ). (2) λ controls if the word map reflects better the semantic categories (λ large) or the semantic relations among terms (λ small).

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 9 Properties Semi-supervised Similarity Frequency 0 500000 1000000 1500000 2000000 Frequency 0e+00 2e+05 4e+05 6e+05 8e+05 0.0 0.2 0.4 0.6 0.8 1.0 cos(x,y) 0.0 0.2 0.4 0.6 0.8 1.0 s(x,y) Fig. 1: Cosine similarity histogram. Fig. 2: Semi-supervised similarity histogram. The histogram is smoother. It is more robust to the curse of dimensionality. Word maps will reflect better the term relationships.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 10 Working with partially labeled documents When only a small fraction of documents are labeled we proceeds as follows: Documents are categorized in a semi-supervised way using Transductive SVM. The Semi-supervised measures can now be computed in the usual way. The Torgerson MDS algorithm is applied to obtain a word map.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 11 Experimental results (I) The semi-supervised algorithm has been applied to the visualization of the semantic relations among terms. Evaluation of the visualization algorithms: The mapping algorithm is applied to generate the word map. A clustering algorithm is run in the map grouping the terms into 7 groups. Finally, the partition induced by the map is compared with the classes induced by the thesaurus.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 12 Experimental results (II) The agreement between the partition induced by the mapping algorithm and the thesaurus has been evaluated through several objective functions: F measure (F). Entropy measure (E): Small values suggest little overlapping among different topics in the word map. Mutual Information (I): Informs particularly about the position of the more specific terms in the word map.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 13 Experimental results (III) F E I Torgerson MDS 0.46 0.55 0.17 Least square MDS 0.53 0.52 0.16 Torgerson MDS (Average) 0.69 0.43 0.27 Torgerson MDS (Maximum) 0.77 0.36 0.31 Least square MDS (Average) 0.70 0.42 0.27 Least square MDS (Maximum) 0.76 0.38 0.31 The primary conclusions are the following: The semi-supervised techniques reduce significantly the overlapping among the different topics in the word map. The widely used F measure is significantly improved. The maximum semi-supervised measure increases particularly the discriminant power of the word maps.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 14 Experimental results (IV) y 0.4 0.3 0.2 0.1 0.0 0.1 PRIOR BAYESIAN NORMAL LEARNING MULTIDIMENSIONAL STATISTICAL MACHINE DISCRIMINANT PATTERN VISUAL PROBABILITY GAUSSIAN POTENTIAL FUZZY EXTRACTION LIKELIHOOD WAVELET RULE QUANTIZATION UNSUPERVISED PERCEPTRON CLUSTER OPTIMIZATION REDUCTION NEURAL PCA PRINCIPAL PROJECTION ESTIMATION DIMENSIONALITY NONLINEAR MAPPING NEURONS VISUALIZATION SOM MAPS PROTOTYPE SELF ORGANIZING Supervised learning KOHONEN DEFECTS FREQUENCY INTEGRATION THYRISTORS TRANSIENT SUBSTRATE DIFFUSION Unsupervised learning OPERATIONAL SILICON DIODES DEVICES ELECTRICAL SEMICONDUCTOR PHASE CIRCUIT THERMAL VOLTAGE LOAD POLARIZATION POWER WAVELENGTH BANDWIDTH SPEED LINES CABLE OPTICAL TRANSMISSION LASER FIBER LIGHT DOPED AMPLIFIER Semiconductor devices and optical cables TECHNOLOGY 0.0 0.1 0.2 0.3 0.4 x Fig. 1: Word map generated by the semi-supervised MDS algorithm.

A PARTIALLY SUPERVISED METRIC MULTIDIMENSIONAL SCALING ALGORITHM FOR TEXTUAL DATA VISUALIZATION IDA 07 15 Conclusions and future research trends We have proposed a semi-supervised version of the Torgerson MDS algorithm. The new algorithm has been applied to the analysis of the semantic relations among terms in textual databases. The experimental results suggest that the proposed algorithm improves significantly the discriminant power of mapping techniques that rely solely on unsupervised measures. Future research will focus on the development of new semisupervised dimension reduction techniques.