Automatic Text Processing: Cross-Lingual. Text Categorization

Similar documents

Bayes and Naïve Bayes. cs534-machine Learning

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Social Media Mining. Data Mining Essentials

Data Mining - Evaluation of Classifiers

Sentiment analysis using emoticons

Logistic Regression for Spam Filtering

Machine Learning Final Project Spam Filtering

How To Use Neural Networks In Data Mining

Content-Based Recommendation

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CSE 473: Artificial Intelligence Autumn 2010

DISIT Lab, competence and project idea on bigdata. reasoning

1 Maximum likelihood estimation

Machine Learning using MapReduce

Statistical Feature Selection Techniques for Arabic Text Categorization

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

Active Learning SVM for Blogs recommendation

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Classification algorithm in Data mining: An Overview

CENG 734 Advanced Topics in Bioinformatics

Spam Detection A Machine Learning Approach

Segmentation and Classification of Online Chats

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Spam Filtering using Naïve Bayesian Classification

Clustering Technique in Data Mining for Text Documents

Introduction to Machine Learning Using Python. Vikram Kamath

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Categorical Data Visualization and Clustering Using Subjective Factors

Towards better accuracy for Spam predictions

Tracking and Recognition in Sports Videos

An Overview of Knowledge Discovery Database and Data mining Techniques

Mining a Corpus of Job Ads

Final Project Report

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Predict Influencers in the Social Network

Linear Threshold Units

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

A Content based Spam Filtering Using Optical Back Propagation Technique

How To Create A Text Classification System For Spam Filtering

Question 2 Naïve Bayes (16 points)

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Employer Health Insurance Premium Prediction Elliott Lui

Azure Machine Learning, SQL Data Mining and R

Social Business Intelligence Framework. Copyright 2012 Deloitte Development LLC. All rights reserved.

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Research on Sentiment Classification of Chinese Micro Blog Based on

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Simple Language Models for Spam Detection

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

How To Solve The Kd Cup 2010 Challenge

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Using News Articles to Predict Stock Price Movements

diagnosis through Random

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Principles of Data Mining by Hand&Mannila&Smyth

Clustering Connectionist and Statistical Language Processing

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Distributed Computing and Big Data: Hadoop and MapReduce

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

A Survey on Product Aspect Ranking

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Semantic Sentiment Analysis of Twitter

Experiments in Web Page Classification for Semantic Web

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

LCs for Binary Classification

The Enron Corpus: A New Dataset for Classification Research

Facilitating Business Process Discovery using Analysis

Support Vector Machines with Clustering for Training with Very Large Datasets

Personalized Hierarchical Clustering

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Automated News Item Categorization

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Predicting Student Performance by Using Data Mining Methods for Classification

Detecting Spam Using Spam Word Associations

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Transcription:

Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo Candidate: Leonardo Rigutini Advisor: Prof. Marco Maggini

Outlines Introduction to Cross Lingual Text Categorization: Realtionships with Cross Lingual Information Retrieval Possible approaches Text Categorization Multinomial Naive Bayes models Distance distribution and term filtering Learning with labeled and unlabeled data The algorithm The basic solution The modified algorithm Experimental results and conclusions

Cross Lingual Text Categorization The problem arose in the last years due to the large amount of documents in many different languages Many industries would categorize the new documents according to the existing class structure without building a different text management system for each language The CLTC is highly close to the Cross-Lingual Information Retrieval (CLIR): Many works in the literature deal with CLIR Very little work about CLTC

Cross Lingual Information Retrieval a) Poly-Lingual Data composed by documents in different languages Dictionary contains terms from different dictionaries A wide learning set containing sufficient documents for each languages is needed An unique classifier is trained b) Cross-Lingual: The language is identified and translated into a different one A new classifier is trained for each language

a) Poly-Lingual Drawbacks: Requires many documents for the learning set for each language High dimensionality of the dictionary: n vocabularies Many terms shared between two languages Difficult feature selection due to the coexistence of many different languages Advantages: Conceptually simple method An unique classifier is used Quite good performances

b) Cross-Lingual Drawbacks: Use of a translation step: Very low performances Named Entity Recognition (NER) Time consuming In some approaches experts for each language are needed Advantages: It does not need experts for each language Three different approaches: 1. Training set translation 2. Test set translation 3. Esperanto

1. Training set translation The classifier is trained with documents in language L 2 translated from the L 1 learning set: L 2 is the language of the unlabeled data The learning set is highly noisy and the classifier could show poor performances The system works on the L 2 language documents Number of translations lower than the test set translation approach Notmuchusedin CLIR

2. Test set translation The model is trained using documents in language L 1 without translation: Training using data not corrupted by noise The unlabeled documents in language L 2 are translated into the language L 1 : The translation step is highly time consuming It has very low performances and it introduces much noise A filtering phase on the test data after the translation is needed The translated documents are categorized by the classifier trained in the language L 1 : Possible inconsistency between training and unlabeled data

3. Esperanto All documents in each languages are translated into a new universal language, Esperanto (L E ) The new language should maintain all the semantic features of each language Very difficult to design High amount of knowledge for each language is needed The system works in this new universal language It needs the translation of the training set and of the test set Very time consuming Few used in CLIR

From CLIR to CLTC Following the CLIR: a) Poly-Lingual approach n mono-lingual text categorization problems, one for each language It requires a test set for each language: experts that labels the documents for each language b) Cross-lingual 1. Test set translation: It requires the tet set translation time consuming 2. Esperanto: It is very time consuming and requires a large amount of knowledge for each language 3. Training set translation: No proposals using this thecnique

CLTC problem formulation Given a predefined category organization for documents in the language L 1 the task is to classify documents in language L 2 according to that organization without having to manually label the data in L 2 since it requires experts in that language and this is expensive. The Poly-Lingual approach translation is not usable in this case, since it requires a learning set in the unknown language L 2 Even the esperanto approach is not possible, since it needs knowledge about all the languages Only the training and test set approach can be used in this type of problem

Outlines Introduction to Cross Lingual Text Categorization: Realtionships with Cross Lingual Information Retrieval Possible approaches Text Categorization Multinomial Naive Bayes models Distance distribution and term filtering Learning with labeled and unlabeled data The algorithm The basic solution The modified algorithm Experimental results and conclusions

Naive Bayes classifier The two most successful techniques for text categorization: NaiveBayes SVM Naive Bayes A document d i belongs to class C j such that: C = argmaxp( C Using bayes rule the probability j P( C r d i C r ) = P( C r d i ) P( C r di r ) P( d P( d ) i i ) C can be expressed as: r )

Multinomial Naive Bayes Since is a common factor, it can be negleted P C ( r ) P ( d i ) can be easily estimated from the document distribution in the training set or otherwise it can be considered constant The naive assumption is that the presence of each word in a document is an independent event and does not depend on the others. It allows to write: N ( wt, di ) P( d C ) = P( w C ) where N w t, d document d i. ( i ) i r w d t is the number of occurrences of word w t in the i t r

Multinomial Naive Bayes Assuming that each document is drawn from a multinomial distribution of words, the probability of w t in class C r can be estimated as: P( w t C r ) This method is very simple and it is one of the most used in text categorization = Despite the strong naive assumption, it yelds good performances in most cases w s d C i d C i j N( w, d ) j t N( w s i, d i )

Smoothing techniques A typical problem in probailistic models are the zero values: If a feature was never observed in training process, its estimated probability is 0. When it is observed during the classification process, the 0 value can not be used, since it makes null the likelihood The two main methods to avoid the zero are Additive smoothing (add-one or Laplace): Good-Turing smoothing: Pˆ( w # w(1 ) C P( w(0)) = # w C j t C j j ) = 1+ (# w V t C + (# w C j ) j )

Distance distribution The distribution of documents in the space is uniform and does not form clouds The distances between two similar documents and between two different documents are very close It depends on: High number of dimensions High number of not discriminative words that overcome the others in the evaluation of the distances

Distances distribution

Information Gain Term filtering: Stopword list Luhn reduction Information gain Information gain: IG( w, C i IG( w ) i k ) = = C k = c 1 { C C } k { w w }, w, k IG( w i, C i k ) i P( w, c)log 2 P( w, c) P( w) P( c)

Learning from labeled and unlabeled data New research area in Automatic Text Processing: Usually having a large labeled dataset is a time consuming task and much expensive Learning from labeled and unlabeled examples: Use a small initial labeled dataset Extract information from a large unlabeled dataset The idea is: Use the labeled data to initialize a labeling process on the unlabeled data Use the new labeled data to build the classifier

Learning from labeled and unlabeled data EM algorithm E step: data are labeled using the current parameter configuration M step: model is updated assuming the labeled to be correct The model is initialized using the small labeled dataset

Outlines Introduction to Cross Lingual Text Categorization: Realtionships with Cross Lingual Information Retrieval Possible approaches Text Categorization Multinomial Naive Bayes models Distance distribution and term filtering Learning with labeled and unlabeled data The algorithm The basic solution The modified algorithm Experimental results and Conclusions

Cross Lingual Text Categorization The problem can be stated as: We have a small labeled dataset in language L 1 We want to categorize a large unlabeled dataset in language L 2 We do not want to use experts for the language L 2 The idea is: We can translate the training set into the language L 2 We can initialize an EM algorithm with these very noisy data We can reinforce the behavior of the classifier using the unlabeled data in language L 2

Notation With L 1, L 2 and L 1 2 we indicate the languages 1,2 and L 1 translated into L 2 We use these pedices for training set Tr, test set Ts and classifier C: C 1 2 indicates the classifier trained with Tr 1 2,, that is the training set Tr 1 translated into language L 2

The basic algorithm Tr 1 Translation 1 2 Tr 1 2 C 2 1 M step Ts 2 E step results E(t) start EM iterations

The basic algorithm Once the classifier is trained, it can be used to label a larger dataset This algortihm can start with small initial dataset and it is an advantage since our initial dataset is very noisy Problems Data Translation Algorithm

Data Temporal dependency: Documents regarding same topic in different times, deal with different themes Geographical dependency: Documents regarding the same topics in different places, deal with different persons, facts etc Find the discriminative terms for each topic independent of time and place

Translation The translator performs very poorly expecially when the text is badly written : Named Entity Recognition (NER): words that should not be translated different words referring to the same entity Word-sense disambiguation: In translation it is a fundamental problem

Algorithm EM algorithm has some important limitations: The trivial solution is a good solution: all documents in a single cluster all the others clusters empty Usually it tends to form few large central clusters and many small peripheral clusters: It depends on the starting point and on the noise on the data added at the cluster at each EM step

Improved algorithm by using IG Tr 1 2 start IG k 1 C 2 1 M step Ts 2 IG k 2 E step results E(t) EM iterations

The filter k 1 Highly selective since the data are composed by translated text and they are very noisy Initialize the EM process by selecting the most informative words in the data Ts 2 Tr 1 2 IG k 1 results

The filter k 2 It performs a regularization effect on the EM algorithm it selects the most discriminative words at each EM iteration The not significative words do not influence the updating of the centroid in EM iterations The parameter should be higher than the previous: It works on the original data C 2 1 M step Ts 2 IG k 2 E step results E(t)

Outlines Introduction to Cross Lingual Text Categorization: Realtionships with Cross Lingual Information Retrieval Possible approaches Text Categorization Multinomial Naive Bayes models Distance distribution and term filtering Learning with labeled and unlabeled data The algorithm The basic solution The modified algorithm Experimental results and Conclusions

Previous works Nuria et al. used ILO corpus and two language (E,S) to test three different approaches to CLTC: Polylingual Test set translation Profile-based translation They used the Winnow (ANN) and Rocchio algorithm They compared the results with the monolingual test Low performances: 70%-75%

Multi-lingual lingual Dataset Very few multi-lingual data sets available: No one with Italian language We built the data set by crawling the Newsgroups Newsgroups: Availability of the same groups in different languages Large number of available messages Different levels of each topic

Multi-lingual lingual Dataset Multi lingual dataset compostion Two languages: Italian (L I ) and English (L E ) Three groups: auto, hardware and sport Tr I Auto 1.000 Hw Sports total 1.000 1.000 3.000 TRAIN 1.000 1.000 3.000 TEST Tr E Ts I 1.000 6.988 6.991 6.984 20.963

Multi-lingual lingual Dataset Drawbacks: Short messages Informal documents: Slang terms Badly written words Often transversal topics advertising, spam, other actual topics (elections) Temporal dependency: same topic in two different moments deals with different problems Geographical dependency: same topic in two different places deals with different persons, facts etc

Monolingual test No traslation Training set and test set in the Italian language Auto Hw Sports total Ts I test set 6.988 6.991 6.984 20.963 Recall Tr I 94,01 ± 1,03% 96,21 ± 0,93% 92,89 ± 1,12% 94,43 ± 0,90% Ts I C I Precision 93,76 ± 1,09% 93,01 ± 0,45% 96,74 ± 1,24% 94,43 ± 0,90% results Results are averaged on a ten-fold cross-validation

Baseline multilingual test total Tr E Translation from English to Italian Ts I test set Auto Hw Sports 6.988 6.991 6.984 20.963 Translation E I Recall Tr E I 69,56 ± 5,34% 87,24 ± 2,02% 50,95 ± 6,28% 69,26 ± 4,22% Ts I C E I Precision 66,56 ± 4,76% 63,35 ± 3,72% 88,22 ± 4,36% 69,26 ± 4,22% results Results are averaged on a ten-fold cross-validation

Simple EM Algorithm Translation from English to Italian Auto Hw Sports total Ts I test set 6.988 6.991 6.984 20.963 Tr E Translation E I Recall Tr E I start 71,32 ± 1,05% 98,04 ± 1,01% 0,73 ± 0,41% 56,32 ± 1,10% C E I Ts I EM iterations E step M step Results are averaged on a ten-fold cross-validation Precision results E(t) 51,40 ± 1,00% 61,55 ± 0,98% 65,41 ± 0,05% 56,32 ± 1,10%

Filtered EM algorithm k 1 = 300 k 2 = 1000 Translation from English to Italian Auto Hw Sports total Ts I test set 6.988 6.991 6.984 20.963 Tr E I Recall IG k 1 start 92,59 ± 1,05% 87,88 ± 0,98% 91,01 ± 1,03% 90,64 ± 0,96% C E I Ts I IG k 2 M step E step Results are averaged on a ten-fold cross-validation results E(t) EM iterations Precision 87,07 ± 1,02% 92,78 ± 0,88% 92,28 ± 0,90% 90,64 ± 0,96%

Conclusions The filtered EM algorithm performs better than other algorithms existing in literature It does not needs an initial labeled dataset in the desired language: No other algorithms have been proposed having such feature It achieves good results starting with few translated documents: It does not require much time for translation