Automatic Text Processing: Cross-Lingual. Text Categorization

Size: px
Start display at page:

Download "Automatic Text Processing: Cross-Lingual. Text Categorization"

Transcription

1 Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo Candidate: Leonardo Rigutini Advisor: Prof. Marco Maggini

2 Outlines Introduction to Cross Lingual Text Categorization: Realtionships with Cross Lingual Information Retrieval Possible approaches Text Categorization Multinomial Naive Bayes models Distance distribution and term filtering Learning with labeled and unlabeled data The algorithm The basic solution The modified algorithm Experimental results and conclusions

3 Cross Lingual Text Categorization The problem arose in the last years due to the large amount of documents in many different languages Many industries would categorize the new documents according to the existing class structure without building a different text management system for each language The CLTC is highly close to the Cross-Lingual Information Retrieval (CLIR): Many works in the literature deal with CLIR Very little work about CLTC

4 Cross Lingual Information Retrieval a) Poly-Lingual Data composed by documents in different languages Dictionary contains terms from different dictionaries A wide learning set containing sufficient documents for each languages is needed An unique classifier is trained b) Cross-Lingual: The language is identified and translated into a different one A new classifier is trained for each language

5 a) Poly-Lingual Drawbacks: Requires many documents for the learning set for each language High dimensionality of the dictionary: n vocabularies Many terms shared between two languages Difficult feature selection due to the coexistence of many different languages Advantages: Conceptually simple method An unique classifier is used Quite good performances

6 b) Cross-Lingual Drawbacks: Use of a translation step: Very low performances Named Entity Recognition (NER) Time consuming In some approaches experts for each language are needed Advantages: It does not need experts for each language Three different approaches: 1. Training set translation 2. Test set translation 3. Esperanto

7 1. Training set translation The classifier is trained with documents in language L 2 translated from the L 1 learning set: L 2 is the language of the unlabeled data The learning set is highly noisy and the classifier could show poor performances The system works on the L 2 language documents Number of translations lower than the test set translation approach Notmuchusedin CLIR

8 2. Test set translation The model is trained using documents in language L 1 without translation: Training using data not corrupted by noise The unlabeled documents in language L 2 are translated into the language L 1 : The translation step is highly time consuming It has very low performances and it introduces much noise A filtering phase on the test data after the translation is needed The translated documents are categorized by the classifier trained in the language L 1 : Possible inconsistency between training and unlabeled data

9 3. Esperanto All documents in each languages are translated into a new universal language, Esperanto (L E ) The new language should maintain all the semantic features of each language Very difficult to design High amount of knowledge for each language is needed The system works in this new universal language It needs the translation of the training set and of the test set Very time consuming Few used in CLIR

10 From CLIR to CLTC Following the CLIR: a) Poly-Lingual approach n mono-lingual text categorization problems, one for each language It requires a test set for each language: experts that labels the documents for each language b) Cross-lingual 1. Test set translation: It requires the tet set translation time consuming 2. Esperanto: It is very time consuming and requires a large amount of knowledge for each language 3. Training set translation: No proposals using this thecnique

11 CLTC problem formulation Given a predefined category organization for documents in the language L 1 the task is to classify documents in language L 2 according to that organization without having to manually label the data in L 2 since it requires experts in that language and this is expensive. The Poly-Lingual approach translation is not usable in this case, since it requires a learning set in the unknown language L 2 Even the esperanto approach is not possible, since it needs knowledge about all the languages Only the training and test set approach can be used in this type of problem

12 Outlines Introduction to Cross Lingual Text Categorization: Realtionships with Cross Lingual Information Retrieval Possible approaches Text Categorization Multinomial Naive Bayes models Distance distribution and term filtering Learning with labeled and unlabeled data The algorithm The basic solution The modified algorithm Experimental results and conclusions

13 Naive Bayes classifier The two most successful techniques for text categorization: NaiveBayes SVM Naive Bayes A document d i belongs to class C j such that: C = argmaxp( C Using bayes rule the probability j P( C r d i C r ) = P( C r d i ) P( C r di r ) P( d P( d ) i i ) C can be expressed as: r )

14 Multinomial Naive Bayes Since is a common factor, it can be negleted P C ( r ) P ( d i ) can be easily estimated from the document distribution in the training set or otherwise it can be considered constant The naive assumption is that the presence of each word in a document is an independent event and does not depend on the others. It allows to write: N ( wt, di ) P( d C ) = P( w C ) where N w t, d document d i. ( i ) i r w d t is the number of occurrences of word w t in the i t r

15 Multinomial Naive Bayes Assuming that each document is drawn from a multinomial distribution of words, the probability of w t in class C r can be estimated as: P( w t C r ) This method is very simple and it is one of the most used in text categorization = Despite the strong naive assumption, it yelds good performances in most cases w s d C i d C i j N( w, d ) j t N( w s i, d i )

16 Smoothing techniques A typical problem in probailistic models are the zero values: If a feature was never observed in training process, its estimated probability is 0. When it is observed during the classification process, the 0 value can not be used, since it makes null the likelihood The two main methods to avoid the zero are Additive smoothing (add-one or Laplace): Good-Turing smoothing: Pˆ( w # w(1 ) C P( w(0)) = # w C j t C j j ) = 1+ (# w V t C + (# w C j ) j )

17 Distance distribution The distribution of documents in the space is uniform and does not form clouds The distances between two similar documents and between two different documents are very close It depends on: High number of dimensions High number of not discriminative words that overcome the others in the evaluation of the distances

18 Distances distribution

19 Information Gain Term filtering: Stopword list Luhn reduction Information gain Information gain: IG( w, C i IG( w ) i k ) = = C k = c 1 { C C } k { w w }, w, k IG( w i, C i k ) i P( w, c)log 2 P( w, c) P( w) P( c)

20 Learning from labeled and unlabeled data New research area in Automatic Text Processing: Usually having a large labeled dataset is a time consuming task and much expensive Learning from labeled and unlabeled examples: Use a small initial labeled dataset Extract information from a large unlabeled dataset The idea is: Use the labeled data to initialize a labeling process on the unlabeled data Use the new labeled data to build the classifier

21 Learning from labeled and unlabeled data EM algorithm E step: data are labeled using the current parameter configuration M step: model is updated assuming the labeled to be correct The model is initialized using the small labeled dataset

22 Outlines Introduction to Cross Lingual Text Categorization: Realtionships with Cross Lingual Information Retrieval Possible approaches Text Categorization Multinomial Naive Bayes models Distance distribution and term filtering Learning with labeled and unlabeled data The algorithm The basic solution The modified algorithm Experimental results and Conclusions

23 Cross Lingual Text Categorization The problem can be stated as: We have a small labeled dataset in language L 1 We want to categorize a large unlabeled dataset in language L 2 We do not want to use experts for the language L 2 The idea is: We can translate the training set into the language L 2 We can initialize an EM algorithm with these very noisy data We can reinforce the behavior of the classifier using the unlabeled data in language L 2

24 Notation With L 1, L 2 and L 1 2 we indicate the languages 1,2 and L 1 translated into L 2 We use these pedices for training set Tr, test set Ts and classifier C: C 1 2 indicates the classifier trained with Tr 1 2,, that is the training set Tr 1 translated into language L 2

25 The basic algorithm Tr 1 Translation 1 2 Tr 1 2 C 2 1 M step Ts 2 E step results E(t) start EM iterations

26 The basic algorithm Once the classifier is trained, it can be used to label a larger dataset This algortihm can start with small initial dataset and it is an advantage since our initial dataset is very noisy Problems Data Translation Algorithm

27 Data Temporal dependency: Documents regarding same topic in different times, deal with different themes Geographical dependency: Documents regarding the same topics in different places, deal with different persons, facts etc Find the discriminative terms for each topic independent of time and place

28 Translation The translator performs very poorly expecially when the text is badly written : Named Entity Recognition (NER): words that should not be translated different words referring to the same entity Word-sense disambiguation: In translation it is a fundamental problem

29 Algorithm EM algorithm has some important limitations: The trivial solution is a good solution: all documents in a single cluster all the others clusters empty Usually it tends to form few large central clusters and many small peripheral clusters: It depends on the starting point and on the noise on the data added at the cluster at each EM step

30 Improved algorithm by using IG Tr 1 2 start IG k 1 C 2 1 M step Ts 2 IG k 2 E step results E(t) EM iterations

31 The filter k 1 Highly selective since the data are composed by translated text and they are very noisy Initialize the EM process by selecting the most informative words in the data Ts 2 Tr 1 2 IG k 1 results

32 The filter k 2 It performs a regularization effect on the EM algorithm it selects the most discriminative words at each EM iteration The not significative words do not influence the updating of the centroid in EM iterations The parameter should be higher than the previous: It works on the original data C 2 1 M step Ts 2 IG k 2 E step results E(t)

33 Outlines Introduction to Cross Lingual Text Categorization: Realtionships with Cross Lingual Information Retrieval Possible approaches Text Categorization Multinomial Naive Bayes models Distance distribution and term filtering Learning with labeled and unlabeled data The algorithm The basic solution The modified algorithm Experimental results and Conclusions

34 Previous works Nuria et al. used ILO corpus and two language (E,S) to test three different approaches to CLTC: Polylingual Test set translation Profile-based translation They used the Winnow (ANN) and Rocchio algorithm They compared the results with the monolingual test Low performances: 70%-75%

35 Multi-lingual lingual Dataset Very few multi-lingual data sets available: No one with Italian language We built the data set by crawling the Newsgroups Newsgroups: Availability of the same groups in different languages Large number of available messages Different levels of each topic

36 Multi-lingual lingual Dataset Multi lingual dataset compostion Two languages: Italian (L I ) and English (L E ) Three groups: auto, hardware and sport Tr I Auto Hw Sports total TRAIN TEST Tr E Ts I

37 Multi-lingual lingual Dataset Drawbacks: Short messages Informal documents: Slang terms Badly written words Often transversal topics advertising, spam, other actual topics (elections) Temporal dependency: same topic in two different moments deals with different problems Geographical dependency: same topic in two different places deals with different persons, facts etc

38 Monolingual test No traslation Training set and test set in the Italian language Auto Hw Sports total Ts I test set Recall Tr I 94,01 ± 1,03% 96,21 ± 0,93% 92,89 ± 1,12% 94,43 ± 0,90% Ts I C I Precision 93,76 ± 1,09% 93,01 ± 0,45% 96,74 ± 1,24% 94,43 ± 0,90% results Results are averaged on a ten-fold cross-validation

39 Baseline multilingual test total Tr E Translation from English to Italian Ts I test set Auto Hw Sports Translation E I Recall Tr E I 69,56 ± 5,34% 87,24 ± 2,02% 50,95 ± 6,28% 69,26 ± 4,22% Ts I C E I Precision 66,56 ± 4,76% 63,35 ± 3,72% 88,22 ± 4,36% 69,26 ± 4,22% results Results are averaged on a ten-fold cross-validation

40 Simple EM Algorithm Translation from English to Italian Auto Hw Sports total Ts I test set Tr E Translation E I Recall Tr E I start 71,32 ± 1,05% 98,04 ± 1,01% 0,73 ± 0,41% 56,32 ± 1,10% C E I Ts I EM iterations E step M step Results are averaged on a ten-fold cross-validation Precision results E(t) 51,40 ± 1,00% 61,55 ± 0,98% 65,41 ± 0,05% 56,32 ± 1,10%

41 Filtered EM algorithm k 1 = 300 k 2 = 1000 Translation from English to Italian Auto Hw Sports total Ts I test set Tr E I Recall IG k 1 start 92,59 ± 1,05% 87,88 ± 0,98% 91,01 ± 1,03% 90,64 ± 0,96% C E I Ts I IG k 2 M step E step Results are averaged on a ten-fold cross-validation results E(t) EM iterations Precision 87,07 ± 1,02% 92,78 ± 0,88% 92,28 ± 0,90% 90,64 ± 0,96%

42 Conclusions The filtered EM algorithm performs better than other algorithms existing in literature It does not needs an initial labeled dataset in the desired language: No other algorithms have been proposed having such feature It achieves good results starting with few translated documents: It does not require much time for translation

Text Classification from Labelled and Unlabelled Documents using EM

Text Classification from Labelled and Unlabelled Documents using EM Text Classification from Labelled and Unlabelled Documents using EM Paper by Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, Tom Mitchel Presentation by Ruoxun Fu Carlos Fornieles Montoya Introduction

More information

LEARNING TECHNIQUES AND APPLICATIONS

LEARNING TECHNIQUES AND APPLICATIONS MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS Classification with GMM + Bayes 1 Clustering, semi-supervised clustering and classification Clustering No labels for the points! Semi-supervised

More information

Text Classification using Naive Bayes

Text Classification using Naive Bayes would have: Text Classification using Naive Bayes Hiroshi Shimodaira 10 February 2015 d B (1, 0, 1, 0, 1, 0 T d M (2, 0, 1, 0, 1, 0 T To classify a document we use equation (1, which requires estimating

More information

Learning with labeled and unlabeled data

Learning with labeled and unlabeled data Learning with labeled and unlabeled data page: 1 of 21 Learning with labeled and unlabeled data Author: Matthias Seeger, Institute for Adaptive Neural Computation, University of Edinburgh Presented by:

More information

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

More information

Text Categorization. Building a knn classifier for Reuters collection. Anita Krishnakumar.

Text Categorization. Building a knn classifier for Reuters collection. Anita Krishnakumar. Text Categorization Building a knn classifier for Reuters-21578 collection Anita Krishnakumar anita@soe.ucsc.edu Introduction Scope of the project Phase I Study of different techniques for preprocessing

More information

Sentiment Analysis for Hotel Reviews

Sentiment Analysis for Hotel Reviews Sentiment Analysis for Hotel Reviews Vikram Elango and Govindrajan Narayanan [vikrame, govindra]@stanford.edu Abstract We consider the problem of classifying a hotel review as a positive or negative and

More information

Outline. Introduction. Outline. Introduction. Outline. Transductive Inference for Text Classification using Support Vector Machines

Outline. Introduction. Outline. Introduction. Outline. Transductive Inference for Text Classification using Support Vector Machines Transductive Inference for Text Classification using Support Vector Machines By : Thorsten Joachims Speaker : Sameer Apte Introduction Text classification one of the key techniques for organizing online

More information

Text Categorization Approach For Chat Room Monitoring

Text Categorization Approach For Chat Room Monitoring Text Categorization Approach For Chat Room Monitoring EIMAN ELNAHRAWY Department of Computer Science University of Maryland College Park, MD 20742, USA USA Abstract: The Internet has been utilized in several

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Classifying receipts or invoices from images based on text extraction

Classifying receipts or invoices from images based on text extraction Master Thesis Project Classifying receipts or invoices from images based on text extraction Author: Iuliia Kaci Supervisor: Johan Hagelbäck Examiner: Welf Löwe Reader: Narges Khakpour Semester: VT/HT 2016

More information

Part II: Web Content Mining Chapter 3: Clustering

Part II: Web Content Mining Chapter 3: Clustering Part II: Web Content Mining Chapter 3: Clustering Learning by Example and Clustering Hierarchical Agglomerative Clustering K-Means Clustering Probability-Based Clustering Collaborative Filtering Slides

More information

An Introduction to Text Data Mining

An Introduction to Text Data Mining An Introduction to Text Data Mining Adam Zimmerman The Ohio State University Data Mining and Statistical Learning Discussion Group September 6, 2013 Adam Zimmerman (OSU) DMSL Group Intro to Text Mining

More information

Geographical Classification of Documents Using Evidence from Wikipedia

Geographical Classification of Documents Using Evidence from Wikipedia Geographical Classification of Documents Using Evidence from Wikipedia Rafael Odon de Alencar (odon.rafael@gmail.com) Clodoveu Augusto Davis Jr. (clodoveu@dcc.ufmg.br) Marcos André Gonçalves (mgoncalv@dcc.ufmg.br)

More information

RapidMiner Sentiment Analysis Tutorial. Some Orientation

RapidMiner Sentiment Analysis Tutorial. Some Orientation RapidMiner Sentiment Analysis Tutorial Some Orientation Set up Training First make sure, that the TextProcessing Extensionis installed. Retrieve labelled data: http://www.cs.cornell.edu/people/pabo/movie-review-data

More information

News Recommendation System Using Logistic Regression and Naive Bayes Classifiers

News Recommendation System Using Logistic Regression and Naive Bayes Classifiers Abstract News Recommendation System Using Logistic Regression and Naive Bayes Classifiers Chi Wai Lau December 16, 2011 To offer a more personalized experience, we implemented a news recommendation system

More information

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Guido Sciavicco. 11 Novembre 2015

Guido Sciavicco. 11 Novembre 2015 classical and new techniques Università degli Studi di Ferrara 11 Novembre 2015 in collaboration with dr. Enrico Marzano, CIO Gap srl Active Contact System Project 1/27 Contents What is? Embedded Wrapper

More information

Using Text for Prediction Biju Francis 10/17/08

Using Text for Prediction Biju Francis 10/17/08 Using Text for Prediction Biju Francis 10/17/08 Overview What is prediction? Document Patterns and Classification Predictive Methods Similarity and Nearest Neighbor Methods Logic Methods Probabilistic

More information

Condition Monitoring Using Accelerometer Readings

Condition Monitoring Using Accelerometer Readings Condition Monitoring Using Accelerometer Readings Cheryl Danner, Kelly Gov, Simon Xu Stanford University Abstract In the field of condition monitoring, machine learning techniques are actively being developed

More information

Introduction to Classification, aka Machine Learning

Introduction to Classification, aka Machine Learning Introduction to Classification, aka Machine Learning Classification: Definition Given a collection of examples (training set ) Each example is represented by a set of features, sometimes called attributes

More information

Multi-label Classification

Multi-label Classification Multi-label Classification Prof. Dr. Dr. Lars Shmidt Thieme Tutor: Leandro Marinho Information System and Machine learning lab Definitions Single-label classification set of examples associated with a

More information

A Distributed Chinese Naive Bayes Classifier Based on Word Embedding Mengke Feng1, a, Guoshi Wu1, b

A Distributed Chinese Naive Bayes Classifier Based on Word Embedding Mengke Feng1, a, Guoshi Wu1, b 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) A Distributed Chinese Naive Bayes Classifier Based on Word Embedding Mengke Feng1, a, Guoshi Wu1, b 1 Beijing

More information

Classification and Verification of Law School Outlines

Classification and Verification of Law School Outlines 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Automatic Web Page Classification

Automatic Web Page Classification Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

More information

CHAPTER 5 SEMI-SUPERVISED LEARNING WITH HIDDEN STATE VECTOR MODEL

CHAPTER 5 SEMI-SUPERVISED LEARNING WITH HIDDEN STATE VECTOR MODEL CHAPTER 5 SEMI-SUPERVISED LEARNING WITH HIDDEN STATE VECTOR 65 Spoken Language Understanding has been a challenge in the design of the spoken dialogue system where the intention of the speaker has to be

More information

TPN 2 : Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data

TPN 2 : Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data TPN 2 : Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data Nikolaos Trogkanis 1, Georgios Paliouras 2 1 School of Electrical and Computer Engineering, NTUA, Greece

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

CHAPTER 5 SOFT CLUSTERING BASED MULTIPLE DICTIONARY BAG OF WORDS FOR IMAGE RETRIEVAL

CHAPTER 5 SOFT CLUSTERING BASED MULTIPLE DICTIONARY BAG OF WORDS FOR IMAGE RETRIEVAL 84 CHAPTER 5 SOFT CLUSTERING BASED MULTIPLE DICTIONARY BAG OF WORDS FOR IMAGE RETRIEVAL Object classification is a highly important area of computer vision and has many applications including robotics,

More information

Generative Learning algorithms

Generative Learning algorithms CS9 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we ve mainly been talking about learning algorithms that model p(y x; θ), the conditional distribution of y given x For instance,

More information

IR 12: Text Classification; Vector space classification

IR 12: Text Classification; Vector space classification IR 12: Text Classification; Vector space classification Recap: Naïve Bayes classifiers Classify based on prior weight of class and conditional parameter for what each word says: $ ' c NB = argmax& log

More information

Machine Learning for NLP

Machine Learning for NLP Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability

More information

Introduction to Machine Learning for Websense Data Security

Introduction to Machine Learning for Websense Data Security Introduction to Machine Learning for Websense Data Security Topic 65022 Machine Learning Data Security Solutions Updated: 31-Oct-2012 Machine learning is a branch of artificial intelligence, comprising

More information

Chapter 5: Information Retrieval and Web Search

Chapter 5: Information Retrieval and Web Search Chapter 5: Information Retrieval and Web Search An introduction Most slides courtesy Bing Liu Introduction Text mining refers to data mining using text documents as data. Most text mining tasks use Information

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

Sentiment analysis using emoticons

Sentiment analysis using emoticons Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Attribution of Musical Works to Josquin des Prez

Attribution of Musical Works to Josquin des Prez Attribution of Musical Works to Josquin des Prez Philip Lee Department of Electrical Engineering Kate Stuckman Department of Electrical Engineering Zachary Sunberg Department of Aeronautics and Astronautics

More information

Text Representation Based on Key Terms of Document for Text Categorization

Text Representation Based on Key Terms of Document for Text Categorization , pp.1-22 http://dx.doi.org/10.14257/ijdta.2016.9.4.01 Text Representation Based on Key Terms of Document for Text Categorization Jieming Yang 1*, Zhiying Liu and Zhaoyang Qu College of Information Engineering,

More information

An Integrated System for Building Enterprise. Taxonomies

An Integrated System for Building Enterprise. Taxonomies An Integrated System for Building Enterprise Taxonomies Li Zhang 1, Tao Li 2, ShiXia Liu 1, and Yue Pan 1 1. IBM China Research Lab, {lizhang,liusx,panyue}@cn.ibm.com 2. School of Computer Science, Florida

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Text Categorization 1

Text Categorization 1 Text Categorization 1 Is this spam? From: "" Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY! There is

More information

A Comparative Study on Content- Based Music Genre Classfication

A Comparative Study on Content- Based Music Genre Classfication A Comparative Study on Content- Based Music Genre Classfication Tao Li, Mitsunori Ogihara, and Qi Li, Proceedings of the 26th Annual International ACM Conference on Research and Development in Information

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Advanced Video Analysis & Imaging (5LSH0), Module 10A

Advanced Video Analysis & Imaging (5LSH0), Module 10A Advanced Video Analysis & Imaging (5LSH0), Module 10A Case Study 1: 3D Camera Modeling-Based Sports Video Analysis Peter H.N. de With & Jungong Han ( jp.h.n.de.with@tue.nl ) 1 Motivation System overview

More information

Knowledge Transfer across Multilingual Corpora via Latent Topics

Knowledge Transfer across Multilingual Corpora via Latent Topics Knowledge Transfer across Multilingual Corpora via Latent Topics Wim De Smet 1, Jie Tang 2, and Marie-Francine Moens 1 1 K.U.Leuven, Leuven, Belgium 2 Tsinghua University, Beijing, China jie.tang@tsinghua.cn.edu,

More information

International Journal of Electronics and Computer Science Engineering 1449

International Journal of Electronics and Computer Science Engineering 1449 International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Third Edition Ethem Alpaydın The MIT Press Cambridge, Massachusetts London, England 2014 Massachusetts Institute of Technology All rights reserved. No part of this book

More information

Content based web spam detection using naive bayes with different feature representation technique

Content based web spam detection using naive bayes with different feature representation technique RESEARCH ARTICLE OPEN ACCESS Content based web spam detection using naive bayes with different feature representation technique Amit Anand Soni 1, Abhishek Mathur 2 Research Scholar SATI Engg. Vidisha

More information

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas Naïve Bayes Vibhav Gogate The University of Texas at Dallas Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximation to f : X Y Examples: what are

More information

CENG 734 Advanced Topics in Bioinformatics

CENG 734 Advanced Topics in Bioinformatics CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the

More information

Bayesian Learning. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Learning. CSL465/603 - Fall 2016 Narayanan C Krishnan Bayesian Learning CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Machine Learning Summary

Machine Learning Summary Machine Learning Summary Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Summary p.1/22 Part I: Core Concepts Summary

More information

Today. CS 188: Artificial Intelligence Spring More Bayes Rule. Bayes Rule. Utilities. Expectations. Bayes rule. Expectations and utilities

Today. CS 188: Artificial Intelligence Spring More Bayes Rule. Bayes Rule. Utilities. Expectations. Bayes rule. Expectations and utilities CS 88: Artificial Intelligence Spring 006 Bayes rule Today Lecture 9: Naïve Bayes //006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Expectations and utilities Naïve Bayes

More information

Bilingual Topic Aspect Classification with A Few Training Examples

Bilingual Topic Aspect Classification with A Few Training Examples Bilingual Topic Aspect Classification with A ew Training Examples Yejun Wu and Douglas W. Oard College of Information Studies and UMIACS Laboratory for Computational Linguistics and Information Processing

More information

Object Recognition and Bags-of-Words

Object Recognition and Bags-of-Words Object Recognition and Bags-of-Words COSC450 Lecture 9 Object Recognition Object recognition problem: Given an image, assign a label(s) Bus, Dog, Koala, Man People are pretty good at this Still some ambiguity

More information

Learning from Labeled and Unlabeled Data

Learning from Labeled and Unlabeled Data Learning from Labeled and Unlabeled Data Machine Learning 10-601 March 31, 2008 Tom M. Mitchell Machine Learning Department Carnegie Mellon University When can Unlabeled Data improve supervised learning?

More information

Prediction of Yelp Review Star Rating using Sentiment Analysis

Prediction of Yelp Review Star Rating using Sentiment Analysis Prediction of Yelp Review Star Rating using Sentiment Analysis Chen Li (Stanford EE) & Jin Zhang (Stanford CEE) 1 Introduction Yelp aims to help people find great local businesses, e.g. restaurants. Automated

More information

BizPro: Extracting and Categorizing Business Intelligence Factors from News

BizPro: Extracting and Categorizing Business Intelligence Factors from News BizPro: Extracting and Categorizing Business Intelligence Factors from News Wingyan Chung, Ph.D. Institute for Simulation and Training wchung@ucf.edu Definitions and Research Highlights BI Factor: qualitative

More information

A Novel Framework for Incorporating Labeled Examples into Anomaly Detection

A Novel Framework for Incorporating Labeled Examples into Anomaly Detection A Novel Framework for Incorporating Labeled Examples into Anomaly Detection Jing Gao Haibin Cheng Pang-Ning Tan Abstract This paper presents a principled approach for incorporating labeled examples into

More information

10-601: Machine Learning Midterm Exam November 3, Solutions

10-601: Machine Learning Midterm Exam November 3, Solutions 10-601: Machine Learning Midterm Exam November 3, 2010 Solutions Instructions: Make sure that your exam has 16 pages (not including this cover sheet) and is not missing any sheets, then write your full

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Pose Estimation Based on 3D Models

Pose Estimation Based on 3D Models Pose Estimation Based on 3D Models Chuiwen Ma, Liang Shi Introduction This project aims to estimate the pose of an object in the image. Pose estimation problem is known to be an open problem and also a

More information

Greedy Term Selection for Document Classification with Given Minimal Precision

Greedy Term Selection for Document Classification with Given Minimal Precision Magyar Kutatók 7. Nemzetközi Szimpóziuma 7 th International Symposium of Hungarian Researchers on Computational Intelligence Greedy Term Selection for Document Classification with Given Minimal Precision

More information

Very convenient! Linear Methods: Logistic Regression. Module 5: Classification. STAT/BIOSTAT 527, University of Washington Emily Fox May 22 nd, 2014

Very convenient! Linear Methods: Logistic Regression. Module 5: Classification. STAT/BIOSTAT 527, University of Washington Emily Fox May 22 nd, 2014 Module 5: Classification Linear Methods: Logistic Regression STAT/BIOSTAT 527, University of Washington Emily Fox May 22 nd, 2014 1 Very convenient! implies Examine ratio: implies linear classification

More information

DISIT Lab, competence and project idea on bigdata. reasoning

DISIT Lab, competence and project idea on bigdata. reasoning DISIT Lab, competence and project idea on bigdata knowledge modeling, OD/LD and reasoning Paolo Nesi Dipartimento di Ingegneria dell Informazione, DINFO Università degli Studi di Firenze Via S. Marta 3,

More information

Research in Information Retrieval and Management

Research in Information Retrieval and Management Research in Information Retrieval and Management Susan Dumais Microsoft Research Library of Congress Feb 8, 1999 Research in IR at MS Microsoft Research (http://research.microsoft.com) Decision Theory

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Sonal Gupta Christopher Manning Natural Language Processing Group Department of Computer Science Stanford University Columbia

More information

MODULE 2 Paradigms for Pattern Recognition LESSON 2

MODULE 2 Paradigms for Pattern Recognition LESSON 2 MODULE 2 Paradigms for Pattern Recognition LESSON 2 Statistical Pattern Recognition Keywords: Statistical, Syntactic, Representation, Vector Space, Classification 1 Different Paradigms for Pattern Recognition

More information

Domain-Specific Keyphrase Extraction

Domain-Specific Keyphrase Extraction Domain-Specific Keyphrase Extraction Eibe Frank and Gordon W. Paynter and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand Carl Gutwin Department of Computer Science

More information

AN EFFICIENT TEXT CLASSIFICATION USING KNN AND NAIVE BAYESIAN

AN EFFICIENT TEXT CLASSIFICATION USING KNN AND NAIVE BAYESIAN AN EFFICIENT TEXT CLASSIFICATION USING KNN AND NAIVE BAYESIAN J.Sreemathy Research Scholar Karpagam University Coimbatore, India P. S. Balamurugan Research Scholar ANNA UNIVERSITY Coimbatore, India Abstract

More information

Text Classification by Bootstrapping with Keywords, EM and Shrinkage

Text Classification by Bootstrapping with Keywords, EM and Shrinkage Text Classification by Bootstrapping with Keywords, EM and Shrinkage Andrew McCallum mccallum@justresearch.com Just Research 4616 Henry Street Pittsburgh, PA 15213 Kamal Nigam knigam@cs.cmu.edu School

More information

Keyword Extraction and Semantic Tag Prediction

Keyword Extraction and Semantic Tag Prediction Keyword Extraction and Semantic Tag Prediction James Hong Michael Fang Stanford University Stanford University Stanford, CA - 94305 Stanford, CA - 94305 jamesh93@stanford.edu mjfang@stanford.edu Abstract

More information

Visual Codebook. Tae- Kyun Kim Sidney Sussex College

Visual Codebook. Tae- Kyun Kim Sidney Sussex College Visual Codebook Tae- Kyun Kim Sidney Sussex College 1 Visual Words Visual words are base elements to describe an image. Interest points are detected from an image Corners, Blob detector, SIFT detector

More information

Preserving Class Discriminatory Information by. Context-sensitive Intra-class Clustering Algorithm

Preserving Class Discriminatory Information by. Context-sensitive Intra-class Clustering Algorithm Preserving Class Discriminatory Information by Context-sensitive Intra-class Clustering Algorithm Yingwei Yu, Ricardo Gutierrez-Osuna, and Yoonsuck Choe Department of Computer Science Texas A&M University

More information

Machine Learning and Applications Christoph Lampert

Machine Learning and Applications Christoph Lampert Machine Learning and Applications Christoph Lampert Spring Semester 2014/2015 Lecture 2 Decision Theory (for Supervised Learning Problems) Goal: Understand existing algorithms Develop new algorithms with

More information

People Detection with DSIFT Algorithm By Bing Han, Dingyi Li and Jia Ji

People Detection with DSIFT Algorithm By Bing Han, Dingyi Li and Jia Ji 1 Introduction People Detection with Algorithm By Bing Han, Dingyi Li and Jia Ji People detection is an interesting computer vision topic. Locating people in images and videos have many potential applications,

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

More information

Syntactic Pattern Recognition. By Nicolette Nicolosi Ishwarryah S Ramanathan

Syntactic Pattern Recognition. By Nicolette Nicolosi Ishwarryah S Ramanathan Syntactic Pattern Recognition By Nicolette Nicolosi Ishwarryah S Ramanathan Syntactic Pattern Recognition Statistical pattern recognition is straightforward, but may not be ideal for many realistic problems.

More information

Extraction of Structured Information From Online Automobile Advertisements

Extraction of Structured Information From Online Automobile Advertisements Extraction of Structured Information From Online Automobile Advertisements Nipun Bhatia, Rakshit Kumar, Shashank Senapaty Department of Computer Science, Stanford University. nipunb, rakshit, senapaty@stanford.edu

More information

Evaluation of the Document Classification Approaches

Evaluation of the Document Classification Approaches Evaluation of the Document Classification Approaches Michal Hrala and Pavel Král 1 Abstract This paper deals with one class automatic document classification. Five feature selection methods and three classifiers

More information

Boosting. Can we make dumb learners smart? Aarti Singh. Machine Learning / Oct 11, 2010

Boosting. Can we make dumb learners smart? Aarti Singh. Machine Learning / Oct 11, 2010 Boosting Can we make dumb learners smart? Aarti Singh Machine Learning 10-701/15-781 Oct 11, 2010 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1 Project Proposal Due Today! 2 Why boost weak learners?

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the

More information

The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm

The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm Michael Collins 1 Introduction This note covers the following topics: The Naive Bayes model for classification (with text classification

More information

AN EFFICIENT PREPROCESSING AND POSTPROCESSING TECHNIQUES IN DATA MINING

AN EFFICIENT PREPROCESSING AND POSTPROCESSING TECHNIQUES IN DATA MINING INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 AN EFFICIENT PREPROCESSING AND POSTPROCESSING TECHNIQUES IN DATA MINING R.Tamilselvi 1, B.Sivasakthi 2, R.Kavitha

More information

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu What is Learning? Merriam-Webster: learn = to acquire knowledge, understanding, or skill

More information

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

More information

CS534 Homework Assignment 2 Due Friday in class, April 29th

CS534 Homework Assignment 2 Due Friday in class, April 29th CS534 Homework Assignment 2 Due Friday in class, April 29th Written assignment. Consider the following decision tree: x < 25 x2 < 5 x2 < 5 x < 0 x < 5 E F A B C D (a) Draw the decision boundaries defined

More information

Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

More information

Inverted Index based Modified Version of K-Means Algorithm for Text Clustering

Inverted Index based Modified Version of K-Means Algorithm for Text Clustering DOI : 10.3745/JIPS.2008.4.2.067 Journal of Information Processing Systems, Vol.4, No.2, June 2008 67 Inverted Index based Modified Version of K-Means Algorithm for Text Clustering Taeho Jo* Abstract: This

More information

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

Emoticon Smoothed Language Models for Twitter Sentiment Analysis Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of

More information

NEURAL NETWORK IN DATA MINING

NEURAL NETWORK IN DATA MINING NEURAL NETWORK IN DATA MINING Ashutosh Bhatt, Harsh Chawla, Bibhuti Bhusan Panda Computer Science and Engineering Department Dronacharya College of Engineering, Gurgaon, Haryana, India Abstract - Companies

More information

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database Dina Vishnyakova 1,2, 4, *, Julien Gobeill 1,3,4, Emilie Pasche 1,2,3,4 and Patrick Ruch

More information