Outline. Introduction. Outline. Introduction. Outline. Transductive Inference for Text Classification using Support Vector Machines

Similar documents
Semi-Supervised Support Vector Machines and Application to Spam Filtering

Supervised Learning (Big Data Analytics)

LCs for Binary Classification

Classification algorithm in Data mining: An Overview

Web Document Clustering

Big Data Analytics CSCI 4030

Knowledge Discovery from patents using KMX Text Analytics

Content-Based Recommendation

Data Mining - Evaluation of Classifiers

Machine Learning Final Project Spam Filtering

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Projektgruppe. Categorization of text documents via classification

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

: Introduction to Machine Learning Dr. Rita Osadchy

WE DEFINE spam as an message that is unwanted basically

Network Intrusion Detection using Semi Supervised Support Vector Machine

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Early defect identification of semiconductor processes using machine learning

A Comparison of Event Models for Naive Bayes Text Classification

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Learning is a very general term denoting the way in which agents:

Question 2 Naïve Bayes (16 points)

Data Mining Practical Machine Learning Tools and Techniques

Active Learning SVM for Blogs recommendation

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Search and Information Retrieval

Active Learning with Boosting for Spam Detection

Statistical Machine Learning

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Supervised and unsupervised learning - 1

Social Media Mining. Data Mining Essentials

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Introduction to Machine Learning Using Python. Vikram Kamath

Interactive Machine Learning. Maria-Florina Balcan

Employer Health Insurance Premium Prediction Elliott Lui

Maximum Margin Clustering

CSE 473: Artificial Intelligence Autumn 2010

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Machine Learning in Spam Filtering

Semi-supervised Sentiment Classification in Resource-Scarce Language: A Comparative Study

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Customer Classification And Prediction Based On Data Mining Technique

Automatic Text Processing: Cross-Lingual. Text Categorization

Challenges of Cloud Scale Natural Language Processing

Theme-based Retrieval of Web News

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Support Vector Machine (SVM)

Spam detection with data mining method:

Mining a Corpus of Job Ads

Linear Threshold Units

Introduction to Learning & Decision Trees

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Introduction to Online Learning Theory

1 Maximum likelihood estimation

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

The Enron Corpus: A New Dataset for Classification Research

Supervised Learning Evaluation (via Sentiment Analysis)!

Application of Support Vector Machines to Fault Diagnosis and Automated Repair

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Chapter 6. The stacking ensemble approach

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Tweaking Naïve Bayes classifier for intelligent spam detection

Neural Networks for Sentiment Detection in Financial Text

Server Load Prediction

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

L25: Ensemble learning

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Microsoft Azure Machine learning Algorithms

Feature Subset Selection in Spam Detection

Less naive Bayes spam detection

Azure Machine Learning, SQL Data Mining and R

Semi-Supervised Learning for Blog Classification

SVM Based Learning System For Information Extraction

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Towards better accuracy for Spam predictions

Author Gender Identification of English Novels

SVM-Based Spam Filter with Active and Online Learning

Support Vector Machines Explained

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Stock Market Forecasting Using Machine Learning Algorithms

Experiments in Web Page Classification for Semantic Web

Support Vector Machines with Clustering for Training with Very Large Datasets

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier

Facilitating Business Process Discovery using Analysis

Knowledge-based systems and the need for learning

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Online Semi-Supervised Learning

Transcription:

Transductive Inference for Text Classification using Support Vector Machines By : Thorsten Joachims Speaker : Sameer Apte Introduction Text classification one of the key techniques for organizing online information (organize document databases, filter spam for email etc.) Learn classifiers from examples as handcoding is impractical Crucial for learner to generalize well using little training data Take transductive approach to tackle problem of learning from small training samples Introduction Examples of transductive text classification task : Relevance Feedback Netnews Filtering Reorganizing a document collection Use Transductive Support Vector Machines (TSVMs).They substantially improve already excellent performance of SVMs for text classification New algorithm for efficiently training TSVMs with 10,000 examples and more

Transductive Inference Problem of estimating the values of a function at given points of interest (Introduced by Vapnik ) In inductive inference, one uses given empirical data to find the approximation of a functional dependency (inductive step) and then uses this approximation to evaluate values of a function at points of interest (deductive step) In transductive inference, we try to estimate the values of a function at the points of interest in one step Transductive Inference For example : problem of learning from small training samples Inductive approach : Learner tries to induce a decision function which has a low error rate on the whole distribution of examples for the particular learning task In many situations we do not care about the particular decision function, but rather we classify a given set of examples (test set) with as few errors as possible [ Transductive Inference ] Text Classification Supervised learning problem Goal is the automatic assignment of documents to a fixed no. of semantic categories Learn classifiers from examples which assign categories automatically Documents (strings of characters), have to be transformed into a representation suitable for the learning algorithm and the classification task Text Classification Text Classification IR research suggests that word stems work well as representations Example : computes, computing, and computer are all mapped to same word stem comput Attribute- value representation of text Each word w i corresponds to a feature with TF(w i,x), the no. of times word w i occurs in document x, as its value

Text Classification For improved performance, we refine the basic representation by scaling the dimensions of the feature vector with their inverse document frequency IDF(w i ) IDF(w i ) = log( n / DF(w i ) ) where n =total no. of documents DF(w i ) = no. of documents the word w i occurs in IDF of a word is low if it occurs in many documents and is highest if the word occurs in only one Transductive Support Vector Machines (TSVMs) SVMs basic idea : Choose a separating plane based on maximizing the notion of a margin How to pick best separating plane Define a set of inequalities we want to satisfy Use advanced optimization methods (eg., linear programming) to find satisfying solutions SVMs have mechanisms for Noise Non-linear separating surfaces TSVMs Take into account a particular test set and try to minimize misclassifications of just those particular examples Learner L is given a hypothesis space H of functions h : X - > {- 1,1} and sample S train of n training examples : (x 1,y 1 ),(x 2,y 2 ),., (x n,y n ) Also given sample S test of k test examples : X 1, X 2,., X k from same distribution TSVMs Transductive learner L aims to select a function h L = L(S train,s test ) from H using S train and S test so that the expected no. of erroneous predictions R(L) on the test examples is minimized. Ə(a,b) is zero if a=b, otherwise it is one. What information we get from studying test sample and how can we use it? TSVMs Training and test sample split the hypothesis space H into a finite no. of equivalence classes H Two functions from H belong to the same equivalence class if they both classify the training and test sample in the same way This reduces the learning problem from finding a function in the possibly infinite set H to finding one of finitely many equivalence classes H These equivalence classes can be used to build a structure of increasing VC-Dimension for structural risk minimization[vapnik]

TSVMs Using prior knowledge about the nature of the learning task we can build a more appropriate structure and learn more quickly. We can build the structure based on the margin of separating hyperplanes on both the training and the test data With the size of the margin we can control the maximum no. of equivalence classes (i.e. the VC- dimension) [Theorem by Vapnik] TSVMs [Optimization Problem] Minimize over (y 1, y 2,., y n,w,b) : ½ w 2 subject to n i=1: y i [w. x i + b]>=1 k j=1 : y j [w. x j + b]>=1 Solving this problem means finding a labelling y 1,., y k of the test data and a hyperplane <w,b>, so that this hyperplane separates both training and test data with maximum margin. Figure 2 illustrates this : TSVMs TSVMs for Text Classification TSVMs for Text Classification(Example) Properties of text classification task High dimensional input space Document vectors are sparse Few irrelevant features SVMs do well on this setting Why TSVMs better? Words occur in strong co-occurrence patterns Many IR approaches try to exploit this cluster structure TSVMs exploit co-occurrence information as prior knowledge about the learning task

TSVMs for Text Classification(Example) D1 given class A, D6 given class B How to classify D2,D3 and D4 D2 and D3 :class A D3 and D4 :class B For the TSVM, the co-occurrence information in the test data was analyzed and two clusters were found {D1,D2,D3} and {D4,D5,D6} TSVM gives same classification as above. It gives the maximum margin solution Maximum margin bias reflects our prior knowledge about text classification well. 3 test collections : Experiments Reuters-21578 collected from Reuters newswire in 1987 9603 training documents and 3299 test documents WebKB collection of www pages [CMU textlearning group] Only the classes course,faculty,project and student are used 4183 examples, pages from Cornell University used for training, all other pages for testing Experiments Test collections(contnd.) Ohsumed corpus 10000 training docs and 10000 test docs Assign documents to one or multiple categories of the 5 most frequent diseases categories Doc belongs to a category if it is indexed with at least one indexing term from that category Performance Measures Precision/Recall- Breakeven Point Experiments (Performance Measures) Precision/Recall- Breakeven Point Precision : Probability that a document predicted to be in class + truly belongs to this class Recall : Probability that a document belonging to class + truly is classified into this class P/R- breakeven point is the value for which precision and recall are equal Transductive SVM uses the breakeven point for which the no. of false positives equals the no. of false negatives

Comparison of different classifiers : 17 training docs and 3299 test docs Effect of varying the size of the training set Influence of size of the test set Comparison of different classifiers (WebKB and Ohsumed): Effect of varying the size of the training set for category course Effect of varying the size of the training set for category project

Related Work Naïve Bayes classifier [Nigam] using EM algorithm Independence assumption is violated for text Co- training [Blum and Mitchell] uses unlabeled data and describes the problem by multiple representations Boosting scheme that exploits a conditional independence between these representations Conclusions Margin of separating hyperplanes is a natural way to encode prior knowledge for learning text classifiers Test set can be used as additional source of information about margins by taking transductive approach Trasductive approach shows improvements over the currently best performing method Most substantially for small training samples and large test sets References for Text Classification using Support Vector Machines [Joachims] Statistical Learning Theory [Vapnik]

Transductive Inference for Text Classification using Support Vector Machines (Joachims, 1999) Performance Goal of transductive inference: classify a given set of examples (i.e., a test set) with as few errors as possible Comments by Alex Kosolapov Large Training Sets Smaller Training Sets Question Reference Relationship between the time required for classification on test set and accuracy in a TSVM? Scaling up for large test sets? Joachims does not report the training/test times for TSVM in the paper Joachims, T. 2002 Learning to Classify Text using Support Vector Machines. Methods, Theory, and Algorithms. Kluwer Academic Publishers, ISBN 0-7923-7679-X Joachims, T. (1999) Transductive inference for text classification using support vector machines. ICML-99

Transductive Inference for Text Classification using Support Vector Machines Algorithm TSVM By Thorsten Joachims Presented - Sameer Apte Comments Kiran Vuppla Algorithm TSVM Objective : solve combinatorial optimization problem OP2. Algorithm is designed to handle large datasets for classification Key Idea: Labeling the test data based on the classification of an Inductive SVM to improve solution by decreasing objective function Algorithm TSVM ρ ρ Input : - training examples ( x1, y1),...( xn, yn) ρ ρ - test data x x 1,... n Parameters : - C, C: parameters from OP2 -num + : # of test examples to be assigned + Output: - predicted labels of the test examples y,... y 1 n Algorithm TSVM Training an inductive SVM on training data and classifying test data Increasing the influence of test examples by incrementing C and C + Changing the labels of the examples decreases the objective function solve_svm_qp Minimize over ( w ρ, b, ρ ε, ρ ε ) : 1 ρ n ω + C ξi + C ξ + C + ξ j j 2 i= 1 j: y j = 1 j: y j = 1 subject to : n ρ ρ i i i i = : y[ w. x + b] 1 ξ 1 n ρ ρ : yi[ w. xj b] 1 j = 1 + ξ j SVM light [Joachims] is used to solve the above problem