Text Classification and Naïve Bayes. The Task of Text Classifica1on

Similar documents

Supervised Learning Evaluation (via Sentiment Analysis)!

Projektgruppe. Categorization of text documents via classification

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Machine Learning in Spam Filtering

Supervised Learning (Big Data Analytics)

1 Maximum likelihood estimation

Sentiment analysis using emoticons

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Semi-Supervised Learning for Blog Classification

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Machine Learning Final Project Spam Filtering

CSE 473: Artificial Intelligence Autumn 2010

Author Gender Identification of English Novels

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Social Media Mining. Data Mining Essentials

Bayesian Spam Filtering

F. Aiolli - Sistemi Informativi 2007/2008

Simple Language Models for Spam Detection

Search and Information Retrieval

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Data Mining Algorithms Part 1. Dejan Sarka

Chapter 6. The stacking ensemble approach

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Mining a Corpus of Job Ads

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Data Mining - Evaluation of Classifiers

Feature Subset Selection in Spam Detection

The Data Mining Process

A Content based Spam Filtering Using Optical Back Propagation Technique

Tweaking Naïve Bayes classifier for intelligent spam detection

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Predicting the Stock Market with News Articles

Experiments in Web Page Classification for Semantic Web

Mining the Software Change Repository of a Legacy Telephony System

The Enron Corpus: A New Dataset for Classification Research

Active Learning SVM for Blogs recommendation

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Spam Detection A Machine Learning Approach

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Towards better accuracy for Spam predictions

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Micro blogs Oriented Word Segmentation System

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Data Mining Yelp Data - Predicting rating stars from review text

Building a Question Classifier for a TREC-Style Question Answering System

Machine Learning Capacity and Performance Analysis and R

Data Mining Practical Machine Learning Tools and Techniques

A semi-supervised Spam mail detector

MACHINE LEARNING IN HIGH ENERGY PHYSICS

ISSN: (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Cross-Validation. Synonyms Rotation estimation

Challenges of Cloud Scale Natural Language Processing

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

An Approach to Detect Spam s by Using Majority Voting

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Chapter 7. Language models. Statistical Machine Translation

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on

Combining Global and Personal Anti-Spam Filtering

Linear Threshold Units

CSCI567 Machine Learning (Fall 2014)

How To Make A Credit Risk Model For A Bank Account

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Customer Classification And Prediction Based On Data Mining Technique

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Evaluation & Validation: Credibility: Evaluating what has been learned

Big Data Analytics CSCI 4030

8. Machine Learning Applied Artificial Intelligence

The Scientific Data Mining Process

Predicting Flight Delays

Keywords Phishing Attack, phishing , Fraud, Identity Theft

Learning from Data: Naive Bayes

Chapter 5. Phrase-based models. Statistical Machine Translation

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Applying Machine Learning to Network Security Monitoring. Alex Pinto Chief Data Scien2st

Knowledge Discovery from patents using KMX Text Analytics

Chapter 7. Feature Selection. 7.1 Introduction

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Applying Machine Learning to Stock Market Trading Bryce Taylor

Social Media Analy.cs (SMA)

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Forecasting stock markets with Twitter

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Automatic Text Processing: Cross-Lingual. Text Categorization

Microsoft Azure Machine learning Algorithms

On the Effectiveness of Obfuscation Techniques in Online Social Networks

LCs for Binary Classification

Data Mining Part 5. Prediction

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Transcription:

Text Classification and Naïve Bayes The Task of Text Classifica1on

Is this spam?

Who wrote which Federalist papers? 1787-8: anonymous essays try to convince New York to ra1fy U.S Cons1tu1on: Jay, Madison, Hamilton. Authorship of 12 of the lelers in dispute 1963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton

Male or female author? 1. By 1925 present- day Vietnam was divided into three parts under French colonial rule. The southern region embracing Saigon and the Mekong delta was the colony of Cochin- China; the central area with its imperial capital at Hue was the protectorate of Annam 2. Clara never failed to be astonished by the extraordinary felicity of her own name. She found it hard to trust herself to the mercy of fate, which had managed over the years to convert her greatest shame into one of her greatest assets S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. Gender, Genre, and Wri1ng Style in Formal WriLen Texts, Text, volume 23, number 3, pp. 321 346

Posi8ve or nega8ve movie review? unbelievably disappoin1ng Full of zany characters and richly applied sa1re, and some great plot twists this is the greatest screwball comedy ever filmed It was pathe1c. The worst part about it was the boxing scenes. 5

What is the subject of this ar8cle? 6 MEDLINE Article? MeSH Subject Category Hierarchy Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology

Text Classifica8on Assigning subject categories, topics, or genres Spam detec1on Authorship iden1fica1on Age/gender iden1fica1on Language Iden1fica1on Sen1ment analysis

Text Classifica8on: defini8on Input: a document d a fixed set of classes C = {c 1, c 2,, c J } Output: a predicted class c C

Classifica8on Methods: Hand- coded rules Rules based on combina1ons of words or other features spam: black- list- address OR ( dollars AND have been selected ) Accuracy can be high If rules carefully refined by expert But building and maintaining these rules is expensive

Classifica8on Methods: Supervised Machine Learning Input: a document d a fixed set of classes C = {c 1, c 2,, c J } A training set of m hand- labeled documents (d 1,c 1 ),...,(d m,c m ) Output: a learned classifier γ:d! c 10

Classifica8on Methods: Supervised Machine Learning Any kind of classifier Naïve Bayes Logis1c regression Support- vector machines k- Nearest Neighbors

Text Classification and Naïve Bayes The Task of Text Classifica1on

Text Classification and Naïve Bayes Naïve Bayes (I)

Naïve Bayes Intui8on Simple ( naïve ) classifica1on method based on Bayes rule Relies on very simple representa1on of document Bag of words

The bag of words representa8on I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.! γ( )=c

The bag of words representa8on I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.! γ( )=c

The bag of words representa8on: using a subset of words x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx! γ( )=c

The bag of words representa8on great! 2! love! 2! γ( )=c recommend! 1! laugh! 1! happy! 1!...!...!

Bag of words for document classifica8on Test document parser! language! label! translation!! Machine Learning! learning! training! algorithm! shrinkage! network...! NLP! parser! tag! training! translation! language...!? Garbage! Collection! garbage! collection! memory! optimization! region...! Planning! planning! temporal! reasoning! plan! language...! GUI!...!

Text Classification and Naïve Bayes Naïve Bayes (I)

Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier

Bayes Rule Applied to Documents and Classes For a document d and a class c P(c d) = P(d c)p(c) P(d)

Naïve Bayes Classifier (I) c MAP = argmax c C = argmax c C = argmax c C P(c d) P(d c)p(c) P(d) P(d c)p(c) MAP is maximum a posteriori = most likely class Bayes Rule Dropping the denominator

Naïve Bayes Classifier (II) c MAP = argmax c C P(d c)p(c) = argmax c C P(x 1, x 2,, x n c)p(c) Document d represented as features x1..xn

Naïve Bayes Classifier (IV) c MAP = argmax c C P(x 1, x 2,, x n c)p(c) O( X n C ) parameters Could only be es1mated if a very, very large number of training examples was available. How often does this class occur? We can just count the relative frequencies in a corpus

Mul8nomial Naïve Bayes Independence Assump8ons P(x 1, x 2,, x n c) Bag of Words assump8on: Assume posi1on doesn t maler Condi8onal Independence: Assume the feature probabili1es P(x i c j ) are independent given the class c. P(x 1,, x n c) = P(x 1 c) P(x 2 c) P(x 3 c)... P(x n c)

Mul8nomial Naïve Bayes Classifier c MAP = argmax c C c NB = argmax c C P(x 1, x 2,, x n c)p(c) x X P(c j ) P(x c)

Applying Mul8nomial Naive Bayes Classifiers to Text Classifica8on positions all word posi1ons in test document c NB = argmax c j C i positions P(c j ) P(x i c j )

Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier

Text Classification and Naïve Bayes Naïve Bayes: Learning

Sec.13.3 Learning the Mul8nomial Naïve Bayes Model First alempt: maximum likelihood es1mates simply use the frequencies in the data ˆP(c j ) = doccount(c = c j ) N doc ˆP(w i c j ) = count(w i,c j ) count(w,c j ) w V

Parameter es8ma8on ˆP(w i c j ) = count(w i,c j ) count(w, c j ) w V frac1on of 1mes word w i appears among all words in documents of topic c j Create mega- document for topic j by concatena1ng all docs in this topic Use frequency of w in mega- document

Sec.13.3 Problem with Maximum Likelihood What if we have seen no training documents with the word fantas&c and classified in the topic posi8ve (thumbs- up)? ˆP("fantastic" positive) = count("fantastic", positive) count(w, positive) Zero probabili1es cannot be condi1oned away, no maler the other evidence! w V c MAP = argmax c ˆP(c) ˆP(xi c) i = 0

Laplace (add- 1) smoothing for Naïve Bayes ˆP(w i c) = = w V # % $ count(w i, c)+1 count(w, c) +1 ( ) ) count(w i, c)+1 & count(w, c) ( + V ' w V

Mul8nomial Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate P(c j ) terms For each c j in C do docs j all docs with class =c j docs P(c j ) j total # documents Calculate P(w k c j ) terms Text j single doc containing all docs j For each word w k in Vocabulary n k # of occurrences of w k in Text j n P(w k c j ) k +α n +α Vocabulary

Text Classification and Naïve Bayes Naïve Bayes: Learning

Text Classification and Naïve Bayes Naïve Bayes: Rela1onship to Language Modeling

Genera8ve Model for Mul8nomial Naïve Bayes c=china X 1 =Shanghai X 2 =and X 3 =Shenzhen X 4 =issue X 5 =bonds 38

Naïve Bayes and Language Modeling Naïve bayes classifiers can use any sort of feature URL, email address, dic1onaries, network features But if, as in the previous slides We use only word features we use all of the words in the text (not a subset) Then 39 Naïve bayes has an important similarity to language modeling.

Sec.13.2.1 Each class = a unigram language model Assigning each word: P(word c) Assigning each sentence: P(s c)=π P(word c) Class pos 0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film I love this fun film 0.1 0.1.05 0.01 0.1 P(s pos) = 0.0000005

Sec.13.2.1 Naïve Bayes as a Language Model Which class assigns the higher probability to s? Model pos Model neg 0.1 I 0.1 love 0.01 this 0.2 I 0.001 love 0.01 this I 0.1 0.2 love this fun 0.1 0.01 0.05 0.001 0.01 0.005 film 0.1 0.1 0.05 fun 0.1 film 0.005 fun 0.1 film P(s pos) > P(s neg)

Text Classification and Naïve Bayes Naïve Bayes: Rela1onship to Language Modeling

Text Classification and Naïve Bayes Mul1nomial Naïve Bayes: A Worked Example

ˆP(w c) = count(w,c)+1 count(c)+ V 44 Priors: P(c)= 3 4 1 P(j)= 4 ˆP(c) = N c N Condi8onal Probabili8es: P(Chinese c) = P(Tokyo c) = (5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14 P(Japan c) = (0+1) / (8+6) = 1/14 P(Chinese j) = (1+1) / (3+6) = 2/9 P(Tokyo j) = (1+1) / (3+6) = 2/9 P(Japan j) = (1+1) / (3+6) = 2/9 Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan? Choosing a class: P(c d5) 3/4 * (3/7) 3 * 1/14 * 1/14 0.0003 P(j d5) 1/4 * (2/9) 3 * 2/9 * 2/9 0.0001

Naïve Bayes in Spam Filtering SpamAssassin Features: Men1ons Generic Viagra Online Pharmacy Men1ons millions of (dollar) ((dollar) NN,NNN,NNN.NN) Phrase: impress... girl From: starts with many numbers Subject is all capitals HTML has a low ra1o of text to image area One hundred percent guaranteed Claims you can be removed from the list 'Pres1gious Non- Accredited Universi1es' hlp://spamassassin.apache.org/tests_3_3_x.html

Summary: Naive Bayes is Not So Naive Very Fast, low storage requirements Robust to Irrelevant Features Irrelevant Features cancel each other without affec1ng results Very good in domains with many equally important features Decision Trees suffer from fragmentagon in such cases especially if lille data Op1mal if the independence assump1ons hold: If assumed independence is correct, then it is the Bayes Op1mal Classifier for problem A good dependable baseline for text classifica1on But we will see other classifiers that give bewer accuracy

Text Classification and Naïve Bayes Mul1nomial Naïve Bayes: A Worked Example

Text Classification and Naïve Bayes Precision, Recall, and the F measure

The 2- by- 2 con8ngency table correct not correct selected tp fp not selected fn tn

Precision and recall Precision: % of selected items that are correct Recall: % of correct items that are selected correct not correct selected tp fp not selected fn tn

A combined measure: F A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean): F 2 1 ( β + 1) PR = = 2 1 1 α + (1 α) β P + R P R The harmonic mean is a very conserva1ve average; see IIR 8.3 People usually use balanced F1 measure i.e., with β = 1 (that is, α = ½): F = 2PR/(P+R)

Text Classification and Naïve Bayes Precision, Recall, and the F measure

Text Classification and Naïve Bayes Text Classifica1on: Evalua1on

More Than Two Classes: Sets of binary classifiers Sec.14.5 Dealing with any- of or mul1value classifica1on A document can belong to 0, 1, or >1 classes. For each class c C Build a classifier γ c to dis1nguish c from all other classes c C Given test doc d, Evaluate it for membership in each class using each γ c d belongs to any class for which γ c returns true 54

More Than Two Classes: Sets of binary classifiers Sec.14.5 One- of or mul1nomial classifica1on Classes are mutually exclusive: each document in exactly one class For each class c C Build a classifier γ c to dis1nguish c from all other classes c C Given test doc d, Evaluate it for membership in each class using each γ c d belongs to the one class with maximum score 55

56 Evalua8on: Classic Reuters- 21578 Data Set Most (over)used data set, 21,578 docs (each 90 types, 200 toknens) 9603 training, 3299 test ar1cles (ModApte/Lewis split) 118 categories An ar1cle can be in more than one category Learn 118 binary category dis1nc1ons Average document (with at least one category) has 1.24 classes Only about 10 out of 118 categories are large Common categories (#train, #test) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Sec. 15.2.4

Reuters Text Categoriza8on data set (Reuters- 21578) document Sec. 15.2.4 <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:43.42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter 57 </BODY></TEXT></REUTERS>

Confusion matrix c For each pair of classes <c 1,c 2 > how many documents from c 1 were incorrectly assigned to c 2? c 3,2 : 90 wheat documents incorrectly assigned to poultry Docs in test set Assigned UK Assigned poultry Assigned wheat Assigned coffee Assigned interest True UK 95 1 13 0 1 0 True poultry 0 1 0 0 0 0 True wheat 10 90 0 1 0 0 True coffee 0 0 0 34 3 7 True interest - 1 2 13 26 5 Assigned trade 58 True trade 0 0 2 14 5 10

Sec. 15.2.4 Per class evalua8on measures Recall: Frac1on of docs in class i classified correctly: j c ii c ij 59 Precision: Frac1on of docs assigned class i that are actually about class i: Accuracy: (1 - error rate) Frac1on of docs classified correctly: j i i c ii j c ii c ji c ij

Sec. 15.2.4 Micro- vs. Macro- Averaging If we have more than one class, how do we combine mul1ple performance measures into one quan1ty? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute con1ngency table, evaluate. 60

Sec. 15.2.4 Micro- vs. Macro- Averaging: Example Class 1 Class 2 Micro Ave. Table Truth: yes Classifier: yes 10 10 Truth: no Classifier: no 10 970 Truth: yes Classifier: yes 90 10 Truth: no Classifier: no 10 890 Truth: yes Classifier: yes 100 20 Truth: no Classifier: no 20 1860 Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 =.83 Microaveraged score is dominated by score on common classes 61

Development Test Sets and Cross- valida8on Training set Development Test Set Test Set Metric: P/R/F1 or Accuracy Unseen test set avoid overfiƒng ( tuning to the test set ) more conserva1ve es1mate of performance Cross- valida1on over mul1ple splits Handle sampling errors from different datasets Pool results over each split Compute pooled dev set performance Training Set Training Set Dev Test Dev Test Dev Test Training Set Test Set

Text Classification and Naïve Bayes Text Classifica1on: Evalua1on

Text Classification and Naïve Bayes Text Classifica1on: Prac1cal Issues

Sec. 15.3.1 The Real World Gee, I m building a text classifier for real, now! What should I do? 65

No training data? Manually written rules Sec. 15.3.1 If (wheat or grain) and not (whole or bread) then Categorize as grain Need careful cra ing Human tuning on development data Time- consuming: 2 days per class 66

Sec. 15.3.1 Very little data? Use Naïve Bayes Naïve Bayes is a high- bias algorithm (Ng and Jordan 2002 NIPS) Get more labeled data Find clever ways to get humans to label data for you Try semi- supervised training methods: Bootstrapping, EM over unlabeled documents, 67

Sec. 15.3.1 A reasonable amount of data? Perfect for all the clever classifiers SVM Regularized Logis1c Regression You can even use user- interpretable decision trees Users like to hack Management likes quick fixes 68

Sec. 15.3.1 A huge amount of data? Can achieve high accuracy! At a cost: SVMs (train 1me) or knn (test 1me) can be too slow Regularized logis1c regression can be somewhat beler So Naïve Bayes can come back into its own again! 69

Sec. 15.3.1 Accuracy as a function of data size With enough data Classifier may not maler 70 Brill and Banko on spelling correc1on

Real- world systems generally combine: Automa1c classifica1on Manual review of uncertain/difficult/"new cases 71

Underflow Preven8on: log space Mul1plying lots of probabili1es can result in floa1ng- point underflow. Since log(xy) = log(x) + log(y) BeLer to sum logs of probabili1es instead of mul1plying probabili1es. Class with highest un- normalized log probability score is s1ll most probable. c NB = argmax c j C Model is now just max of sum of weights i positions log P(c j )+ log P(x i c j )

Sec. 15.3.2 How to tweak performance Domain- specific features and weights: very important in real performance Some1mes need to collapse terms: 73 Part numbers, chemical formulas, But stemming generally doesn t help Upweigh1ng: Coun1ng a word as if it occurred twice: 1tle words (Cohen & Singer 1996) first sentence of each paragraph (Murata, 1999) In sentences that contain 1tle words (Ko et al, 2002)

Text Classification and Naïve Bayes Text Classifica1on: Prac1cal Issues