Supervised Learning Evaluation (via Sentiment Analysis)!



Similar documents
Big Data Analytics CSCI 4030

1 o Semestre 2007/2008

Data Mining Part 5. Prediction

Active Learning SVM for Blogs recommendation

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Introduction to Data Mining

Azure Machine Learning, SQL Data Mining and R

Projektgruppe. Categorization of text documents via classification

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Social Media Mining. Data Mining Essentials

Mining a Corpus of Job Ads

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Machine Learning using MapReduce

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Predicting Flight Delays

Car Insurance. Jan Tomášek Štěpán Havránek Michal Pokorný

Performance Metrics for Graph Mining Tasks

Framing Business Problems as Data Mining Problems

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Machine Learning.

Machine Learning: Overview

Classification and Prediction

Data Mining Fundamentals

Data Mining - Evaluation of Classifiers

Environmental Remote Sensing GEOG 2021

Why is Internal Audit so Hard?

Knowledge Discovery from patents using KMX Text Analytics

Towards Effective Recommendation of Social Data across Social Networking Sites

Using multiple models: Bagging, Boosting, Ensembles, Forests

: Introduction to Machine Learning Dr. Rita Osadchy

SI485i : NLP. Set 6 Sentiment and Opinions

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Detecting Spam Using Spam Word Associations

Introduction to Information Retrieval

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Chapter 6. The stacking ensemble approach

A Decision Support Approach based on Sentiment Analysis Combined with Data Mining for Customer Satisfaction Research

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

The Enron Corpus: A New Dataset for Classification Research

F. Aiolli - Sistemi Informativi 2007/2008

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Eng. Mohammed Abdualal

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Personalized Hierarchical Clustering

Machine Learning Capacity and Performance Analysis and R

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Maximize Revenues on your Customer Loyalty Program using Predictive Analytics

LCs for Binary Classification

Data Mining Yelp Data - Predicting rating stars from review text

How To Solve The Kd Cup 2010 Challenge

How To Cluster

Chapter ML:XI (continued)

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Microsoft Azure Machine learning Algorithms

The Data Mining Process

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Text Analytics Illustrated with a Simple Data Set

Data Mining and Neural Networks in Stata

from Larson Text By Susan Miertschin

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Clustering Technique in Data Mining for Text Documents

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Supervised Learning (Big Data Analytics)

ISSN: (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Search Engines. Stephen Shaw 18th of February, Netsoc

Predictive Modeling Techniques in Insurance

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Predictive Data modeling for health care: Comparative performance study of different prediction models

Web Document Clustering

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Data quality in Accounting Information Systems

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Term extraction for user profiling: evaluation by the user

A Comparison of Event Models for Naive Bayes Text Classification

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Course 395: Machine Learning

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

How To Learn From The Revolution

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Using Data Mining for Mobile Communication Clustering and Characterization

Three Methods for ediscovery Document Prioritization:

Transcription:

Supervised Learning Evaluation (via Sentiment Analysis)!

Why Analyze Sentiment?

Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents More fine-grained analysis Within specific domains

Sentiment Analysis - Approaches

What are the challenges to Sentiment Analysis? domain specificity thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

What are the challenges to Sentiment Analysis? Cold Small

What are the challenges to Sentiment Analysis? domain specificity thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

What are the challenges to Sentiment Analysis? domain specificity This film should be brilliant It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance However, it can t hold up (Pang et al, 2002) thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

What are the challenges to Sentiment Analysis? domain specificity thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

What are the challenges to Sentiment Analysis? domain specificity thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

Supervised Classification Example

Supervised Classification Example

Supervised Classification Example

Supervised Classification Example

Supervised Classification Example

Supervised Classification Example Accuracy

Supervised Classification Example Accuracy 15/20 = 075

Supervised Classification Example Accuracy 15/20 = 075 Precision

Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058

Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058 Recall

Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058 Recall 7/7 = 10

Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058 Recall 7/7 = 10 F1 (2PR/(P+R))

Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058 Recall 7/7 = 10 F1 (2PR/(P+R)) = 073

Supervised ML in practice Supervised learning algorithm choice Support Vector Machines Naïve Bayes Neural Networks Decision Trees Configuration Feature Selection Training Data =

Unsupervised Clustering Example

Unsupervised Clustering Example

Evaluation: Classic Reuters-21578 Data Set Sec 1524 Most (over)used data set 21578 documents 9603 training, 3299 test articles (ModApte split) 118 categories An article can be in more than one category Learn 118 binary category distinctions (118 2-class classifiers) Average number of classes assigned 124 for docs with at least one category Only about 10 out of 118 categories are large Common categories (#train, #test) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) 27

Reuters Text Categorization data set (Reuters-21578) document Sec 1524 <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:4342</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added Reuter &#3;</BODY></TEXT></REUTERS> 28

Sec 1524 Per class evaluation measures Recall: Fraction of docs in class i classified correctly: Precision: Fraction of docs assigned class i that are actually about class i: j j c ii c ij c ii c ji F Measure (F1) = 2PR/(P + R) Accuracy: Fraction of docs classified correctly: i j c ii i c ij 29

Sec 1524 Confusion Matrix This (i, j) entry means 53 of the docs actually in class i were put in class j by the classifier Class assigned by classifier Actual Class 53 30

Sec 1524 Micro- vs Macro-Averaging If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average Microaveraging: Collect decisions for all classes, compute contingency table, evaluate 31

Micro- vs Macro-Averaging: Example Sec 1524 Class 1 Class 2 Micro Ave Table Truth: yes Truth: no Truth: yes Truth: no Truth: yes Truth: no Classifi er: yes 10 10 Classifi er: yes 90 10 Classifier: yes 100 20 Classifi er: no 10 970 Classifi er: no 10 890 Classifier: no 20 1860 Macroaveraged precision: (05 + 09)/2 = 07 Microaveraged precision: 100/120 = 83 Microaveraged score is dominated by score on common classes 32

Vector Space Model each document is a vector similarity of two documents = distance between two vectors A query is just a short document Rank all docs by their distance to the query What s the right distance metric? Cosine Similarity the dot-product (sum of products) of two normalized vectors happens to be cosine of the angle between them! (d j d k )/( d j d k ) = cos(θ) θ 33

TFIDF What terms are most important? 34

TFIDF example TF IDF (Term Frequency Inverse Doc Frequency) For each doc d, and a term t TFIDF = (freq of t in d)/(total #words in d) TF / (#docs with this term)/(total #docs) IDF 35

TFIDF in practice Term frequency and document frequency tables are both calculated in the indexing process Used to summarize a document 36

Sec 1524 37

Yang&Liu: SVM vs Other Methods Sec 1524 38

Naïve Bayes Classifier: Naïve Bayes Assumption P(c j ) Estimated from the frequency of classes in the training examples P(x 1, x 2,, x n c j ) Could only be estimated from a very large number of training examples Naïve Bayes Conditional Independence Assumption: Assume that the prob of observing the conjunction of attributes is equal to the product of the indiv probs P(x i c j )

P(pos features) = P(pos)* product of probabilities P(feature pos) = P(pos) * P( pomona pos) * P( college pos) * P( is pos) * P( great pos) =0333 * 6/5000 * 100/5000 * 98/5000 * 40/5000

Naïve Bayes Classification Emotion Classifier Pomona College is great input output Positive 300 Reviews of Colleges

Words Positive Doc Count Negative Doc Count Neutral Doc Count pomona 6 5 5 college 100 100 100 great 40 1 2 the 100 100 100 bad 2 30 2 is 98 99 98 Total count 5000 5000 5000