Supervised Learning Evaluation (via Sentiment Analysis)!

Size: px

Start display at page:

Download "Supervised Learning Evaluation (via Sentiment Analysis)!"

Leo Cox
7 years ago
Views:

1 Supervised Learning Evaluation (via Sentiment Analysis)!

2 Why Analyze Sentiment?

3 Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents More fine-grained analysis Within specific domains

4 Sentiment Analysis - Approaches

5 What are the challenges to Sentiment Analysis? domain specificity thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

6 What are the challenges to Sentiment Analysis? Cold Small

7 What are the challenges to Sentiment Analysis? domain specificity thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

8 What are the challenges to Sentiment Analysis? domain specificity This film should be brilliant It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance However, it can t hold up (Pang et al, 2002) thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good

9 What are the challenges to Sentiment Analysis? domain specificity thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

10 What are the challenges to Sentiment Analysis? domain specificity thwarted expectations sarcasm and subtle nature of sentiment sufficient, high quality training data

11 Supervised Classification Example

12 Supervised Classification Example

13 Supervised Classification Example

14 Supervised Classification Example

15 Supervised Classification Example

16 Supervised Classification Example Accuracy

17 Supervised Classification Example Accuracy 15/20 = 075

18 Supervised Classification Example Accuracy 15/20 = 075 Precision

19 Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058

20 Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058 Recall

21 Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058 Recall 7/7 = 10

22 Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058 Recall 7/7 = 10 F1 (2PR/(P+R))

23 Supervised Classification Example Accuracy 15/20 = 075 Precision 7/12 = 058 Recall 7/7 = 10 F1 (2PR/(P+R)) = 073

24 Supervised ML in practice Supervised learning algorithm choice Support Vector Machines Naïve Bayes Neural Networks Decision Trees Configuration Feature Selection Training Data =

25 Unsupervised Clustering Example

26 Unsupervised Clustering Example

27 Evaluation: Classic Reuters Data Set Sec 1524 Most (over)used data set documents 9603 training, 3299 test articles (ModApte split) 118 categories An article can be in more than one category Learn 118 binary category distinctions (118 2-class classifiers) Average number of classes assigned 124 for docs with at least one category Only about 10 out of 118 categories are large Common categories (#train, #test) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) 27

28 Reuters Text Categorization data set (Reuters-21578) document Sec 1524 <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR :51:4342</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added Reuter </BODY></TEXT></REUTERS> 28

29 Sec 1524 Per class evaluation measures Recall: Fraction of docs in class i classified correctly: Precision: Fraction of docs assigned class i that are actually about class i: j j c ii c ij c ii c ji F Measure (F1) = 2PR/(P + R) Accuracy: Fraction of docs classified correctly: i j c ii i c ij 29

30 Sec 1524 Confusion Matrix This (i, j) entry means 53 of the docs actually in class i were put in class j by the classifier Class assigned by classifier Actual Class 53 30

31 Sec 1524 Micro- vs Macro-Averaging If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average Microaveraging: Collect decisions for all classes, compute contingency table, evaluate 31

32 Micro- vs Macro-Averaging: Example Sec 1524 Class 1 Class 2 Micro Ave Table Truth: yes Truth: no Truth: yes Truth: no Truth: yes Truth: no Classifi er: yes Classifi er: yes Classifier: yes Classifi er: no Classifi er: no Classifier: no Macroaveraged precision: ( )/2 = 07 Microaveraged precision: 100/120 = 83 Microaveraged score is dominated by score on common classes 32

33 Vector Space Model each document is a vector similarity of two documents = distance between two vectors A query is just a short document Rank all docs by their distance to the query What s the right distance metric? Cosine Similarity the dot-product (sum of products) of two normalized vectors happens to be cosine of the angle between them! (d j d k )/( d j d k ) = cos(θ) θ 33

34 TFIDF What terms are most important? 34

35 TFIDF example TF IDF (Term Frequency Inverse Doc Frequency) For each doc d, and a term t TFIDF = (freq of t in d)/(total #words in d) TF / (#docs with this term)/(total #docs) IDF 35

36 TFIDF in practice Term frequency and document frequency tables are both calculated in the indexing process Used to summarize a document 36

37 Sec

38 Yang&Liu: SVM vs Other Methods Sec

39 Naïve Bayes Classifier: Naïve Bayes Assumption P(c j ) Estimated from the frequency of classes in the training examples P(x 1, x 2,, x n c j ) Could only be estimated from a very large number of training examples Naïve Bayes Conditional Independence Assumption: Assume that the prob of observing the conjunction of attributes is equal to the product of the indiv probs P(x i c j )

41 P(pos features) = P(pos)* product of probabilities P(feature pos) = P(pos) * P( pomona pos) * P( college pos) * P( is pos) * P( great pos) =0333 * 6/5000 * 100/5000 * 98/5000 * 40/5000

42 Naïve Bayes Classification Emotion Classifier Pomona College is great input output Positive 300 Reviews of Colleges

43 Words Positive Doc Count Negative Doc Count Neutral Doc Count pomona college great the bad is Total count

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising