Sentiment Analysis of Social Media Texts: a SemEval Perspective

Sentiment Analysis of Social Media Texts: a SemEval Perspective Preslav Nakov Qatar Computing Research Institute, HBKU Dialog conference May 29, 2015 Moscow, Russia Using some slides by Pushpak Bhattacharyya, Sam Clark, Oren Etzioni, Dan Jurafsky, Bing Liu, Svetlana Kiritchenko, Zornitsa Kozareva, Saif Mohammad, Chris Manning, Mausam, Hwee Tou Ng, Alan Ritter, Sara Rosenthal, Veselin Stoyanov, Pidong Wang, Theresa Wilson, Xiaodan Zhu,

Sentiment Analysis on Twitter SemEval-2013 Task 2: 44 teams SemEval-2014 Task 9: 46 teams SemEval-2015 Task 10: 41 teams General sentiment No aspect SemEval-2016 Task??: coming 2

SemEval-2013 Task 2: Sentiment Analysis on Twitter 3

Task Description Two subtasks: A. Phrase-level sentiment B. Message-level sentiment Classify as positive, negative, neutral/objective: Words and phrases identified as subjective Messages (tweets/sms) 4

Subtask A: Phrase-Level

Subtask B: Message-Level

Data Collection Extract NEs (Ritter et al., 2011) Identify Popular Topics (Ritter et al., 2012) NEs frequently associated with specific dates Extract Messages Mentioning Topics Filter Messages for Sentiment Keep if pos/neg term from SentiWordNet (>0.3) Data for Annotation 7

Annotation Task Mechanical Turk HIT (3-5 workers per tweet) Instructions: Subjective words are ones which convey an opinion. Given a sentence, identify whether it is objective, positive, negative, or neutral. Then, identify each subjective word or phrase in the context of the sentence and mark the position of its start and end in the text boxes below. The number above each word indicates its position. The word/phrase will be generated in the adjacent textbox so that you can confirm that you chose the correct range. Choose the polarity of the word or phrase by selecting one of the radio buttons: positive, negative, or neutral. If a sentence is not subjective please select the checkbox indicating that There are no subjective words/phrases. Please read the examples and invalid responses before beginning if this is your first time answering this hit.

Data Annotations Final annotations determined using majority vote Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Intersection I would love to watch Vampire Diaries tonight :) and some Heroes! Great combination I would love to watch Vampire Diaries tonight :) and some Heroes! Great combination I would love to watch Vampire Diaries tonight :) and some Heroes! Great combination I would love to watch Vampire Diaries tonight :) and some Heroes! Great combination I would love to watch Vampire Diaries tonight :) and some Heroes! Great combination I would love to watch Vampire Diaries tonight :) and some Heroes! Great combination 9

Example Annotations friday evening plans were great, but saturday s plans didnt go as expected i went dancing & it was an ok club, but terribly crowded :-( WHY THE HELL DO YOU GUYS ALL HAVE MRS. KENNEDY! SHES A FUCKING DOUCHE AT&T was okay but whenever they do something nice in the name of customer service it seems like a favor, while T-Mobile makes that a normal everyday thin obama should be impeached on TREASON charges. Our Nuclear arsenal was TOP Secret. Till HE told our enemies what we had. #Coward #Traitor My graduation speech: I d like to thanks Google, Wikipedia and my computer! :D #ithingteens 10

Subtask A Distribution of Classes Train Dev Test-TWEET Test-SMS Positive 5,895 648 2,734 (60%) 1,071 (46%) Negative 3,131 430 1,541 (33%) 1,104 (47%) Neutral 471 57 160 (3%) 159 (7%) Total 4,635 2,334 Subtask B Train Dev Test-TWEET Test-SMS Positive 3,662 575 1,573 (41%) 492 (23%) Negative 1,466 340 601 (16%) 394 (19%) Neutral/Ob jective 4,600 739 1,640 (43%) 1,208 (58%) Total 3,814 2,094 11

Options for Participation 1. Subtask A and/or Subtask B 2. Constrained* and/or Unconstrained Refers to data used for training 3. Tweets and/or SMS * Used for ranking 12

Participation Constrained (21) Unconstrained (7) Constrained (36) Unconstrained (15) Submissions (148) 13

Scoring Recall, Precision, F-measure calculated for pos/neg classes for each run submitted Score = Ave(Pos F, Neg F) 14

100 90 80 70 60 50 40 Subtask A (words/phrases) Results Tweets SMS 100 90 80 70 60 50 40 30 20 10 0 Top Systems 1. NRC-Canada 2. AVAYA 3. Bounce 30 20 10 0 Top Systems 1. GU-MLT-LT 2. NRC-Canada 3. AVAYA Constrained Unconstrained Constrained Unconstrained GU-MLT-LT: careful normalization AVAYA: dependency parse Bounce: term-length (long are neutral) 15

80 70 60 50 40 Subtask B (messages) Results Tweets SMS 80 70 60 50 40 30 30 20 10 Top Systems 1. NRC-Canada 2. GU-MLT-LT 3. teragram 20 10 Top Systems 1. NRC-Canada 2. GU-MLT-LT 3. KLUE 0 0 Constrained Unconstrained Constrained Unconstrained GU-MLT-LT: careful normalization Bounce: Term-length (long are neutral) Teragram: Manual rules 16

The Winning System: NRC-Canada 17

The Winning System: NRC-Canada IS THIS THE WINNER? 18

Subtask A (phrase): Twitter 19

Subtask A (phrase): SMS 20

Subtask B (message): Twitter 21

Subtask B (message): SMS 22

The Winning System: NRC-Canada SUBTASK B (MESSAGE) 23

The System for Subtask B Pre-processing: - URL -> http://someurl - UserID -> @someuser - Tokenization and part-of-speech (POS) tagging (CMU Twitter NLP tool) Classifier: - SVM with linear kernel 24

Features (B: message) http://www.saifmohammad.com/webpages/abstracts/nrc-sentimentanalysis.htm 25

NRC: Subtask B (message): Twitter 26

NRC: Subtask B (message): SMS 27

Subtask B (message): SMS 30

The Winning System: NRC-Canada SUBTASK A (PHRASE) 31

Features 32

Subtask A (words/phrases): Twitter 33

Subtask A (words/phrases): Twitter 34

The Dominant Polarity Baseline (subtask A: phrases) 35

The Winning System: NRC-Canada THE SECRET? - MASSIVE AUTOMATIC LEXICONS! 36

Sentiment Lexicons What is in a sentiment lexicon? 37

Sentiment Lexicons Manually built lexicons NRC Emotion Lexicon (Mohammad & Turney, 2010): ~14K words MPQA Lexicon (Wilson et al., 2005): ~8K words Bing Liu Lexicon (Hu and Liu, 2004): ~6.8K words Automatically generated lexicons NRC Hashtag Sentiment lexicon: ~650K entries! Sentiment140 lexicon: >1M entries! 38

NRC Hashtag Sentiment Lexicon Hashtagged words can label emotions That jerk stole my photo on Tumblr #grrrr #anger 39

NRC Hashtag Sentiment Lexicon Seeds: synonyms of excellent, good, bad, terrible 30 positive 47 negative Collect tweets with the seeds as hashtags 775,000 tweets A tweet is considered positive if it has a positive hashtag negative if it has a negative hashtag 40

NRC Hashtag Sentiment Lexicon Еach w in the tweets is scored: score(w) = PMI(w,positive) PMI(w,negative) w is word or bigram PMI = pointwise mutual information If score(w) > 0, then w is positive If score(w) < 0, then w is negative 41

Pointwise Mutual Information PMI(word 1, word 2 ) = log 2 P(word 1,word 2 ) P(word 1 )P(word 2 ) 42

NRC Hashtag Sentiment Lexicon The final lexicon contains 54,129 words 316,531 bigrams 308,808 non-contiguous pairs Bigrams incorporate context: unpredictable story 0.4 unpredictable steering -0.7 43

NRC Sentiment140 Lexicon The lexicon contains 62,648 words 677,698 bigrams 480,010 non-contiguous pairs From 1.6 million tweets with emoticons Tweets with :) are considered positive Tweets with :( are considered negative 44

OTHER LEXICONS 45

The General Inquirer Home page: http://www.wjh.harvard.edu/~inquirer List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls Categories: Positiv (1915 words) and Negativ (2291 words) Strong vs Weak, Active vs Passive, Overstated versus Understated Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc Free for Research Use Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press Not used by the NRC system! 46

LIWC (Linguistic Inquiry and Word Count) Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC 2007. Austin, TX Home page: http://www.liwc.net/ 2300 words, >70 classes Affective Processes negative emotion (bad, weird, hate, problem, tough) positive emotion (love, nice, sweet) Cognitive Processes Tentative (maybe, perhaps, guess), Inhibition (block, constraint) Pronouns, Negation (no, never), Quantifiers (few, many) $30 or $90 fee Not used by the NRC system! 47

MPQA Subjectivity Cues Lexicon Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005. Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003. Home page: http://www.cs.pitt.edu/mpqa/subj_lexicon.html 6885 words from 8221 lemmas 2718 positive 4912 negative Each word annotated for intensity (strong, weak) GNU GPL 48

Bing Liu Opinion Lexicon Minqing Hu and Bing Liu. Mining and Summarizing Customer Reviews. ACM SIGKDD-2004. Bing Liu's Page on Opinion Mining http://www.cs.uic.edu/~liub/fbs/opinionlexicon-english.rar 6786 words 2006 positive 4783 negative 49

SentiWordNet Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010 SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. LREC-2010 Home page: http://sentiwordnet.isti.cnr.it/ All WordNet synsets automatically annotated for degrees of positivity, negativity, and neutrality/objectiveness [estimable(j,3)] may be computed or estimated Pos 0 Neg 0 Obj 1 [estimable(j,1)] deserving of respect or high regard Pos.75 Neg 0 Obj.25 Not used by the NRC system! 50

Polarity Lexicons: Disagreement Christopher Potts, Sentiment Tutorial, 2011 Opinion Lexicon General Inquirer SentiWordNet LIWC MPQA 33/5402 (0.6%) 49/2867 (2%) 1127/4214 (27%) 12/363 (3%) Opinion Lexicon 32/2411 (1%) 1004/3994 (25%) 9/403 (2%) General Inquirer 520/2306 (23%) 1/204 (0.5%) SentiWordNet 174/694 (25%) LIWC 51

SemEval-2014 Task 9 Same two subtasks: A. Phrase-level sentiment B. Message-level sentiment 53

Test Datasets: Examples 54

Datasets: Statistics 55

GOLD vs. Best/Average/Worst Turker 56

Annotation 57

Results: Subtask A (phrase polarity) SentiKLUE: used message-level polarity CMUQ-Hybrid: RBF kernel ThinkPositive: deep convolution network

Results: Subtask B (message polarity) TeamX: fine tuning towards tweet dataset coooolll: sentimentspecific word embeddings RTRGO: random subspace learning

Baselines: Subtask A (phrase) All systems beat this Most systems beat this Very few systems beat this 60

NRC-Canada: Feature Importance (A) 61

Baselines: Subtask B (message) Almost all systems beat this 2/3 of the systems beat this 62

NRC-Canada: Feature Importance (B) 63

Progress over the Two Years Returning teams in 2014-18 out of 46 Improvements - Subtask A o 0-1 points absolute o e.g., NRC-Canada: 88.93 90.14 - Subtask B o 2-3 points absolute o e.g., NRC-Canada: 69.02 70.75 64

Out-of-Domain Data Only tweets were given as training data Some teams only good on tweets - e.g., TeamX o tune a weighting scheme specifically for class imbalances in tweets Some teams good across all datasets - e.g., NRC-Canada o because they relied on lexicons Everybody suffers on sarcasm - A: 5-10 points - B: 10-20 points 65

Impact of Training Data Size Tweet distribution - cannot be done directly: violates Twitter TOS - download script released Data used - min : 5,215 - max: 10,882 - avg.: 8,500 The best teams had less than 8,500 - e.g., teamx, coooolll 66

Impact of Lexicons New Twitter Sentiment lexicons - NRC-Canada o A: +2 points o B: +6.5 points Pre-existing Sentiment lexicons - in general: +1-2 points - on SMS: up to +3.5 points 67

Impact of Word Clusters & Embeddings Word embeddings - sentiment-specific word embeddings o coooolll: +3-4 points: General word clusters and embeddings - on tweets: +0.5-1 points - on SMS: +1-2 points 68

Negation Handling 1. Invert polarity in negated context. 2. Add NOT_ to every word in negation context: RTRGO: +1.5 points (using both) didn t like this movie, but I didn t NOT_like NOT_this NOT_movie, but I 3. Separate lexicon for negated words. NRC-Canada: +1.5 for A, +2.5 for B 69

The Role of Context in Subtask A (phrase-level polarity) NRC-Canada: +4 points - unigrams, bigrams from context window - features from the entire message BOUNCE: +6.4 points - features from neighboring target phrases AVAYA - dependency path features 70

Why is Subtask A (phrase-level) Easier than Subtask B (message-level)? 85-89% of test phrases seen in training Skewed to one polarity 80% have the same polarity in test as the dominant in training 71

New Subtasks at SemEval-2015 Topic-Based Message Polarity Classification Detecting Trends Towards a Topic Determining the Strength of Twitter Sentiment Terms 73

Results: Subtask B (message polarity) Webis: ensemble of four approaches from previous editions of the task unitn: deep convolutional neural networks lsislif: logistic regression with special weighting for positives and negatives INESC-ID: word embeddings

Upcoming SemEval-2016: Sentiment Analysis on Twitter 75

SemEval-2016: Stars & Trends Message-level polarity - pos/neg/neu o classification - 5 stars o ordinal regression Trend detection - pos/neg/neu o quantification - 5 stars o ordinal quantification 76

Other Sentiment Tasks at SemEval-2015 77

SemEval-2015, Sentiment Tasks SemEval-2015: Relevant Tasks! Task 9: CLIPEval Implicit Polarity of Events explicit and implicit, pleasant and unpleasant, events! Task 10: Sentiment Analysis in Twitter repeat of 2013 and 2014 task more subtasks! Task 11: Sentiment Analysis of Figurative Language in Twitter metaphoric and ironic tweets intensity of sentiment! Task 12: Aspect Based Sentiment Analysis repeat of 2014 task domain adaptation task 78

Task 9: CLIPEval Implicit Polarity of Events Task 9: CLIPEval Implicit Polarity of Events! Explicit pleasant event Yesterday I met a beautiful woman! Explicit unpleasant event I ate a bad McRib this week! Implicit pleasant event Last night I finished the sewing project! Implicit unpleasant event Today, I lost a bet with my grandma A dataset of first person sentences annotated as instantiations of pleasant and unpleasant events (MacPhillamy and Lewinsohn 1982): After that, I started to color my hair and polish my nails. positive, personal_care When Swedish security police Saepo arrested me in 2003 I was asked questions about this man. negative, legal_issue 178 79

Task 10: Sentiment Analysis in Twitter Task 10: Sentiment Analysis in Twitter! Subtask A: Contextual Polarity Disambiguation Given a message containing a marked instance of a word or phrase, determine whether that instance is positive, negative or neutral in that context.! Subtask B: Message Polarity Classification Given a message, classify whether the message is of positive, negative, or neutral sentiment.! Subtask C NEW : Topic-Based Message Polarity Classification Given a message and a topic, classify whether the message is of positive, negative, or neutral sentiment towards the given topic.! Subtask D NEW : Detecting Trends Towards a Topic Given a set of messages on a given topic from the same period of time, determine whether the dominant sentiment towards the target topic in these messages is (a) strongly positive, (b) weakly positive, (c) neutral, (d) weakly negative, or (e) strongly negative.! Subtask E NEW : Determining degree of prior polarity Given a word or a phrase, provide a score between 0 and 1 that is indicative of its strength of association with positive sentiment. 179 80

Task 11: Sentiment Task 11: Sentiment Analysis Analysis of Figurative Language of Figurative in Twitter Language in Twitter! Twitter is rife with ironic, sarcastic and figurative language.! How does this creativity impact the perceived affect?! Do conventional sentiment techniques need special augmentations to cope with this non-literal content? This is not an irony detection task per se, but a sentiment analysis task in the presence of irony.! Task 11 will test the capability of sentiment systems on a collection of tweets that have a high concentration of sarcasm, irony and metaphor. Tweets are hand-tagged on a sentiment scale ranging from -5 (very negative meaning) to +5 (very positive). 81

Task 12: Aspect Based Sentiment Analysis Task 12: Aspect Based! Subtask 1 a set of quintuples has to be extracted from a collection of opinionated documents "opinion target "target category "target polarity " from and to that indicate the opinion target s starting and ending offset in the text! Subtask 2 same as subtask 1, but on new unseen domain no training data from the target domain 82 181

Other Other Sentiment Sentiment Challenges Challenges! Kaggle competition on Sentiment Analysis on Movie Reviews website: http://www.kaggle.com/c/sentiment-analysis-on-moviereviews deadline: 11:59 pm, Saturday 28 February 2015 UTC # of teams: 395 The sentiment labels are: "0 - negative "1 - somewhat negative "2 - neutral "3 - somewhat positive "4 - positive 83

Other Sentiment Tasks 84

Computational work on other affective states Emotion: Detecting annoyed callers to dialogue system Detecting confused/frustrated vs. confident students Mood: Finding traumatized or depressed writers Interpersonal stances: Detection of flirtation or friendliness in conversations Personality traits: Detection of extroverts 85

Scherer Typology of Affective States Emotion: brief organically synchronized evaluation of a major event angry, sad, joyful, fearful, ashamed, proud, elated Mood: diffuse non-caused low-intensity long-duration change in subjective feeling cheerful, gloomy, irritable, listless, depressed, buoyant Interpersonal stances: affective stance toward another person in a specific interaction friendly, flirtatious, distant, cold, warm, supportive, contemptuous Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons liking, loving, hating, valuing, desiring Personality traits: stable personality dispositions and typical behavior tendencies nervous, anxious, reckless, morose, hostile, jealous 86

CONCLUSION SemEval task on Sentiment Analysis testbed for comparisons generated new lexicons revealed important features Future get closer to what users need stars, trends, Thank you! 87

Why Sentiment Analysis? 88

What people think? What others think has always been an important piece of information Which car should I buy? Which schools should I apply to? Which Professor to work for? Whom should I vote for? 89

Google Product Search a 90

Bing Shopping a 91

Personality Analysis 92

Personality Analysis: Used by HR 93

Analyzing German Politician s Profiles (Tumasjan & al, 2010) 94

Predicting German Elections (Tumasjan & al, 2010) 95

Twitter Predicts US Election Results 404 out of 432 races for the US House of Representatives 92.8% correct! 96

Twitter vs. Gallup Poll of Consumer Confidence (O'Connor & al, 2010) 97

Twitter Sentiment (Bollen & al, 2011)

CALM Dow Jones CALM predicts DJIA 3 days later At least one current hedge fund uses this algorithm (Bollen & al, 2011) 99

Fake Reviews on Amazon 100

Review Manipulation on Yelp

Review Manipulation on Yelp 102

Political Trolls?

The Bulgarian Twitter Space 104

Spam, Trolls, Computer-Generated Content Books - Algorithm by Philip Parker, Insead o 1,000,000+ books generated o 100,000+ sold at Amazon Robo journalism - tons of articles generated today - by 2025, can cover 90% What is next - Robo trolls writing fake comments/reviews? - Computers mining text written by other computers? - From Computational to Computer Linguistics? - 105