Sentiment analysis using emoticons



Similar documents
Introduction to nonparametric regression: Least squares vs. Nearest neighbors

1 Maximum likelihood estimation

Social Media Mining. Data Mining Essentials

LCs for Binary Classification

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Simple Language Models for Spam Detection

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Machine Learning Final Project Spam Filtering

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Analysis of Tweets for Prediction of Indian Stock Markets

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Chapter 6. The stacking ensemble approach

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Mining a Corpus of Job Ads

Monday Morning Data Mining

Spam Filtering with Naive Bayesian Classification

A CRF-based approach to find stock price correlation with company-related Twitter sentiment

Investigation of Support Vector Machines for Classification

Machine Learning in Spam Filtering

Sentiment analysis on tweets in a financial domain

An Introduction to Data Mining

Supervised Learning (Big Data Analytics)

End-to-End Sentiment Analysis of Twitter Data

Data Mining. Nonlinear Classification

Logistic Regression for Spam Filtering

Data Mining - Evaluation of Classifiers

Data Mining Yelp Data - Predicting rating stars from review text

Maschinelles Lernen mit MATLAB

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Sentiment Analysis for Movie Reviews

Final Project Report

Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Semantic Sentiment Analysis of Twitter

II. RELATED WORK. Sentiment Mining

Geometry Notes RIGHT TRIANGLE TRIGONOMETRY

E-commerce Transaction Anomaly Classification

Data Mining Practical Machine Learning Tools and Techniques

Classification algorithm in Data mining: An Overview

Distances, Clustering, and Classification. Heatmaps

Introduction to Data Mining

CSE 473: Artificial Intelligence Autumn 2010

Towards better accuracy for Spam predictions

Decompose Error Rate into components, some of which can be measured on unlabeled data

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

Automatic Text Processing: Cross-Lingual. Text Categorization

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Microsoft Azure Machine learning Algorithms

Knowledge Discovery and Data Mining

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Knowledge Discovery and Data Mining

Decision Support System on Prediction of Heart Disease Using Data Mining Techniques

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Creating a NL Texas Hold em Bot

2010 Solutions. a + b. a + b 1. (a + b)2 + (b a) 2. (b2 + a 2 ) 2 (a 2 b 2 ) 2

Music Genre Classification

Machine Learning using MapReduce

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, HIKARI Ltd,

Principal components analysis

King Saud University

Active Learning SVM for Blogs recommendation

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

A Logistic Regression Approach to Ad Click Prediction

Data Mining for Knowledge Management. Classification

Math Journal HMH Mega Math. itools Number

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Support Vector Machine (SVM)

FPGA Implementation of Human Behavior Analysis Using Facial Image

The Scientific Data Mining Process

Spring 2007 Assignment 6: Learning

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

Big Data Analytics CSCI 4030

Data Mining Algorithms Part 1. Dejan Sarka

Use of social media data for official statistics

Data Mining Part 5. Prediction

Music Mood Classification

STA 4273H: Statistical Machine Learning

Forecasting stock markets with Twitter

Content-Based Recommendation

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms

An Approach to Detect Spam s by Using Majority Voting

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

An Overview of Knowledge Discovery Database and Data mining Techniques

Automated News Item Categorization

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Beating the MLB Moneyline

Transcription:

Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was to apply machine learning algorithms to determine what the emotion of an author is based on the contents of his/her tweets. Our assumption was that we can judge whether an author is happy or sad based on his/her choice of words. Preprocessing and feature extraction: For the purpose of training our classifiers we used the Twitter dataset [1]. It is a large dataset and easy to obtain. Tweets (posts on twitter.com by twitter users) are short and concise, usually no more than 1 or 2 sentences long. Sentences were assumed to have certain emotions associated with them (Happy, Sad, Angry, Neutral etc.). Ideally, human labeling of such sentences as conveying a particular emotion would have been a good approach. But, considering the size of the dataset (an estimated 300 million tweets) this would have been highly impractical. Hence, we decided to exploit emoticons to help us label our training tweets as Happy or Sad. The assumption was that if a person used a happy emoticon, then that person was probably happy at the time of posting the tweet. The same applies to a sad tweet. A typical tweet in our dataset would look something like the one shown in Figure 1. Figure 1 1

Please note that the tweet in Figure 1 is a fictitious tweet, but the format in the dataset is the same as the one shown. Information in the tweet that was not required for the purpose of training our classifiers, like the user names, tweet dates and URLs, were removed. Stop word like a, are, be etc. were also removed. In addition to this, very infrequent words were also removed as these may not have contributed much to the training. Only tweets with happy and sad emoticons were retained. For this project we are considering only tweets containing happy and sad emoticons because: 1) They are rarely used together in the same tweet and 2) Other emoticons are rarely used, therefore they may not contribute much to the training. Non-standard words such as LOL or ROTFL were not removed because they are words that sometimes have a high correlation with the emoticon being used and usually signify some emotion. Unbalanced training data was another problem that we came across. The ratio of happy tweets to sad ones was 9 to 1. We believe this was biasing our classifier s prediction towards the happy class, therefore we added more Sad tweets to the training data set. Nearly 440,000 such tweets were shortlisted. Tweets were converted into a bag of words format. We are ignoring ordering of the words for our classification. We are also maintaining a dictionary of all the words which have appeared at least once. Description of Machine Learning Algorithms used Naïve Bayes Classifier We are modeling our bag of words as unigrams (single worded dictionary), i.e. we are assuming that occurrence of each word given the class is independent of any other word in the sentence for the same class. 2

Mathematically: Out of Dictionary Words (ODW) are another problem with the Naïve Bayes classifier. Words in a testing sample which have not been seen in the training phase would have a probability of zero, which is not desirable since it will be multiplied by other probabilities resulting in a zero probability for ( ). While implementing our Naïve Bayes classifier we used some of the concepts from a paper by David Ahn & Balder ten Cate [2]. The paper mentions a technique called Laplace s law of smoothing, and we have used it with a slight variation. For dictionary words we used the below formula: For the ODWs we are using the following formula: Here we describe how we came up with this modified method of smoothing. For this purpose we are building a Virtual Tweet which is a long tweet contains all the words in the dictionary, plus a word to represent any unseen words. Thus, in this set up, probabilities are calculated as the above equations. Another interesting problem with Naïve Bayes classifier that we came across via this paper was the possibility of underflow due to repetitive multiplications of small probabilities. To solve this problem we added the logs of the probabilities, instead of multiplying the probabilities. Assuming that we have a sample testing tweet as where w i is a word in that tweet, and C j is a class, then C j 3

K-Nearest Neighbor classifier Two flavors of the K-Nearest Neighbor classifier were used. Centroid-based Nearest Neighbor Since we already have 2 clusters that contain tweets that are labeled as Happy and Sad, we calculate the centroid of these clusters, and check whether a new tweet that needs to be classified is more similar to the centroids of the Happy and Sad clusters. K in this case is effectively 1. The centroid for the cluster i can be calculated using the following formula: (*DW: Dictionary Word, N:Dictionary Size) [ ] Figure 2 Figure 2 describes this approach. Figure 2 has two clusters whose elements are either red squares or blue rhombuses. The X and the green triangle are the centroids of the respective clusters. And the Black dot is the element that needs to be classified. For each class we will calculate the Cosine or Jaccard similarity [3:74] of the centroid of that class and the testing tweet. The class whose centroid has the higher similarity will be declared the predicted class for the testing tweet. Below is the formula for calculating the similarity using the Cosine measure: i i 4

And below is the formula for calculating the similarity using the Jaccard measure: i i K-Nearest Neighbors Using the traditional K-Nearest Neighbor classifier, when a testing tweet came in to be classified as Happy or Sad, we would find the K most similar tweets in the training dataset. If majority of the K most similar tweets were Happy tweets, then the new tweet would be classified as a Happy tweet. Otherwise, it would be classified as a Sad tweet. K was always chosen to be an odd number, so that a tweet would either be classified as either Happy or Sad and not both. We used the same Cosine or Jaccard similarity measures as the centroid based nearest neighbor classifier. Results and Method of training and testing In all test cases, a testing tweet was said to have been classified accurately if the label (happy or sad) predicted by the classifier was the same as the label (the emoticon) that existed for that testing tweet. For testing the K-nearest neighbor classifier, we chose a much smaller data set 10,000 tweets. The reason why we chose to use a smaller dataset is because the K-nearest neighbor algorithm is very slow. Larger the training data set, slower the algorithm. We then did a 10-fold cross validation on the data set. Figure 3 shows a plot of the accuracy vs. the value of K for Cosine and Jaccard similarity measures. The data set used in this case included randomly chosen tweets that had happy or sad emoticons. 5

Figure 3 In another case, we tried varying the size of the Figure training 4 data set. The training set had tweets that had n In another case, we tried varying the size of the training data set. The training set had tweets that had an (almost) equal number of happy and sad tweets. The same training set was used for the Naïve Bayes classifier as well as both flavors of the nearest neighbor classifiers. The testing set comprised of 1000 randomly chosen tweets with happy and sad emoticons. The same testing data set was used for all three 6

classifiers. Figure 4 shows a plot of how the accuracy varies with the size of the training data set for all three classifiers. Lastly, we also tested the Naïve Bayes classifier with no smoothing, with smoothing, and smoothing with log probabilities. Figure 5 shows a plot of the accuracy vs. size of the training dataset for all three methods. Figure 5 Discussion 1) Our accuracy would not improve much beyond a certain point. On further analysis we discovered that people used emoticons in different ways than we expected. This may imply that emoticons are perhaps not the best labels for sentiment analysis. 2) Smoothing improved the accuracy of the Naïve Bayes classifier. Words in a testing sample which had not been seen in the training phase would have a probability of zero, which when multiplied 7

by other probabilities would result in a zero probability for ( ), possibly leading to misclassification. 3) Log probabilities for the Naïve Bayes classifier gave us substantially better results. We assume that this is due the avoidance of underflow caused by multiplying very small probabilities. 4) We didn t handle negation. It s possible we may have gotten better results if we had handled it. There were 5625 occurrences of negations in 93,000 tweets. 5) We didn t take into account sentence structure. We re not sure if this would increase the accuracy of classification by much, since people on twitter often do not follow sentence structures that we would normally learn in school. 6) We had initially planned to use the perceptron, but since our training dataset was so large, we were unsure about whether it would ever converge and even if it did, then how long it would take. We do not know if the feature space is linearly separable. 7) In the case of traditional K-NN, since each testing tweet needs to be compared with all the training tweets, the time complexity for each testing tweet is O(T) where T is the size the training dataset, which is quite large. In the case of the centroid based nearest neighbor, since the centroids are calculated only once, the time complexity is much lower. However, there is a tradeoff in terms of accuracy. 8) As the value of K is increased in the traditional K-NN classifier, the accuracy seems to increase. When K is small, it s possible that noisy training tweets may cause misclassification. 9) For large training sets, we discovered that the Jaccard similarity measure performs slightly better than the Cosine similarity measure. For smaller training sets though, they seem to be on par with each other. 8

Acknowledgements We would like to thank Dave Fuhry 1 for sharing the twitter data set with us. We would also like to thank Prof. Eric Fosler-Lussier 2 for his guidance. References: [1] www.twitter.com,twitter, Inc. (US), Tweets from 2008 and 2009. [2] David Ahn & Balder ten Cate. Simple language models and spam filtering with Naive Bayes, 2005. http://ilps.science.uva.nl/teaching/0405/ar/part2/assignment1.pdf [3] Tan, Steinbach & Kumar, Introduction to Data Mining, 4 th ed., Pearson Education, Inc., 2006 1 http://www.cse.ohio-state.edu/~fuhry/ 2 http://www.cse.ohio-state.edu/~fosler/ 9