Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was to apply machine learning algorithms to determine what the emotion of an author is based on the contents of his/her tweets. Our assumption was that we can judge whether an author is happy or sad based on his/her choice of words. Preprocessing and feature extraction: For the purpose of training our classifiers we used the Twitter dataset [1]. It is a large dataset and easy to obtain. Tweets (posts on twitter.com by twitter users) are short and concise, usually no more than 1 or 2 sentences long. Sentences were assumed to have certain emotions associated with them (Happy, Sad, Angry, Neutral etc.). Ideally, human labeling of such sentences as conveying a particular emotion would have been a good approach. But, considering the size of the dataset (an estimated 300 million tweets) this would have been highly impractical. Hence, we decided to exploit emoticons to help us label our training tweets as Happy or Sad. The assumption was that if a person used a happy emoticon, then that person was probably happy at the time of posting the tweet. The same applies to a sad tweet. A typical tweet in our dataset would look something like the one shown in Figure 1. Figure 1 1
Please note that the tweet in Figure 1 is a fictitious tweet, but the format in the dataset is the same as the one shown. Information in the tweet that was not required for the purpose of training our classifiers, like the user names, tweet dates and URLs, were removed. Stop word like a, are, be etc. were also removed. In addition to this, very infrequent words were also removed as these may not have contributed much to the training. Only tweets with happy and sad emoticons were retained. For this project we are considering only tweets containing happy and sad emoticons because: 1) They are rarely used together in the same tweet and 2) Other emoticons are rarely used, therefore they may not contribute much to the training. Non-standard words such as LOL or ROTFL were not removed because they are words that sometimes have a high correlation with the emoticon being used and usually signify some emotion. Unbalanced training data was another problem that we came across. The ratio of happy tweets to sad ones was 9 to 1. We believe this was biasing our classifier s prediction towards the happy class, therefore we added more Sad tweets to the training data set. Nearly 440,000 such tweets were shortlisted. Tweets were converted into a bag of words format. We are ignoring ordering of the words for our classification. We are also maintaining a dictionary of all the words which have appeared at least once. Description of Machine Learning Algorithms used Naïve Bayes Classifier We are modeling our bag of words as unigrams (single worded dictionary), i.e. we are assuming that occurrence of each word given the class is independent of any other word in the sentence for the same class. 2
Mathematically: Out of Dictionary Words (ODW) are another problem with the Naïve Bayes classifier. Words in a testing sample which have not been seen in the training phase would have a probability of zero, which is not desirable since it will be multiplied by other probabilities resulting in a zero probability for ( ). While implementing our Naïve Bayes classifier we used some of the concepts from a paper by David Ahn & Balder ten Cate [2]. The paper mentions a technique called Laplace s law of smoothing, and we have used it with a slight variation. For dictionary words we used the below formula: For the ODWs we are using the following formula: Here we describe how we came up with this modified method of smoothing. For this purpose we are building a Virtual Tweet which is a long tweet contains all the words in the dictionary, plus a word to represent any unseen words. Thus, in this set up, probabilities are calculated as the above equations. Another interesting problem with Naïve Bayes classifier that we came across via this paper was the possibility of underflow due to repetitive multiplications of small probabilities. To solve this problem we added the logs of the probabilities, instead of multiplying the probabilities. Assuming that we have a sample testing tweet as where w i is a word in that tweet, and C j is a class, then C j 3
K-Nearest Neighbor classifier Two flavors of the K-Nearest Neighbor classifier were used. Centroid-based Nearest Neighbor Since we already have 2 clusters that contain tweets that are labeled as Happy and Sad, we calculate the centroid of these clusters, and check whether a new tweet that needs to be classified is more similar to the centroids of the Happy and Sad clusters. K in this case is effectively 1. The centroid for the cluster i can be calculated using the following formula: (*DW: Dictionary Word, N:Dictionary Size) [ ] Figure 2 Figure 2 describes this approach. Figure 2 has two clusters whose elements are either red squares or blue rhombuses. The X and the green triangle are the centroids of the respective clusters. And the Black dot is the element that needs to be classified. For each class we will calculate the Cosine or Jaccard similarity [3:74] of the centroid of that class and the testing tweet. The class whose centroid has the higher similarity will be declared the predicted class for the testing tweet. Below is the formula for calculating the similarity using the Cosine measure: i i 4
And below is the formula for calculating the similarity using the Jaccard measure: i i K-Nearest Neighbors Using the traditional K-Nearest Neighbor classifier, when a testing tweet came in to be classified as Happy or Sad, we would find the K most similar tweets in the training dataset. If majority of the K most similar tweets were Happy tweets, then the new tweet would be classified as a Happy tweet. Otherwise, it would be classified as a Sad tweet. K was always chosen to be an odd number, so that a tweet would either be classified as either Happy or Sad and not both. We used the same Cosine or Jaccard similarity measures as the centroid based nearest neighbor classifier. Results and Method of training and testing In all test cases, a testing tweet was said to have been classified accurately if the label (happy or sad) predicted by the classifier was the same as the label (the emoticon) that existed for that testing tweet. For testing the K-nearest neighbor classifier, we chose a much smaller data set 10,000 tweets. The reason why we chose to use a smaller dataset is because the K-nearest neighbor algorithm is very slow. Larger the training data set, slower the algorithm. We then did a 10-fold cross validation on the data set. Figure 3 shows a plot of the accuracy vs. the value of K for Cosine and Jaccard similarity measures. The data set used in this case included randomly chosen tweets that had happy or sad emoticons. 5
Figure 3 In another case, we tried varying the size of the Figure training 4 data set. The training set had tweets that had n In another case, we tried varying the size of the training data set. The training set had tweets that had an (almost) equal number of happy and sad tweets. The same training set was used for the Naïve Bayes classifier as well as both flavors of the nearest neighbor classifiers. The testing set comprised of 1000 randomly chosen tweets with happy and sad emoticons. The same testing data set was used for all three 6
classifiers. Figure 4 shows a plot of how the accuracy varies with the size of the training data set for all three classifiers. Lastly, we also tested the Naïve Bayes classifier with no smoothing, with smoothing, and smoothing with log probabilities. Figure 5 shows a plot of the accuracy vs. size of the training dataset for all three methods. Figure 5 Discussion 1) Our accuracy would not improve much beyond a certain point. On further analysis we discovered that people used emoticons in different ways than we expected. This may imply that emoticons are perhaps not the best labels for sentiment analysis. 2) Smoothing improved the accuracy of the Naïve Bayes classifier. Words in a testing sample which had not been seen in the training phase would have a probability of zero, which when multiplied 7
by other probabilities would result in a zero probability for ( ), possibly leading to misclassification. 3) Log probabilities for the Naïve Bayes classifier gave us substantially better results. We assume that this is due the avoidance of underflow caused by multiplying very small probabilities. 4) We didn t handle negation. It s possible we may have gotten better results if we had handled it. There were 5625 occurrences of negations in 93,000 tweets. 5) We didn t take into account sentence structure. We re not sure if this would increase the accuracy of classification by much, since people on twitter often do not follow sentence structures that we would normally learn in school. 6) We had initially planned to use the perceptron, but since our training dataset was so large, we were unsure about whether it would ever converge and even if it did, then how long it would take. We do not know if the feature space is linearly separable. 7) In the case of traditional K-NN, since each testing tweet needs to be compared with all the training tweets, the time complexity for each testing tweet is O(T) where T is the size the training dataset, which is quite large. In the case of the centroid based nearest neighbor, since the centroids are calculated only once, the time complexity is much lower. However, there is a tradeoff in terms of accuracy. 8) As the value of K is increased in the traditional K-NN classifier, the accuracy seems to increase. When K is small, it s possible that noisy training tweets may cause misclassification. 9) For large training sets, we discovered that the Jaccard similarity measure performs slightly better than the Cosine similarity measure. For smaller training sets though, they seem to be on par with each other. 8
Acknowledgements We would like to thank Dave Fuhry 1 for sharing the twitter data set with us. We would also like to thank Prof. Eric Fosler-Lussier 2 for his guidance. References: [1] www.twitter.com,twitter, Inc. (US), Tweets from 2008 and 2009. [2] David Ahn & Balder ten Cate. Simple language models and spam filtering with Naive Bayes, 2005. http://ilps.science.uva.nl/teaching/0405/ar/part2/assignment1.pdf [3] Tan, Steinbach & Kumar, Introduction to Data Mining, 4 th ed., Pearson Education, Inc., 2006 1 http://www.cse.ohio-state.edu/~fuhry/ 2 http://www.cse.ohio-state.edu/~fosler/ 9