CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques Chris MacLellan cjmaclel@asu.edu May 3, 2012 Abstract Different methods for aggregating twitter sentiment data are proposed and three approaches are compared. The naive approach of equally weighting each tweet, taken by all research on twitter sentiment to date, was compared against two weighting schemes suggested by past work on influence in the twitter network: number of followers and pagerank. Evaluation was not within the scope of the course project but a system was implemented to collect tweets, predict the sentiment of collected tweets, and to visualize the tweet sentiment weighted by different schemes. Initial visual inspection of the graphs suggests that results by for different weighting schemes are qualitatively the same. Introduction Sentiment analysis is a growing field of research based on the concept of extracting the opinions and sentiment of users from social media sources such as blogs, micro-blogs, reviews, and newspapers [12]. The opinions and sentiments of all users can be aggregated to express some measure of public sentiment. Studies have been conducted to evaluate the use of aggregated sentiment to match polling data and predict elections [11, 8, 10, 18, 5], predict box office sales [9], predict the swine flu pandemic [14], and predict stock market movements [2, 3, 22, 7, 16, 20, 19, 21]. Overall the studies show that public sentiment has some predictive power over various systems (polls, stocks, etc...). In [2] the authors show that the aggregate sentiment of the entire twitter network can predict with 87.6% accuracy the next day closing price of the DOW. The authors have since joined a hedge fund where they will be investing $40 million dollars on a trading strategy based on twitter sentiment [17]. In all of these studies, researchers give equal weighting to each tweet when aggregating them into a composite score. This method of aggregating public opinion contradicts the work on user influence in social networks [15, 4, 6] which uses indegree (followers), retweets, mentions, and pagerank to measure the influence of a user. Additionally, it has been shown that certain moods and sentiments are associative between users [1] suggesting that more influential users may pass their sentiments onto their followers and that those with more influence will have a greater affect on the public sentiment. There are additional problems with assigning equal weight to the sentiment from each tweet. When non-influential users spam tweets related to a topic they may incorrectly skew the aggregate sentiment score. Also, when someone of high influence and someone of low influence express opinions they are viewed as equal when in fact they are not. Imagine someone influential like president Obama (millions of followers) expresses a positive opinion about the economy which is contradicted by a normal user (with only 1
15 followers). With equal weighting the sentiment from the president and the twitter user would counter each other out. What would be ideal is for the opinion of the president to have more weight because he has greater influence. This project took steps to better understand this situation. Proposed Work & Progress The proposed work consisted of five steps: Collect Tweets Analyze Tweet Sentiment Calculate Tweet Weight Aggregate Sentiment and Visualize Evaluate In this section each of the steps will be outlined and the progress made on each step will be discussed. Collect Tweets There are two cases under which tweets can be collected for this system: historical pre-collected tweets and real-time tweets. When using precollected tweets the data needs to be stored in a database so it can be searched to select only the tweets containing pertinent keywords. When using real-time tweets the twitter stream API needs to be used. This API enables you to specify a number of keywords and will return a statistically significant percentage of the tweets posted every second that contain the specified keywords. The tweets collected from the stream API will also be stored in a database. Once the tweets are collected they will be processed the same regardless of the method of collection. For this project pre-collected tweets from Egypt in the month of March were provided. This data set consists of 575k tweets that are geo tagged as originated in Egypt. This data was provided in XML format and a script was written to import the XML into the database. Additionally, the code for utilizing the twitter stream API was implemented. Even though this functionality was not used in this project, the system is capable of collecting and storing tweets in real-time. Analyze Tweet Sentiment Three techniques for analyzing tweet sentiment were attempted. The first attempt used a lexicon based method cited in the literature [3, 2, 22, 11, 12, 8, 10]. This method consisted of taking all the words in a tweets, looking them up in a lexicon of sentiment values, and averaging the sentiment values of all of the words in the tweet. This method was simple but was ineffective, producing poor sentiment predictions. The second technique implemented for sentiment analysis was to use OpinionFinder, a sentiment analysis tool from the University of Pittsburgh. This tool, which was cited in much of the work in sentiment analysis [5, 11, 9, 12, 8], was also quite ineffective and did not even support simple negation detection. The final technique that was implemented was a support vector machine, which [13] has shown to have as good as 80% accuracy on classification. This technique takes a set of prelabeled tweets and generates a set of feature vectors which are then used to train the support vector machine. Once trained, the system is capable of extracting the feature vector from a tweet and using it to predict the sentiment. The sentiment output by the system is -1 for negative sentiment, 0 for neutral sentiment, and 1 for positive sentiment. The acquisition of prelabeled tweets was a challenge. To assist in this challenge a web based system was implemented to crowd source this task (see Figure 1). The hope was that it would be interesting for users to see the sentiment of tweets for various terms on twitter. The implemented system could search for tweets related to a users query using javascript, would label the tweets using the support vector machine, and would then accept labels from a user where 2
(a) Search on ipad. (b) Search on Obama. Figure 1: Collecting sentiment labels the machine had incorrectly labeled the sentiment. These corrections were collected as training data for the support vector machine. I hand labeled 100 tweets in this fashion, being sure to tag tweets in all three classes (positive, negative, and neutral). The system should improve if more prelabeled tweets are provided. Once the system had sufficient training data, all of the tweets in the database were classified using this system and the sentiment scores were stored for later use. Calculate Tweet Weight Once the sentiment score for each tweet was calculated its weight towards the aggregate was needed. For uniform weighting no calculation was required. For indegree, the number of followers was used to weight the sentiment score. To get the number of followers for each user a crawler was implemented using the Ruby Twitter Gem. This crawler used the Twitter REST API to look up the follower count of each users in our database. The collected counts were then stored in the database for future use. I could only place 350 request an hour due to the rate limits on Twitters REST API. Since there were approximately 60k users in the system it would take about 8 days to collect all of the follower counts. I didn t have time to complete this and only was able to crawl about 35k follower counts. To weight tweets by pagerank, the pagerank scores of each user in the twitter follower network were needed. There was not sufficient time to crawl the entire twitter network and perform the pagerank calculation myself. To resolve this problem I used the API from http://www.trst.me which has precalculated the pagerank of the majority of users on twitter. They had a free account which enabled me to make 100,000 calls a month in up to 2,000 calls / hour bursts. Some users were not indexed by the service and these users where listed as having an influence of 0 (were essentially ignored in the aggregate). After collecting all of the page rank values only 9k out of 60k had non-zero page rank values. The collection of the pagerank val- 3
ues took approximately 30 hours. Aggregate Sentiment and Visualize Once the sentiment had been calculated and the weights (followers and pagerank values) had been collected I was able to generate graphs of the aggregated sentiment. To perform this aggregation I found all of the tweets for a query, binned all of the tweets for each day, and took the weighted sum as the aggregate sentiment value for each day. I then normalized the sums so that the values would be between -1 and 1 (this was necessary for easy comparison between weighting schemes). I performed this aggregation for each weighting scheme (uniform, followers, and pagerank) and plotted all three schemes in the same graph for easy comparison. Evaluate In the original proposal I had planned on evaluating the various weighting schemes by comparing the resulting graphs to an objective measure of sentiment (the stock market, poll data, box office sales, etc). I had planned on calculating the Granger causality between the objective measure and each of the weighting schemes to determine which was more predictive of the objective measure. I did not have the time to collect the necessary tweets for this type of an evaluation. Since I was provided tweets from Egypt in the month of march, which have no obvious objective measure I can compare to, I simply evaluated the graphs by visual inspection (see Figure 2). The visual inspection shows that the information presented by each of the weighting schemes is different, but mostly recognizes the same trends. There are some instances where the three schemes do not agree, March 20th in Figure 2 for example. This may be due to the fact that I was unable to get all of the follower counts from twitter, skewing the follower weighted graph. It is important to note that the pagerank graph is only looking at 9k users as opposed to the uniformly weighted graph which is considering all 60k users. From the opinions of these 9k influencial users it is able to generate a nearly identical graph to the one containing all of the data uniformly weighted. Discussion This work attempted to evaluate the effectiveness of three different weighting schemes in aggregating tweets sentiment. The three schemes compared were uniform weighting, weighting by indegree (followers), and weighting by pagerank. The visual inspection of the resulting graphs of sentiment output by these three weighting schemes suggests that the weighting scheme does not make a substantial difference. However, the pagerank weighting scheme only requires a fraction of the opinions that the uniform weighting scheme requires. This suggests that maybe only a core set of influencial users need to be tracked (as opposed to all users) to be able to recognize the all trends that emerge from the uniformly weighted sentiment. It is still not clear if this holds in all situations, it may be the case that an influential user is opposed by a large number of non-influential users. My initial hypothesis was that the pagerank and follower weighted sentiment and would show trends that the uniform might not. Based on visual inspection, this hypothesis appears to have been invalidated for the pagerank weighted sentiment, as both graphs recognize simililar trends. For the follower weighted sentiment, I cannot draw any meaningful conclusions without all of the relevant follower counts. I will need to finish collecting follower counts, collect more data sets, perform more tests to validate these initial findings, and perform comparisons of the resulting graphs against an objective measures of sentiment. There were a number of things factors that could potentially change these results. I only trained my sentiment predictor (SVM) on 100 hand labeled tweets. If we had more training data then the sentiment prediction would be more accurate, potentially changing the re- 4
Figure 2: Graphical output of three weighting schemes for all the tweets in the database. Green is uniform weighting, yellow is weighted by followers, and blue is weighted by pagerank. sults. Additionally, I chose to bin the tweets according to day, perhaps it would be better to use a different scheme (minutes, hours, days, weeks, months, etc). Choosing a different binning scheme could potentially change the results. I was unable to finish collecting all of the follower counts, invalidating any conclusions drawn from the follower graph. Lastly, if I narrowed each of the tweets by topics instead of looking at all the tweets in the database there may be more variation between weighting schemes. Overall, I think that this project has been a good start towards finding some insightful results about the aggregation of sentiment. I am planning on continuing this work and attempting to collect enough tweets to perform some comparisons against objective measures (stocks, poll data, etc). Also, the technology used in this project would be very useful commercially, for marketers and advertisers. I have plans to commercializing my work. References [1] J. Bollen, B. Goncalves, G. Ruan, and H. Mao. Happiness is assortative in online social networks. Artificial Life, (Early Access):1 15, 2011. [2] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2011. [3] J. Bollen, A. Pepe, and H. Mao. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. In Proc. of WWW 2009 Conference, 2009. [4] M. Cha, H. Haddadi, F. Benevenuto, and K.P. Gummadi. Measuring user influence in twitter: The million follower fallacy. In 4th International AAAI Conference on Weblogs and Social Media (ICWSM), 2010. [5] J. Chung and E. Mustafaraj. Can collective sentiment expressed on twitter predict 5
political elections? In Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011. [6] L.R. Flynn, R.E. Goldsmith, and J.K. Eastman. Opinion leaders and opinion seekers: Two new measurement scales. Journal of the Academy of Marketing Science, 24(2):137 147, 1996. [7] E. Gilbert and K. Karahalios. Widespread worry and the stock market. In Proceedings of the International Conference on Weblogs and Social Media, 2010. [8] M. Lidman. Social media as a leading indicator of markets and predictor of voting patterns. 2011. [9] C. Meador and J. Gluck. Analyzing the relationship between tweets, box-office performance, and stocks. [10] P.T. Metaxas, E. Mustafaraj, and D. Gayo- Avello. How (not) to predict elections. [11] B. OConnor, R. Balasubramanyan, B.R. Routledge, and N.A. Smith. From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the International AAAI Conference on Weblogs and Social Media, pages 122 129, 2010. [12] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1 135, 2008. [13] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 79 86, 2002. [14] J. Ritterman, M. Osborne, and E. Klein. Using prediction markets and twitter to predict a swine flu pandemic. In 1st International Workshop on Mining Social Media, 2009. [15] D.M. Romero, W. Galuba, S. Asur, and B.A. Huberman. Influence and passivity in social media. In Proceedings of the 20th international conference companion on World wide web, pages 113 114. ACM, 2011. [16] T.O. Sprenger. Tweettrader.net: Leveraging crowd wisdom in a stock microblogging forum. 2011. [17] Maxwell Strachan. Hedge fund bets $40 million that twitter can predict the stock market. In Huffington Post. http: //www.huffingtonpost.com/2011/03/21/ hedge-fund-twitter-stock-market_n_ 838497.html. [18] A. Tumasjan, T.O. Sprenger, P.G. Sandner, and I.M. Welpe. Predicting elections with twitter: What 140 characters reveal about political sentiment. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, pages 178 185, 2010. [19] M.S.A. Wolfram. Modeling the stock market using twitter. [20] A. Yi. Stock market prediction based on public attentions: a social web mining approach, 2009. [21] W. Zhang and S. Skiena. Trading strategies to exploit blog and news sentiment. In The 4th International AAAI Conference on Weblogs and Social Media, 2010. [22] X. Zhang, H. Fuehres, and P. Gloor. Predicting stock market indicators through twitter i hope it is not as bad as i fear. In COIN Collaborative Innovations Networks Conference, pages 1 8, 2010. 6