Text Mining for Sentiment Analysis of Twitter Data Shruti Wakade, Chandra Shekar, Kathy J. Liszka and Chien-Chung Chan The University of Akron Department of Computer Science liszka@uakron.edu, chan@uakron.edu Abstract Text messages express the state of minds from a large population on earth. From the perspective of decision makers, this collection of messages provides a precious source of information. In this paper, we present the use of Weka data mining tools to extract useful information for classifying sentiment of tweets collected from Twitter. The results of tweet mining are represented as decision trees that can be used for judging sentiment of new tweets. We introduce a new method for preprocessing tweets for decision tree learning. We evaluate the impact of tweets containing emoticons to the classifying process. The method is applied to perform sentiment analysis from tweets related to iphone and Microsoft. Experimental results show that decision tree classifiers out-performed naïve Bayes algorithm. Keywords: geometric tiling, minimal covering sets, wireless sensor networks 1. Introduction Billions of dollars are spent worldwide each year on market analysis. Data-driven decisions are a powerful and necessary method of conducting business. Imagine how useful it would be for a company to know how its products are viewed in the market or how a political candidate could leverage their public image in their campaign, without surveying people directly. One way to accomplish this is by collecting public sentiment on Internet microblogging sites such as Twitter 1, Tumblr 2, Plurk 3, Pownce 4, and Jaiku 5. These are the top five social networking forums that provide a quick and easy means for people to express themselves while creating a valuable pool of data for those who are interested in those 1 http://twitter.com 2 http://www.tumbl.com 3 http://www.plurk.com 4 http://pownce.com 5 http://jaiku.com opinions. Messages that users create are saved in their personal profile and forwarded to others in their circle of friends. The information may be kept private among the list, or made public and unrestricted. Opinion mining, sentiment analysis, and subjectivity analysis are related fields sharing common goals of developing and applying computational techniques to process collections of opinionated texts or reviews. Other research goals are to generate heuristics or tools that can be used to classify, rank, or summarize sentiments toward certain objects, events, or topics. For example, these tools can be used to determine a thumbs up or thumbs down vote for specific movies from their reviews, or to predict in-favor or in-worse of certain products or events. In this paper, we look specifically at Twitter data, called tweets, to perform clustering and sentiment analysis. Tweets are limited to 140 characters. Figure 1 shows an actual tweet taken from Twitter. This type of cyber-communication is commonly called microblogging. Sentiment analysis is a field of research that determines if there is a favorable or non-favorable reaction in text. Figure 1. Example tweet. Our approach is to use the Weka1 data mining software with a positive and negative word set and compare it to a second word set provided by Twitter. We are interested in the impact of emoticons added to both of these sets. In section two, we discuss previous research in the field of sentiment analysis on text. Section three presents the problem statement and setup. In section four, we describe the preprocessing steps performed on the data and the
feature selection used. Section five presents the experiments. Section six contains discussion of the results and we conclude in section seven. 2. Previous Work There is a small, but growing body of research in specifically opinion mining from microblogging data. Kim et al. give a compelling case for using Twitter lists for a corpus in sentiment analysis2. In this context, lists are groups of people who share a common interest such as music. They show that even though tweets are brief, they contain enough information to express identifiable characteristics, interests and sentiments. The seminal work by Pang et al. shows that machine learning is a viable tool for sentiment analysis using movie reviews for a corpus3. They apply three standard machine learning algorithms; Naïve Bayes, maximum entropy (MaxEnt), and support vector machines (SVMs). Their positive and negative word lists were relatively small, from five to eleven in different experiments, but nonetheless, the results are good. More notable, they bring to light the difficulty of the task compared to topic based classification. The work in Go et al. is very similar to Pang in using the same three classifiers, but microblogging data from Twitter is used as opposed to the longer text movie reviews4. The results are remarkably similar, showing promise that applying these tools for sentiment analysis cross the boundaries from longer text blocks to the 140 characters restricted tweets. The research in this paper excludes neutral sentiments from the corpora. Only positive and negative tweets are collected, mined through queries in the Twitter search utility using common emoticons. Once collected, the emoticons are removed from the tweets before training with the classifiers. Manually collected test data retains emoticons, if present. Pak and Paroubek5 collect data from Twitter, filter it and then classify as positive or negative by the use of popular emoticons (smiley faces, sad faces, and variations). Neutral tweets are collected from newspaper accounts to round out the corpora. An analysis indicates the distribution of word frequencies in the collection is normal. They apply a Naïve Bayes classifier to test the posts. Their best results are those experiments using bigrams. This is contrary to the findings of Pang, but may easily be explained by the very nature of the differing corpora. Movies reviews may contain more words and users may take more time to think about their post where tweeters tend to give lightening quick, brief snapshots of a thought sent from a cell phone or other small device. In fact, one very interesting observation that this paper makes is the amount of slang used and frequent misspellings in tweets. This may have minor effects on any opinion analysis applied to microblogging data. Read performs sentiment analysis on Usenet group data and movie reviews. He uses the Naïve Bayes and SVM classifiers6. His corpus is created using emoticons to identify positive and negative texts. No neutral or objective texts are included in either the training or testing data sets. Read also looks at topic, domain, and temporal dependency classifications. To summarize, research parameters tend to be grouped as follows: Classifier used Naïve Bayes Maximum Entropy Support Vector Machine Text blocks versus microblogging data Positive/negative word list source and size Use of neutral/objective data In the training data set In the testing data set Use of emoticons In the training data set In the testing data set Use of unigrams, bigrams, or both Use of word presence versus word frequency 3. Problem Formulation Sentiment analysis can be viewed as an application of text categorization, which dates back to the work on probabilistic text classification by Maron7. The main task of text classification is how to label texts with a predefined set of categories. Text categorization has been applied in other areas such as document indexing, document filtering, word sense disambiguation, etc. as surveyed in Sebastiani8. One of the central issues in text classification is how to represent the content of a text in order to facilitate an effective classification. From researches in information retrieval systems, one of the most popular and successful method is to represent a text by the collection of terms appear in it. The similarity between documents is defined by using the term frequency inverse document frequency (tfidf) measure9. In this approach, the terms or features used to represent a text is determined by taking the union of all terms that appear in the collection of texts used to derive the classifier. This usually results in a large number of features. Therefore, dimensionality reduction is a related issue that needs to be addressed. The problem we consider in this paper is as follows. Given a collection of tweets related to a specific subject,
how do we come up with a classifier for labeling sentiment of new tweets as positive, negative, or neutral? We start by collecting related tweets using a query containing words or phrase denoting the subject of interest. Since tweets may belong to multiple subjects, the inclusion of a tweet to a specific subject is not necessarily certain. In this work, we do not consider a fuzzy membership. In order to apply data mining tools to generate a classifier, we need to determine a list of features to represent tweets and assign a sentiment label to each tweet. Instead of using all terms that appear in the collected tweets, we have adopted a list of positive and negative words together with one where a positive emoticon is present and one where a negative emoticon is present to form the list of features. This is, in general, a much smaller set of features than using the unigram representation. We use three values for sentiment determined by combining the sentiment values derived from the following two factors: (1) The frequency counts of positive and negative words. (2) The presence of a positive or negative emoticon. If the count of positive words is greater than the negative words, then factor (1) has value 1, else its value is -1, and it has value 0 for a tie. For factor (2), it has value 1 when only a positive emoticon is present, its value is -1 when only a negative emoticon is present, and it has value 0, otherwise. The final sentiment value for a tweet is determined by summing up the values of factors (1) and (2), and then it is mapped into one of the three possible values: positive, negative, or neutral. Table 1 contains example of each sentiment for the iphone. Table 1. Example iphone-related tweets for each sentiment. Sentiment Tweet positive iphone junkie lots talk i'm :) negative Anyone else frustrated MMS experience iphone? Logging slow buggy ATT website... Seems un-apple like. neutral Ok help here, buy phone, choices are: G1, iphone, BB Storm, BB Bold. Chime We use the Weka data mining program J4810 to generate a decision tree from the labeled training set. A decision tree is a symbolic classifier with two advantages: first, it can further reduce the features to be included in the tree and second, the tree structure can provide a different form of summary for sentiments derived from the training set. 4. Methodology for Sentiment Classification The following steps were applied for text mining Twitter data for our sentiment analysis. 4.1 Data collection We used a publicly available dataset for our sample space, provided for research purposes under Creative Commons license from Choudhury 11. This data set contains more than 10.5 million tweets collected from over 200,000 users in the time period from 2006 through 2009. As subjects of interest, we use iphone and Microsoft as query terms to retrieve tweets from the raw data. The iphone corpus contains 18,548 related tweets. The Microsoft corpus consists 14,547 related tweets. 4.2 Data preprocessing We took several steps to preprocess the data to clean the tweets. First was the removal of stop words. These are words commonly filtered out when doing any type of text processing. In our data, we mainly removed prepositions and pronouns along with words such as been, have, is, being, and so forth. They can easily be removed without affecting the sentiment of the message as they do not convey any positive or negative meaning. It s common to find URLs in tweets, as people often share interesting links with friends. The next preprocessing task was to identify hyperlinks in the text and replace them with the tag URL. Symbols were also removed except for those that make up the set of emoticons listed in Table 2. Stemming is a process of reducing a word to its root form. For example, the set of words read, reader, readers, and reading all reduce to the root word read. We used the Snowball stemmer available as part of the Weka 1 software. 4.3 Feature Determination We use the following features to represent tweets in our experiments. A list of 931 positive words was downloaded from Winspiration 12. Example words in this list are beautiful, easy, and popular. A list of 1838 negative words was downloaded from EQI 13, a web site with resources related to emotional intelligence. Example negative words from this list are fragile, grumpy, and stressed. The set of emoticons we used are listed in Table
2. Positive emoticons are collectively represented as a feature named C+, and negative emoticons are collectively represented as a feature named as C-. For comparison purposes, a set of 129 positive and 144 negative words compatible to those provided by Twitrratr 14 was downloaded from the web site. These lists contain emoticons which were removed. Positive set of emoticons :) :-) : ) :D =) ;-) ;) Table 2. Set of emoticons. Negative set of emoticons :( :-( :( In addition, we looked at the frequency distribution of sentiment words among the subject-related tweets. Many words have a frequency count that is less than two. Therefore, we apply a threshold of two to further reduce the features. As a result, the word list from EQI and Winspiration has been reduced from 2769 to 59 words, and the list from Twitrratr has been reduced from 273 to 30 words. 4.4 Sentiment labeling We have created four training sets with combinations of two sets of sentiment words (EQI and Winspiration as one set, Twitrratr as the other) and inclusion or exclusion of emoticons. Training tweets are labeled by using a Java program that implements the labeling strategy described in Section 3. 5. Experimental Results We used the Weka data mining tools for our experiments. For each of the four combinations, we create an independent testing set by randomly selecting 20% of the labeled tweets collected. The remaining 80% is used for creating classifiers using Weka s J48 and Naïve Bayes algorithms. The validation is done by 10-fold crossvalidation. Default parameters are used for both learning algorithms. The experimental results for iphone-related tweets are shown in Table 3. The first training set, denoted by T1-1, uses 59 out of 2769 words downloaded from Refs. 11 12 as its features. Features used in the second training set, denoted as T1-2, consist of those in T1-1 plus the two emoticon categories. Similarly, features of the third training set T2-1 consist of only the Twitter compatible word list downloaded from Ref. 13 using 30 out 273 words. Similarly, the fourth training set T2-2 includes the two emoticon categories. The values of the receiver operating characteristic (ROC) areas are all excellent, and most of the F-measures are excellent, as well, as shown in Table 3. In this case, the table shows that the decision tree based algorithm J48 outperforms the Naïve Bayes algorithm. In addition, the use of the emoticon categories as features has a negative effect on J48 learning, while they provide a slight improvement for Naïve Bayes learning. The use of a large feature set has a negative impact on the Bayes algorithm, but it seems to have no impact on J48. Table 3. Performance measures for iphone-related tweets analysis. Accuracy F-Measure ROC Area J48 NB J48 NB J48 NB T1-1 98.05 84.73 0.98 0.84 0.98 0.996 T1-2 (Emoticons) 97.87 85.22 0.98 0.84 0.98 0.995 T2-1 98.03 95.19 0.98 0.95 0.93 0.999 T2-2 (Emoticons) 98.03 95.41 0.98 0.95 0.94 0.999 The experimental results on Microsoft-related tweets are shown in Table 4. We have similar results as in the case of iphone-related tweets. The J48 algorithm has outperformed the Naïve Bayes algorithm in all cases. Again, the use of emoticons as features does not improve performance. Instead we see a slight negative impact in all cases. Table 4. Performance measures for Microsoft-related tweets analysis. Accuracy F-Measure ROC Area J48 NB J48 NB J48 NB T1-1 97.56 85.61 0.98 0.84 0.964 0.998 T1-2 (Emoticons) 97.49 84.96 0.97 0.83 0.964 0.998 T2-1 97.62 95.94 0.97 0.95 0.859 1 T2-2 (Emoticons) 97.56 95.87 0.97 0.96 0.87 0.951
6. Discussion The use of Internet slang must be addressed in any work involving microblogging data. The original motivation for users to create these abbreviations was to reduce keystrokes. Texting on cell phones made this form of writing even more pervasive. In some cases, this has grown into social cultures with different dialects (ex., leet, netspeak, chatspeak) rather than a timesaving utility. In our case, we observe that the words or phrases used in tweets may include many of these abbreviated words such as abt (about), afaik (as far as I know), alol (actual laugh out loud), and so forth. This may cause missed matches with words or phrases that appear on the positive and negative word list. To evaluate the impact of irregular expressions in tweets to our strategy of tweet labeling, we have compiled our own list of 500 abbreviated words by personal observation and various web sites. We observed that the overlap is small between this list and the positive and negative word lists used in our experiments. Therefore, the impact is minimal, which is confirmed by our experiments on the iphonerelated tweets where the hit rate of positive words versus negative words remains quite similar with and without substitutions of abbreviated words or phrases. Thus, it does not affect the result of labeling tweets based on a sentiment word list. However, the excessive amount of abbreviated words in tweets may need to be dealt with in different types of tweet analysis. We also note that some emoticons may be neutral, for example (\_/) indicating bunny ears or 0w0 meaning non-decript. We do not include these or use them as indicators of a neutral tweet. This is a possible addition to future work on tweet sentiment analysis since microblogging use and strategies are constantly evolving. We speculate on the high accuracies obtained by using the decision tree approach for classifying tweets in contrast to previous results of using Naïve Bayes or Support Vector Machine (SVM) classifiers based on different feature representation schemes of tweets. There are three possible factors: (1) We use single subject-related tweets in our experiments for training J48. (2) We use three values for sentiment: positive, neutral, and negative. (3) We use sentiment words as features to represent tweets, thus reducing the impact of the curse of dimensionality. From our experiments, we observe that there are a large number of tweets which do not contain any sentiment words. Therefore, they are classified as neutral in our strategy. This indicates the importance of including a neutral label in sentiment analysis. The high performance obtained by J48 in classifying single subjectrelated tweets may suggest that the integration of document filtering techniques, described in Refs. 15, 16, and 17. This may lead to the development of even more effective systems for tweet analysis. A collection of tweets can be sorted into different categories or subjects by first applying document filtering algorithms, followed by applying single subject-related tweet analysis. The use of sentiment words as features for representing tweets seems to be quite effective from our experiments. It is reasonable to think that the list we used happens to contain a large enough number of typical sentiment words. Thus, the availability of an effective list of words is an important factor for our approach to be successful. It is possible that our approach can be further enhanced by integrating more sophisticated feature selection functions such as those taking into account local context18, using DIA association factor19, making use of distribution of multi-words20, or considering different similarity measures21. In addition to decision tree learning programs, there are other data mining and knowledge discovery tools22, 23 which may be used to generate and present results of tweet analysis. 7. Conclusions In the paper, we have presented the process of applying Weka data mining tools to generate decision trees for classifying sentiment of tweets. We introduced the idea of using a list of sentiment words plus emoticons as features to represent and to label tweets for training data. We also include a neutral classification of tweets in our corpus. Experiments on iphone and Microsoft related tweets show that decision tree classifiers out-perform naïve Bayes ones using our approach. In addition, it appears that including emoticons as features has slightly negative impacts on the performance of decision tree based classifiers. The impact of the naïve Bayes classifiers is mixed. Our experiments also show that dimension reduction is critical to the performance of naïve Bayes classifiers. Based on our approach and experimental results, we observe that the integration of document filtering and document indexing techniques with our approach may provide one viable way to the development of effective systems for tweets analysis. Our future work includes
application of our approach to tweet analysis based on different data mining tools. 8. References [1] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The WEKA Data Mining Software: An Update, SIGKDD Explorations, Vol. 11, No. 1, 2009. [2] D. Kim, Y. Jo, I-C. Moon, and A. Oh, Analysis of Twitter Lists as a Potential Source for Discovering Latent Characteristics of Users, Workshop on Microblogging at the ACM Conference on Human Factors in Computer Systems (CHI 2010). [3] B. Pang, L. Lee, and S. Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proc. Of the Conf. on Empirical Methods in Natural Language Processing (EMNLP), July 2002, pp. 79-86. [4] A. Go, R. Bhayani, and L. Huang, Twitter Sentiment Classification using Distant Supervision, Proc. of the 4th International Conf. on Computer and Information Technology (CIT2004), pp. 1147-1152. [5] A. Pak and P. Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, Proc. of the Seventh Conf. on International Language Resources and Evaluation (LREC'10), May 2010. [6] J. Read, Using Emoticons to Reduce Dependency in Machine Learning Techniques for Sentiment Classification, Proc. of ACL-05, 43rd Meeting of the Association for Computational Linguistics, 2005. [7] M. Maron, Automatic Indexing: an Experimental Inquiry. J. Assoc. Comput. Mach. 8, 3, 404-417, 1961. [8] F. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, 1-47, March 2002. [9] G. Salton, A. Wong, and C. Yang, A Vector Space Model for Automatic Indexing. Communication of ACM 18, 11, 613-620, 1975. [10] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. 2nd edition, Morgan Kaufman, 2005. ISBN 0120884070, 9780120884070. [11] M. D. Choudhury, Y.-R. Lin, H. Sundaram, K. S. Candan, L. Xie, and A. Kelliher, How Does the Sampling Strategy Impact the Discovery of Information Diffusion in Social Media? Proc. of the 4th Int'l AAAI Conference on Weblogs and Social Media, George Washington University, Washington, DC, May 23-26, 2010. [12] Positive words download: http://www.winspiration.co.uk/positive.htm [13] Negative words download: http://eqi.org/fw_neg.htm [14] Twitter compatible positive and negative word list: http://www.twitrratr.com [15] N. J. Belkin and W. B. Croft, Information filtering and information retrieval: two sides of the same coin? Communication of ACM 35, 12, 29-38, 1992. [16] D. D. Lewis, The TREC-4 filtering track: description and analysis. Proceedings of TREC-4, the 4th Text Retrieval Conference, Gaithersburg, MD, 165-180, (1995). [17] Y.-H. Kim, S.-Y. Hahn, and B.-T. Zhang, Text filtering by boosting naïve Bayes classifiers. Proceedings of SIGIR-00, 23rd ACM International Conf. on Research and Development in Information Retrival, Athens, Greece, 168-175, (2000). [18] T.J. Siddiqui and U. S. Tiwary, Utilizing local context for effective information retrieval, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 1, 5-21, (2008), DOI No: 10.1142/S0219622008002788. [19] N. Fuhr and C. Buckley, A probabilistic learning approach for document indexing, ACM Transactions on Information Systems, 9, 3, 223-248, (1991). [20] W. Zhang, T.Yoshida, and X. Tang, Disbribution of multi-words in Chinese and English documents, International Journal of Information Technology and Decision Making, Vol. 8, Issue: 2, 249-265, (2009), DOI No: 10.1142/S0219622009003399. [21] E. Atlam, A new approach for text similarity using articles, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 1, 23-34, (2008), DOI No: 10.1142/S021962200800279X. [22] Y. Peng, G. Kou, Y. Shi, and Z. Chen, A descriptive framework for the field of data mining and knowledge discovery, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, 639-682, (2008), DOI No: 10.1142/S0219622008003204. [23] Q. Zhang and R. Segall, Web mining: a survey of current research, techniques, and software, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, 683-720, (2008), DOI No: 10.1142/S0219622008003150.