Text Mining for Sentiment Analysis of Twitter Data



Similar documents
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Sentiment analysis on tweets in a financial domain

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Web Document Clustering

Sentiment analysis: towards a tool for analysing real-time students feedback

Semantic Sentiment Analysis of Twitter

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Forecasting stock markets with Twitter

Sentiment Analysis Tool using Machine Learning Algorithms

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

End-to-End Sentiment Analysis of Twitter Data

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Data Mining Yelp Data - Predicting rating stars from review text

Microblog Sentiment Analysis with Emoticon Space Model

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

II. RELATED WORK. Sentiment Mining

Introducing diversity among the models of multi-label classification ensemble

Term extraction for user profiling: evaluation by the user

Analysis of Tweets for Prediction of Indian Stock Markets

Using Twitter as a source of information for stock market prediction

Domain Classification of Technical Terms Using the Web

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

A Sentiment Detection Engine for Internet Stock Message Boards

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians

III. DATA SETS. Training the Matching Model

Using Social Media for Continuous Monitoring and Mining of Consumer Behaviour

Social Media Mining. Data Mining Essentials

Sentiment Analysis and Topic Classification: Case study over Spanish tweets

Twitter sentiment vs. Stock price!

Micro blogs Oriented Word Segmentation System

Statistical Feature Selection Techniques for Arabic Text Categorization

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

COURSE RECOMMENDER SYSTEM IN E-LEARNING

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

Predicting IMDB Movie Ratings Using Social Media

Machine Learning for Naive Bayesian Spam Filter Tokenization

Designing Ranking Systems for Consumer Reviews: The Impact of Review Subjectivity on Product Sales and Review Quality

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

S-Sense: A Sentiment Analysis Framework for Social Media Sensing

A Hybrid Text Regression Model for Predicting Online Review Helpfulness

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Sentiment Analysis for Movie Reviews

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Search and Information Retrieval

DATA MINING TECHNIQUES AND APPLICATIONS

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Financial Trading System using Combination of Textual and Numerical Data

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

SPATIAL DATA CLASSIFICATION AND DATA MINING

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

An Introduction to Data Mining

Support Vector Machines with Clustering for Training with Very Large Datasets

Subordinating to the Majority: Factoid Question Answering over CQA Sites

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Sentiment Analysis of Microblogs

Exploring Big Data in Social Networks

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Document Image Retrieval using Signatures as Queries

Experiments in Web Page Classification for Semantic Web

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Facilitating Business Process Discovery using Analysis

Active Learning SVM for Blogs recommendation

Sentiment Analysis of Twitter Data

SVM Ensemble Model for Investment Prediction

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

How To Identify A Churner

Approaches for Sentiment Analysis on Twitter: A State-of-Art study

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Sentiment analysis using emoticons

Clustering Technique in Data Mining for Text Documents

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Predicting Students Final GPA Using Decision Trees: A Case Study

EFFICIENTLY PROVIDE SENTIMENT ANALYSIS DATA SETS USING EXPRESSIONS SUPPORT METHOD

An Information Retrieval using weighted Index Terms in Natural Language document collections

Projektgruppe. Categorization of text documents via classification

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Classification of Learners Using Linear Regression

How To Write A Summary Of A Review

RRSS - Rating Reviews Support System purpose built for movies recommendation

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Can Twitter Predict Royal Baby's Name?

The Enron Corpus: A New Dataset for Classification Research

A Survey on Product Aspect Ranking Techniques

On Discovering Deterministic Relationships in Multi-Label Learning via Linked Open Data

Building A Smart Academic Advising System Using Association Rule Mining

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

How To Predict Web Site Visits

Crowdfunding Support Tools: Predicting Success & Failure

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Transcription:

Text Mining for Sentiment Analysis of Twitter Data Shruti Wakade, Chandra Shekar, Kathy J. Liszka and Chien-Chung Chan The University of Akron Department of Computer Science liszka@uakron.edu, chan@uakron.edu Abstract Text messages express the state of minds from a large population on earth. From the perspective of decision makers, this collection of messages provides a precious source of information. In this paper, we present the use of Weka data mining tools to extract useful information for classifying sentiment of tweets collected from Twitter. The results of tweet mining are represented as decision trees that can be used for judging sentiment of new tweets. We introduce a new method for preprocessing tweets for decision tree learning. We evaluate the impact of tweets containing emoticons to the classifying process. The method is applied to perform sentiment analysis from tweets related to iphone and Microsoft. Experimental results show that decision tree classifiers out-performed naïve Bayes algorithm. Keywords: geometric tiling, minimal covering sets, wireless sensor networks 1. Introduction Billions of dollars are spent worldwide each year on market analysis. Data-driven decisions are a powerful and necessary method of conducting business. Imagine how useful it would be for a company to know how its products are viewed in the market or how a political candidate could leverage their public image in their campaign, without surveying people directly. One way to accomplish this is by collecting public sentiment on Internet microblogging sites such as Twitter 1, Tumblr 2, Plurk 3, Pownce 4, and Jaiku 5. These are the top five social networking forums that provide a quick and easy means for people to express themselves while creating a valuable pool of data for those who are interested in those 1 http://twitter.com 2 http://www.tumbl.com 3 http://www.plurk.com 4 http://pownce.com 5 http://jaiku.com opinions. Messages that users create are saved in their personal profile and forwarded to others in their circle of friends. The information may be kept private among the list, or made public and unrestricted. Opinion mining, sentiment analysis, and subjectivity analysis are related fields sharing common goals of developing and applying computational techniques to process collections of opinionated texts or reviews. Other research goals are to generate heuristics or tools that can be used to classify, rank, or summarize sentiments toward certain objects, events, or topics. For example, these tools can be used to determine a thumbs up or thumbs down vote for specific movies from their reviews, or to predict in-favor or in-worse of certain products or events. In this paper, we look specifically at Twitter data, called tweets, to perform clustering and sentiment analysis. Tweets are limited to 140 characters. Figure 1 shows an actual tweet taken from Twitter. This type of cyber-communication is commonly called microblogging. Sentiment analysis is a field of research that determines if there is a favorable or non-favorable reaction in text. Figure 1. Example tweet. Our approach is to use the Weka1 data mining software with a positive and negative word set and compare it to a second word set provided by Twitter. We are interested in the impact of emoticons added to both of these sets. In section two, we discuss previous research in the field of sentiment analysis on text. Section three presents the problem statement and setup. In section four, we describe the preprocessing steps performed on the data and the

feature selection used. Section five presents the experiments. Section six contains discussion of the results and we conclude in section seven. 2. Previous Work There is a small, but growing body of research in specifically opinion mining from microblogging data. Kim et al. give a compelling case for using Twitter lists for a corpus in sentiment analysis2. In this context, lists are groups of people who share a common interest such as music. They show that even though tweets are brief, they contain enough information to express identifiable characteristics, interests and sentiments. The seminal work by Pang et al. shows that machine learning is a viable tool for sentiment analysis using movie reviews for a corpus3. They apply three standard machine learning algorithms; Naïve Bayes, maximum entropy (MaxEnt), and support vector machines (SVMs). Their positive and negative word lists were relatively small, from five to eleven in different experiments, but nonetheless, the results are good. More notable, they bring to light the difficulty of the task compared to topic based classification. The work in Go et al. is very similar to Pang in using the same three classifiers, but microblogging data from Twitter is used as opposed to the longer text movie reviews4. The results are remarkably similar, showing promise that applying these tools for sentiment analysis cross the boundaries from longer text blocks to the 140 characters restricted tweets. The research in this paper excludes neutral sentiments from the corpora. Only positive and negative tweets are collected, mined through queries in the Twitter search utility using common emoticons. Once collected, the emoticons are removed from the tweets before training with the classifiers. Manually collected test data retains emoticons, if present. Pak and Paroubek5 collect data from Twitter, filter it and then classify as positive or negative by the use of popular emoticons (smiley faces, sad faces, and variations). Neutral tweets are collected from newspaper accounts to round out the corpora. An analysis indicates the distribution of word frequencies in the collection is normal. They apply a Naïve Bayes classifier to test the posts. Their best results are those experiments using bigrams. This is contrary to the findings of Pang, but may easily be explained by the very nature of the differing corpora. Movies reviews may contain more words and users may take more time to think about their post where tweeters tend to give lightening quick, brief snapshots of a thought sent from a cell phone or other small device. In fact, one very interesting observation that this paper makes is the amount of slang used and frequent misspellings in tweets. This may have minor effects on any opinion analysis applied to microblogging data. Read performs sentiment analysis on Usenet group data and movie reviews. He uses the Naïve Bayes and SVM classifiers6. His corpus is created using emoticons to identify positive and negative texts. No neutral or objective texts are included in either the training or testing data sets. Read also looks at topic, domain, and temporal dependency classifications. To summarize, research parameters tend to be grouped as follows: Classifier used Naïve Bayes Maximum Entropy Support Vector Machine Text blocks versus microblogging data Positive/negative word list source and size Use of neutral/objective data In the training data set In the testing data set Use of emoticons In the training data set In the testing data set Use of unigrams, bigrams, or both Use of word presence versus word frequency 3. Problem Formulation Sentiment analysis can be viewed as an application of text categorization, which dates back to the work on probabilistic text classification by Maron7. The main task of text classification is how to label texts with a predefined set of categories. Text categorization has been applied in other areas such as document indexing, document filtering, word sense disambiguation, etc. as surveyed in Sebastiani8. One of the central issues in text classification is how to represent the content of a text in order to facilitate an effective classification. From researches in information retrieval systems, one of the most popular and successful method is to represent a text by the collection of terms appear in it. The similarity between documents is defined by using the term frequency inverse document frequency (tfidf) measure9. In this approach, the terms or features used to represent a text is determined by taking the union of all terms that appear in the collection of texts used to derive the classifier. This usually results in a large number of features. Therefore, dimensionality reduction is a related issue that needs to be addressed. The problem we consider in this paper is as follows. Given a collection of tweets related to a specific subject,

how do we come up with a classifier for labeling sentiment of new tweets as positive, negative, or neutral? We start by collecting related tweets using a query containing words or phrase denoting the subject of interest. Since tweets may belong to multiple subjects, the inclusion of a tweet to a specific subject is not necessarily certain. In this work, we do not consider a fuzzy membership. In order to apply data mining tools to generate a classifier, we need to determine a list of features to represent tweets and assign a sentiment label to each tweet. Instead of using all terms that appear in the collected tweets, we have adopted a list of positive and negative words together with one where a positive emoticon is present and one where a negative emoticon is present to form the list of features. This is, in general, a much smaller set of features than using the unigram representation. We use three values for sentiment determined by combining the sentiment values derived from the following two factors: (1) The frequency counts of positive and negative words. (2) The presence of a positive or negative emoticon. If the count of positive words is greater than the negative words, then factor (1) has value 1, else its value is -1, and it has value 0 for a tie. For factor (2), it has value 1 when only a positive emoticon is present, its value is -1 when only a negative emoticon is present, and it has value 0, otherwise. The final sentiment value for a tweet is determined by summing up the values of factors (1) and (2), and then it is mapped into one of the three possible values: positive, negative, or neutral. Table 1 contains example of each sentiment for the iphone. Table 1. Example iphone-related tweets for each sentiment. Sentiment Tweet positive iphone junkie lots talk i'm :) negative Anyone else frustrated MMS experience iphone? Logging slow buggy ATT website... Seems un-apple like. neutral Ok help here, buy phone, choices are: G1, iphone, BB Storm, BB Bold. Chime We use the Weka data mining program J4810 to generate a decision tree from the labeled training set. A decision tree is a symbolic classifier with two advantages: first, it can further reduce the features to be included in the tree and second, the tree structure can provide a different form of summary for sentiments derived from the training set. 4. Methodology for Sentiment Classification The following steps were applied for text mining Twitter data for our sentiment analysis. 4.1 Data collection We used a publicly available dataset for our sample space, provided for research purposes under Creative Commons license from Choudhury 11. This data set contains more than 10.5 million tweets collected from over 200,000 users in the time period from 2006 through 2009. As subjects of interest, we use iphone and Microsoft as query terms to retrieve tweets from the raw data. The iphone corpus contains 18,548 related tweets. The Microsoft corpus consists 14,547 related tweets. 4.2 Data preprocessing We took several steps to preprocess the data to clean the tweets. First was the removal of stop words. These are words commonly filtered out when doing any type of text processing. In our data, we mainly removed prepositions and pronouns along with words such as been, have, is, being, and so forth. They can easily be removed without affecting the sentiment of the message as they do not convey any positive or negative meaning. It s common to find URLs in tweets, as people often share interesting links with friends. The next preprocessing task was to identify hyperlinks in the text and replace them with the tag URL. Symbols were also removed except for those that make up the set of emoticons listed in Table 2. Stemming is a process of reducing a word to its root form. For example, the set of words read, reader, readers, and reading all reduce to the root word read. We used the Snowball stemmer available as part of the Weka 1 software. 4.3 Feature Determination We use the following features to represent tweets in our experiments. A list of 931 positive words was downloaded from Winspiration 12. Example words in this list are beautiful, easy, and popular. A list of 1838 negative words was downloaded from EQI 13, a web site with resources related to emotional intelligence. Example negative words from this list are fragile, grumpy, and stressed. The set of emoticons we used are listed in Table

2. Positive emoticons are collectively represented as a feature named C+, and negative emoticons are collectively represented as a feature named as C-. For comparison purposes, a set of 129 positive and 144 negative words compatible to those provided by Twitrratr 14 was downloaded from the web site. These lists contain emoticons which were removed. Positive set of emoticons :) :-) : ) :D =) ;-) ;) Table 2. Set of emoticons. Negative set of emoticons :( :-( :( In addition, we looked at the frequency distribution of sentiment words among the subject-related tweets. Many words have a frequency count that is less than two. Therefore, we apply a threshold of two to further reduce the features. As a result, the word list from EQI and Winspiration has been reduced from 2769 to 59 words, and the list from Twitrratr has been reduced from 273 to 30 words. 4.4 Sentiment labeling We have created four training sets with combinations of two sets of sentiment words (EQI and Winspiration as one set, Twitrratr as the other) and inclusion or exclusion of emoticons. Training tweets are labeled by using a Java program that implements the labeling strategy described in Section 3. 5. Experimental Results We used the Weka data mining tools for our experiments. For each of the four combinations, we create an independent testing set by randomly selecting 20% of the labeled tweets collected. The remaining 80% is used for creating classifiers using Weka s J48 and Naïve Bayes algorithms. The validation is done by 10-fold crossvalidation. Default parameters are used for both learning algorithms. The experimental results for iphone-related tweets are shown in Table 3. The first training set, denoted by T1-1, uses 59 out of 2769 words downloaded from Refs. 11 12 as its features. Features used in the second training set, denoted as T1-2, consist of those in T1-1 plus the two emoticon categories. Similarly, features of the third training set T2-1 consist of only the Twitter compatible word list downloaded from Ref. 13 using 30 out 273 words. Similarly, the fourth training set T2-2 includes the two emoticon categories. The values of the receiver operating characteristic (ROC) areas are all excellent, and most of the F-measures are excellent, as well, as shown in Table 3. In this case, the table shows that the decision tree based algorithm J48 outperforms the Naïve Bayes algorithm. In addition, the use of the emoticon categories as features has a negative effect on J48 learning, while they provide a slight improvement for Naïve Bayes learning. The use of a large feature set has a negative impact on the Bayes algorithm, but it seems to have no impact on J48. Table 3. Performance measures for iphone-related tweets analysis. Accuracy F-Measure ROC Area J48 NB J48 NB J48 NB T1-1 98.05 84.73 0.98 0.84 0.98 0.996 T1-2 (Emoticons) 97.87 85.22 0.98 0.84 0.98 0.995 T2-1 98.03 95.19 0.98 0.95 0.93 0.999 T2-2 (Emoticons) 98.03 95.41 0.98 0.95 0.94 0.999 The experimental results on Microsoft-related tweets are shown in Table 4. We have similar results as in the case of iphone-related tweets. The J48 algorithm has outperformed the Naïve Bayes algorithm in all cases. Again, the use of emoticons as features does not improve performance. Instead we see a slight negative impact in all cases. Table 4. Performance measures for Microsoft-related tweets analysis. Accuracy F-Measure ROC Area J48 NB J48 NB J48 NB T1-1 97.56 85.61 0.98 0.84 0.964 0.998 T1-2 (Emoticons) 97.49 84.96 0.97 0.83 0.964 0.998 T2-1 97.62 95.94 0.97 0.95 0.859 1 T2-2 (Emoticons) 97.56 95.87 0.97 0.96 0.87 0.951

6. Discussion The use of Internet slang must be addressed in any work involving microblogging data. The original motivation for users to create these abbreviations was to reduce keystrokes. Texting on cell phones made this form of writing even more pervasive. In some cases, this has grown into social cultures with different dialects (ex., leet, netspeak, chatspeak) rather than a timesaving utility. In our case, we observe that the words or phrases used in tweets may include many of these abbreviated words such as abt (about), afaik (as far as I know), alol (actual laugh out loud), and so forth. This may cause missed matches with words or phrases that appear on the positive and negative word list. To evaluate the impact of irregular expressions in tweets to our strategy of tweet labeling, we have compiled our own list of 500 abbreviated words by personal observation and various web sites. We observed that the overlap is small between this list and the positive and negative word lists used in our experiments. Therefore, the impact is minimal, which is confirmed by our experiments on the iphonerelated tweets where the hit rate of positive words versus negative words remains quite similar with and without substitutions of abbreviated words or phrases. Thus, it does not affect the result of labeling tweets based on a sentiment word list. However, the excessive amount of abbreviated words in tweets may need to be dealt with in different types of tweet analysis. We also note that some emoticons may be neutral, for example (\_/) indicating bunny ears or 0w0 meaning non-decript. We do not include these or use them as indicators of a neutral tweet. This is a possible addition to future work on tweet sentiment analysis since microblogging use and strategies are constantly evolving. We speculate on the high accuracies obtained by using the decision tree approach for classifying tweets in contrast to previous results of using Naïve Bayes or Support Vector Machine (SVM) classifiers based on different feature representation schemes of tweets. There are three possible factors: (1) We use single subject-related tweets in our experiments for training J48. (2) We use three values for sentiment: positive, neutral, and negative. (3) We use sentiment words as features to represent tweets, thus reducing the impact of the curse of dimensionality. From our experiments, we observe that there are a large number of tweets which do not contain any sentiment words. Therefore, they are classified as neutral in our strategy. This indicates the importance of including a neutral label in sentiment analysis. The high performance obtained by J48 in classifying single subjectrelated tweets may suggest that the integration of document filtering techniques, described in Refs. 15, 16, and 17. This may lead to the development of even more effective systems for tweet analysis. A collection of tweets can be sorted into different categories or subjects by first applying document filtering algorithms, followed by applying single subject-related tweet analysis. The use of sentiment words as features for representing tweets seems to be quite effective from our experiments. It is reasonable to think that the list we used happens to contain a large enough number of typical sentiment words. Thus, the availability of an effective list of words is an important factor for our approach to be successful. It is possible that our approach can be further enhanced by integrating more sophisticated feature selection functions such as those taking into account local context18, using DIA association factor19, making use of distribution of multi-words20, or considering different similarity measures21. In addition to decision tree learning programs, there are other data mining and knowledge discovery tools22, 23 which may be used to generate and present results of tweet analysis. 7. Conclusions In the paper, we have presented the process of applying Weka data mining tools to generate decision trees for classifying sentiment of tweets. We introduced the idea of using a list of sentiment words plus emoticons as features to represent and to label tweets for training data. We also include a neutral classification of tweets in our corpus. Experiments on iphone and Microsoft related tweets show that decision tree classifiers out-perform naïve Bayes ones using our approach. In addition, it appears that including emoticons as features has slightly negative impacts on the performance of decision tree based classifiers. The impact of the naïve Bayes classifiers is mixed. Our experiments also show that dimension reduction is critical to the performance of naïve Bayes classifiers. Based on our approach and experimental results, we observe that the integration of document filtering and document indexing techniques with our approach may provide one viable way to the development of effective systems for tweets analysis. Our future work includes

application of our approach to tweet analysis based on different data mining tools. 8. References [1] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. Witten, The WEKA Data Mining Software: An Update, SIGKDD Explorations, Vol. 11, No. 1, 2009. [2] D. Kim, Y. Jo, I-C. Moon, and A. Oh, Analysis of Twitter Lists as a Potential Source for Discovering Latent Characteristics of Users, Workshop on Microblogging at the ACM Conference on Human Factors in Computer Systems (CHI 2010). [3] B. Pang, L. Lee, and S. Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proc. Of the Conf. on Empirical Methods in Natural Language Processing (EMNLP), July 2002, pp. 79-86. [4] A. Go, R. Bhayani, and L. Huang, Twitter Sentiment Classification using Distant Supervision, Proc. of the 4th International Conf. on Computer and Information Technology (CIT2004), pp. 1147-1152. [5] A. Pak and P. Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, Proc. of the Seventh Conf. on International Language Resources and Evaluation (LREC'10), May 2010. [6] J. Read, Using Emoticons to Reduce Dependency in Machine Learning Techniques for Sentiment Classification, Proc. of ACL-05, 43rd Meeting of the Association for Computational Linguistics, 2005. [7] M. Maron, Automatic Indexing: an Experimental Inquiry. J. Assoc. Comput. Mach. 8, 3, 404-417, 1961. [8] F. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, 1-47, March 2002. [9] G. Salton, A. Wong, and C. Yang, A Vector Space Model for Automatic Indexing. Communication of ACM 18, 11, 613-620, 1975. [10] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. 2nd edition, Morgan Kaufman, 2005. ISBN 0120884070, 9780120884070. [11] M. D. Choudhury, Y.-R. Lin, H. Sundaram, K. S. Candan, L. Xie, and A. Kelliher, How Does the Sampling Strategy Impact the Discovery of Information Diffusion in Social Media? Proc. of the 4th Int'l AAAI Conference on Weblogs and Social Media, George Washington University, Washington, DC, May 23-26, 2010. [12] Positive words download: http://www.winspiration.co.uk/positive.htm [13] Negative words download: http://eqi.org/fw_neg.htm [14] Twitter compatible positive and negative word list: http://www.twitrratr.com [15] N. J. Belkin and W. B. Croft, Information filtering and information retrieval: two sides of the same coin? Communication of ACM 35, 12, 29-38, 1992. [16] D. D. Lewis, The TREC-4 filtering track: description and analysis. Proceedings of TREC-4, the 4th Text Retrieval Conference, Gaithersburg, MD, 165-180, (1995). [17] Y.-H. Kim, S.-Y. Hahn, and B.-T. Zhang, Text filtering by boosting naïve Bayes classifiers. Proceedings of SIGIR-00, 23rd ACM International Conf. on Research and Development in Information Retrival, Athens, Greece, 168-175, (2000). [18] T.J. Siddiqui and U. S. Tiwary, Utilizing local context for effective information retrieval, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 1, 5-21, (2008), DOI No: 10.1142/S0219622008002788. [19] N. Fuhr and C. Buckley, A probabilistic learning approach for document indexing, ACM Transactions on Information Systems, 9, 3, 223-248, (1991). [20] W. Zhang, T.Yoshida, and X. Tang, Disbribution of multi-words in Chinese and English documents, International Journal of Information Technology and Decision Making, Vol. 8, Issue: 2, 249-265, (2009), DOI No: 10.1142/S0219622009003399. [21] E. Atlam, A new approach for text similarity using articles, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 1, 23-34, (2008), DOI No: 10.1142/S021962200800279X. [22] Y. Peng, G. Kou, Y. Shi, and Z. Chen, A descriptive framework for the field of data mining and knowledge discovery, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, 639-682, (2008), DOI No: 10.1142/S0219622008003204. [23] Q. Zhang and R. Segall, Web mining: a survey of current research, techniques, and software, International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, 683-720, (2008), DOI No: 10.1142/S0219622008003150.