Sentiment Analysis on Twitter Data using: Hadoop

Sentiment Analysis on Twitter Data using: Hadoop ABHINANDAN P SHIRAHATTI 1, NEHA PATIL 2, DURGAPPA KUBASAD 3, ARIF MUJAWAR 4 1,2,3,4 Department of CSE, Jain College of Engineering, Belagavi-590014, India Abstract As of now we know present industries and some survey companies are mainly taking decisions by data obtained from web. As we see WWW is a rich collection of data that is mainly in the form of unstructured data from which we can do analysis on those data which is collected on some situation or on a particular thing. In this paper, we are going to talk how effectively sentiment analysis is done on the data which is collected from the Twitter using Flume. Twitter is an online web application which contains rich amount of data that can be a structured, semi-structured and un-structured data. We can collect the data from the twitter by using BIGDATA eco-system using online streaming tool Flume. And doing analysis on Twitter is also difficult due to language that is used for comments. And, coming to analysis there are different types of analysis that can be done on the collected data. So here we are taking sentiment analysis, for this we are using Hive and its queries to give the sentiment data based up on the groups that we have defined in the HQL (Hive Query Language). Here we have categorized this sentiment analysis into 3 groups like tweets that are having positive, moderate and negative comments. Keywords Analysis, BIGDATA, Comment, Flume, Hive, HQL, Sentiment Analysis, Structured, Semi-Structured, Twitter, Tweets, Un-Structured, WWW (Word Wide Web). I. INTRODUCTION From 20th century onwards this WWW has completely changed the way of expressing their views. Present situation is completely they are expressing their thoughts through online blogs, discussion forms and also some online applications like Facebook, Twitter, etc. If we take Twitter as our example nearly 1TB of text data is generating within a week in the form of tweets. So, by this it is understand clearly how this Internet is changing the way of living and style of people. Among these tweets can be categorized by the hash value tags for which they are commenting and posting their tweets. So, now many companies and also the survey companies are using this for doing some analytics such that they can predict the success rate of their product or also they can show the different view from the data that they have collected for analysis. But, to calculate their views is very difficult in a normal way by taking these heavy data that are going to generate day by day. Fig. 1: Describes clearly Apache Hadoop Ecosystem. The above figure shows clearly the different types of ecosystems that are available on Hadoop so, this problem is taking now and can be solved by using BIGDATA Problem as a solution. And if we consider getting the data from Twitter one should use any one programming language to crawl the data from their database or from their web pages. Coming to this problem here we are collecting this data by using BIGDATA online streaming Eco System Tool known as Flume and also the shuffling of data and generating them into structured data in the form of tables can be done by using Apache Hive. II. LITERATURE SURVEY Sentiment analysis has been handled as a Natural Language Processing task at many levels of granularity. Micro blog data like Twitter, on which users post real time reactions to and opinions about everything, poses newer and different challenges. Some of the early and recent results on sentiment analysis of Twitter data are by Go et al. (2009), (Bermingham and Smeaton, 2010) and Pak and Paroubek (2010). Go et al. (2009) use distant learning to acquire sentiment data.they use tweets ending in positive emoticons like :) :-) as positive and negative emoticons like :( :-( as negative. 831

2.1 Existing System III. PROBLEM STATEMENT As we have already discussed about the older way of getting data and also performing the sentiment analysis on those data. Here they are going to use some coding techniques for crawling the data from the twitter where they can extract the data from the Twitter web pages by using some code that may be written either in JAVA, Python etc. For those they are going to download the libraries that are provided by the twitter guys by using this they are crawling the data that we want particularly. After getting raw data they will filter by using some old techniques and also they will find out the positive, negative and moderate words from the list of collected words in a text file. All these words should be collected by us to filter out or do some sentiment analysis on the filtered data. These words can be called as a dictionary set by which they will perform sentiment analysis. Also, after performing all these things and they want to store these in a database and coming to here they can use RDBMS where they are having limitations in creating tables and also accessing the tables effectively. 2.2Proposed System As we have seen existing system drawbacks, here we are going to overcome them by solving this issue using Big Data problem statement. So here we are going to use Hadoop and its Ecosystems, for getting raw data from the Twitter we are using Hadoop online streaming tool using Apache Flume. In this tool only we are going to configure everything that we want to get data from the Twitter. For this we want to set the configuration and also want to define what information that we want to get form Twitter. All these will be saved into our HDFS (Hadoop Distributed File System) in our prescribed format. From this raw data we are going to create the table and filter the information that is needed for us and sort them into the Hive Table. And from that we are going to perform the Sentiment Analysis by using some UDF s (User Defined Functions) by which we can perform sentiment analysis by taking Stanford Core NLP as the data dictionary so that by using that we can decide the list of words that coming under positive, moderate and negative. The following figure shows clearly the architecture view for the proposed system by this we can understand how our project is effective using the Hadoop ecosystems and how the data is going to store form the Flume, also how it is going to create tables using Hive also how the sentiment analysis is going to perform. Fig. 2: Architecture diagram for proposed system. IV. METHODOLOGY As we have seen the procedure how to overcome the problem that we are facing in the existing problem that is shown clearly in the proposed system. So, to achieve this we are going to follow the following methods: Creating Twitter Application. Getting data using Flume. Querying using Hive Query Language (HQL) 4.1 Creating Twitter Application First of all if we want to do sentiment analysis on Twitter data we want to get Twitter data first so to get it we want to create an account in Twitter developer and create an application by clicking on the new application button provided by them.[3] After creating a new application just create the access tokens so that we no need to provide our authentication details there and also after creating application it will be having one consumer keys to access that application for getting Twitter data. The following is the figure that show clearly how the application data looks after creating the application and here it s self we can see the consumer details and also the access token details. We want to take this keys and token details and want to set in the Flume configuration file such that we can get the required data from the Twitter in the form of twits. 832

that we got form the Twitter is also in the JSON format that is shown clearly below: Fig. 3: Creating Twitter application from Twitter Developer. The figure show clearly the application keys that are generated after creating application and in this keys we can see the top two keys are the API key and API secret. And coming to the reaming two keys it is nothing but know as the access tokens that we want to generate it by ourselves by clicking the generate access token. After clicking that we can get the two keys that are our account access token and coming to that one is Access token and the other one is the Access token secret. 4.2 Getting data using Flume After creating an application in the Twitter developer site we want to use the consumer key and secret along with the access token and secret values. By which we can access the Twitter and we can get the information that what we want exactly here we will get everything in JSON format and this is stored in the HDFS that we have given the location where to save all the data that comes from the Twitter. The following is the configuration file that we want to use to get the Twitter data from the Twitter. Fig. 5: Twitter data in HDFS (Hadoop Distributed File System). Fig. 6: Twitter data in JSON format. Fig. 4: Flume configuration files for Twitter data. 4.3 Querying using Hive Query Language (HQL)After running the Flume by setting the above configuration then the Twitter data will automatically will save into HDFS where we have the set the path storage to save the Twitter data that was taken by using Flume. The following is the figure that shows clearly how the data is stored in the HDFS in a documented format and the raw data From these data first we want to create a table where the filtered data want to set into a formatted structured such that by which we can say clearly that we have converted the unstructured data into structured format. For this we want to use some custom serde concepts. These concepts are nothing but how we are going to read the data that is in the form of JSON format for that we are using the custom serde for JSON so that our hive can read the JSONdata and can create a table in our prescribed format. 833

V. RELATED WORK Sentiment analysis of tweets data is considered as a much harder problem than that of conventional text such as review documents. This is partly due to the short length of tweets, the frequent use of informal and irregular words, and the rapid evolution of language in Twitter. A large amount of work has been conducted in Twitter sentiment analysis following the feature-based approaches. Go et al. explored augmenting different n-gram features in conjunction with POS tags into the training of supervised classifiers including Naive Bayes (NB), Maximum Entropy (MaxEnt) and Support Vector Machines (SVMs). They found that MaxEnt trained from a combination of unigrams and bigrams outperforms other models trained from a combination of POS tags and unigrams by almost 3%. However, a contrary finding was reported in that adding POS tag features into n-grams improves the sentiment classification accuracy on tweets. Fig. 7: HQL Query for creating Tweets table. Fig. 8: Inserting data by performing sentiment analysis Also we are using another UDF s (User Defined Functions) for performing the sentiment analysis on the tales that are created by using Hive.[5]From that we can perform the sentiment analysis. And acquire the results where a new table is created by partition concept such that all the comments that are having positive will go into the positive partition and all the comments that are having moderate will go into moderate partition and finally all the comments that are having negative will go into negative partition. The following figure shows clearly how the rating is done and how the data is partitioned into 3 types. Fig. 9: Result after performing the sentiment analysis. Barbosa and Feng argued that using n-grams on tweet data may hinder the classification performance because of the large number of infrequent words in Twitter. Instead, they proposed using micro blogging features such as re-tweets, hash tags, replies, punctuations, and emoticons. They found that using the above features to train the SVMs enhances the sentiment classification accuracy by 2.2% compared to SVMs trained from unigrams only. A similar finding was reported by Kouloumpis et al. It explored the microblogging features including emoticons, abbreviations and the presence of intensifiers such as all-caps and character repetitions for Twitter sentiment classification. Their results show that the best performance comes from using the n-grams together with the microblogging features and the lexicon features where words tagged with their prior polarity. However, including the POS features produced a drop in performance. Agarwal et al. also explored the POS features, the lexicon features and the microblogging features. Apart from simply combining various features, they also designed a tree representation of tweets to combine many categories of features in one succinct representation. A partial tree kernel was used to calculate the similarity between two trees. They found that the most important features are those that combine prior polarity of words with their POS tags. All other features only play a marginal role. Furthermore, they also showed that combining unigrams with the best set of features outperforms the tree kernel-based model and gives about 4% absolute gain over a unigram baseline. Rather than directly incorporating the microblogging features into sentiment classifier training, Speriosu et al. constructed a graph that has some of the microblogging features such as hashtags and emoticons together with users, tweets, word unigrams and bigrams as its nodes which are connected based on the link existence among them (e.g., users are connected to tweets they created; tweets are connected to word unigrams that they contain etc.). They then applied a label propagation method where sentiment labels were propagated from a small set of nodes seeded with 834

some initial label information throughout the graph. They claimed that their label propagation method outperforms MaxEnt trained from noisy labels and obtained an accuracy of 84.7% on the subset of the Twitter sentiment test set from. Existing work mainly concentrates on the use of three types of features; lexicon features, POS features, and microblogging features for sentiment analysis. Mixed findings have been reported. Some argued the importance of POS tags with or without word prior polarity involved, while others emphasised the use of microblogging features. In this paper, we propose a new type of features for sentiment analysis, called semantic features, where for each entity in a tweet (e.g. iphone, ipad, MacBook), the abstract concept that represents it will be added as a new feature (e.g. Apple product). We compare the accuracy of sentiment analysis against other types of features;unigrams, POS features, and the sentiment-topic features. To the best of our knowledge, using such semantic features is novel in the context of sentiment analysis. VI. DATASETS For the work and experiments described in this paper, we used two different Twitter datasets as detailed below. The statistics of the datasets are shown in Table 1. paper focuses on identifying positive and negative tweets, and therefore we exclude neutral tweets from this dataset. The final #smartphone dataset for training consists 839 tweets, and another 839 tweets are used for testing. VII. SEMANTIC FEATURES FOR SENTIMENT ANALYSIS This section describes our semantic features and their incorporation into our sentiment analysis method.the semantic concepts of entities extracted from tweets can be used to measure the overall correlation of a group of entities (e.g. all Samsung products) with a given sentiment duality. Hence adding such features to the analysis could help identifying the sentiment of tweets that contain any of the entities that such concepts represent, even if those entities never appeared in the training set (e.g. a new gadget from Samsung). Semantic features refer to those semantically hidden concepts extracted from tweets. An example for using semantic features for sentiment classifier training is shown in Figure 10 where the left box lists entities appeared in the training set together with their occurrence probabilities in positive and negative tweets. For example, the entities Galaxy I, Galaxy II appeared more often in tweets of positive polarity and they are all mapped to the semantic concept Samsung products. As a result, the tweet from the test set Finally, I got my Galaxy II. What a product! is more likely to have a positive duality because it contains the entity Galaxy II which is also mapped to the concept Samsung Product. 7.1 Extracting Semantic Entities and Concepts Table 1. Statistics of the twotwitter datasets used in this paper. Cricket World Cup(CWC15)This dataset consists of 60,000 tweets randomly selected from the Cricket World Cup(CWC15). Half of the tweets in this dataset contains positive emoticons, such as :), :-), : ), :D, and =), and the other half contains negative emoticons such as :(, :-(, or : (. The original dataset contained 1.6 million general tweets, and its test set manually annotated tweets consisted of 177 negative and 182 positive tweets. The training set which was collected based on specific emoticons, whereas the test set was collected by searching Twitter API with specific queries.to extend the testing set, we added 641 tweets randomly selected from the original dataset, and annotated manually by 12 users (researchers in our lab), where each tweet was annotated by one user. Our final CWC15 dataset consists of 60K general tweets, with a test set of 1,000 tweets of 527 negatively, and 473 positively annotated ones. 6.2 Business(#smartphones) The Business(#smartphones) dataset was built by crawling tweets containing the various hashtags eg. #smartphones (business) in March 2010. A subset of this corpus was manually annotated with three polarity labels (positive, negative, neutral) and split into training and test sets. This There are several open APIs that provide entity extraction services for online textual data. Rizzo and Troncy evaluated the use of five popular entity extraction tools. Assuming of course that the entity extractor successfully identify the new entities as sub-types of concepts already correlated with negative or positive sentiment on a dataset of news articles. Fig. 10. Measuring correlation of semantic concepts with negative/positive sentiment. These semantic concepts are then incorporated in sentiment classification. Their experimental results showed that AlchemyAPI performs best for entity extraction and semantic concept mapping. Our datasets consist of informal tweets. Therefore we conducted our own evaluation, and randomly selected tweets from the CWC15 and asked 3 evaluators to evaluate the semantic concept extraction outputs generated from AlchemyAPI, OpenCalais and Zemanta. 835

capabilities for abbreviated phrases, emoticons and interjections (e.g. lol, omg ). 8.3 Sentiment-Topic Features Table 2. Evaluation results of AlchemyAPI, Zemanta and OpenCalais The assessment of the outputs was based on the correctness of the extracted entities; and the correctness of the entity-concept mappings. The evaluation results presented in Table 2 show that AlchemyAPI extracted the most number of concepts and it also has the highest entity-concept mapping accuracy compared to OpenCalais and Zematna. As such, we chose AlchemyAPI to extract the semantic concepts from our three datasets. Table 3 lists the total number of entities extracted and the number of semantic concepts mapped against them for each dataset. The sentiment-topic features are extracted from tweets using the weakly-supervised joint sentiment-topic (JST) mode that we developed earlier. We trained this model on the training set with tweet sentiment labels discarded. The resulting model assigns each word in tweets with a sentiment label and a topic label. Hence JST essentially groups different words that share similar sentiment and topic. Table 3. Entity/concept extraction statistics of CWC15 and Smartphones using AlchemyAPI. VIII. SEMANTIC FEATURES FOR SENTIMENT ANALYSIS We compare the performance of our semantic sentiment analysis approach against the baselines described below. 8.1 Unigrams Features Word unigrams are the simplest features are being used for sentiment analysis of tweets data. Models trained from word unigrams are shown to outperform random classifiers by a decent margin of 20%. In this work, we are use NB classifiers trained from word unigrams as our first baseline model. Table 4 lists, for each dataset, the total number of the extracted unigram features that are used for the classification training. Table 4. Total number of unigram features extracted from each dataset 8.2 Part-of-Speech Features POS features are common features that have been widely used in the literature for the task of Twitter sentiment analysis. Here, we build various NB classifiers trained using a combination of word unigrams and POS features and use them as baseline models. We extract the POS features using the TweetNLP POS tagger,which is trained specifically from tweets. This differs from the previous work, which relies on POS taggers trained from treebanks in the newswire domain for POS tagging. It was shown that TweetNLP tagger outperforms the Stanford tagger with a relative error reduction of 25% when evaluated on 500 manually annotated tweets. Moreover, the tagger offers additional recognition Table5. Extracted sentiment-topic words by the sentiments We list some of the topic words extracted by this model from the CWC15 and Smartphones corpora in Table 5.Words in each cell are grouped under single topic and the upper half of the table shows topic words bearing positive sentiment while the lower half shows topic words bearing negative sentiment. The rationale behind this model is that grouping words under the same topic and bearing similar sentiment could reduce data sparseness in Twitter sentiment classification and improves accuracy. IX. CONCLUSION There are different ways to get Twitter data or any other online streaming data where they want to code lines of coding to achieve this. And, also they want to perform the sentiment analysis on the stored data where it makes some complex to perform those operations. Coming to this paper we have achieved by this problem statement and solving it in BIGDATA by using Hadoop and its Eco Systems. And finally we have done sentiment analysis on the Twitter data that is stored in HDFS. So, here the processing time taken is also very less compared to the previous methods because HadoopMapReduce and Hive are the best methods to process large amount of data in a small time. REFERENCES [1] Apporv Agarwal, Jasneet Singh Sabarwal, End to End Sentiment Analysis of Twitter Data [2] Real Time Sentiment Analysis of Twitter Data Using Hadoop Sunil B. Mane, Yashwant Sawant, Saif Kazi, Vaibhav Shinde 836

[3] Luciano Barbosa and Junlan Feng. 2010. Robust sentimentdetection on twitter from biased and noisy data. [4] Semantic Sentiment Analysis of Twitter Hassan Saif, Yulan He and Harith Alani Knowledge Media Institute, The Open University, United Kingdom. [5] Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: The good the bad and the omg! In: Proceedings of the ICWSM (2011). [6] Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford (2009). 837