SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH APACHE HADOOP

Transcription

1 SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH APACHE HADOOP A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI I AT MĀNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN ELECTRICAL ENGINEERING DECEMBER 2014 By Qiuling Kang Thesis Committee: Xiangrong Zhou, Chairperson Galen Sasaki Rui Zhang Keywords: Twitter, Analysis Sentiment, Hadoop MapReduce, HDFS

2 ACKNOWLEDGMENTS I would like to express my appreciation to my advisor professor Zhou, for his patience and help throughout my master program. Due to his generous and valuable suggestions, I can complete my master project on time. I also would like to thank Professor Sasaki for his guidance during my study in EE department and reviewing this manuscript. In addition, I would like to thank Professor Zhang, for reviewing my thesis. Based on his insightful suggestions, my thesis could be better. ii

3 ABSTRACT Twitter is a microblog service and is a very popular communication mechanism. Users of Twitter express their interests, favorites, and sentiments towards various topics and issues they encountered in daily life, therefore, Twitter is an important online platform for people to express their opinions which is a key fact to influence their behaviors. Thus, sentiment analysis for Twitter data is meaningful for both individuals and organizations to make decisions. Due to the huge amount of data generated by Twitter every day, a system which can store and process big data is becoming a problem. In this study, we present a method to collect Twitter data sets, and store and analyze the data sets on Hadoop platform. The experiment results prove that the present method performs efficient. iii

4 TABLE OF CONTENTS Acknowledgments... ii Abstract... iii Table of Contents... iv List of Tables... vi List of Figures... vii Chapter 1 Introduction Background Motivation Contribution of the Thesis Thesis Overview... 5 Chapter 2 Twitter Data Collection and Storage Twitter Data Collection Twitter API Introduction Tweepy Introduction and Installation Collection Procedure Storage in HDFS Filesystem Chapter 3 Sentiment Analysis of Tweets in Hadoop System Algorithm Selection Decision Trees Naive Bayes Classifiers Support Vector Machines Sentiment Analysis of Tweets Extract Feature from Tweets Classifier Run on Hadoop Hadoop MapReduce and HDFS iv

5 3.3.2 MapReduce Functions Chapter 4 Experiments and Results Scottish Independence Vote Analysis Some Hawai i Tourism Sites Analysis Performance Environment Chapter 5 Conclusion and Open Issues References v

6 LIST OF TABLES Table 1.1 Traditional RDBMS compared to Hadoop... 5 Table 3.1 Words with sentiment polarity in tweet Table 4.1 The environment of the experiment vi

7 LIST OF FIGURES Figure 1.1 The increasing trends of tweets in recent years... 1 Figure 2.1 the process of requesting from REST APIs... 7 Figure 2.2 Streaming APIs working procedure... 8 Figure 2.3 Twitter dataset collection procedure... 9 Figure 2.4 Create a Twitter application Figure 2.5 Create a Twitter application Figure 2.6 Obtain token Figure 2.7 The procedure of data collection Figure 2.8 Sample of collected files Figure 2.9 Architecture of HDFS Figure 2.10 HDFS workflow Figure 3.1 The process of sentiment analysis Figure 3.2 An example of a decision tree Figure 3.3 Naive Bayes classifier represented by graph Figure 3.4 Example of SVM maximum margin and margin Figure 3.5 The workflow of opinion mining Figure 3.6 The Hadoop MapReduce and HDFS architecture Figure 3.7 A client submit a job to MapReduce Figure 3.8 The Mapper flowchart Figure 3.9 The reducer flowchart Figure 4.1 Data analysis process Figure 4.2 The curve for tweets based on different keywords Figure 4.3 Scottish independence vote polarity values Figure 4.4 The attitude distribution based on different keywords Figure 4.5 Values of positive vs negative Figure 4.6 Distribution for attitude polarity based on keyword Hawaii Figure 4.7 Distribution for attitude polarity based on keyword Waikiki Figure 4.8 Distribution for attitude polarity based on keyword Diamond head vii

8 Figure 4.9 Distribution for attitude polarity based on keyword Hanauma bay viii

9 CHAPTER 1 INTRODUCTION 1.1 Background Microblogging websites have become one of major source information. Twitter is one such popular communication microblog which is an online social networking platform that allows people to publish messages to express their interests, favorites, opinions, and sentiments towards various topics and issues they encountered in their daily life. The messages are called tweets which are real-time and at most 140 characters per one [1]. There are about 200 billion tweets per year, 500 million tweets per day, 350,000 tweets per minute, and 6,000 tweets per second are published [2]. Figure1.1 shows the increasing trends of twitter in recent years. Such a huge amount of data can be efficiently used for social network studies and analysis to gain useful and meaningful results. Figure 1.1 The increasing trends of tweets in recent years [2] There is previous research on sentiment analysis of Twitter data. Pak and Paroubek (2010) perform linguistic analysis of the collected tweets and they show the method to build a sentiment classifier using training data. [4]. Sitaram Asur and Bernardo A.Huberman 1

10 (2010) proved that social media content can be utilized to predict real world performance. They built a linear regression model for forecasting the box-office revenues of movies[5]. Apoorv Agarwal and Boye Xie introduce POS-specific prior polarity features and designed new tree for the tree kernel based model. (2011)[6]. Hsiang Hui Lek and Danny C.C. Poo (2013) proposed an aspect-based sentiment classification which improve the existing tweet level classifiers [7]. Classification techniques are fundamental to analyze sentiment of social data. S.B. Kotsiantis(2007) review some recent classification techniques. They discussed and compared several supervised learning algorithms: logical based algorithm such as decision trees, rule based algorithms[3]; perceptron based techniques, such as single layered perceptrons and multilayered perceptrons; Radial Bassis Function (RBF) networks; statistical learning algorithms such as Naive Bayes classifier and Bayesian networks; instance-based learning; and Support Vector machines. 1.2 Motivation Since the number of internet users of social networking platforms and services grows fast, more and more data from these platforms can be used for data mining studies. For example, government may be interested in people s attitude toward the vote. They may prefer to predict the vote result like the question in the following [4]: 1. Could the new policy get the most people s support? 2. How positive (or negative) are people about the new policy? 3. What kind of people has the most influence on the result? Also, the local tourism company may interested in which place is popular among tourists. They may like to know the following questions answer [4]: 1. Which is most visited tourist attractions for tourists? 2

11 2. How positive (or negative) are people about the tourist attraction? 3. Which time do people prefer to travel outside in a day? In this thesis, we show how we use the datasets from microblogging platform to do data mining. We collected big datasets from the Twitter database and analyzed them. There are several reasons we use Twitter data set for opinion mining purpose [4]: 1. Valuable data source: twitter is social network platform used by various people to post their opinions on various topics and discuss current issues. 2. Sufficient data: the volume of tweets grows at high rates, sufficient data could be gathered for data mining. 3. Variety source of users: internet users could be from variety groups of people, for example, researchers, politicians, students, farmers, workers, artists, and so on. We collected more than one million tweets published from Twitter. They are separated into two sets of tweets: 1. Two thirds of tweets concerning the topic of Scotland independence vote event in September One third of tweets talks about the tourism in Hawai i of U.S. As we are going to calculate huge amount of datasets, the problem of how to store large datasets and improve the performance of calculating has become significant and cannot be ignored. There are several reasons that we use Hadoop System instead of Rational Database Management System(RDBMS) this study: Access speed [8][9] Although the storage capacity of a single disk is considerable due to the significant increase of hard drives development over the years, the access speed of data from the disk has not kept up. It usually cost a lot of time to read data from a hard drive. However, 3

12 MapReduce model built on Hadoop is effective when unstructured data are combined from different nodes for merging and sorting. Data duplication [8][9] It is necessary to duplicate data to distinct storage systems to avoid the problems brought by hardware failure. While, we cannot use RDBMS with many disks to do the large scale calculation, since if seek time dominates the data access time, data access will take longer to read and write large portions of the dataset than streaming through it. That is to say, if we update large portions of database, the RDBMS works less efficient than MapReduce built on Hadoop. Distributed File System (HDFS) can store data in distributed systems which duplicate data sets on a cluster of hard drives to avoid the failure coming with one disk failure. Linear Scalability [8][9] When Gigabytes of structured data is computed, RDMBS need to be highly integrated. However, Hadoop models can store very large datasets and process Petabytes of structured, semi-structured or unstructured data with linear scaling and low integrity. If we increase the number of clusters, the speed of processing data is increased proportionally. It does not work for SQL queries. Table 1.1 shows the comparison between traditional RDBMS and Hadoop modules. We show how to collect datasets from twitter database via Twitter API and preform sentiment analysis of the collected datasets on Hadoop distributed system. 4

13 Table 1.1 Traditional RDBMS compared to Hadoop [9] 1.3 Contribution of the Thesis The thesis presents a method to collect a huge amount of datasets which is concerning some specific topics from Twitter database via Twitter API. This study extracts features from Tweets and use sentiment classifier to classify the tweets into positive attitude and negative attitude classes for analyzing people s opinion toward specific topics and issues. This study store and analyzes the datasets by using HDFS and Hadoop MapReduce models respectively, which are more scalable and efficient than traditional RDBMS. The experiment results prove that the present method performs efficient. 1.4 Thesis Overview The rest of paper is organized as follows. A twitter sentiment analysis and research background on Hadoop MapReduce is given in chapter 2. The Twitter data set collection and storage is presented in Chapter 3, and a sentiment analysis on Hadoop system is introduced in Chapter 4. In Chapter 5, we describe the experiment and results. Finally, we conclude the thesis in Chapter 6 and indicate future work. 5

14 CHAPTER 2 TWITTER DATA COLLECTION AND STORAGE Millions of tweets are generated by Twitter users per day [4]. Through the Twitter API (Application Programming Interface), researchers and developers can collect a large public data set from Twitter database. Twitter provides two types of APIs for users to access the Twitter data: REST APIs and Streaming APIs [22]. Users need to request the information explicitly to retrieve tweets with REST APIs that allow users to access some of the core primitives of Twitter including timelines, status updates, and user information. While users can collect stream of public information continuously with Streaming APIs that allow users to request for real-time large quantities with specific type of data filtered by specified and tracked keyword, geographic area, user, or a random sample. As long as a long lived connection is maintained, users can get a continuous stream of updates. In this research, we retrieve Twitter data via Streaming APIs based on our research objective analyzing user sentiments about given topics that requires collecting twitter messages published by users. We need to use the authentication method supported by Twitter to make calls to Twitter s APIs. Twitter use OAuth (Open Authentication) which is an open standard for authentication to protected information. After obtaining the data set, we store the data in HDFS. 2.1 Twitter Data Collection Twitter API Introduction Twitter has two types of APIs for user to access Twitter data: REST APIs and Streaming APIs. The REST APIs do not require users keeping a persistent HTTP (Hypertext Treansfer Protocol) connection open. User makes one or more requests to a web 6

15 application, and then the user will receive the results to the user s initial request. Figure 2.1 show the process of requesting from REST APIs [14]. Figure 2.1 the process of requesting from REST APIs[14] Twitter offers three types of endpoints for Streamingg APIs: Public streams are public data flowing with public tweets; User streamss are single-user streams which are corresponding to the view of a single user; Site streams are multi-user streams which are intended for servers accessed to many Twitter users. In this research, we collect real-time tweets via Twitter Streaming APIs. Figure 2.2 shows how the streaming APIs work. Before the result is stored into a data store, the Tweets input as a streaming process that is parsed, filtered and/or aggregated first. To respondd to user requests, the HTTP queries results from the data store [14]. 7

16 Figure 2.22 Streaming APIs working procedure [14] Tweepy Introduction and Installationn Tweepy is open-sourced and provides access to documented Twitter API by using Python. It supports accessing Twitter through OAth which iss the only way adopted by Twitter to secure its information. OAth offers several benefits: (1) it can make the user information more secure; (2) It conceal the user s password; (3) if the user changes the password, the application will still work, since the application doesn t reply on a password; (4) the permissions are easily to be managed [20][21]. In order to get data from Streaming APIs, our application should obtain an OAuth access token first and then install Tweepy package which is a Python library for accessing the Twitter API. Tweepy API class provides access to the twitter API methods that accept various parameters and return response data. Therefore a copy of Tweepy package is downloaded and installed on the Ubuntu Linux system. After the collection Python script 8

17 is run to collect data and store the dataset on the local server. The whole collecting procedure is shown in Figure 2.3. Figure 2.3 Twitter dataset collection procedure In this paper, the Tweepy package which requires Python 2.5 or later has been installed in Ubuntu Linux system. There are three steps to complete the installation process. 1. Download the tweepy package in local server $ git clone git://github.com/tweepy/tweepy.git 2. Go into the tweepy file. $ cd tweepy 3. Install tweepy using administration or root privilege. $ python setup.py install In order to start the collection process, a client application is registered and a new application is also created with Twitter. We log in the portal, and then go to My Applications. After filling the information shown in Figure 2.4, a new application can be created. We use the application information to communicate with Twitter API to retrieve data sets. As shown in the figure, we input the application name For Hadoop for Name blank, input This application is used for test in Description blank, input placeholder since we do not have a URL in Website blank. At last, we check Yes, I agree and click Create your Twitter application to complete the creating process. 9

18 Figure 2.4 Create a Twitter application After creating the application, we generate the access token. As it is shown in Figure 2.5, we click Create my access token at the end of form. 10

19 Figure 2.5 Create a Twitter application Next, we get the access token and access token secret presented in Figure 2.6. All the information we need to communicate with Twitter API are included in the figure, owner, owner ID, API key, and API secret. 11

20 Figure 2.6 Obtain token Collection Procedure The collection procedure is presented in the Figure 2.7. It includes four steps as follows: 12

21 Start Set OAuth authentication with Username and password Set request parameters Set filter method Access to twitter API and collecting data End Figure 2.7 The procedure of data collection 1. Set OAuth authentication with tokens using Tweepy: twitter utilizes OAuth to provide authorized access to its API and requires all requests to use OAuth for authentication [11]. In the following code, we show how to use Tweepy with OAth to access the Twitter API. 1. consumer_key = "tcytpmwiwlxynbdcs9ipg" 2. consumer_secret = "7BFXcq07s5y4YrjwjP6p3t4cYu0ojeTFG9vq98rE8" 3. access_token = " Nu3991UKfyVIjacGHNnxKmBykHj5W5zX0g89 kn4k" 4. access_token_secret = "KuINYWTDE1fd5QVmRlVsMBmLTDdgMoq2MnyFmo4pG7gv1 " 5. auth = tweepy.oauthhandler(consumer_key, consumer_secret) 6. auth.set_access_token(access_token, access_token_secret) 7. api = tweepy.api(auth) 13

22 We use the consumer_key, consumer_secret, access_token, and access_token_secret as shown in Figure 2.6 to create the OAth access. Line5 and line 6 show how OAth process works. In line 7, we create actual interface using authentication. 2. Set request parameters: in this dissertation, a list of key words and longitude, latitude pairs are used to specify the Tweets that will be returned from the twitter stream. I set the parameters in the following format: track = [ key words 1, key words 2, key words 3 ] follow = [] geo_location = [ , 21.31, , 21.71, , 20.59, , , , 21.89, , 22.24, , , , ] At least of one of the three parameters track, follow, and location should be specified. The parameter track is a list of keywords to track. A list of phrases which are used to determine what Tweets will be called back is separated by comma. A phrase can contain one or several words which are separated by spaces must less than 60 bytes, inclusive [10][12]. The parameter follow is a list of user IDs separated by commas to track. We can collect the tweets of a specific users by setting the follow parameters [10][12]. The parameter location is a list of longitude, latitude pairs which can set a box of bounding of geometric area to track. All the tweets in the area we set will be retrieved [10][12]. For example with setting of the longitude and latitude pairs to bound the location of Hawai i Islands we can track tweets from Hawai i area. 3. Use filter to collect a bunch of tweets matching the request parameters tweets. We use filter() to pass parameters. We use filter() in the format of following codes: stream.filter(track = track, follow = follow, locations = geo_location) 14

23 4. The tweets that match one or more filter parameters are returned and stored in the local server. These tweets are encoded in JSON (JavaScript Object Notation) which is a lightweight data-interchange format. This format is easy for humans to read and write and for machines to parse and generate [13]. In this thesis, the tweepy package is utilized for accessing the Streaming API to gather the tweet back encoded in JSON Figure 2.8 presents part of Twitter data set we collected. Figure 2.8 Sample of collected files More than one million of tweets have been collected from Twitter API in this thesis. One of the tweets is shown in the following: {"created_at":"mon Aug 11 18:50: ", "id": , "id_str":" ", "text":"@crankydad I need to visit you the next time I'm over there mee ting with Jay and Sara.", "source":"\u003ca href=\" rel=\ "nofollow\"\u003etwitter for iphone\u003c\/a\u003e", "truncated":false, "in_reply_to_status_id": , "in_reply_to_status_id_str":" ", "in_reply_to_user_id": , "in_reply_to_user_id_str":" ", "in_reply_to_screen_name":"crankydad", "user":{"id": , "id_str":" ", "name":"julieford", "screen_name":"julieford808", "location":"honolulu, Hawaii", "url":" "description":"local girl after 20 years in Hawaii. Mom of a crazy todd ler. Owner of Schweitzer Consulting, a PR consultancy.", "protected":false, "verified":false, "followers_count":1159, "friends_count":953, "listed_count":49, "favourites_count":19, 15

24 "statuses_count":4296, "created_at":"sat Aug 02 22:50: ", "utc_offset":-36000, "time_zone":"hawaii", "geo_enabled":true, "lang":"en", "contributors_enabled":false, "is_translator":false, "profile_background_color":"edece9", "profile_background_image_url":" und_images\/ \/twilk_background.jpg", "profile_background_image_url_https":" background_images\/ \/twilk_background.jpg", "profile_background_tile":true, "profile_link_color":"088253", "profile_sidebar_border_color":"d3d2cf", "profile_sidebar_fill_color":"e3e2de", "profile_text_color":"634047", "profile_use_background_image":true, "profile_image_url":" \/04050c00d0da6c74b318be1e34f8a38d_normal.jpeg", "profile_image_url_https":" \/04050c00d0da6c74b318be1e34f8a38d_normal.jpeg", "profile_banner_url":" 0\/ ", "default_profile":false, "default_profile_image":false, "following":null, "follow_request_sent":null, "notifications":null}, "geo":{ "type":"point", "coordinates":[ , ]}, "coordinates":{"type":"point","coordinates":[ , ]}, "place":{"id":"c47c0bc571bf5427", "url":" "place_type":"city", "name":"honolulu", "full_name":"honolulu, HI", "country_code":"us", "country":"united States", "bounding_box":{"type":"polygon", "coordinates":[[[ , ],[ , ],[ , ],[ , ]]]}, "attributes":{}}, "contributors":null, "retweet_count":0, "favorite_count":0, "entities":{"hashtags":[], "trends":[], "urls":[], "user_mentions":[{ "screen_name":"crankydad", "name":"mike Gordon", "id": , "id_str":" ", 16

25 "indices":[0,10]}], "symbols":[]}, "favorited":false, "retweeted":false, "possibly_sensitive":false, "filter_level":"medium", "lang":"en"} 2.2 Storage in HDFS Filesystem HDFS is a distributed file system with master/slave architecture and built in Hadoop platform. It is designed for storing large files which could be hundreds of megabytes, gigabytes and petabytes in size; and running on clusters of computers which could be inexpensive and not necessary to be a highly reliable commodity hardware [9]. Figure 2.9 shows the HDFS architecture. In a cluster, the HDFS are consisted of a single NameNode (the master) and a cluster of DataNodes (the slaves). The NameNode hosts the files system index which is in the form of namespace image and edit log. It knows and manages the DataNodes from which the NameNode is constructed when the system starts. DataNodes store the data of the filesystem and retrieve blocks when the NameNode tells them to. They report their status to the NameNode periodically. There is also a secondary NameNode which produce snapshot of the primary NameNode s memory structures to avoid the problems brought by file system corruption [9][14][15]. 17

26 Figure 2.9 Architecture of HDFS The HDFS clusters are setup at the beginning of the process and then we transfer the collected data sets from the local system to HDFS for the future sentiment analysis. Figure 2.10 shows the process of the how to store datasets into HDFS. Figure 2.10 HDFS workflow 1. HDFS Setup: we need to install the Hadoop platform on a cluster of servers, configure the files of NameNode and DataNodes. We execute the following command to setup the system: Java1.6.0_30 was installed on the cluster. 18

27 sudo chmod u+x jdk-6u30-linux-x64. bin sudo./jdk-6u30-linux-x64.bin sudo chmod u+x jre-6u30-linux-x64. bin sudo./jre-6u30-linux-x64.bin SSH was installed on the cluster. sudo apt-get install ssh sudo apt-get install rsync sudo /etc/init.d/ssh start Rather than invoking init scripts through /etc/init.d, use the service (8) utility, e.g. service ssh start Since the script you are attempting to invoke has been converted to an Upstart job, you may also use the start(8) utility, e.g. start ssh ps -ef grep sshd root :07? 00:00:00 /usr/sbin/sshd -D hadoop :18 pts/1 00:00:00 grep --color=auto sshd hdp@hadoop:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: 15:49:81:02:71:55:a9:a0:9d:a8:e6:4d:c1:00:ae:65 hadoop@hadoop The key's randomart image is: +--[ RSA 2048] oo...+=+...o. o..eo o = o..... S o. o o hdp@hadoop:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Install Hadoop on the cluster and config the Hadoop configuration file: export JAVA_HOME=/usr/local/java/jdk1.6.0_45 into hadoop-env.sh export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:/usr/local/hadoop/bin 19

28 We config core-site.xml, hdfs-site.xml, and mapred-site.xml respectively. The network interface is setup by connecting each server via a single hub. We assign the IP address to the master machine and and to the slave machines respectively. 2. Initialize the system: the HDFS filesystem is formatted via NameNode. The following command is executed: hdp@master:/usr/local/hadoop$ bin/hadoop namenode -format Start the system: hdp@master:/usr/local/hadoop$ bin/start-all.sh 3. Transfer local data sets into HDFS: hdp@master:/usr/local/hadoop$ bin/hadoop dfs -copyfromlocal /tmp/inputt xt1 /user/hdp/inputtxt1 4. MapReduce calculation hdp@master:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-* streaming*.jar \ > -file /home/hdp/mymapper.py -mapper /home/hdp/ MyMapper.py \ > -file /home/hdp/myreducer.py -reducer /home/hdp/ MyReducer.py \ > -input /user/hdp/inputtxt1/* -output /user/hdp/inputtxt1-output 20

29 CHAPTER 3 SENTIMENT ANALYSIS OF TWEETS IN HADOOP SYSTEM Sentiment analysis is the field of research that identifies and extracts subjective information from written language. It is also called opinion mining which is aiming to analyze people s attitudes, emotions and opinions and classify the polarity of a given text. Sentiment analysis usually classifies the given text into two classes, positive and negative [16]. The proposed process of sentiment of analysis of Tweets is described in Figure 3.1. Figure 3.1 The process of sentiment analysis The first step is collecting dataset from Twitter Database. If there is not an expert who could tell us which the most informative fields are, then we could use the brute-force 21

30 method to gathering everything that may contains relevant features can be isolated. However, the dataset collected by using the brute force method may miss useful information and contains too many noises, so we need to define the keywords of classifier. The second step is the definition of classifier keywords and data preparation. Keywords selection in these dataset reduces data size and removes many irrelevant and redundant features and thus reduces noises. Tweets filtered by keywords are processed more effectively and faster by using data mining algorithm. In sum, a good selection of keywords of classifier contributes to better analyze results. 3.1 Algorithm Selection It is very important to choose a specific algorithm for sentiment analysis. For the text classification problem, there are three methods that can be applied, e.g. Decision Tree, Naive Bayes classification, and Support Vector Machines (SVM) [3]. In the following section, we will introduce and compare them then propose the method we use Decision Trees Decision trees are tree-like graphs that are used to classify instances by using a specific sorting algorithm and to help to reach a goal. A decision tree uses decision nodes to test attributes of an instance described by attribute values to be classified, and each tree branches corresponds to attribute value represented by tree node. Each leaf node of decision tree represents a classification goal. The classification is started from root node, sorted based on the attribute values, and end at leaf nodes. Figure 3.2 shows an example of a decision tree [3]. 22

31 Figure 3.2 An example of a decision tree. Decision tree method is simple to understand and easy to implement. A general pseudocode for building a decision tree for sentiment analysis is showed as follows [3]: Check for base cases Create a node r for the tree For each Tweet in Tweets do: If Tweet does not contain keywords, discard the Tweet. If Tweet contains keywords, do: add a new tree branch below r, corresponding to the test if key words are positive then: label the Tweet Positive attitude Else add a new tree branch below, corresponding to the test if keywords are negative then: label the Tweet Negative attitude Else label the Tweet Neutral attitude 23

32 3.1.2 Naive Bayes Classifiers Assume that there are two classes of keywords: w 1 =positive, w 2 = negative, and a set of sentiments words in Tweet is represented as T. Define following symbols: p(w j T) is the probability of class w j, given that we have observed T. Bayesian classifiers use Bayes theorem, which is described as follows [3]: where p(w j T) is probability of instance T being in class w j p(t w j ) is probability of generating instance T given class w j p(w j ) is probability of occurrence of class w j p(t) is probability of instance T occurring. In order to classify T s attitude as positive and negative, the probabilities of p(w 1 T) and p(w 2 T) are compared and the larger probability event indicates that the class sentiment is more likely to happen. We input n sentiment words in a Tweet T = {t 1, t 2,, t n }. When t i is a positive word t i equals to 1 and when t i is a negative word t i equals to 2. We assume all t i are probability independent and there exist k positive words in T, and the following formulas are existence: p(w 1 ) = p(w 2 ) = 0.5 p(t i = 1 w 1 ) >> p(t i = 2 w 1 ) p(t i = 2 w 2 ) >> p(t i = 1 w 2 ), 24

33 p(t i = 1 w 1 ) = p(t i = 2 w 2 ) = p >> 0.5 Since Similarly, 1 Thus, 1 In sum, the classifier result depends on the number of positive words and negative words. For example, an input Tweet is: Lovely turtle, beautiful fish, and bad weather, but still a fancy trip. The sentiment polarity in this tweet is shown in following table. Table 3.1 Words with sentiment polarity in tweet 25

34 Since then Because p(positive Tweet) is larger than p(negative Tweet), we can deduce the result that is p(w 1 T) is larger than p(w 2 T). Therefore, the sampled Tweet is labeled as positive. As Naive Bayes classifier assumes attributes have strong independent distributions, the estimate is: The Naive Bayes classifiers can be represented as directed acyclic graphs which have one unobserved node as parent and several observed nodes as children with strong independence assumptions among them [3]. 26

35 p(t w j ) p(t 1 w j ) p(t 2 w j ) p(t n w j ) Figure 3.3 Naive Bayes classifier represented by graph A general pseudo-code for Naive Bayes classifier for sentiment analysis is showed as follows [3]: For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Support Vector Machines Support vector machines (SVMs) are supervised machine learning models which can be used for data analysis and classification. A hyperplane is constructed by a SVM can be used for classification. To achieved the best classification performance, we need to find the maximum margin which means either side of the a hyperplane has a largest distance to the corresponding nearest data point, therefore, reduced an upper bound on the expected generalization error [3]. Suppose some given linearly separable data points which can be separated into two classes by hyperplane. There may be many hyperplanes that can classify the data points 27

36 into two classes. One reasonable choice is the best hyperplane which represent the largest separation maximum margin. Figure 3.4 shows the Maximum-margin and margins for an SVM. Figure 3.4 Example of SVM maximum margin and margin If the data sets are linearly separable, we select two hyperplanes between which the distance is maximized. The area bounded by them is margin where there is not data points located. Therefore, there is a pair (w, b) meets the following inequations [3]: 1, 1, where x i is a n-dimensional vector 28

37 w is the normal vector to the hyperplane b is the offset The two constraints can be rewritten as: 1 When we linearly classify the two classes, a best hyperplane can be found by solving the quadratic programming optimization problem:, 1 2 subject to 1, for i = 1,, n The data points lying on the margin and satisfying 1 are support vector points of which the linear combination represents the solution. (See Figure 3.5) A general pseudo-code for SVMs is illustrated in the follow process [3]. 1) INTRODUCE POSITIVE LAGRANGE MULTIPLIERS Α, ONE FOR EACH OF THE INEQUALITY CONSTRAINTS (1). THIS GIVES LAGRANGIAN: L 1 2 w α y w x b 1 2) MINIMIZE LP WITH RESPECT TO W,B. 3) COMPUTE QUADRATIC PROGRAMMING SOLUTION W, B 4) IN THE SOLUTION, THOSE POINTS FOR WHICH α 0 ARE CALLED SUPPORT VECTORS 29

38 3.2 Sentiment Analysis of Tweets Extract Feature from Tweets We extract features from tweets collected by a Python program for the future sentiment classifying. The process of parsing the Tweet post and obtaining unigrams is as follows: Decode: the datasets collected from API comes in JSON, then they are decoded into Python data structure for future process. (e. g. JSON: [{"text": "tweet", "truncated": false, "test": [6,14]}], Python: [{u'text': u'tweet', u'truncted': False, u'test': [6, 14]}]). Filtering: we extract text element (tweet content) from tweet which is in Python data structure (e.g. Everyone in Hawai i is so nice. ). And then convert the text into lower case (e.g. everyone in Hawai i is so nice.). Tokenization: We parse the data by splitting it by space. We encode text in UTF- 8 to get rid of Unicode errors and replace the punctuation in text Classifier Since Naive Bayes is fast, space efficient, and not sensitive to irrelevant features, in this research we used the Naive Bayes classifier which is based on Bayes theorem (Anthony J, 2007) in this study. where w is a sentiment word, T is a Twitter message [3]. 30

39 Bayes s theorem is based on strong independence assumptions. Therefore, the probabilistic model for a classifier can be described as: Comparing the probabilities P(positive T) and P(negative T), the larger probability indicates that the class label value has a higher probability to be actual label. If R is larger than 0, then predict positive attitude is more likely to be true, otherwise, predict negative attitude has more likely to be true. During the sentiment analysis, the Naive Bayes classifier classifies a Tweet into a positive class or a negative class by comparing the words in each Tweet. Each word will be labeled with positive and negative coming from the lexicon. In the Naive Bayes classification, the number of sentiment words is counted. If more positive words are used than negative in a Tweet, then the Tweet could be labeled as positive, otherwise if less positive words presented in a Tweet than negative ones, the Tweet could be labeled as negative. A neutral label word is ignored in this study since it contains no valuable information for sentiment analysis. The algorithm judges the polarity of the text in the Tweet by checking the words in the Tweet. At last, the algorithm output the individual s view. Figure 3.5 shows the workflow of sentiment analysis [17]. 31

40 Figure 3.5 The workflow of opinion mining 3.3 Run on Hadoop Hadoop MapReduce and HDFS Hadoop has a master-slave architecture which is consisted of HDFS and MapReduce [14]. The big Twitter datasets are stored in HDFS from which the data is read for processing and the computational layer s job is done by MapReduce [18]. The MapReduce master is responsible for organizing where computational work should be scheduled on the slave nodes. The HDFS master is responsible for partitioning the 32

41 storage across the slave node and keeping track of where data is located [18]. Figure 3.6 shows Hadoop MapReduce and HDFS architecture.... Figure 3.6 The Hadoop MapReduce and HDFS architecture 33

42 MapReduce breaks the sentiment analysis processing into map and reduce phase which is executed by MyMapper.py and MyReducer.py respectively. The map phase output keyvalue pairs. After being sorted by Unix build in sort program, the key-value pairs will be process by the reduce phase and then write out the results which are stored in HDFS. The following three steps and figure 3.7 described how MapReduce process is: 1. Map process: the datasets are split based on distinct keys and values. 2. Shuffle and sort process: datasets are shuffled and sorted based on the keys into some logically order. 3. Reduce process: the data flows input into reduce process are output from previous procedure are grouped by keys and applied some functions. Figure 3.7 A client submit a job to MapReduce [18] MapReduce Functions Hadoop Streaming provided by Hadoop distribution is a utility that allows us to create and run Map/Reduce jobs with Python script [23]. It helps us passing data between our map and reduce functions. Since it allows us to use standard input and standard output, 34

43 we write our map and reduce function by Python and read input data using Python s sys.stdin and print the output data using Python s sys.stdout [9]. The function MyMap.py read data from STDIN, split it into words and pass them line by line to the STDOUT. The Map script output key-value pairs which are not sorted. The intermediate sort work is done by the sort program built in UNIX-based systems. After being sorted by key, the sorted output key-value a pairs will be read in line by line by MyReducer.py script through standard input STDIN and write its final result to standard output STDOUT [9]. 35

44 1. Mapper: figure 3.8 shows the map flowchart Figure 3.8 The Mapper flowchart 36

45 2. Reducer: figure 3.9 shows the reduce flowchart. Figure 3.9 The reducer flowchart 37

46 CHAPTER 4 EXPERIMENTS AND RESULTS In this section, we present experiments and results for two classification tasks: sentiment classification for Scottish independence vote: positive vs negative and sentiment classification for Hawai i tourism spot: positive vs negative. For each of the sentiment classification, we follow the procedures described in figure 4.1 below and Naive Bayes classifier is applied to classify the datasets into positive and negative class. Figure 4.1 Data analysis process 4.1 Scottish Independence Vote Analysis The Scottish independence vote was a referendum on Scottish independence which took place in Scotland on 18 September The voters answered Should Scotland be an independent country? with Yes or No to decide whether Scotland should be independent [19]. We extracted tweets from Twitter for opinion mining to predict the result of voting. To make sure all the data sets we collected from Twitter refer to the Scotland independent vote, we used keywords concerning the event as search arguments. We extracted tweets 38

47 via Twitter Stream API over frequent intervals, thus, we had the timestamp, author and tweet text for opinion evaluation. About one million tweets were gathered over a period of ten days around the Scottish Independence vote date. Since the independence polling took place on 18 September, we extracted tweets from 11 September to 20 September for sentiment analysis. 14 x Scotland Scottish Vote Independence Independent 10 Tweets Amount /11/14 09/12/14 09/13/14 09/14/14 09/15/14 09/16/14 09/17/14 09/18/14 09/19/14 09/20/14 Date Figure 4.2 The curve for tweets over collecting period based on different keywords Figure 4.2 shows the time series trend in the amount of tweets for Scottish polling over the collecting period. We can observe that the busiest time for the voting is at September which is reasonable since the event happened at that day. After September 18, there are less and less people discussed the event since the polling process was ended, thus, less and less tweets concerning the topic could be collected and the curves come down. 39

48 4.5 x Total Postive Negative Neutral Tweets Amount /11/14 09/12/14 09/13/14 09/14/14 09/15/14 09/16/14 09/17/14 09/18/14 09/19/14 09/20/14 Date Figure 4.3 Scottish independence vote polarity values Figure 4.3 displays the amount of tweets about the people attitudes over time. As we can read from the figure for the independent vote most tweets about the topic were published via Twitter and many twitter uses have neutral attitude comparing to positive and negative attitude.. 40

49 Postive Negative Neutral Postive Negative Neutral 29% 27% 16% 13% 55% 60% A B Postive Negative Neutral Postive Negative Neutral 33% 21% 6% 46% 21% 73% C D Postive Negative Neutral 13% 3% 84% E Figure 4.4 The attitude distribution based on different keywords The pie chart A, B, C, D, and E of the Figure 4.4 shows the attitude distribution based on different keywords. 41

50 Figure 4.5 Distribution of authors political standpoint toward Scottish Independence vote: values of positive vs negative Figure 4.5 shows the authors political standpoint toward Scottish independence vote. The x axis represents the time period over the ten days, thus, there are 240 hours totally. The y axis represents the degree of authors attitude. The blue points represent positive results standing for supporting independence of Scotland while negative results are marked by red points standing for opposing to the Scotland independence. As we can observe at Sep 18, the peak point appeared. 4.2 Some Hawai i Tourism Sites Analysis Hawai i islands which are Hawai i, O ahu, Maui, Kaua i, and Lāna i are located in the Pacific Ocean and have significant tourism [24]. In 2013, according to Hawai i government 2013 annual report, there were over 8 million visitors to the Hawaiian Islands with expenditures of over $15 billion [25]. The most popular times for tourist are e summer months and major holidays, therefore, our tweets collecting period is from August 23 to September 20, In this study, we mainly collect data concerning these topics: Hawai i the name of the islands, Waikīkī well known for Waikīkī beach which 42

51 is the most popular beach of O ahu; Diamond head the name of a volcanic and a major tourist attraction on O ahu; Hanauma bay famous for snorkeling; and Hawai i airlines the largest airline in Hawai i [26][27][28][29]. Figure 4.6 shows the distribution for attitude polarity from August 23 to September 20 periods based on keyword Hawai i. The blue circles in the figure represent the positive attitude degree, the red stars represent negative attitude, and the pink points represents the average of positive and negative. As we can observed, all the average points are above zero line over the collecting period, thus, the authors of Twitter have positive attitude comments on Hawai i Positive Negative Average Figure 4.6 Distribution for attitude polarity over collecting period based on keyword Hawai i 43

52 Figure 4.7 shows the distribution for attitude polarity from August 23 to September 20 periods based on keyword Waikīkī. We can observe that the pink points which are the average of positive and negative values are above zero line. Therefore, from the figure, we can conclude that authors have positive attitude toward Waikīkī Positive Negative Average Figure 4.7 Distribution for attitude polarity over collecting period based on keyword Waikīkī Figure 4.8 describes the distribution for attitude polarity from August 23 to September 20 periods based on keyword Diamond head. People s average attitude is positive toward Diamond head, since the average values represented by pink points are above the zero line. 44

53 Positive Negative Average Figure 4.8 Distribution for attitude polarity over collecting period based on keyword Diamond head Figure 4.9 shows the distribution for attitude polarity from August 23 to September 20 periods based on keyword Hanauma bay. The average values are above zero line, thus, authors have positive comments on Hanauma bay. 45

54 Positive Negative Average Figure 4.9 Distribution for attitude polarity over collecting period based on keyword Hanauma bay 4.3 Performance Environment In this experiment, we use one server as master, and two servers as slaves, on which we installed Ubuntu system and Hadoop models. The environment of experiment is described in Table 4.1 below. 46

55 Table 4.1 The environment of the experiment 47

56 CHAPTER 5 CONCLUSION AND OPEN ISSUES This study presents a method to collect datasets which is concerning some specific topics from Twitter database via Twitter API. We extracts features from Tweets and use Naive Bayes classifier separate the data into two classes: positive and negative for opinion evaluation toward some topics and issues. In this study, we store original dataset in HDFS filesystem and analyze the datasets using Hadoop MapReduce model. We visualized analyzing results by using Matlab. The experiment results prove that the present method performs efficient. Although this thesis evaluate the views of authors of Twitter: predict the Scottish independence vote result and analyze tourists attitude toward some popular tourist attractions in Hawai i, there are many open issues that still require further investigation and research work. In this paragraph some of the open issues that are worth of attention in relation to this thesis work are discussed. This thesis uses Naive Bayes classifier for classification, in the future work we may modify it to improve its performance or try other classifier to overcome the independence assumption. 48

57 REFERENCES [1] Matthew A. Russell, Mining the Social Web, O Reilly, 2011 [2] [3] S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in ehealth, HCI, Information Retrieval and Pervasive Technologies, pp. 3-24, June [4] Alexander Pak, Patrick Paroubek, Twitter as a Corpus for sentiment analysis and opinion mining, LREC 2010, Seventh International Conference on Language Resources and Evaluation, May [5] Predicting the Future with Social Media, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp , [6] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, Rebecca Passonneau, Sentiment Analysis of Twitter Data, LSM '11 Proceedings of the Workshop on Languages in Social Media, pp.30-38, June [7] Hsiang Hui Lek and Poo, D.C.C., Aspect-based Twitter Sentiment Classification, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp , November 2013 [8] [9] Tom White, Hadoop: The Definitive Guide, Third Edition, O Reilly, [10] [11]. [12] [13] [14] [15] [16] Bing Liu, Sentiment Analysis and Opinion Mining, Graeme Hirst,

58 [17] Shamanth Kumar, Fred Morstatter, Huan Liu, Twitter Data Analytics, Springer, 2013 [18]. Alex Holmes, Hadoop in Practice, Manning Shelter Island, 2012 [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] 50