SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH APACHE HADOOP
|
|
|
- Junior Mathews
- 9 years ago
- Views:
Transcription
1 SENTIMENT ANALYSIS OF BIG SOCIAL DATA WITH APACHE HADOOP A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI I AT MĀNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN ELECTRICAL ENGINEERING DECEMBER 2014 By Qiuling Kang Thesis Committee: Xiangrong Zhou, Chairperson Galen Sasaki Rui Zhang Keywords: Twitter, Analysis Sentiment, Hadoop MapReduce, HDFS
2 ACKNOWLEDGMENTS I would like to express my appreciation to my advisor professor Zhou, for his patience and help throughout my master program. Due to his generous and valuable suggestions, I can complete my master project on time. I also would like to thank Professor Sasaki for his guidance during my study in EE department and reviewing this manuscript. In addition, I would like to thank Professor Zhang, for reviewing my thesis. Based on his insightful suggestions, my thesis could be better. ii
3 ABSTRACT Twitter is a microblog service and is a very popular communication mechanism. Users of Twitter express their interests, favorites, and sentiments towards various topics and issues they encountered in daily life, therefore, Twitter is an important online platform for people to express their opinions which is a key fact to influence their behaviors. Thus, sentiment analysis for Twitter data is meaningful for both individuals and organizations to make decisions. Due to the huge amount of data generated by Twitter every day, a system which can store and process big data is becoming a problem. In this study, we present a method to collect Twitter data sets, and store and analyze the data sets on Hadoop platform. The experiment results prove that the present method performs efficient. iii
4 TABLE OF CONTENTS Acknowledgments... ii Abstract... iii Table of Contents... iv List of Tables... vi List of Figures... vii Chapter 1 Introduction Background Motivation Contribution of the Thesis Thesis Overview... 5 Chapter 2 Twitter Data Collection and Storage Twitter Data Collection Twitter API Introduction Tweepy Introduction and Installation Collection Procedure Storage in HDFS Filesystem Chapter 3 Sentiment Analysis of Tweets in Hadoop System Algorithm Selection Decision Trees Naive Bayes Classifiers Support Vector Machines Sentiment Analysis of Tweets Extract Feature from Tweets Classifier Run on Hadoop Hadoop MapReduce and HDFS iv
5 3.3.2 MapReduce Functions Chapter 4 Experiments and Results Scottish Independence Vote Analysis Some Hawai i Tourism Sites Analysis Performance Environment Chapter 5 Conclusion and Open Issues References v
6 LIST OF TABLES Table 1.1 Traditional RDBMS compared to Hadoop... 5 Table 3.1 Words with sentiment polarity in tweet Table 4.1 The environment of the experiment vi
7 LIST OF FIGURES Figure 1.1 The increasing trends of tweets in recent years... 1 Figure 2.1 the process of requesting from REST APIs... 7 Figure 2.2 Streaming APIs working procedure... 8 Figure 2.3 Twitter dataset collection procedure... 9 Figure 2.4 Create a Twitter application Figure 2.5 Create a Twitter application Figure 2.6 Obtain token Figure 2.7 The procedure of data collection Figure 2.8 Sample of collected files Figure 2.9 Architecture of HDFS Figure 2.10 HDFS workflow Figure 3.1 The process of sentiment analysis Figure 3.2 An example of a decision tree Figure 3.3 Naive Bayes classifier represented by graph Figure 3.4 Example of SVM maximum margin and margin Figure 3.5 The workflow of opinion mining Figure 3.6 The Hadoop MapReduce and HDFS architecture Figure 3.7 A client submit a job to MapReduce Figure 3.8 The Mapper flowchart Figure 3.9 The reducer flowchart Figure 4.1 Data analysis process Figure 4.2 The curve for tweets based on different keywords Figure 4.3 Scottish independence vote polarity values Figure 4.4 The attitude distribution based on different keywords Figure 4.5 Values of positive vs negative Figure 4.6 Distribution for attitude polarity based on keyword Hawaii Figure 4.7 Distribution for attitude polarity based on keyword Waikiki Figure 4.8 Distribution for attitude polarity based on keyword Diamond head vii
8 Figure 4.9 Distribution for attitude polarity based on keyword Hanauma bay viii
9 CHAPTER 1 INTRODUCTION 1.1 Background Microblogging websites have become one of major source information. Twitter is one such popular communication microblog which is an online social networking platform that allows people to publish messages to express their interests, favorites, opinions, and sentiments towards various topics and issues they encountered in their daily life. The messages are called tweets which are real-time and at most 140 characters per one [1]. There are about 200 billion tweets per year, 500 million tweets per day, 350,000 tweets per minute, and 6,000 tweets per second are published [2]. Figure1.1 shows the increasing trends of twitter in recent years. Such a huge amount of data can be efficiently used for social network studies and analysis to gain useful and meaningful results. Figure 1.1 The increasing trends of tweets in recent years [2] There is previous research on sentiment analysis of Twitter data. Pak and Paroubek (2010) perform linguistic analysis of the collected tweets and they show the method to build a sentiment classifier using training data. [4]. Sitaram Asur and Bernardo A.Huberman 1
10 (2010) proved that social media content can be utilized to predict real world performance. They built a linear regression model for forecasting the box-office revenues of movies[5]. Apoorv Agarwal and Boye Xie introduce POS-specific prior polarity features and designed new tree for the tree kernel based model. (2011)[6]. Hsiang Hui Lek and Danny C.C. Poo (2013) proposed an aspect-based sentiment classification which improve the existing tweet level classifiers [7]. Classification techniques are fundamental to analyze sentiment of social data. S.B. Kotsiantis(2007) review some recent classification techniques. They discussed and compared several supervised learning algorithms: logical based algorithm such as decision trees, rule based algorithms[3]; perceptron based techniques, such as single layered perceptrons and multilayered perceptrons; Radial Bassis Function (RBF) networks; statistical learning algorithms such as Naive Bayes classifier and Bayesian networks; instance-based learning; and Support Vector machines. 1.2 Motivation Since the number of internet users of social networking platforms and services grows fast, more and more data from these platforms can be used for data mining studies. For example, government may be interested in people s attitude toward the vote. They may prefer to predict the vote result like the question in the following [4]: 1. Could the new policy get the most people s support? 2. How positive (or negative) are people about the new policy? 3. What kind of people has the most influence on the result? Also, the local tourism company may interested in which place is popular among tourists. They may like to know the following questions answer [4]: 1. Which is most visited tourist attractions for tourists? 2
11 2. How positive (or negative) are people about the tourist attraction? 3. Which time do people prefer to travel outside in a day? In this thesis, we show how we use the datasets from microblogging platform to do data mining. We collected big datasets from the Twitter database and analyzed them. There are several reasons we use Twitter data set for opinion mining purpose [4]: 1. Valuable data source: twitter is social network platform used by various people to post their opinions on various topics and discuss current issues. 2. Sufficient data: the volume of tweets grows at high rates, sufficient data could be gathered for data mining. 3. Variety source of users: internet users could be from variety groups of people, for example, researchers, politicians, students, farmers, workers, artists, and so on. We collected more than one million tweets published from Twitter. They are separated into two sets of tweets: 1. Two thirds of tweets concerning the topic of Scotland independence vote event in September One third of tweets talks about the tourism in Hawai i of U.S. As we are going to calculate huge amount of datasets, the problem of how to store large datasets and improve the performance of calculating has become significant and cannot be ignored. There are several reasons that we use Hadoop System instead of Rational Database Management System(RDBMS) this study: Access speed [8][9] Although the storage capacity of a single disk is considerable due to the significant increase of hard drives development over the years, the access speed of data from the disk has not kept up. It usually cost a lot of time to read data from a hard drive. However, 3
12 MapReduce model built on Hadoop is effective when unstructured data are combined from different nodes for merging and sorting. Data duplication [8][9] It is necessary to duplicate data to distinct storage systems to avoid the problems brought by hardware failure. While, we cannot use RDBMS with many disks to do the large scale calculation, since if seek time dominates the data access time, data access will take longer to read and write large portions of the dataset than streaming through it. That is to say, if we update large portions of database, the RDBMS works less efficient than MapReduce built on Hadoop. Distributed File System (HDFS) can store data in distributed systems which duplicate data sets on a cluster of hard drives to avoid the failure coming with one disk failure. Linear Scalability [8][9] When Gigabytes of structured data is computed, RDMBS need to be highly integrated. However, Hadoop models can store very large datasets and process Petabytes of structured, semi-structured or unstructured data with linear scaling and low integrity. If we increase the number of clusters, the speed of processing data is increased proportionally. It does not work for SQL queries. Table 1.1 shows the comparison between traditional RDBMS and Hadoop modules. We show how to collect datasets from twitter database via Twitter API and preform sentiment analysis of the collected datasets on Hadoop distributed system. 4
13 Table 1.1 Traditional RDBMS compared to Hadoop [9] 1.3 Contribution of the Thesis The thesis presents a method to collect a huge amount of datasets which is concerning some specific topics from Twitter database via Twitter API. This study extracts features from Tweets and use sentiment classifier to classify the tweets into positive attitude and negative attitude classes for analyzing people s opinion toward specific topics and issues. This study store and analyzes the datasets by using HDFS and Hadoop MapReduce models respectively, which are more scalable and efficient than traditional RDBMS. The experiment results prove that the present method performs efficient. 1.4 Thesis Overview The rest of paper is organized as follows. A twitter sentiment analysis and research background on Hadoop MapReduce is given in chapter 2. The Twitter data set collection and storage is presented in Chapter 3, and a sentiment analysis on Hadoop system is introduced in Chapter 4. In Chapter 5, we describe the experiment and results. Finally, we conclude the thesis in Chapter 6 and indicate future work. 5
14 CHAPTER 2 TWITTER DATA COLLECTION AND STORAGE Millions of tweets are generated by Twitter users per day [4]. Through the Twitter API (Application Programming Interface), researchers and developers can collect a large public data set from Twitter database. Twitter provides two types of APIs for users to access the Twitter data: REST APIs and Streaming APIs [22]. Users need to request the information explicitly to retrieve tweets with REST APIs that allow users to access some of the core primitives of Twitter including timelines, status updates, and user information. While users can collect stream of public information continuously with Streaming APIs that allow users to request for real-time large quantities with specific type of data filtered by specified and tracked keyword, geographic area, user, or a random sample. As long as a long lived connection is maintained, users can get a continuous stream of updates. In this research, we retrieve Twitter data via Streaming APIs based on our research objective analyzing user sentiments about given topics that requires collecting twitter messages published by users. We need to use the authentication method supported by Twitter to make calls to Twitter s APIs. Twitter use OAuth (Open Authentication) which is an open standard for authentication to protected information. After obtaining the data set, we store the data in HDFS. 2.1 Twitter Data Collection Twitter API Introduction Twitter has two types of APIs for user to access Twitter data: REST APIs and Streaming APIs. The REST APIs do not require users keeping a persistent HTTP (Hypertext Treansfer Protocol) connection open. User makes one or more requests to a web 6
15 application, and then the user will receive the results to the user s initial request. Figure 2.1 show the process of requesting from REST APIs [14]. Figure 2.1 the process of requesting from REST APIs[14] Twitter offers three types of endpoints for Streamingg APIs: Public streams are public data flowing with public tweets; User streamss are single-user streams which are corresponding to the view of a single user; Site streams are multi-user streams which are intended for servers accessed to many Twitter users. In this research, we collect real-time tweets via Twitter Streaming APIs. Figure 2.2 shows how the streaming APIs work. Before the result is stored into a data store, the Tweets input as a streaming process that is parsed, filtered and/or aggregated first. To respondd to user requests, the HTTP queries results from the data store [14]. 7
16 Figure 2.22 Streaming APIs working procedure [14] Tweepy Introduction and Installationn Tweepy is open-sourced and provides access to documented Twitter API by using Python. It supports accessing Twitter through OAth which iss the only way adopted by Twitter to secure its information. OAth offers several benefits: (1) it can make the user information more secure; (2) It conceal the user s password; (3) if the user changes the password, the application will still work, since the application doesn t reply on a password; (4) the permissions are easily to be managed [20][21]. In order to get data from Streaming APIs, our application should obtain an OAuth access token first and then install Tweepy package which is a Python library for accessing the Twitter API. Tweepy API class provides access to the twitter API methods that accept various parameters and return response data. Therefore a copy of Tweepy package is downloaded and installed on the Ubuntu Linux system. After the collection Python script 8
17 is run to collect data and store the dataset on the local server. The whole collecting procedure is shown in Figure 2.3. Figure 2.3 Twitter dataset collection procedure In this paper, the Tweepy package which requires Python 2.5 or later has been installed in Ubuntu Linux system. There are three steps to complete the installation process. 1. Download the tweepy package in local server $ git clone git://github.com/tweepy/tweepy.git 2. Go into the tweepy file. $ cd tweepy 3. Install tweepy using administration or root privilege. $ python setup.py install In order to start the collection process, a client application is registered and a new application is also created with Twitter. We log in the portal, and then go to My Applications. After filling the information shown in Figure 2.4, a new application can be created. We use the application information to communicate with Twitter API to retrieve data sets. As shown in the figure, we input the application name For Hadoop for Name blank, input This application is used for test in Description blank, input placeholder since we do not have a URL in Website blank. At last, we check Yes, I agree and click Create your Twitter application to complete the creating process. 9
18 Figure 2.4 Create a Twitter application After creating the application, we generate the access token. As it is shown in Figure 2.5, we click Create my access token at the end of form. 10
19 Figure 2.5 Create a Twitter application Next, we get the access token and access token secret presented in Figure 2.6. All the information we need to communicate with Twitter API are included in the figure, owner, owner ID, API key, and API secret. 11
20 Figure 2.6 Obtain token Collection Procedure The collection procedure is presented in the Figure 2.7. It includes four steps as follows: 12
21 Start Set OAuth authentication with Username and password Set request parameters Set filter method Access to twitter API and collecting data End Figure 2.7 The procedure of data collection 1. Set OAuth authentication with tokens using Tweepy: twitter utilizes OAuth to provide authorized access to its API and requires all requests to use OAuth for authentication [11]. In the following code, we show how to use Tweepy with OAth to access the Twitter API. 1. consumer_key = "tcytpmwiwlxynbdcs9ipg" 2. consumer_secret = "7BFXcq07s5y4YrjwjP6p3t4cYu0ojeTFG9vq98rE8" 3. access_token = " Nu3991UKfyVIjacGHNnxKmBykHj5W5zX0g89 kn4k" 4. access_token_secret = "KuINYWTDE1fd5QVmRlVsMBmLTDdgMoq2MnyFmo4pG7gv1 " 5. auth = tweepy.oauthhandler(consumer_key, consumer_secret) 6. auth.set_access_token(access_token, access_token_secret) 7. api = tweepy.api(auth) 13
22 We use the consumer_key, consumer_secret, access_token, and access_token_secret as shown in Figure 2.6 to create the OAth access. Line5 and line 6 show how OAth process works. In line 7, we create actual interface using authentication. 2. Set request parameters: in this dissertation, a list of key words and longitude, latitude pairs are used to specify the Tweets that will be returned from the twitter stream. I set the parameters in the following format: track = [ key words 1, key words 2, key words 3 ] follow = [] geo_location = [ , 21.31, , 21.71, , 20.59, , , , 21.89, , 22.24, , , , ] At least of one of the three parameters track, follow, and location should be specified. The parameter track is a list of keywords to track. A list of phrases which are used to determine what Tweets will be called back is separated by comma. A phrase can contain one or several words which are separated by spaces must less than 60 bytes, inclusive [10][12]. The parameter follow is a list of user IDs separated by commas to track. We can collect the tweets of a specific users by setting the follow parameters [10][12]. The parameter location is a list of longitude, latitude pairs which can set a box of bounding of geometric area to track. All the tweets in the area we set will be retrieved [10][12]. For example with setting of the longitude and latitude pairs to bound the location of Hawai i Islands we can track tweets from Hawai i area. 3. Use filter to collect a bunch of tweets matching the request parameters tweets. We use filter() to pass parameters. We use filter() in the format of following codes: stream.filter(track = track, follow = follow, locations = geo_location) 14
23 4. The tweets that match one or more filter parameters are returned and stored in the local server. These tweets are encoded in JSON (JavaScript Object Notation) which is a lightweight data-interchange format. This format is easy for humans to read and write and for machines to parse and generate [13]. In this thesis, the tweepy package is utilized for accessing the Streaming API to gather the tweet back encoded in JSON Figure 2.8 presents part of Twitter data set we collected. Figure 2.8 Sample of collected files More than one million of tweets have been collected from Twitter API in this thesis. One of the tweets is shown in the following: {"created_at":"mon Aug 11 18:50: ", "id": , "id_str":" ", "text":"@crankydad I need to visit you the next time I'm over there mee ting with Jay and Sara.", "source":"\u003ca href=\" rel=\ "nofollow\"\u003etwitter for iphone\u003c\/a\u003e", "truncated":false, "in_reply_to_status_id": , "in_reply_to_status_id_str":" ", "in_reply_to_user_id": , "in_reply_to_user_id_str":" ", "in_reply_to_screen_name":"crankydad", "user":{"id": , "id_str":" ", "name":"julieford", "screen_name":"julieford808", "location":"honolulu, Hawaii", "url":" "description":"local girl after 20 years in Hawaii. Mom of a crazy todd ler. Owner of Schweitzer Consulting, a PR consultancy.", "protected":false, "verified":false, "followers_count":1159, "friends_count":953, "listed_count":49, "favourites_count":19, 15
24 "statuses_count":4296, "created_at":"sat Aug 02 22:50: ", "utc_offset":-36000, "time_zone":"hawaii", "geo_enabled":true, "lang":"en", "contributors_enabled":false, "is_translator":false, "profile_background_color":"edece9", "profile_background_image_url":" und_images\/ \/twilk_background.jpg", "profile_background_image_url_https":" background_images\/ \/twilk_background.jpg", "profile_background_tile":true, "profile_link_color":"088253", "profile_sidebar_border_color":"d3d2cf", "profile_sidebar_fill_color":"e3e2de", "profile_text_color":"634047", "profile_use_background_image":true, "profile_image_url":" \/04050c00d0da6c74b318be1e34f8a38d_normal.jpeg", "profile_image_url_https":" \/04050c00d0da6c74b318be1e34f8a38d_normal.jpeg", "profile_banner_url":" 0\/ ", "default_profile":false, "default_profile_image":false, "following":null, "follow_request_sent":null, "notifications":null}, "geo":{ "type":"point", "coordinates":[ , ]}, "coordinates":{"type":"point","coordinates":[ , ]}, "place":{"id":"c47c0bc571bf5427", "url":" "place_type":"city", "name":"honolulu", "full_name":"honolulu, HI", "country_code":"us", "country":"united States", "bounding_box":{"type":"polygon", "coordinates":[[[ , ],[ , ],[ , ],[ , ]]]}, "attributes":{}}, "contributors":null, "retweet_count":0, "favorite_count":0, "entities":{"hashtags":[], "trends":[], "urls":[], "user_mentions":[{ "screen_name":"crankydad", "name":"mike Gordon", "id": , "id_str":" ", 16
25 "indices":[0,10]}], "symbols":[]}, "favorited":false, "retweeted":false, "possibly_sensitive":false, "filter_level":"medium", "lang":"en"} 2.2 Storage in HDFS Filesystem HDFS is a distributed file system with master/slave architecture and built in Hadoop platform. It is designed for storing large files which could be hundreds of megabytes, gigabytes and petabytes in size; and running on clusters of computers which could be inexpensive and not necessary to be a highly reliable commodity hardware [9]. Figure 2.9 shows the HDFS architecture. In a cluster, the HDFS are consisted of a single NameNode (the master) and a cluster of DataNodes (the slaves). The NameNode hosts the files system index which is in the form of namespace image and edit log. It knows and manages the DataNodes from which the NameNode is constructed when the system starts. DataNodes store the data of the filesystem and retrieve blocks when the NameNode tells them to. They report their status to the NameNode periodically. There is also a secondary NameNode which produce snapshot of the primary NameNode s memory structures to avoid the problems brought by file system corruption [9][14][15]. 17
26 Figure 2.9 Architecture of HDFS The HDFS clusters are setup at the beginning of the process and then we transfer the collected data sets from the local system to HDFS for the future sentiment analysis. Figure 2.10 shows the process of the how to store datasets into HDFS. Figure 2.10 HDFS workflow 1. HDFS Setup: we need to install the Hadoop platform on a cluster of servers, configure the files of NameNode and DataNodes. We execute the following command to setup the system: Java1.6.0_30 was installed on the cluster. 18
27 sudo chmod u+x jdk-6u30-linux-x64. bin sudo./jdk-6u30-linux-x64.bin sudo chmod u+x jre-6u30-linux-x64. bin sudo./jre-6u30-linux-x64.bin SSH was installed on the cluster. sudo apt-get install ssh sudo apt-get install rsync sudo /etc/init.d/ssh start Rather than invoking init scripts through /etc/init.d, use the service (8) utility, e.g. service ssh start Since the script you are attempting to invoke has been converted to an Upstart job, you may also use the start(8) utility, e.g. start ssh ps -ef grep sshd root :07? 00:00:00 /usr/sbin/sshd -D hadoop :18 pts/1 00:00:00 grep --color=auto sshd hdp@hadoop:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: 15:49:81:02:71:55:a9:a0:9d:a8:e6:4d:c1:00:ae:65 hadoop@hadoop The key's randomart image is: +--[ RSA 2048] oo...+=+...o. o..eo o = o..... S o. o o hdp@hadoop:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Install Hadoop on the cluster and config the Hadoop configuration file: export JAVA_HOME=/usr/local/java/jdk1.6.0_45 into hadoop-env.sh export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:/usr/local/hadoop/bin 19
28 We config core-site.xml, hdfs-site.xml, and mapred-site.xml respectively. The network interface is setup by connecting each server via a single hub. We assign the IP address to the master machine and and to the slave machines respectively. 2. Initialize the system: the HDFS filesystem is formatted via NameNode. The following command is executed: hdp@master:/usr/local/hadoop$ bin/hadoop namenode -format Start the system: hdp@master:/usr/local/hadoop$ bin/start-all.sh 3. Transfer local data sets into HDFS: hdp@master:/usr/local/hadoop$ bin/hadoop dfs -copyfromlocal /tmp/inputt xt1 /user/hdp/inputtxt1 4. MapReduce calculation hdp@master:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-* streaming*.jar \ > -file /home/hdp/mymapper.py -mapper /home/hdp/ MyMapper.py \ > -file /home/hdp/myreducer.py -reducer /home/hdp/ MyReducer.py \ > -input /user/hdp/inputtxt1/* -output /user/hdp/inputtxt1-output 20
29 CHAPTER 3 SENTIMENT ANALYSIS OF TWEETS IN HADOOP SYSTEM Sentiment analysis is the field of research that identifies and extracts subjective information from written language. It is also called opinion mining which is aiming to analyze people s attitudes, emotions and opinions and classify the polarity of a given text. Sentiment analysis usually classifies the given text into two classes, positive and negative [16]. The proposed process of sentiment of analysis of Tweets is described in Figure 3.1. Figure 3.1 The process of sentiment analysis The first step is collecting dataset from Twitter Database. If there is not an expert who could tell us which the most informative fields are, then we could use the brute-force 21
30 method to gathering everything that may contains relevant features can be isolated. However, the dataset collected by using the brute force method may miss useful information and contains too many noises, so we need to define the keywords of classifier. The second step is the definition of classifier keywords and data preparation. Keywords selection in these dataset reduces data size and removes many irrelevant and redundant features and thus reduces noises. Tweets filtered by keywords are processed more effectively and faster by using data mining algorithm. In sum, a good selection of keywords of classifier contributes to better analyze results. 3.1 Algorithm Selection It is very important to choose a specific algorithm for sentiment analysis. For the text classification problem, there are three methods that can be applied, e.g. Decision Tree, Naive Bayes classification, and Support Vector Machines (SVM) [3]. In the following section, we will introduce and compare them then propose the method we use Decision Trees Decision trees are tree-like graphs that are used to classify instances by using a specific sorting algorithm and to help to reach a goal. A decision tree uses decision nodes to test attributes of an instance described by attribute values to be classified, and each tree branches corresponds to attribute value represented by tree node. Each leaf node of decision tree represents a classification goal. The classification is started from root node, sorted based on the attribute values, and end at leaf nodes. Figure 3.2 shows an example of a decision tree [3]. 22
31 Figure 3.2 An example of a decision tree. Decision tree method is simple to understand and easy to implement. A general pseudocode for building a decision tree for sentiment analysis is showed as follows [3]: Check for base cases Create a node r for the tree For each Tweet in Tweets do: If Tweet does not contain keywords, discard the Tweet. If Tweet contains keywords, do: add a new tree branch below r, corresponding to the test if key words are positive then: label the Tweet Positive attitude Else add a new tree branch below, corresponding to the test if keywords are negative then: label the Tweet Negative attitude Else label the Tweet Neutral attitude 23
32 3.1.2 Naive Bayes Classifiers Assume that there are two classes of keywords: w 1 =positive, w 2 = negative, and a set of sentiments words in Tweet is represented as T. Define following symbols: p(w j T) is the probability of class w j, given that we have observed T. Bayesian classifiers use Bayes theorem, which is described as follows [3]: where p(w j T) is probability of instance T being in class w j p(t w j ) is probability of generating instance T given class w j p(w j ) is probability of occurrence of class w j p(t) is probability of instance T occurring. In order to classify T s attitude as positive and negative, the probabilities of p(w 1 T) and p(w 2 T) are compared and the larger probability event indicates that the class sentiment is more likely to happen. We input n sentiment words in a Tweet T = {t 1, t 2,, t n }. When t i is a positive word t i equals to 1 and when t i is a negative word t i equals to 2. We assume all t i are probability independent and there exist k positive words in T, and the following formulas are existence: p(w 1 ) = p(w 2 ) = 0.5 p(t i = 1 w 1 ) >> p(t i = 2 w 1 ) p(t i = 2 w 2 ) >> p(t i = 1 w 2 ), 24
33 p(t i = 1 w 1 ) = p(t i = 2 w 2 ) = p >> 0.5 Since Similarly, 1 Thus, 1 In sum, the classifier result depends on the number of positive words and negative words. For example, an input Tweet is: Lovely turtle, beautiful fish, and bad weather, but still a fancy trip. The sentiment polarity in this tweet is shown in following table. Table 3.1 Words with sentiment polarity in tweet 25
34 Since then Because p(positive Tweet) is larger than p(negative Tweet), we can deduce the result that is p(w 1 T) is larger than p(w 2 T). Therefore, the sampled Tweet is labeled as positive. As Naive Bayes classifier assumes attributes have strong independent distributions, the estimate is: The Naive Bayes classifiers can be represented as directed acyclic graphs which have one unobserved node as parent and several observed nodes as children with strong independence assumptions among them [3]. 26
35 p(t w j ) p(t 1 w j ) p(t 2 w j ) p(t n w j ) Figure 3.3 Naive Bayes classifier represented by graph A general pseudo-code for Naive Bayes classifier for sentiment analysis is showed as follows [3]: For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Support Vector Machines Support vector machines (SVMs) are supervised machine learning models which can be used for data analysis and classification. A hyperplane is constructed by a SVM can be used for classification. To achieved the best classification performance, we need to find the maximum margin which means either side of the a hyperplane has a largest distance to the corresponding nearest data point, therefore, reduced an upper bound on the expected generalization error [3]. Suppose some given linearly separable data points which can be separated into two classes by hyperplane. There may be many hyperplanes that can classify the data points 27
36 into two classes. One reasonable choice is the best hyperplane which represent the largest separation maximum margin. Figure 3.4 shows the Maximum-margin and margins for an SVM. Figure 3.4 Example of SVM maximum margin and margin If the data sets are linearly separable, we select two hyperplanes between which the distance is maximized. The area bounded by them is margin where there is not data points located. Therefore, there is a pair (w, b) meets the following inequations [3]: 1, 1, where x i is a n-dimensional vector 28
37 w is the normal vector to the hyperplane b is the offset The two constraints can be rewritten as: 1 When we linearly classify the two classes, a best hyperplane can be found by solving the quadratic programming optimization problem:, 1 2 subject to 1, for i = 1,, n The data points lying on the margin and satisfying 1 are support vector points of which the linear combination represents the solution. (See Figure 3.5) A general pseudo-code for SVMs is illustrated in the follow process [3]. 1) INTRODUCE POSITIVE LAGRANGE MULTIPLIERS Α, ONE FOR EACH OF THE INEQUALITY CONSTRAINTS (1). THIS GIVES LAGRANGIAN: L 1 2 w α y w x b 1 2) MINIMIZE LP WITH RESPECT TO W,B. 3) COMPUTE QUADRATIC PROGRAMMING SOLUTION W, B 4) IN THE SOLUTION, THOSE POINTS FOR WHICH α 0 ARE CALLED SUPPORT VECTORS 29
38 3.2 Sentiment Analysis of Tweets Extract Feature from Tweets We extract features from tweets collected by a Python program for the future sentiment classifying. The process of parsing the Tweet post and obtaining unigrams is as follows: Decode: the datasets collected from API comes in JSON, then they are decoded into Python data structure for future process. (e. g. JSON: [{"text": "tweet", "truncated": false, "test": [6,14]}], Python: [{u'text': u'tweet', u'truncted': False, u'test': [6, 14]}]). Filtering: we extract text element (tweet content) from tweet which is in Python data structure (e.g. Everyone in Hawai i is so nice. ). And then convert the text into lower case (e.g. everyone in Hawai i is so nice.). Tokenization: We parse the data by splitting it by space. We encode text in UTF- 8 to get rid of Unicode errors and replace the punctuation in text Classifier Since Naive Bayes is fast, space efficient, and not sensitive to irrelevant features, in this research we used the Naive Bayes classifier which is based on Bayes theorem (Anthony J, 2007) in this study. where w is a sentiment word, T is a Twitter message [3]. 30
39 Bayes s theorem is based on strong independence assumptions. Therefore, the probabilistic model for a classifier can be described as: Comparing the probabilities P(positive T) and P(negative T), the larger probability indicates that the class label value has a higher probability to be actual label. If R is larger than 0, then predict positive attitude is more likely to be true, otherwise, predict negative attitude has more likely to be true. During the sentiment analysis, the Naive Bayes classifier classifies a Tweet into a positive class or a negative class by comparing the words in each Tweet. Each word will be labeled with positive and negative coming from the lexicon. In the Naive Bayes classification, the number of sentiment words is counted. If more positive words are used than negative in a Tweet, then the Tweet could be labeled as positive, otherwise if less positive words presented in a Tweet than negative ones, the Tweet could be labeled as negative. A neutral label word is ignored in this study since it contains no valuable information for sentiment analysis. The algorithm judges the polarity of the text in the Tweet by checking the words in the Tweet. At last, the algorithm output the individual s view. Figure 3.5 shows the workflow of sentiment analysis [17]. 31
40 Figure 3.5 The workflow of opinion mining 3.3 Run on Hadoop Hadoop MapReduce and HDFS Hadoop has a master-slave architecture which is consisted of HDFS and MapReduce [14]. The big Twitter datasets are stored in HDFS from which the data is read for processing and the computational layer s job is done by MapReduce [18]. The MapReduce master is responsible for organizing where computational work should be scheduled on the slave nodes. The HDFS master is responsible for partitioning the 32
41 storage across the slave node and keeping track of where data is located [18]. Figure 3.6 shows Hadoop MapReduce and HDFS architecture.... Figure 3.6 The Hadoop MapReduce and HDFS architecture 33
42 MapReduce breaks the sentiment analysis processing into map and reduce phase which is executed by MyMapper.py and MyReducer.py respectively. The map phase output keyvalue pairs. After being sorted by Unix build in sort program, the key-value pairs will be process by the reduce phase and then write out the results which are stored in HDFS. The following three steps and figure 3.7 described how MapReduce process is: 1. Map process: the datasets are split based on distinct keys and values. 2. Shuffle and sort process: datasets are shuffled and sorted based on the keys into some logically order. 3. Reduce process: the data flows input into reduce process are output from previous procedure are grouped by keys and applied some functions. Figure 3.7 A client submit a job to MapReduce [18] MapReduce Functions Hadoop Streaming provided by Hadoop distribution is a utility that allows us to create and run Map/Reduce jobs with Python script [23]. It helps us passing data between our map and reduce functions. Since it allows us to use standard input and standard output, 34
43 we write our map and reduce function by Python and read input data using Python s sys.stdin and print the output data using Python s sys.stdout [9]. The function MyMap.py read data from STDIN, split it into words and pass them line by line to the STDOUT. The Map script output key-value pairs which are not sorted. The intermediate sort work is done by the sort program built in UNIX-based systems. After being sorted by key, the sorted output key-value a pairs will be read in line by line by MyReducer.py script through standard input STDIN and write its final result to standard output STDOUT [9]. 35
44 1. Mapper: figure 3.8 shows the map flowchart Figure 3.8 The Mapper flowchart 36
45 2. Reducer: figure 3.9 shows the reduce flowchart. Figure 3.9 The reducer flowchart 37
46 CHAPTER 4 EXPERIMENTS AND RESULTS In this section, we present experiments and results for two classification tasks: sentiment classification for Scottish independence vote: positive vs negative and sentiment classification for Hawai i tourism spot: positive vs negative. For each of the sentiment classification, we follow the procedures described in figure 4.1 below and Naive Bayes classifier is applied to classify the datasets into positive and negative class. Figure 4.1 Data analysis process 4.1 Scottish Independence Vote Analysis The Scottish independence vote was a referendum on Scottish independence which took place in Scotland on 18 September The voters answered Should Scotland be an independent country? with Yes or No to decide whether Scotland should be independent [19]. We extracted tweets from Twitter for opinion mining to predict the result of voting. To make sure all the data sets we collected from Twitter refer to the Scotland independent vote, we used keywords concerning the event as search arguments. We extracted tweets 38
47 via Twitter Stream API over frequent intervals, thus, we had the timestamp, author and tweet text for opinion evaluation. About one million tweets were gathered over a period of ten days around the Scottish Independence vote date. Since the independence polling took place on 18 September, we extracted tweets from 11 September to 20 September for sentiment analysis. 14 x Scotland Scottish Vote Independence Independent 10 Tweets Amount /11/14 09/12/14 09/13/14 09/14/14 09/15/14 09/16/14 09/17/14 09/18/14 09/19/14 09/20/14 Date Figure 4.2 The curve for tweets over collecting period based on different keywords Figure 4.2 shows the time series trend in the amount of tweets for Scottish polling over the collecting period. We can observe that the busiest time for the voting is at September which is reasonable since the event happened at that day. After September 18, there are less and less people discussed the event since the polling process was ended, thus, less and less tweets concerning the topic could be collected and the curves come down. 39
48 4.5 x Total Postive Negative Neutral Tweets Amount /11/14 09/12/14 09/13/14 09/14/14 09/15/14 09/16/14 09/17/14 09/18/14 09/19/14 09/20/14 Date Figure 4.3 Scottish independence vote polarity values Figure 4.3 displays the amount of tweets about the people attitudes over time. As we can read from the figure for the independent vote most tweets about the topic were published via Twitter and many twitter uses have neutral attitude comparing to positive and negative attitude.. 40
49 Postive Negative Neutral Postive Negative Neutral 29% 27% 16% 13% 55% 60% A B Postive Negative Neutral Postive Negative Neutral 33% 21% 6% 46% 21% 73% C D Postive Negative Neutral 13% 3% 84% E Figure 4.4 The attitude distribution based on different keywords The pie chart A, B, C, D, and E of the Figure 4.4 shows the attitude distribution based on different keywords. 41
50 Figure 4.5 Distribution of authors political standpoint toward Scottish Independence vote: values of positive vs negative Figure 4.5 shows the authors political standpoint toward Scottish independence vote. The x axis represents the time period over the ten days, thus, there are 240 hours totally. The y axis represents the degree of authors attitude. The blue points represent positive results standing for supporting independence of Scotland while negative results are marked by red points standing for opposing to the Scotland independence. As we can observe at Sep 18, the peak point appeared. 4.2 Some Hawai i Tourism Sites Analysis Hawai i islands which are Hawai i, O ahu, Maui, Kaua i, and Lāna i are located in the Pacific Ocean and have significant tourism [24]. In 2013, according to Hawai i government 2013 annual report, there were over 8 million visitors to the Hawaiian Islands with expenditures of over $15 billion [25]. The most popular times for tourist are e summer months and major holidays, therefore, our tweets collecting period is from August 23 to September 20, In this study, we mainly collect data concerning these topics: Hawai i the name of the islands, Waikīkī well known for Waikīkī beach which 42
51 is the most popular beach of O ahu; Diamond head the name of a volcanic and a major tourist attraction on O ahu; Hanauma bay famous for snorkeling; and Hawai i airlines the largest airline in Hawai i [26][27][28][29]. Figure 4.6 shows the distribution for attitude polarity from August 23 to September 20 periods based on keyword Hawai i. The blue circles in the figure represent the positive attitude degree, the red stars represent negative attitude, and the pink points represents the average of positive and negative. As we can observed, all the average points are above zero line over the collecting period, thus, the authors of Twitter have positive attitude comments on Hawai i Positive Negative Average Figure 4.6 Distribution for attitude polarity over collecting period based on keyword Hawai i 43
52 Figure 4.7 shows the distribution for attitude polarity from August 23 to September 20 periods based on keyword Waikīkī. We can observe that the pink points which are the average of positive and negative values are above zero line. Therefore, from the figure, we can conclude that authors have positive attitude toward Waikīkī Positive Negative Average Figure 4.7 Distribution for attitude polarity over collecting period based on keyword Waikīkī Figure 4.8 describes the distribution for attitude polarity from August 23 to September 20 periods based on keyword Diamond head. People s average attitude is positive toward Diamond head, since the average values represented by pink points are above the zero line. 44
53 Positive Negative Average Figure 4.8 Distribution for attitude polarity over collecting period based on keyword Diamond head Figure 4.9 shows the distribution for attitude polarity from August 23 to September 20 periods based on keyword Hanauma bay. The average values are above zero line, thus, authors have positive comments on Hanauma bay. 45
54 Positive Negative Average Figure 4.9 Distribution for attitude polarity over collecting period based on keyword Hanauma bay 4.3 Performance Environment In this experiment, we use one server as master, and two servers as slaves, on which we installed Ubuntu system and Hadoop models. The environment of experiment is described in Table 4.1 below. 46
55 Table 4.1 The environment of the experiment 47
56 CHAPTER 5 CONCLUSION AND OPEN ISSUES This study presents a method to collect datasets which is concerning some specific topics from Twitter database via Twitter API. We extracts features from Tweets and use Naive Bayes classifier separate the data into two classes: positive and negative for opinion evaluation toward some topics and issues. In this study, we store original dataset in HDFS filesystem and analyze the datasets using Hadoop MapReduce model. We visualized analyzing results by using Matlab. The experiment results prove that the present method performs efficient. Although this thesis evaluate the views of authors of Twitter: predict the Scottish independence vote result and analyze tourists attitude toward some popular tourist attractions in Hawai i, there are many open issues that still require further investigation and research work. In this paragraph some of the open issues that are worth of attention in relation to this thesis work are discussed. This thesis uses Naive Bayes classifier for classification, in the future work we may modify it to improve its performance or try other classifier to overcome the independence assumption. 48
57 REFERENCES [1] Matthew A. Russell, Mining the Social Web, O Reilly, 2011 [2] [3] S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in ehealth, HCI, Information Retrieval and Pervasive Technologies, pp. 3-24, June [4] Alexander Pak, Patrick Paroubek, Twitter as a Corpus for sentiment analysis and opinion mining, LREC 2010, Seventh International Conference on Language Resources and Evaluation, May [5] Predicting the Future with Social Media, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp , [6] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, Rebecca Passonneau, Sentiment Analysis of Twitter Data, LSM '11 Proceedings of the Workshop on Languages in Social Media, pp.30-38, June [7] Hsiang Hui Lek and Poo, D.C.C., Aspect-based Twitter Sentiment Classification, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp , November 2013 [8] [9] Tom White, Hadoop: The Definitive Guide, Third Edition, O Reilly, [10] [11]. [12] [13] [14] [15] [16] Bing Liu, Sentiment Analysis and Opinion Mining, Graeme Hirst,
58 [17] Shamanth Kumar, Fred Morstatter, Huan Liu, Twitter Data Analytics, Springer, 2013 [18]. Alex Holmes, Hadoop in Practice, Manning Shelter Island, 2012 [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] 50
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
TP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Single Node Hadoop Cluster Setup
Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
A Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.
EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure
Hadoop (pseudo-distributed) installation and configuration
Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under
Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.
Big Data Computing Instructor: Prof. Irene Finocchi Master's Degree in Computer Science Academic Year 2013-2014, spring semester Installing Hadoop Emanuele Fusco ([email protected]) Prerequisites You
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
Big Data : Experiments with Apache Hadoop and JBoss Community projects
Big Data : Experiments with Apache Hadoop and JBoss Community projects About the speaker Anil Saldhana is Lead Security Architect at JBoss. Founder of PicketBox and PicketLink. Interested in using Big
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/
Download the Hadoop tar. Download the Java from Oracle - Unpack the Comparisons -- $tar -zxvf hadoop-2.6.0.tar.gz $tar -zxf jdk1.7.0_60.tar.gz Set JAVA PATH in Linux Environment. Edit.bashrc and add below
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala
How To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pig and Typical Mapreduce Anjali P P and Binu A Department of Information Technology, Rajagiri School of Engineering and Technology,
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: [email protected] & [email protected] Abstract : In the information industry,
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
HADOOP CLUSTER SETUP GUIDE:
HADOOP CLUSTER SETUP GUIDE: Passwordless SSH Sessions: Before we start our installation, we have to ensure that passwordless SSH Login is possible to any of the Linux machines of CS120. In order to do
Single Node Setup. Table of contents
Table of contents 1 Purpose... 2 2 Prerequisites...2 2.1 Supported Platforms...2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster... 3 5 Standalone
Tutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
Extreme computing lab exercises Session one
Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Installation and Configuration Documentation
Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE
NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE. A Thesis. Presented to. The Faculty of the Graduate School
NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE A Thesis Presented to The Faculty of the Graduate School At the University of Missouri In Partial Fulfillment Of
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
HADOOP - MULTI NODE CLUSTER
HADOOP - MULTI NODE CLUSTER http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm Copyright tutorialspoint.com This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1
102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and
How To Analyze Sentiment On A Microsoft Microsoft Twitter Account
Sentiment Analysis on Hadoop with Hadoop Streaming Piyush Gupta Research Scholar Pardeep Kumar Assistant Professor Girdhar Gopal Assistant Professor ABSTRACT Ideas and opinions of peoples are influenced
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Installing Hadoop. Hortonworks Hadoop. April 29, 2015. Mogulla, Deepak Reddy VERSION 1.0
April 29, 2015 Installing Hadoop Hortonworks Hadoop VERSION 1.0 Mogulla, Deepak Reddy Table of Contents Get Linux platform ready...2 Update Linux...2 Update/install Java:...2 Setup SSH Certificates...3
Keywords social media, internet, data, sentiment analysis, opinion mining, business
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Running Kmeans Mapreduce code on Amazon AWS
Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
