A Review of Sentiment Analysis in Twitter Data Using Hadoop
|
|
|
- Jack Lyons
- 10 years ago
- Views:
Transcription
1 , pp A Review of Sentiment Analysis in Twitter Data Using Hadoop L.Jaba Sheela Panimalar Engineering College, Chennai, India. [email protected] Abstract Twitter is an online social networking site which contains rich amount of data that can be a structured, semi-structured and un-structured data. In this work, a method which performs classification of tweet sentiment in Twitter is discussed. To improve its scalability and efficiency, it is proposed to implement the work on Hadoop Ecosystem, a widely-adopted distributed processing platform using the Map Reduce parallel processing paradigm. Finally, extensive experiments will be conducted on real-world data sets, with an expectation to achieve comparable or greater accuracy than the proposed techniques in literature. Keywords: Twitter, Sentiment Analysis, Hadoop, Map reduce, HDFS 1. Introduction We live in a society where the textual data on the Internet is growing at a rapid pace and many companies are trying to use this deluge of data to extract people s views towards their products. Online social network platforms, with their large-scale repositories of user-generated content, can provide unique opportunities to gain insights into the emotional pulse of the nation, and indeed the global community. A great source of unstructured text information is included in social networks, where it is unfeasible to manually analyze such amounts of data. There is a large number of social networks websites that enable users to contribute, modify and grade the content, as well as to express their personal opinions about specific topics. Some examples include blogs, forums, product reviews sites, and social networks, like Twitter ( Twitter (San Francisco, CA, USA) is a micro blogging site that offers the opportunity for the analysis of expressed mood, and previous studies have shown that geographical, diurnal, weekly, and seasonal patterns of positive and negative affect can be observed. Micro blogging and more particularly Twitter is used for the following reasons: Micro blogging platforms are used by different people to express their opinion about different topics, thus it is a valuable source of people s opinions. Twitter contains an enormous number of text posts and it grows every day. The collected corpus can be arbitrarily large. Twitter s audience varies from regular users to celebrities, company representatives, politicians, and even country presidents. Therefore, it is possible to collect text posts of users from different social and interests groups. Twitter s audience is represented by users from many countries As the audience of micro blogging platforms and services grows every day, data from these sources can be used in opinion mining and sentiment analysis tasks. For example, manufacturing companies may be interested in the following questions: 2. Problem Definition The project focuses on using Twitter, the most popular micro blogging platform, for the task of sentiment analysis. The tweets are important for analysis because data arrive at ISSN: IJDTA Copyright c 2016 SERSC
2 a high frequency and algorithms that process them must do so under very strict constraints of storage and time. It will be shown how to automatically collect a corpus for sentiment analysis and opinion mining purposes and then perform linguistic analysis of the collected corpus. All public tweets posted on twitter are freely available through a set of APIs provided by Twitter. Using the corpus, a sentiment classifier, is constructed that is able to determine positive, negative and neutral sentiments. 3. Literature Review In the past years, many works has been released in sentiment analysis. Implementation of sentiment analysis has been carried out for a variety of applications over a wide range of classification algorithms and for varying data size. There exist many possible variants; some of them are discussed in following section. 3.1 Lin, Jimmy, and Alek Kolcz. "Large-Scale Machine Learning at Twitter." In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp ACM, [3] This paper presents a case study of Twitter s integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, the authors have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In contrast to other linguistic approaches the authors adopt a knowledge-poor, datadriven approach. It provides a base-line for classification accuracy from content, given only large amounts of data. The data set involves a test set consisting of one million English tweets with emoticons from Sept. 1, 2015, at least 20 characters in length. The test set was selected to contain an equal number of positive and negative examples. For training, they have prepared three separate datasets containing 1 million, 10 million, and 100 million English training examples from tweets before Sept. 1, 2015 (also containing an equal number of positive and negative examples). In preparing both the training and test sets, emoticons are removed. Their experiments used a simple logistic regression classifier learned using online stochastic gradient descent, using hashed byte 4-grams as features. Their machine learning framework consists of two components: a core Java library and a layer of lightweight wrappers that expose functionalities in Pig. A Pig script was written for training binary sentiment polarity classifiers. The script processes tweets, separately filtering out those containing positive and negative emoticons, which are unioned together to generate the final training set. The learner in the training module is SGD logistic regression which is embedded inside the Pig store function, such that the learned model is written directly to HDFS. For model training, the core Java library is integrated into Pig as follows: feature vectors in Java are exposed as maps in Pig, which are treated as a set of feature id (int) to feature value (oat) mappings. Thus, a training instance in Pig has the following schema: (label: int, features: map[ ]) The authors have developed wrappers that use the classifiers directly in Pig. For each classifier in our core Java library, there is a corresponding Pig UDF. The UDF is initialized with the model, and then can be invoked like any other UDF. Results of the polarity classification experiments showed accuracy in the range 77% to 82% with varying data set size. 78 Copyright c 2016 SERSC
3 3.2 Bian, Jiang, Umit Topaloglu, and Fan Yu. "Towards Large-Scale Twitter Mining for Drug-Related Adverse Events" In Proceedings of the 2012 international workshop on Smart health and wellbeing, pp ACM, [4] In this paper, the authors describe an approach to find drug users and potential adverse events by analyzing the content of twitter messages utilizing Natural Language Processing (NLP) and to build Support Vector Machine (SVM) classifiers. Due to the size nature of the dataset (i.e., 2 billion Tweets), the experiments were conducted on a High Performance Computing (HPC) platform using Map Reduce, which exhibits the trend of big data analytics. The results suggest that daily-life social networking data could help early detection of important patient safety issues The data set used is a collection of over 2 billion Tweets collected from May 2009 to October 2010, from which they try to identify potential adverse events caused by drugs of interest. The collected stream of Tweets was organized by a timeline. The raw Twitter messages were crawled using the Twitter s user timeline API that contains information about the specific Tweet and the user. The work is indexed only with the following four fields for each Tweet: 1) Tweet id that uniquely identifies each Tweet; 2) User identifier associated with each Tweet; 3) Timestamp of the Tweet; and 4) the Tweet text. They utilized the Amazon Elastic Compute Cloud (EC2) to run the Twitter indexers on 15 separate EC2 instances, 34.2 GB of memory, and 13 EC2 Compute Units) in parallel, which were able to parse and index all 2 billion Tweets within two days. The size of the Lucene indexes is 896 GB. To mine Twitter messages for AEs, the process can be separated into two parts: 1) Identifying potential users of the drug; 2) Finding possible side effects mentioned in the users Twitter timeline that might be caused by the use of the drug concerned. Both processes involve building and training classification models based on features extracted from the users Twitter messages. Two-sets of features (i.e., textual and semantic features) are extracted from Twitter users timeline for both classification models. Textual features such as the bag-of-words (BoWs) model are derived based our analysis of the actual Twitter messages. Semantic features are derived from the Unified Medical Language System (UMLS) Metathesaurus concept codes extracted from the Tweets using Metamap developed at the National Library of Medicine (NLM). Two-class Support Vector Machine (SVM) was used for the purpose of classification. Evaluation of the SVM was done using parameters such as, the Area under the Curve (AUC) value, and the Receiver operating characteristic (ROC) curve. The ROC curve using the mean values of the 1000 iterations was drawn. The prediction accuracy on average over the 1000 iterations was evaluated to 0.74 and the mean AUC value is Liu, Bingwei, Erik Blasch, Yu Chen, Dan Shen, and Genshe Chen. "Scalable Sentiment Classification for Big Data Analysis Using Naive Bayes Classifier" In Big Data, 2013 IEEE International Conference on, pp IEEE, [5] Machine learning technologies are widely used in sentiment classification because of their ability to learn from the training dataset to predict or support decision making with relatively high accuracy. However, when the dataset is large, some algorithms might not scale up well. In this paper, the authors evaluate the scalability of Naive Bayes classifier (NBC) in large-scale datasets. They have presented a simple and complete system for sentiment mining on large datasets using a Naive Bayes classifier with the Hadoop framework. Instead of using Mahout Library, they implemented NBC to achieve finegrain control of the analysis procedure for a Hadoop implementation. They have Copyright c 2016 SERSC 79
4 demonstrated that NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput. The raw data comes from large sets of movie reviews collected by research communities. In their experiments, they use two datasets: the Cornell University movie review dataset3 and Stanford SNAP Amazon movie review dataset4. The Cornell dataset has 1000 positive and 1000 negative reviews. The Amazon movie review dataset is organized into eight lines for each review, with additional information such product identification (ID), user ID, profile Name, score, summary etc. They have used only unigrams for the Naive Bayes classifier. The classification task is divided into three sequential jobs as follows. 1) Training job - All training reviews are fed into this job to produce a model for all unique words with their frequency in positive and negative review documents respectively. 2) Combining job - In this job, the model and the test reviews are combined to a intermediate table with all necessary information for the final classification. 3) Classify job - This job classifies all reviews simultaneously and writes the classification results to HDFS. The experimental setup consists of a Virtual Hadoop cluster of seven nodes. It is a fast and easy way to test a Hadoop program in the Cloud, although the performance might be weaker compared to a physical Hadoop cluster. The cloud infrastructure is built on a Dell server with 12 Intel Xeon E GHz cores and 32G memories. They tested their code on Cornell dataset and resulted in a 80.85% average accuracy. Without changing the Hadoop code, the program was able to classify different subsets of Amazon movie review dataset with comparable accuracy. To test the scalability of Naive Bayes classifier, the size of dataset in their experiment varies from one thousand to one million reviews in each class. 3.4 ÁlvaroCuesta, David F., and María D. R-Moreno. "A Framework for Massive Twitter Data Extraction and Analysis", In Malaysian Journal of Computer Science, pp (2014):1. [6] The authors propose an open framework to automatically collect and analyze data from Twitter s public streams. This is a customizable and extensible framework, so researchers can use it to test new techniques. The framework is complemented with a languageagnostic sentiment analysis module, which provides a set of tools to perform sentiment analysis of the collected tweets. The capabilities of this platform are illustrated with two study cases in Spanish, one related to a high impact event (the Boston Terror Attack), and another one related to regular political activity on Twitter. The first case study involves the activity on Twitter around a high impact event, the Boston Terror Attacks. In this case, they tracked a hash tag. The second case study was focused on regular Twitter usage, tracking the activity around well-known Spanish political actors, i.e. politicians, political parties, journalists and activist organizations as well. The authors have selected controversial accounts to have a good foundation for sentiment analysis. There are several layers of processing and these modules need to interchange data among them, using open data formats such as JSON. Most tools in the framework are implemented in Python, but the Classifier and Tester web interfaces run on NodeJS and are programmed in CoffeeScript (a language which can be pre-processed into JavaScript). The chosen backend database is MongoDB, which is a good fit for our purposes since its atomic representation is JSON, just like tweets The implementation was based on the Natural Language Toolkit (NLTK) framework A complete procedure of data extraction and sentiment analysis is divided into three separate steps: data acquisition, training for sentiment analysis and report generation. The first step is, gathering data from Twitter with the Miner. Then the classifier is trained and 80 Copyright c 2016 SERSC
5 the sentiment analysis carried out. Finally, the platform generates a set of reports, including the sentiment analysis if it is enabled. Classification was done according to three classes, positive, negative and neutral. Several Naïve Bayes classifiers using a set of ngrams in order to select the one with the best performance. In particular, they have tried {1}, {2}, {3}, {1, 2}, {1, 3} and {2, 3} ngrams and minimum score of 0, 1, 2, 3, 4, 5, 6 and 10. All these different options were tried using ten-fold cross-validation to avoid biases induced by the partition of the training set. The parameters such as accuracy mean and variance, precision, recall and fmeasure mean and variance were used for evaluation. The conclusion is that the best trainers had 1-grams included and a minimum score between 2 and Skuza, Michal, and Andrzej Romanowski. "Sentiment analysis of Twitter Data within Big Data Distributed Environment for Stock Prediction" In Computer Science and Information Systems (FedCSIS), 2015 Federated Conference on, pp IEEE, [7] This paper discusses a possibility of making prediction of stock market basing on classification of data coming from Twitter micro blogging platform. Twitter messages are retrieved in real time using Twitter Streaming API. Tweets were collected over 3 month s period from 2nd January 2013 to 31st March It was specified in the query that tweets have to contain name of the company or hashtag of that name. Predictions were made for Apple Inc. in order to ensure that sufficiently large datasets would be retrieved Only tweets in English are used in this research work. Reposted messages are redundant for classification and were deleted. After pre-processing each message was saved as bag of words model a standard technique of simplified information representation used in information retrieval. System design consists of four components: Retrieving Twitter data, pre-processing and saving to database (1), stock data retrieval (2), model building (3) and predicting future stock prices (4). Polarity mining is a part of sentiment in which input is classified either as positive or negative. Automatic sentiment detection of messages was achieved by employing SentiWordNet. Prediction of future stock prices is performed in this work by combining results of sentiment classification of tweets and stock prices from a past interval. Taking into consideration large volumes of data to be classified and the fact they are textual, Naïve Bayes method was chosen due to its fast training process even with large volumes of training data and the fact that is it is incremental. Considered large volumes of data resulted also in decision to apply a map reduce version of Naïve Bayes algorithm. 3.6 Tare, Mohit, Indrajit Gohokar, Jayant Sable, Devendra Paratwar, and Rakhi Wajgi. "Multi-Class Tweet Categorization Using Map Reduce Paradigm" In International Journal of Computer Trends and Technology. pp (2014) [8] The authors have proposed strategy that uses Apache Hadoop framework, an open source java framework, which relies on Map Reduce paradigm and a Hadoop Distributed File System (HDFS) to process data. The proposed Map Reduce strategy for classification of tweets using Naïve Bayes classifier relies on two Map-Reduce passes. We have used the Twitter4j library to gather tweets which internally uses twitter REST API. The Twitter4j library requires OAuth support to access the API. Twitter uses OAuth to provide authorized access to its API. The final step after preprocessing of tweets is the labeling of tweets based on categories namely politics, sports and technology. Copyright c 2016 SERSC 81
6 In the first Map-Reduce pass, the mapper takes the labeled tweets from the training data and outputs category and word as key value pair. The Reducer then sums up all instances of the words for each category and outputs category and word-count pair as keyvalue. The Map-Reduce thus deals with formation of model for the classifier. The next Map-Reduce pass does the classification by calculating conditional probability of each word (i.e. feature) and outputs category and conditional probability of each word as keyvalue pair. Then final reducer calculates the final probability of each category to which the tweet may belong to and outputs the predicted category and its probability value as key-value pair 3.7 Comparitive Analysis Table 2.1 presents an analysis of the various approaches studied in the literature survey. Table 2.1 Summary of Literature Survey 82 Copyright c 2016 SERSC
7 4. Development Environment Table 4.1 Development Environment COMPONENTS Operating System Crawler, HDFS Layer MapReduce Layer MongoDB Web Server ROLES Use of Hadoop for distributed storage Supporting Java environment for processing some business logic Crawler: Gathering the source data from various SNSs HDFS: Distribution File system, Data storage Sentence Analysis, Text Mining, Sentiment Analysis Storing analyzed results by MapReduce in MongoDB Supporting Web applications using analyzed results 5. Development Methodology 1. Collect unstructured data from Social Media sources. 2. Real-Time Processing with a sentiment analysis engine based on keyword search. 3. Store processed data (with sentiment) in NoSQL database. 4. Extract sentiments from NoSQL to visualization layer. 5. Visualize with a tool of choice. The proposed system has the following modules; 1. DATA STREAMING 2. PREPROCESSING 3. SENTIMENT POLARITY ANALYSIS 4. VISUALIZATION 5. EVALUATION METRICS The details of the modules are presented below. 5.1 Data Streaming Extracting real time tweets using Twitter Streaming API For classification and training the classifier we need Twitter data. For this purpose we make use of API's twitter provides. Twitter provides two API's; Stream API1 and REST API2. The difference between Streaming API and REST APIs are: Streaming API supports long-lived connection and provides data in almost real -time. The REST APIs support short-lived connections and are rate-limited (one can download a certain amount of data [*150 tweets per hour] but not more per day) Preprocessing In this phase, the tweets are available as text data and each line contains a tweet. Initially we clean up or remove retweets as that will induce a bias in the classification Copyright c 2016 SERSC 83
8 process. We need to remove the punctuations and other symbols that doesn t make any sense as it may result in inefficiencies and may affect the accuracy of the overall process 5.3 Sentiment Polarity Analysis MapReduce is a new parallel programming model, hence the classical Naive Bayes based sentiment analysis algorithm is adjusted to fit into Map Reduce model. we choose to employ a Naive Bayes classifier and empower it with an English lexical dictionary SentiWordNet 5.4 Visualization Tweets are presented using several different visualization techniques. Each technique is designed to highlight different aspects of the tweets and their sentiment Heatmap The heatmap visualizes the number of tweets within different sentiment regions. It highlights "hot" red regions with many tweets, and "cold" blue regions with only a few tweets Tag Cloud The tag cloud visualizes the most frequently occurring terms in four emotional regions: upset in the upper-left, happy in the upper-right, relaxed in the lower-right, and unhappy in the lower-left. A term's size shows how often it occurs over all the tweets in the given emotional region. Larger terms occur more frequently Timeline The timeline visualizes when tweets were posted. Pleasant tweets are shown in green above the horizontal axis, and unpleasant tweets in blue below the axis Map The map shows where tweets were posted. Twitter uses an "opt-in" system for reporting location: users must explicitly choose to allow their location to be posted before their tweets are geotagged Affinity The affinity graph visualizes frequent tweets, people, hashtags, and URLs, together with relationships or affinities between these elements. 6. Evaluation Metrics We will evaluate our experiment results by using following Information Retrieval matrices. Precision = TP/(TP + FP) Recall = TP/(TP + FN) F-measure = 2*Precision*recall/( Precision + recall) Accuracy = TP + TN /(TP + TN + FP + FN ) 84 Copyright c 2016 SERSC
9 7. Conclusion It is proposed to stream real time live tweets from twitter using Twitter API, and the large volume of data makes the application suitable for Big Data Analytics. A method to predict or deduct the location of a tweet based on the tweet s information and the user s information should be found in the future. References [1] T. Wilson, J. Wiebe and P. Hoffmann, Recognizing contextual polarity in phrase-level sentiment analysis, in Proceedings of HLT and EMNLP. ACL, (2005), pp [2] C. C. Tao, S. K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Y. Ng and Kunle Olukotun, Map-reduce for machine learning on multicore, In NIPS, vol. 6, (2006), pp [3] L. Jimmy, and A. Kolcz, Large-scale machine learning at twitter, In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ACM, (2012), pp [4] B. Jiang, U. Topaloglu and F. Yu, Towards large-scale twitter mining for drug-related adverse events, In Proceedings of the 2012 international workshop on Smart health and wellbeing, ACM, (2012), pp [5] L. Bingwei, E. Blasch, Y. Chen, D. Shen and G. Chen, Scalable Sentiment Classification for Big Data Analysis Using Naive Bayes Classifier, In Big Data, 2013 IEEE International Conference on, IEEE, (2013), pp [6] Á. Cuesta, David F. and María D. R-Moreno, A Framework for Massive Twitter Data Extraction and Analysis, In Malaysian Journal of Computer Science, (2014), pp [7] S. Michal and A. Romanowski, Sentiment analysis of Twitter data within big data distributed environment for stock prediction, In Computer Science and Information Systems (FedCSIS), 2015 Federated Conference on, IEEE, (2015), pp [8] T. Mohit, I. Gohokar, J. Sable, D. Paratwar and R. Wajgi, Multi-Class Tweet Categorization Using Map Reduce Paradigm, In International Journal of Computer Trends and Technology. (2014), pp [9] D. Jeffrey and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM 51.1, (2008), pp [10] B. Yingyi, HaLoop: Efficient iterative data processing on large clusters, Proceedings of the VLDB Endowment 3.1-2, (2010), pp [11] T. Maite, Lexicon-based methods for sentiment analysis, Computational linguistics 37.2, (2011), pp [12] R. Tushar and S. Srivastava, Analyzing stock market movements using twitter sentiment analysis, Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society, (2012). [13] D. Pessemier and Martens MovieTweetings: A Movie Rating Dataset Collected From Twitter, Ghent University, Ghent, Belgium, (2013). [14] Twitter. Twitter Search API, available at [15] V. D. Katkar, S. V. Kulkarni, A Novel Parallel implementation of Naive Bayesian classifier for Big Data, International Conference on Green Computing, Communication and Conservation of Energy, /2013 IEEE, pp [16] S. Kumar, F. Morstatter and H. Liu, Twitter Data Analytics, Springer Science & Business Media, (2013). [17] B. Vishal, Data Mining in Dynamic Social Networks and Fuzzy Systems, IGI Global, (2013). [18] G. Elmer, G. Langlois and J. Redden, Compromised Data: From Social Media to Big Data, Bloomsbury Publishing USA, (2015). [19] Nalini K. and L. J. Sheela, Classification of Tweets Using Text Classifier to Detect Cyber Bullying, In Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India CSI, Springer International Publishing, vol. 2, (2015), pp [20] Jaba S. L. and Dr V. Shanthi, An Approach for Discretization and Feature Selection Of Continuous- Valued Attributes in Medical Images for Classification Learning, International Journal of Computer Theory and Engineering, vol. 1, no. 2, pp [21] T. White, Hadoop: The Definitive Guide, Third Edition, O'Reilley, (2012). [22] L. George, HBase: The Definitive Guide, O'Reilley, (2011). [23] E. Hewitt, Cassandra: The Definitive Guide, O'Reilley, (2010). [24] A. Gates, Programming Pig, O'Reilley, (2011). Copyright c 2016 SERSC 85
10 Author L. Jaba Sheela, She received the PhD degree in She is currently working as Professor & HOD in department of Master of Computer Applications, of Panimalar Engineering College, Chennai. She is also a team member of research projects sponsored by AICTE, New Delhi. She is a member of CSI and ICAST. Her areas of interest are Data Mining, Big Data Analytics, Software Quality Management, Operating Systems, Image Processing, and Networking. 86 Copyright c 2016 SERSC
Scalable Sentiment Classification for Big Data Analysis Using Naïve Bayes Classifier
Scalable Sentiment Classification for Big Analysis Using Naïve Bayes Classifier Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen Intelligent Fusion Technology, Inc. Germantown, Maryland, USA
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
Big Data and Natural Language: Extracting Insight From Text
An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies
Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE. A Thesis. Presented to. The Faculty of the Graduate School
NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE A Thesis Presented to The Faculty of the Graduate School At the University of Missouri In Partial Fulfillment Of
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Large-Scale Test Mining
Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding
Keywords social media, internet, data, sentiment analysis, opinion mining, business
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract
Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Sentiment Analysis on Big Data
SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Sentiment Analysis for Movie Reviews
Sentiment Analysis for Movie Reviews Ankit Goyal, [email protected] Amey Parulekar, [email protected] Introduction: Movie reviews are an important way to gauge the performance of a movie. While providing
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
A Comparative Study on Sentiment Classification and Ranking on Product Reviews
A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Research Article MapReduce Functions to Analyze Sentiment Information from Social Big Data
International Journal of Distributed Sensor Networks Volume 2015, Article ID 417502, 11 pages http://dx.doi.org/10.1155/2015/417502 Research Article MapReduce Functions to Analyze Sentiment Information
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING
Exploration on Service Matching Methodology Based On Description Logic using Similarity Performance Parameters K.Jayasri Final Year Student IFET College of engineering [email protected] R.Rajmohan
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Text Opinion Mining to Analyze News for Stock Market Prediction
Int. J. Advance. Soft Comput. Appl., Vol. 6, No. 1, March 2014 ISSN 2074-8523; Copyright SCRG Publication, 2014 Text Opinion Mining to Analyze News for Stock Market Prediction Yoosin Kim 1, Seung Ryul
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
Full-text Search in Intermediate Data Storage of FCART
Full-text Search in Intermediate Data Storage of FCART Alexey Neznanov, Andrey Parinov National Research University Higher School of Economics, 20 Myasnitskaya Ulitsa, Moscow, 101000, Russia [email protected],
How To Analyze Sentiment On A Microsoft Microsoft Twitter Account
Sentiment Analysis on Hadoop with Hadoop Streaming Piyush Gupta Research Scholar Pardeep Kumar Assistant Professor Girdhar Gopal Assistant Professor ABSTRACT Ideas and opinions of peoples are influenced
BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics
BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
MapReduce/Bigtable for Distributed Optimization
MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data
Open Source Technologies on Microsoft Azure
Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights
DATA EXPERTS We accelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE Domains Data Mining Mining longitudinal and linked datasets from web and other
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Robust Sentiment Detection on Twitter from Biased and Noisy Data
Robust Sentiment Detection on Twitter from Biased and Noisy Data Luciano Barbosa AT&T Labs - Research [email protected] Junlan Feng AT&T Labs - Research [email protected] Abstract In this
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc
Sentiment Analysis using Hadoop Sponsored By Atlink Communications Inc Instructor : Dr.Sadegh Davari Mentors : Dilhar De Silva, Rishita Khalathkar Team Members : Ankur Uprit Kiranmayi Ganti Pinaki Ranjan
Data Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
Combining Social Data and Semantic Content Analysis for L Aquila Social Urban Network
I-CiTies 2015 2015 CINI Annual Workshop on ICT for Smart Cities and Communities Palermo (Italy) - October 29-30, 2015 Combining Social Data and Semantic Content Analysis for L Aquila Social Urban Network
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Sentiment Analysis of Twitter Data within Big Data Distributed Environment for Stock Prediction
Proceedings of the Federated Conference on Computer Science and Information Systems pp. 1349 1354 DOI: 10.15439/2015F230 ACSIS, Vol. 5 Sentiment Analysis of Twitter Data within Big Data Distributed Environment
The 4 Pillars of Technosoft s Big Data Practice
beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Large Scale Multi-view Learning on MapReduce
Large Scale Multi-view Learning on MapReduce Cibe Hariharan Ericsson Research India Email: [email protected] Shivashankar Subramanian Ericsson Research India Email: [email protected] Abstract
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Semantic Sentiment Analysis of Twitter
Semantic Sentiment Analysis of Twitter Hassan Saif, Yulan He & Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The 11 th International Semantic Web Conference
