Final Project Report. Twitter Sentiment Analysis

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Final Project Report. Twitter Sentiment Analysis"

Transcription

1 Final Project Report Twitter Sentiment Analysis John Dodd Student number: x Higher Diploma in Science in Data Analytics 28/05/2014

2 Declaration SECTION 1 Student to complete Name: Student ID: Supervisor: SECTION 2 Confirmation of Authorship The acceptance of your work is subject to your signature on the following declaration: I confirm that I have read the College statement on plagiarism (summarised overleaf and printed in full in the Student Handbook) and that the work I have submitted for assessment is entirely my own work. Signature: Date: NB. If it is suspected that your assignment contains the work of others falsely represented as your own, it will be referred to the College s Disciplinary Committee. Should the Committee be satisfied that plagiarism has occurred this is likely to lead to your failing the module and possibly to your being suspended or expelled from college. ii

3 Abstract Over the past decade humans have experienced exponential growth in the use of online resources, in particular social media and microblogging websites such as Twitter. Many companies and organisations have identified these resources as a rich mine of marketing knowledge. This project focuses implementing machine learning algorithms to extract an audience s sentiment relating to a popular television program. A major focus of this study was on comparing different machine learning algorithms for the task of sentiment classification. The major findings were that out of the classification algorithms evaluated it was found that the Random forest classifier provide the highest classification accuracy for this domain. From the evaluation of this study it can be concluded that the proposed machine learning and natural language processing techniques are an effective and practical methods for sentiment analysis. iii

4 Table of Contents Abstract... iii Table of Figures... v Table of Tables... v Section 1: Introduction... 1 Objective... 1 Motivation... 1 Contribution to the Knowledge... 2 Section 2: Background... 3 Data Mining & Sentiment analysis... 3 Machine Learning Algorithms... 4 Naïve Bayes... 5 Decision Tree... 6 Random Forests... 8 Support Vector Machine... 9 Section 3: Implementation Platform and Software Platform Python SQL Weka Data Collection Training Data Data Cleaning Implementation of Classifiers in Weka Section 4: Classifier Evaluation Classifying Twitter data Manual Verification Section 5: Results and Conclusion Results Evaluation Conclusion iv

5 Future work Bibliography Appendix Appendix A: Python Programs Appendix B: JSON Format Appendix C: Classifier Evaluation Results Appendix D: Manual Validation Appendix E: Reports & Project Management Table of Figures Figure 1: Decision Tree Structure (Donaldson, 2012)... 6 Figure 2: Random Forest Structure... 8 Figure 3: SVM basic operation (Anon., 2011)... 9 Figure 4: Finding the optimum hyperplane (Buch, 2008) Figure 5: System Architecture Figure 6: Sentiment Table of Tables Table 1: Sample collected data Table 2: Emoticon usage frequency Table 3: Sample Training Data Table 4: Unwanted Content Table 5: Sample Cleaned Data Table 6: Model Evaluation Table 7: Manual Verification Confusion Matrix Table 8: Overall Sentiment v

6 Section 1: Introduction This section will cover the overall aim of the project, the motivation behind it, and any contribution to the knowledge base that has been added. Objective The overall goal of this project was to perform a sentiment analysis on a particular television program. Viewer opinions on the American television show Supernatural were mined from the popular microblogging website Twitter. The main goal of such a sentiment analysis is to discover how the audience perceives the television show. The Twitter data that is collected will be classified into two categories; positive or negative. An analysis will then be performed on the classified data to investigate what percentage of the audience sample falls into each category. Particular emphasis is placed on evaluating different machine learning algorithms for the task of twitter sentiment analysis. Motivation Over the past decade humans have experienced exponential growth in the use of online resources, in particular social media and microblogging websites such as Twitter, Facebook and YouTube. Many companies and organisations have identified these resources as a rich mine of marketing knowledge. Traditionally companies used interviews, questionnaires and surveys to gain feedback and insight into how customers felt about their products. These traditional methods were often extremely time consuming and expensive and did not always return the results that the companies were looking for due to environmental factors and poorly designed surveys. Natural language processing and sentiment analysis are playing an increasingly important role in making educated decisions on marketing strategies and giving valuable feedback on products and services. There are massive amounts of data containing consumer sentiment uploaded to the internet every day, this type of data is predominantly unstructured text that is difficult for computers to gain meaning from. In the past it was not possible to process such large amounts of unstructured data but now with computational power following the projections of Moore s law (Moore, 1965) and distributed networks of computers using frameworks such as Hadoop, massive datasets can be now processed with relative ease. Major investment is going into this area such as IBMs tireless research into their Natural Language Processing supercomputer Watson and Googles recent acquisition of deep mind technology. With further research and investment into this area machines will soon be able to gain an understanding from text which will greatly improve data analytics and search engines. 1

7 Contribution to the Knowledge To many companies and organisations a customer s perception of a product or service is extremely valuable information. From the knowledge gained from an analysis such as this a company can identify issues with their products, spot trends before their competitors, create improved communications with their target audience, and gain valuable insight into how effective their marketing campaigns were. Through this knowledge companies gain valuable feedback which allows them to further develop the next generation of their product. In the context of the sentiment analysis being carried out for this application, the results will allow the producers of the show to gain insight into how each episode is being perceived by the viewer. This is very valuable information as viewers are uploading their expectations, opinions and views on the television program before, during and after it is aired. This really revolutionises the feedback process, an application such as this has the potential to analyse the sentiment in real time giving the producers immediate feedback on how the program is being help in the eyes of its audience. Such an application could be expanded to use clustering algorithms to give insight into particular scenes or characters. From an academic perspective it was felt that there were no new findings added to the knowledge base of natural language processing or sentiment analysis. This was to be expected as with this course being a level 8 on the national framework and the short duration of the project it would have been extremely difficult to make a meaningful contribution to the already highly researched fields. This being said an application such as this does have value for small to medium sized companies to gain valuable insights to their data without investing heavily in the area. This report would also serve well as reference material for other researchers in the field as there is very few documents available in the twitter sentiment analysis domain that compared a number of different machine learning classification algorithms and achieved such high accuracy in their finished model. 2

8 Section 2: Background This section aims to give an overview of the background material used for this project. Most notably it will cover sentiment analysis, machine learning and the classification algorithms used in this project. Data Mining & Sentiment analysis Data mining is the computational process of finding patterns in large datasets and its methods are at the intersection between Artificial Intelligence, Machine Learning, computer science, data base technologies, and statistics. The objective of data mining is to extract information or knowledge from a dataset and transform it into a structure that can be understood. Data preparation is an important part of any data analysis. To properly prepare data it is necessary to understand the application domain, this is important as the researcher must be able to identify pertinent data and cleansing the dataset removing any data which is deemed as unimportant to the analysis. Some of these pre-processing techniques include; Data cleaning, Noise treatment, sampling, strategy s to deal with missing values, Normalisation, and feature extraction. Many of these preprocessing techniques will be examined in more detail in the implementation section of this report. Data mining focuses on discovering patterns in data. Sentiment analysis which is also known as opinion mining focuses on discovering patterns in text that can be analysed to classify the sentiment in that text. Sentiment analysis is the field of study that analyses people s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, and their attributes (Liu, 2012) Sentiment analysis is predominantly implemented in software which can autonomously extract emotions and opinions in text. It has many real world applications such as it allows companies to analyse how their products or brand is being perceived by their consumers, this usage is particularly applicable to this project. It is difficult to classify sentiment analysis as one specific field of study as in incorporates many different areas such as linguistics, Natural Language Processing (NLP), and Machine Learning or Artificial Intelligence. As the majority of the sentiment that is uploaded to the internet is of an unstructured nature it is a difficult task for computers to process it and extract meaningful information from it. Natural language processing techniques are used to transform this raw data into a form that it can be processed efficiently by a computer. Natural Language Processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages (Kumar, 2011) Natural language processing uses many different methods to process text such as string processing (Tokenizers, sentence tokenizers, stemmers tagging) or by using speech tagging (n-gram, backoff, Brill, HMM, TnT). (Bird, Steven, Edward Loper, Ewan Klein, 2009) 3

9 Machine Learning Algorithms Machine learning is a branch of artificial intelligence which focuses on building models that have the ability to learn from data. As it is such an enormous field that encompasses many areas there is no standardised definition for it, but Arthur Samuels s general definition which describes it well: Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed (Samuel, 2000) In general there are two types of machine learning algorithms supervised, and unsupervised, there are variations on these algorithms such hybrid types (semi-supervised learning) but for this report they will be classified into one of the two categories. The method of supervised learning consists of presenting an algorithm with a training dataset; this dataset consists of training examples and the corresponding expected output for each example. The expected output is general known as the target. A supervised learning algorithm uses this dataset so that it can learn to map the input examples to their expected target. If the training process is implemented correctly the machine learning algorithm should be able to generalise the training data so that it can correctly map new data that it has never seen before. Unsupervised machine learning algorithms do not require training data, they operate on data where the output is unknown. The object of this form of learning is usually to discover patterns in the data that may not be known by the researcher. An example of an unsupervised method would be clustering where the algorithm uses a distance function to group similar data points together. This project focuses exclusively on supervised learning algorithms for the task of text classification, however there are two major applications for supervised methods: Classification: The target output is a class or label, the simplest case is a choice between zero ore one, although there can also be multiple alternative classes. Classification is used for many applications such as test classification, object recognition, and voice recognition software. Regression: In this case the target is a real number or vector of real numbers. Regression is mostly used for prediction. Supervised regression algorithms are used mostly for prediction. Example applications are stock market prediction, in power systems analysis it can be to predict spikes in a network, and most recently they have been used in memory caching to complement the locality of reference method. This project focuses on supervised classification algorithms, the models that were used are described below. After a review of the literature surrounding machine learning algorithms used for sentiment analysis it was found that some of the most commonly used and highest performing were the Naïve Bayes, Decision Tree, Random Forests, and the Support Vector Machine. Using the major literature in the field it was decided to further investigate these algorithms for their suitability to be used in this analysis. 4

10 Naïve Bayes The Naïve Bayes classifier is a simple probabilistic model which relies on the assumption of feature independent in order to classify input data. Despite its simplicity, the algorithm is commonly used for text classification in many opinion mining applications (Pak Alexander, Paroubek Patrick, 2010) (Alec Go, Lei Huang, Richa Bhayani, 2009) (Prem Melville, Wojciech gryc, Richard D.Lawrence, 2009). Much of it popularity is a result of its extremely simple implementation, low computational cost and it relatively high accuracy. The algorithm assumes that each feature is independent of the absence or presence of any other feature in the input data, because of this assumption it is known as naïve. In reality words in a sentence are strongly related, their positions and presence in a sentence have a major impact on the overall meaning and sentiment in that sentence. Despite this naïve assumption the classifier can produce high classification accuracy when used with quality training and in specific domains. A recent study (Zhang, n.d.) addressed this assumption and presented strong evidence of how the algorithm could be so effective while relying on this assumption. The algorithm itself is derived from Bayes theorem: ( ) ( ( ) ( ) ( ) ( ) Where P(c) and P(f c) are calculated by estimating the relative frequency of a feature f which is extracted from the training data corpus and where ( ) is the number of these features. In the entire training data corpus there are m features. The document d contains the training data or the input data to be classified. In basic terms the algorithm will take every feature (word) in the training set and calculate the probability of it being in each class (positive or negative), now that the probabilities of each feature are calculated the algorithm is ready to classify new data. When a new sentence is being classified it will split it into single word features and the model will use the probabilities that were computed in the training phase to calculate the condition probabilities of the combined features in order to predict its class. Many machine learning algorithms will ignore features which have a weak influence on the overall classification, a major advantage of the Naïve Bayes classifier is that it utilizes all the evidence that is available to it in order to make a classification. Using this approach it takes into account that many weak features which may have a relativity minor effect individually may have a much larger influence on the overall classification when combined. Some of the major work in the field of sentiment analysis using the Naïve Bayes was carried out by (Alexander Pak, Patrick Paroubek, 2010). The training data was collected using the assumption the emoticons contained in text represented the overall sentiment in that text. Using this assumption a large quantity of training data was automatically collected. This study used an ensemble of two different naïve Bayes classifiers; one trained using the presence of unigrams while the second used part of speech tagging. When the two classifiers were combined they produced an accuracy of 74%. Pang et al (Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, 2002) used a single Naïve Bayes classifier on a movie review corpus to achieve similar results as the previous study. Multiple Naïve Bayes models were trained using different features such as part of speech tagging, unigrams, and bigrams. 5

11 The final model achieved a classification accuracy of 77.3% which was considered a high performance of the algorithm on that domain. Another important study (Prem Melville, Wojciech Gryc, Richard D. Lawrence, 2009) in the field combined Naïve Bayes model with lexical knowledge to produce a classifier that had improved performance over the performance of the individual models. The model was tested on a number of different datasets, the most notable results was an accuracy of 91.21% on a Lotus dataset. From the literature examined surrounding the Naïve Bayes it can be seen that despite its simplicity the algorithm has the ability to produce high classification accuracy on similar dataset to the one used in this project. It has many advantages over some of the more sophisticated algorithms, the major one being how simple it is to understand and implement, the low processing cost of the algorithm can be attributed to this simplicity. Decision Tree Decision trees on one of the most widely used machine learning algorithms much of their popularity is due to the fact that they can be adapted to almost any type of data. They are a supervised machine learning algorithm that divides its training data into smaller and smaller parts in order to identify patterns that can be used for classification. The knowledge is then presented in the form of logical structure similar to a flow chart that can be easily understood without any statistical knowledge. The algorithm is particularly well suited to cases where many hierarchical categorical distinctions can be made. They are built using a heuristic called recursive partitioning. This is generally known as the divide and conquer approach because it uses feature values to split the data into smaller and smaller subsets of similar classes. The structure of a decision tree consists of a root node which represents the entire dataset, decision nodes which perform the computation and leaf nodes which produce the classification. In the training phase the algorithm learns what decisions have to be made in order to split the labelled training data into its classes. Figure 1: Decision Tree Structure (Donaldson, 2012) 6

12 In order to classify an unknown instance, the data is passed through the tree. At each decision node a specific feature from the input data is compared with a constant that was identified in the training phase. The computation which takes place in each decision node usually compares the selected feature with this predetermined constant, the decision will be based on whether the feature is greater than or less than the constant, creating a two way split in the tree. The data will eventually pass through these decision nodes until it reaches a leaf node which represents its assigned class. There are many different implementations and variations of the decision tree algorithm, this project implements the J48 method which is a Java implementation of the C4.5 algorithm which was the industry standard up until the C5.0 algorithm was released. Some of the major work in the field of sentiment analysis using the Decision tree algorithm was carried out by Castillo et al (Castillo, Carlos, Marcelo Mendoza, Barbara Poblete, 2011). The studies main focus was on accessing the creditability of tweets posted on Twitter but there was also secondary focus on sentiment analysis. A decision tree was implemented using the J48 algorithm to classify sentiment in the twitter dataset. By training the algorithm with hand annotated examples the algorithm produced an accuracy of 70%. (Bifet, Albert, Eibe Frank, 2010) study implemented a Hoeffding tree for sentiment classification on a twitter dataset. The trained the algorithm using a massive dataset of 1,600,000 tweets split into approximately equal representations of each class. The overall accuracy of the completed model was 69.36%. This would have been quite a disappointing result as the there was a vast amount of research and modifications made to the algorithm and the dataset and the final result was only marginally higher than what the algorithm can achieve with no optimisations. From the literature examined surrounding the decision tree algorithm it can be seen that the algorithm can be very effective for text base classification. 7

13 Random Forests Ensemble learning focuses on techniques to combine the results of different trained models in order to produce a more accurate classifier. Ensemble models generally have considerably improved performance than that of a singular model. The random forest algorithm is an example of an ensemble method which was introduced by (Breiman, 2001), it is quite a simple algorithm but despite its simplicity it can produce state of the art performance in terms of classification. The basic structure of the random forest can be seen in Figure 3 below. Figure 2: Random Forest Structure Random forests are constructed by combining a number of decision tree classifiers, each tree is trained using a bootstrapped subset of the training data. At each decision node a random subset of the features is chosen and the algorithm will only consider splits on those features. The main problem with using an individual tree is that it has high variance that is to say that the arrangement of the training data and features may affect its performance. Each individual tree has high variance but if we average over an ensemble of trees we can reduce the variance of the overall classification. Provided that each tree has better accuracy then pure chance, and that they are not highly correlated with one another the central limit theory states that when they are averaged they will produce a Gaussian distribution. The more decisions that are averaged the lower the variance will become. Reducing the variance will generally increase the overall performance of the model by lowering the overall error. Looking at the literature surrounding the Random Forest algorithm for text classification it was found that major works in this area were very sparse. Some of the works found in this field using the Random forest algorithm was carried out by Aramaki et all (Aramaki, Eiji, Sachiko Maskawa, Mizuki Morita, 2011). The study focused on using machine learning and Twitter to detect influenza epidemics, a number of machine learning algorithms were compared in order to classify tweets containing keywords into two classes. The random forest proved to have an accuracy of 72.9% on the test dataset. 8

14 Tsutsumi et al (Tsutsumi, Kimitaka; Kazutaka Shimada; Tsutomu Endo, 2007) study implemented a weighted voting random forest on a movie review database. A scoring criterion was used to appoint a weighted vote to each random tree in the forest. Using this method the algorithm produced an accuracy of 83.4% on a dataset of 1400 reviews. From the literature reviewed it can be seen that the Random forest algorithm can produce state of the art performance for text based classification. By combining multiple simple random trees the algorithm can produce significantly higher performance than each tree individually. For such a simple algorithm the accuracy is really astounding. Support Vector Machine The support vector machine was the most sophisticated algorithm evaluated in this project and it is becoming an increasingly common method for text classification. Its increased popularity is largely due to the high classification accuracy that is associated with its use. The support vector machine is classed as a non-probabilistic binary linear classifier. It works by plotting the training data in multidimensional space; it then tries to separate the classes with a hyperplane. If the classes are not immediately linearly separable in the multidimensional space the algorithm will add a new dimension in an attempt to further separate the classes. It will continue this process until it is able to separate the training data into its two separate classes using a hyperplane. A basic representation of how it splits the data is shown in figure 3 below. Figure 3: SVM basic operation (Anon., 2011) One of the main areas where this method differs from other linear classifiers such as the perceptron is in the way it selects the hyperplane. In most cases there may be multiple hyperplanes or in some cases an infinite number of hyperplanes that could separate that classes. The SVM algorithm chooses the hyperplane which provides the maximum separation between the classes has the greatest margin or the maximal margin hyperplane which minimises the upper bound of the classification errors. A standard method for finding the optimum way of separating the classes is to plot two hyperplanes in a way that there are no data points between them, and then by using these 9

15 planes the final hyperplane can be calculated. This process is shown in figure below. The data points that fall on these planes are known as the supports. Figure 4: Finding the optimum hyperplane (Buch, 2008) Now that the algorithm has calculated the hyperplane that provides the maximum level of separation between the classes, new data can be classified. New instances are mapped into the feature space and are classified by which side of the hyperplane they fall onto. A major problem with the SVM is that by adding extra dimensions the size of the feature space increases exponentially. From a processing point of view the SVM algorithm counteracts this by using dot products in the original space. This method hugely reduces processing as all the calculations are performed in the original space and then mapped to the feature space. From a classification perspective this increase in the size of the feature space has a negative effect on the models ability to accurately classify data this is known as the Hughes effect (Hughes, 1968). This has a strong negative effect of classification because as the feature space increases the training data becomes extremely sparse in that space, to counteract this phenomenon the training data would need to be increased exponentially with every dimension added, which is not really practical in real world applications. Some of the major work in the field of sentiment analysis using the SVM was carried out by Pang et al (Bo Pang, 2002). In this study the SVM was used to extract sentiment from a movie review database, multiple SVM were trained using different features such as part of speech tagging, unigrams, and bigrams. The final model achieved a classification accuracy of 82.9% which was considered as extremely high for that domain. The study proved contrary to the results of other studies in the area (McCallum, Andrew, and Kamal Nigam, 1998) where the more traditional method of using unigram frequency was used over unigram presence. 10

16 A later study (Read, 2005) used the same principals as Pang et al but the training data was substantially increased. The increase of the training data was due to the researcher using the assumption the emoticons contained in text represented the overall sentiment in that text. Using this assumption large quantities of training data were automatically collected. The final model that was produced had a classification accuracy of 70% on a movie review dataset. From the literature examined surrounding the support vector machine it can be seen that the algorithm has the ability to produce very high classification accuracy on similar dataset to the one used in this project. The major downside of this algorithm is that is complexity makes it difficult to gain a solid understanding of how exactly it works when compared to some of the simpler algorithms. 11

17 Section 3: Implementation This section aims to give an overview of how the data mining, text processing and machine learning techniques that were implemented in this project. A block diagram of the system can be seen in the figure 5 below. Figure 5: System Architecture Platform and Software This section aim to give a brief overview of the main software languages, environments, principle libraries and a brief description of the platform that was used to implement this project. Platform All processing was carried out on commodity hardware, specifically a 64bit Intel i3 processor (6GB ram) running a windows operating system. Python For the purpose of this project Python was used as it is a mature, versatile and robust high level programming language. It is an interpreted language which makes the testing and debugging phase s extremely quick as there is no compilation step. There are extensive open source libraries available for this version of python and a large community of users. This version of Python was chosen over the latest version Python 3.0 for numerous reasons the main ones being everything in 12

18 Python 3.0 is backwards compatible with the older versions and it is felt that Python 2.7 has better documentation and a wider community of users than the latest version. Other high level programming languages such as R and Matlab were considered because they have many benefits such as ease of use but they do not offer the same flexibility and freedom that Python can deliver. A low level language such a C was considered to write functions and some of algorithms that had very high computational cost. Using this low level language may have increased the efficiency of the algorithms and lowered computational overhead. As the project progressed it was found that there was insufficient time to explore this avenue. Python Key libraries: Python- Twitter, NLTK SQL There were many options on how to store the information such as a in a comma separated values (CSV) file, a text file or in a database. It was decided that the optimum approach was to use a SQL database as this method allowed faster read/write times, easier segmentation of the data to be stored for example the date and tweet could be stored in separate columns, and SQL databases also support multithreaded applications which would be beneficial if this project was ever expanded to a real-time classification application in the future. SQLite was used as the database management system as it is an open source application that is easily integrated with the Python programming language. Weka Weka is an open source software environment written in Java that can be used for many data mining applications. The environment contains a collection of machine learning algorithms that are suitable for the task of text classification and sentiment analysis. This software was chosen because it allows developers to quickly and easily pre-process data and build machine learning models. Weka was found to be more desirable for this task over python because of it easy to use GUI and its standardised output results for classifiers allowed models to be evaluated and compared easily. The major downside to using this software was that many of the algorithms were poorly documented and there was a much smaller community of users when compared to python. 13

19 Data Collection Twitter allows developers access to a range of streaming APIs which offer low latency access to flows of twitter data. For the data collection implementation the public streams API was used, it was found that this was the most suitable method of gathering information for data mining purposes as it allowed access to a global stream of twitter data that could be filtered as required. In order to take advantage of this stream, a python interface library had to be installed this library was necessary for python to interface with twitters API v1.1. For this task there were a number of library s available, python twitter tools v was chosen as it allowed the basic filtering and streaming functionality required for this project. Twitter has numerous regulations and rate limits imposed on it API for this reason it requires that all users must register an account and provide authentication details when they query the API. This registration requires users to provide an address and telephone number for verification, once the user account is verified the user will be issued with the authentication detail which allows access to the API. A Python script was then created which provided the API with the authentication details and initialised a streaming process where data could be pulled from twitters RESTful web service to a local machine. A filter function was used to allow the program to request twitter content based on specific keywords related to this specific study. All the downloaded data was transmitted in JSON format, it was found that this standard was less verbose than the alternative format that was offered XML. Each JSON formatted package contained a large amount of information but it was decided that for this project only the tweet and the time the tweet was written was required (an example of JSON format can be found in the Appendix C). In order to remove the unwanted content each package was parsed using a python script which located the useful content and stored it in RAM until main memory storage became available. An additional check was performed to ensure all the tweets downloaded were written in the English language. This check involved parsing the JSON content for a Lang tag and then performing an equality check on its content. So now once the required content was removed from the JSON package and stored in RAM it now could be written to main memory. There were many options on how to store the information such as a in a comma separated values (CSV) file, a text file or in a database. It was decided that the optimum approach was to use a SQL database as this method allowed faster read/write times, easier segmentation of the data to be stored for example the date and tweet could be stored in separate columns, and SQL databases also support multithreaded applications which would be beneficial if this project was ever expanded to a real-time classification application in the future. A database was created with a simple table structure which had the fields Primary_Key, Date, and Tweet. The primary key was automatically generated by simply incrementing a counter each time the database was written to. An example of the data collected is shown in table 5 below where Primary_Key is an integer, Date and Tweet are Strings. It is worth noting that Date was stored as a string rather than the SQL DATE data type because it was intended to convert this data to a UNIX timestamp at a later stage in the project. The Python script that was created for this phase can be found in Appendix B. 14

20 Table 1: Sample collected data Primary_Key Date Tweet 1 Mon May 19 19:43: Supernatural Season Finale TOMORROW 2 Mon May 19 19:44: Mon May 19 19:44: Mon May 19 19:45: Mon May 19 19:45: Supernatural season finale tomorrow! So excited/nervous/scared!!!! SUPERNATURAL SEASON FINALE TOMORROW AND APPARENTLY JARED CRIED READING THE SCRIPT. NOOOO well in case i forgot, Supernatural Season Finale TOMORROW is world wide trending\nthanks guys Supernatural Season Finale TOMORROW - expect: blood, fight, guilt, anger, tears, pain, love, death, touching, danger, un\u2026 Training Data In order to train a supervised learning algorithm a training dataset must be collected; this dataset consists of training examples and the corresponding expected output for each example. The expected output is general known as the target. A supervised learning algorithm uses this dataset so that it can learn to map the input examples to their expected target. If the training process is implemented correctly the machine learning algorithm should be able to generalise the training data so that it can correctly map new data that it has never seen before. Training data must contain a class label, this can be achieved through manually assigning each tweet with a class but this is a tedious process and as twitter enforces strict rules on the distribution of its data it has proved difficult to source reliable hand annotated twitter datasets. For this reason other avenues were examined and it was found that a number of researchers (Bo Pang, 2002) have successfully used emoticons (, ) as a noisy label in order to automatically classify sentences. In order to use this method an assumption must be made, this assumption is that the emoticon in the tweet represents the overall sentiment contained in that tweet. This assumption is quite reasonable as the maximum length of a tweet is 140 characters so in the majority of cases the emoticon will correctly represent the overall sentiment of that tweet. An analysis performed by DataGenetics (Berry, 2010) studied over 96million tweets containing emoticons; the study documented the usage frequency of each emoticon. The five emoticons with the highest frequency are tabulated below. 15

21 Table 2: Emoticon usage frequency Emoticon Usage Percentage :) 32,115, % :D 10,595, % :( 7,613, % ;) 7,238, % :-) 4,250, % For this project the smiley face and the sad face were chosen as the noisy labels for the training data, this choice was made as they were the two labels with the highest frequency that represented either the positive or negative class. The training data was collect using the data collection process outlined in the data collection phase of this report. It was decided to collect 20,000 examples of each class to make up a training dataset of 40,000 tweets and their associated noisy labels. An example of some of the raw training data can be found in the table below. Table 3: Sample Training Data Tweet Big Thank You To Our Lovely Customers From WeLoveCarlisle For Our Splendid Gift :) I really liked it, in my opinion it def is You and I is amazing, the video is so sweet, be proud guys :) :( How awful. Police: Driver kills 2, injures 23 at #SXSW So pissed I just cracked my phone screen :( I hate you srsly Class Positive Positive Positive Negative Negative Negative 16

22 Data Cleaning The aim of the data cleaning process is to remove any unwanted content from the training data and the input tweets. The term unwanted content is used to describe any piece of information within the tweet that will not be useful for the machine learning algorithm to assign a class to that tweet. Data cleaning can not only simplify the classification task for the machine learning model but it also serves to greatly decrease processing cost in the training phase. The unwanted content is tabulated in table 8 below. Table 4: Unwanted Content Unwanted Content ACTION Punctuation (!?,. : ; ) Removed #word Removed Replaced with AT_USER RT Removed Emoticons Removed Unicode formatting Removed Uppercase characters Lowercase all content URLs and web links Replaced with URL A Python script was created to read in each tweet from the database and preform processing on them in order to clean the undesirable data, this program can be found in Appendix B of the report. This Python script also served to remove stopwords from the tweets. Stop words are words such as the, which, is, and at and they have little value for machine learning algorithms as they are contained in approximately equal measure in the positive and the negative training sets. Removing them allows more specific word features to be passed into the classification models and hugely reduces processing during the training stages. There is no definite standard for removing stop word as each application has different requirements, for this project the default stop word list was taken from the Natural Language Tool Kit (NLTK) for Python. When this pre-processing of the tweets was complete the cleaned data was stored in a new table in the database. A number of examples of the data pre cleaning and post processing are contained in the table 9 below. 17

23 Table 5: Sample Cleaned Data Tweet Big Thank You To Our Lovely Customers From WeLoveCarlisle For Our Splendid Gift :) big thank lovely customers welovecarlisle splendid gift :) I really liked it, in my opinion it def is :) AT_USER really liked it, opinion def :) So pissed I just cracked my phone screen :( pissed cracked phone screen :( Status Raw Data Cleaned Raw Data Cleaned Raw Data Cleaned :( How awful. Police: Driver kills 2, injures 23 at #SXSW Raw Data :( AT_USER awful police driver kills 2 injures 23 sxsw URL Cleaned Implementation of Classifiers in Weka There are a number of machine learning libraries available for Python such as PyBrain, mlpy, and Scikit-learn but it was decided that Weka offered superior benefits such as ease of use, larger library of classification algorithms, and more sophisticated report generation and visualisation. Weka is a standalone software environment built in Java that facilitates the use of a large collection of machine learning classifiers. It includes tools for pre-processing data, classification, clustering, regression, and visualisation these combined tools make Weka a very powerful application and for this reason it was decided to implement the machine learning classifiers in this environment. The first stage of the implementation required the training data to be reformatted into ARFF (Attribute Relation File Format) so that it could be read by Weka. This process was performed by reading all the tweets from the database and explicitly labelling each tweet with its class, before storing them in a text file with a specific header which specified the relationship between that data in the file. This file was then saved as an ARFF file which could be loaded into Weka. The next stage of the process was to load this training data into Weka and explicitly define the class information. This was a very important step as at this stage the training data contained the tweet itself and a class label, Weka cannot differentiate between the two. The class must be assigned to the data by using the inbuilt classassigner function. When this is complete the training data is ready for further processing. All three of the classifiers that were test required that the input data be in numeric format, this posed a problem as the tweets were currently stored as String variables. In order to proceed with this process it was required that the string data was converted into numeric form. For this step the StringToWordVector filter was used, this took every distinct word in the training set and then created a vector for each tweet. As some parts of Weka are poorly documentation it is unknown if 18

24 these sparse vectors are simplified in any way to reduce processing overhead. Now that the preprocessing of the training data was complete a machine learning classifier could be trained using the data. For the classification phase three different machine learning algorithms were chosen to be used, the reason behind this was to compare the performance of the models and select the most suitable classifier for the data. The three models that were selected were the Naïve Bayes, Support Vector Machine, and the Random Forest. A description of each of these algorithms can be found in the background section of this report. Once the models were trained they were saved and an evaluation was carried out which can be found in the next section of the report. 19

25 Section 4: Classifier Evaluation This section will focus on comparing and evaluating the performance of different machine learning algorithms on the training and test datasets, with the aim of selecting the best model for the task of twitter sentiment classification. The section will cover the different processes that were used to evaluate the models and the different attributes that were modified in order to select the optimum model for the task. The algorithm with the highest performance will then be applied to the twitter data collected on the Supernatural television program to produce the overall sentiment in that dataset. All models were evaluated using stratified threefold cross-validation. The cross-validation method was chosen over the standard holdout method to prevent uneven representation of the classes in the training and test set. The standard holdout method generally randomly splits the training data into training and test data, this split varies depending on the researcher but a commonly used method is to segment the data into 70% training and 30% test data. If safeguards are not put in place it is not uncommon for algorithms to split the data so that there is an uneven representation of the classes in the training data, take for example if after using a 70/30% split the test data only contained examples of the negative class. This would hugely decrease the amount of negative examples in the training set which could severely impact the classification algorithms ability to lean the underlying patterns of that class. To counteract this stratified threefold cross-validation segments the data into three approximately equal partitions; each segment contains approximately equal representation of each class. Each partition in turn is used as test data, so the first time one third of the data is used for testing and the remaining two thirds are used for training. This process is repeated until every instance has been used for test data. Research has shown that using this method can increase the performance of the classification model. Tenfold cross-validation is the standard method used in many machine learning applications but due to the high processing cost associated with it this method was not feasible for this project. Each model was evaluated under four criteria; accuracy, precision, recall, and the models F-measure. The accuracy is simply the percentage of correctly classified instances. The precision is calculated for each class and is the number of instances correctly classified as its true class out of all the instances classified as that class. Again the recall is calculated for each class and represents the number correctly classified instances of a class out of all the instances of that class. The F-score or F-measure gives a good indication of the overall performance of a classifier and is calculated using the following formula: ( )( ) This evaluation process focused on finding the optimum classification model for the dataset, this entailed modifying different attributes for each of the algorithms to increase their performance. It was discovered early on that the performance of the Naïve Bayes and J48 was significantly lower 20

26 than that of the SVM and the random forest. It was decided that because of this relatively low performance there was no benefit of further optimising the algorithms as it was highly unlikely that they would improve the overall performance by such a high degree. During this evaluation an analysis was performed on what effect word stemming had on the performance of the SVM and the Random Forest. The SVM was evaluated using different kernel methods and the Random forest was evaluated using different numbers of random trees to see how these methods and attributes affected the models performance. The results of this evaluation can be found in table 10 below. Table 61: Model Evaluation Model Stemmer Kernel Accuracy Precision Recall F-Measure Class SVM Null Poly kernel % Pos Neg SVM Iterated Lovins Poly kernel 79.32% Pos Neg SVM Lovins Poly Kernel 79.32% Pos Neg SVM Snowball Poly Kernel % Pos Neg SVM Null Normilised Poly % Pos Neg SVM Null PUK Pos Neg SVM Null RBF Kernel % Pos Neg Random Null N/A 81.02% Pos Forest * Neg Random Null N/A 82.17% Pos Forest** Neg Random Forest*** Null N/A Pos Neg Naïve Bayes Null N/A % Pos Neg J48 Null N/A Pos Neg *5 random trees, **10 random trees, ***20 random trees From the evaluation it was found that word stemming had no effect or a negative effect on the Support Vector Machine algorithms performance. It did however have a marginal positive affect when used with the random forest but it was decided to use no stemmer in the training of the final 21

27 model as the additional computational cost associated with it did not justify its small performance increase. The kernel methods used in the Support Vector Machine did affect its overall performance and it was found that the normalised Polly kernel provided the best results on this dataset out of all the kernel methods tested. It is worth noting that the normalised polynomial kernel incurred a processing cost of almost twice that of the standard polynomial kernel. It was found that by increasing the number of random trees used in the random forest algorithm the overall accuracy of the algorithm increased also. Again adding more random trees increased computational overhead. The accuracy of the Naïve Bayes and the J48 was so far below that of the SVM and the Random Forest that it did not seem reasonable to carry out an in depth evaluation of them, as neither of them will be considered for the final model. From this evaluation it was found that the random forest classifier with 20 random trees had the highest performance on the dataset. For this reason it was decided that this model would be applied to the real data in order to find its sentiment. Classifying Twitter data This section focuses on the process of classifying the data taken from twitters API. The tweets were read from the database and converted into ARFF format so that it could be processed by Weka. They were loaded into the Weka environment and the Random Forrest classifier that was created in the previous section was used to classify the sentiment in each tweet into the positive or negative class. When this classification was complete the results were saved in a text file. As there were a large number of tweets in the dataset a python program was created to calculate the percentage positive and negative tweets in the file and to visualise the results. This Python script can be found in the Appendix B. Manual Verification In order to be thorough a manually verification of the classified data was carried out on a sample of 100 tweets from the data. This process entailed a human manually classifying the sentiment in each sampled tweet and comparing it with the computer generated classification. The results of this process are tabulated in table 12 below. The full sample of tweets and their classifications can be found in the appendix D. It is worth noting that because the neutral class is ignored in this project, it is unfair to the classifier to judge its performance on any neutral tweets that may be contained in the data. The confusion matrix is displayed below. Table 7: Manual Verification Confusion Matrix Actual Class Classified As Positive Negative Positive 58 9 Negative

28 Using the confusion matrix the overall accuracy, precision, recall and F-measure were calculated. Table 8: Manual validation performance Accuracy Precision Recall F-measure Class 83% Positive Negative When examining the results from the manual verification it can be seen that they are very similar to the results from the cross-validation. Taking a sample of the input data and manually classifying them has served to validate the computer generated classifications and reassure the user that the high accuracy on the test data was not due to some computational error. 23

29 Section 5: Results and Conclusion This section will cover the results and evaluation of the actual classification of the input data collected from twitter. It will also provide a conclusion and recommendations for further work on the project. Results The results from the Random Forest classifier when re-evaluated on the dataset containing 8634 Super Natural related tweets that were collected from twitters API are displayed in table 12 below. A visual representation of the overall sentiment contained in the input data is displayed graphically as a pie chart in figure 6. Table 9: Overall Sentiment Number of input instances 8634 Number of instances of class positive 5017 Number of instances of class negative 3617 Percentage of instances classified positive 58.11% Percentage of instances classified negative 41.89% Sentiment Negative Positive Figure 6: Sentiment 24

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts

More information

Automatic Web Page Classification

Automatic Web Page Classification Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Sentiment Analysis on Big Data

Sentiment Analysis on Big Data SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social

More information

Analysis Tools and Libraries for BigData

Analysis Tools and Libraries for BigData + Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

Sentiment analysis: towards a tool for analysing real-time students feedback

Sentiment analysis: towards a tool for analysing real-time students feedback Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: nabeela.altrabsheh@port.ac.uk Mihaela Cocea Email: mihaela.cocea@port.ac.uk Sanaz Fallahkhair Email:

More information

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015 Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project

Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Paul Bone pbone@csse.unimelb.edu.au June 2008 Contents 1 Introduction 1 2 Method 2 2.1 Hadoop and Python.........................

More information

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

More information

Music Mood Classification

Music Mood Classification Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Twitter Stream Analysis in Spanish

Twitter Stream Analysis in Spanish Twitter Stream Analysis in Spanish María D. R-Moreno mdolores@aut.uah.es Álvaro Cuesta alvaro.cuestac@gmail.com David F. Barrero david@aut.uah.es ABSTRACT Social Networks have opened to companies and politicians

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

A Lightweight Solution to the Educational Data Mining Challenge

A Lightweight Solution to the Educational Data Mining Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

More information

More Data Mining with Weka

More Data Mining with Weka More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

Decision tree algorithm short Weka tutorial

Decision tree algorithm short Weka tutorial Decision tree algorithm short Weka tutorial Croce Danilo, Roberto Basili Machine leanring for Web Mining a.a. 2009-2010 Machine Learning: brief summary Example You need to write a program that: given a

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Whitepaper. Leveraging Social Media Analytics for Competitive Advantage

Whitepaper. Leveraging Social Media Analytics for Competitive Advantage Whitepaper Leveraging Social Media Analytics for Competitive Advantage May 2012 Overview - Social Media and Vertica From the Internet s earliest days computer scientists and programmers have worked to

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques. Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques. Akshay Amolik, Niketan Jivane, Mahavir Bhandari, Dr.M.Venkatesan School of Computer Science and Engineering, VIT University,

More information

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights DATA EXPERTS We accelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE Domains Data Mining Mining longitudinal and linked datasets from web and other

More information

Inference Rating Framework for Sentiment Analysis Using SVM Light

Inference Rating Framework for Sentiment Analysis Using SVM Light Inference Rating Framework for Sentiment Analysis Using SVM Light Tapan Biswas 1, Poonam Singh 2 and Binay Kumar Pandey 3 1 Tapan Biswas, Govindh Ballabh Pant University of Agriculure and Technology,Pantnagar,

More information

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System

More information

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Ray Chen, Marius Lazer Abstract In this paper, we investigate the relationship between Twitter feed content and stock market

More information

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Robust Sentiment Detection on Twitter from Biased and Noisy Data Robust Sentiment Detection on Twitter from Biased and Noisy Data Luciano Barbosa AT&T Labs - Research lbarbosa@research.att.com Junlan Feng AT&T Labs - Research junlan@research.att.com Abstract In this

More information

Distinguishing Opinion from News Katherine Busch

Distinguishing Opinion from News Katherine Busch Distinguishing Opinion from News Katherine Busch Abstract Newspapers have separate sections for opinion articles and news articles. The goal of this project is to classify articles as opinion versus news

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

How much does word sense disambiguation help in sentiment analysis of micropost data?

How much does word sense disambiguation help in sentiment analysis of micropost data? How much does word sense disambiguation help in sentiment analysis of micropost data? Chiraag Sumanth PES Institute of Technology India Diana Inkpen University of Ottawa Canada 6th Workshop on Computational

More information

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Sentiment Analysis for Movie Reviews

Sentiment Analysis for Movie Reviews Sentiment Analysis for Movie Reviews Ankit Goyal, a3goyal@ucsd.edu Amey Parulekar, aparulek@ucsd.edu Introduction: Movie reviews are an important way to gauge the performance of a movie. While providing

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Keywords social media, internet, data, sentiment analysis, opinion mining, business Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010

Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010 Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010 Ernst van Waning Senior Sales Engineer May 28, 2010 Agenda SPSS, an IBM Company SPSS Statistics User-driven product

More information

Role of Social Networking in Marketing using Data Mining

Role of Social Networking in Marketing using Data Mining Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

Classification of Commodity Price Forecast Sentiment With Random Forests and Bayesian Optimization

Classification of Commodity Price Forecast Sentiment With Random Forests and Bayesian Optimization Classification of Commodity Price Forecast Sentiment With Random Forests and Bayesian Optimization 1 2 3 4 5 Anonymous Author(s) Affiliation Address Email 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

More information

Crowdfunding Support Tools: Predicting Success & Failure

Crowdfunding Support Tools: Predicting Success & Failure Crowdfunding Support Tools: Predicting Success & Failure Michael D. Greenberg Bryan Pardo mdgreenb@u.northwestern.edu pardo@northwestern.edu Karthic Hariharan karthichariharan2012@u.northwes tern.edu Elizabeth

More information

Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Direct-to-Company Feedback Implementations

Direct-to-Company Feedback Implementations SEM Experience Analytics Direct-to-Company Feedback Implementations SEM Experience Analytics Listening System for Direct-to-Company Feedback Implementations SEM Experience Analytics delivers real sentiment,

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

2009-04-15. The Representation and Storage of Combinatorial Block Designs Outline. Combinatorial Block Designs. Project Intro. External Representation

2009-04-15. The Representation and Storage of Combinatorial Block Designs Outline. Combinatorial Block Designs. Project Intro. External Representation Combinatorial Block Designs 2009-04-15 Outline Project Intro External Representation Design Database System Deployment System Overview Conclusions 1. Since the project is a specific application in Combinatorial

More information

Semantic Sentiment Analysis of Twitter

Semantic Sentiment Analysis of Twitter Semantic Sentiment Analysis of Twitter Hassan Saif, Yulan He & Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The 11 th International Semantic Web Conference

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Natural Language Processing for Sentiment Analysis

Natural Language Processing for Sentiment Analysis 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology Natural Language Processing for Sentiment Analysis An Exploratory Analysis on Wei Yen Chong

More information

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Comparative Study on Sentiment Classification and Ranking on Product Reviews A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Twitter sentiment vs. Stock price!

Twitter sentiment vs. Stock price! Twitter sentiment vs. Stock price! Background! On April 24 th 2013, the Twitter account belonging to Associated Press was hacked. Fake posts about the Whitehouse being bombed and the President being injured

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Data Mining & Data Stream Mining Open Source Tools

Data Mining & Data Stream Mining Open Source Tools Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.

More information

International Journal of Electronics and Computer Science Engineering 1449

International Journal of Electronics and Computer Science Engineering 1449 International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data Proceedings of Student-Faculty Research Day, CSIS, Pace University, May 2 nd, 2014 Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: p.ducange@iet.unipi.it Office: Dipartimento di Ingegneria

More information

lop Building Machine Learning Systems with Python en source

lop Building Machine Learning Systems with Python en source Building Machine Learning Systems with Python Master the art of machine learning with Python and build effective machine learning systems with this intensive handson guide Willi Richert Luis Pedro Coelho

More information

End-to-End Sentiment Analysis of Twitter Data

End-to-End Sentiment Analysis of Twitter Data End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India apoorv@cs.columbia.edu,

More information

College Tuition: Data mining and analysis

College Tuition: Data mining and analysis CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010 Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

More information

Text Classification and Clustering with. A guided example by Sergio Jiménez

Text Classification and Clustering with. A guided example by Sergio Jiménez Text Classification and Clustering with WEKA A guided example by Sergio Jiménez The Task Building a model for movies revisions in English for classifying it into positive or negative. Sentiment Polarity

More information