Final Project Report Twitter Sentiment Analysis John Dodd Student number: x13117815 John.Dodd@student.ncirl.ie Higher Diploma in Science in Data Analytics 28/05/2014
Declaration SECTION 1 Student to complete Name: Student ID: Supervisor: SECTION 2 Confirmation of Authorship The acceptance of your work is subject to your signature on the following declaration: I confirm that I have read the College statement on plagiarism (summarised overleaf and printed in full in the Student Handbook) and that the work I have submitted for assessment is entirely my own work. Signature: Date: NB. If it is suspected that your assignment contains the work of others falsely represented as your own, it will be referred to the College s Disciplinary Committee. Should the Committee be satisfied that plagiarism has occurred this is likely to lead to your failing the module and possibly to your being suspended or expelled from college. ii
Abstract Over the past decade humans have experienced exponential growth in the use of online resources, in particular social media and microblogging websites such as Twitter. Many companies and organisations have identified these resources as a rich mine of marketing knowledge. This project focuses implementing machine learning algorithms to extract an audience s sentiment relating to a popular television program. A major focus of this study was on comparing different machine learning algorithms for the task of sentiment classification. The major findings were that out of the classification algorithms evaluated it was found that the Random forest classifier provide the highest classification accuracy for this domain. From the evaluation of this study it can be concluded that the proposed machine learning and natural language processing techniques are an effective and practical methods for sentiment analysis. iii
Table of Contents Abstract... iii Table of Figures... v Table of Tables... v Section 1: Introduction... 1 Objective... 1 Motivation... 1 Contribution to the Knowledge... 2 Section 2: Background... 3 Data Mining & Sentiment analysis... 3 Machine Learning Algorithms... 4 Naïve Bayes... 5 Decision Tree... 6 Random Forests... 8 Support Vector Machine... 9 Section 3: Implementation... 12 Platform and Software... 12 Platform... 12 Python... 12 SQL... 13 Weka... 13 Data Collection... 13 Training Data... 15 Data Cleaning... 16 Implementation of Classifiers in Weka... 18 Section 4: Classifier Evaluation... 20 Classifying Twitter data... 22 Manual Verification... 22 Section 5: Results and Conclusion... 24 Results... 24 Evaluation... 25 Conclusion... 25 iv
Future work... 25 Bibliography... 27 Appendix... 29 Appendix A: Python Programs... 29 Appendix B: JSON Format... 32 Appendix C: Classifier Evaluation Results... 35 Appendix D: Manual Validation... 39 Appendix E: Reports & Project Management... 45 Table of Figures Figure 1: Decision Tree Structure (Donaldson, 2012)... 6 Figure 2: Random Forest Structure... 8 Figure 3: SVM basic operation (Anon., 2011)... 9 Figure 4: Finding the optimum hyperplane (Buch, 2008)... 10 Figure 5: System Architecture... 12 Figure 6: Sentiment... 24 Table of Tables Table 1: Sample collected data... 15 Table 2: Emoticon usage frequency... 16 Table 3: Sample Training Data... 16 Table 4: Unwanted Content... 17 Table 5: Sample Cleaned Data... 18 Table 6: Model Evaluation... 21 Table 7: Manual Verification Confusion Matrix... 22 Table 8: Overall Sentiment... 24 v
Section 1: Introduction This section will cover the overall aim of the project, the motivation behind it, and any contribution to the knowledge base that has been added. Objective The overall goal of this project was to perform a sentiment analysis on a particular television program. Viewer opinions on the American television show Supernatural were mined from the popular microblogging website Twitter. The main goal of such a sentiment analysis is to discover how the audience perceives the television show. The Twitter data that is collected will be classified into two categories; positive or negative. An analysis will then be performed on the classified data to investigate what percentage of the audience sample falls into each category. Particular emphasis is placed on evaluating different machine learning algorithms for the task of twitter sentiment analysis. Motivation Over the past decade humans have experienced exponential growth in the use of online resources, in particular social media and microblogging websites such as Twitter, Facebook and YouTube. Many companies and organisations have identified these resources as a rich mine of marketing knowledge. Traditionally companies used interviews, questionnaires and surveys to gain feedback and insight into how customers felt about their products. These traditional methods were often extremely time consuming and expensive and did not always return the results that the companies were looking for due to environmental factors and poorly designed surveys. Natural language processing and sentiment analysis are playing an increasingly important role in making educated decisions on marketing strategies and giving valuable feedback on products and services. There are massive amounts of data containing consumer sentiment uploaded to the internet every day, this type of data is predominantly unstructured text that is difficult for computers to gain meaning from. In the past it was not possible to process such large amounts of unstructured data but now with computational power following the projections of Moore s law (Moore, 1965) and distributed networks of computers using frameworks such as Hadoop, massive datasets can be now processed with relative ease. Major investment is going into this area such as IBMs tireless research into their Natural Language Processing supercomputer Watson and Googles recent acquisition of deep mind technology. With further research and investment into this area machines will soon be able to gain an understanding from text which will greatly improve data analytics and search engines. 1
Contribution to the Knowledge To many companies and organisations a customer s perception of a product or service is extremely valuable information. From the knowledge gained from an analysis such as this a company can identify issues with their products, spot trends before their competitors, create improved communications with their target audience, and gain valuable insight into how effective their marketing campaigns were. Through this knowledge companies gain valuable feedback which allows them to further develop the next generation of their product. In the context of the sentiment analysis being carried out for this application, the results will allow the producers of the show to gain insight into how each episode is being perceived by the viewer. This is very valuable information as viewers are uploading their expectations, opinions and views on the television program before, during and after it is aired. This really revolutionises the feedback process, an application such as this has the potential to analyse the sentiment in real time giving the producers immediate feedback on how the program is being help in the eyes of its audience. Such an application could be expanded to use clustering algorithms to give insight into particular scenes or characters. From an academic perspective it was felt that there were no new findings added to the knowledge base of natural language processing or sentiment analysis. This was to be expected as with this course being a level 8 on the national framework and the short duration of the project it would have been extremely difficult to make a meaningful contribution to the already highly researched fields. This being said an application such as this does have value for small to medium sized companies to gain valuable insights to their data without investing heavily in the area. This report would also serve well as reference material for other researchers in the field as there is very few documents available in the twitter sentiment analysis domain that compared a number of different machine learning classification algorithms and achieved such high accuracy in their finished model. 2
Section 2: Background This section aims to give an overview of the background material used for this project. Most notably it will cover sentiment analysis, machine learning and the classification algorithms used in this project. Data Mining & Sentiment analysis Data mining is the computational process of finding patterns in large datasets and its methods are at the intersection between Artificial Intelligence, Machine Learning, computer science, data base technologies, and statistics. The objective of data mining is to extract information or knowledge from a dataset and transform it into a structure that can be understood. Data preparation is an important part of any data analysis. To properly prepare data it is necessary to understand the application domain, this is important as the researcher must be able to identify pertinent data and cleansing the dataset removing any data which is deemed as unimportant to the analysis. Some of these pre-processing techniques include; Data cleaning, Noise treatment, sampling, strategy s to deal with missing values, Normalisation, and feature extraction. Many of these preprocessing techniques will be examined in more detail in the implementation section of this report. Data mining focuses on discovering patterns in data. Sentiment analysis which is also known as opinion mining focuses on discovering patterns in text that can be analysed to classify the sentiment in that text. Sentiment analysis is the field of study that analyses people s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, and their attributes (Liu, 2012) Sentiment analysis is predominantly implemented in software which can autonomously extract emotions and opinions in text. It has many real world applications such as it allows companies to analyse how their products or brand is being perceived by their consumers, this usage is particularly applicable to this project. It is difficult to classify sentiment analysis as one specific field of study as in incorporates many different areas such as linguistics, Natural Language Processing (NLP), and Machine Learning or Artificial Intelligence. As the majority of the sentiment that is uploaded to the internet is of an unstructured nature it is a difficult task for computers to process it and extract meaningful information from it. Natural language processing techniques are used to transform this raw data into a form that it can be processed efficiently by a computer. Natural Language Processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages (Kumar, 2011) Natural language processing uses many different methods to process text such as string processing (Tokenizers, sentence tokenizers, stemmers tagging) or by using speech tagging (n-gram, backoff, Brill, HMM, TnT). (Bird, Steven, Edward Loper, Ewan Klein, 2009) 3
Machine Learning Algorithms Machine learning is a branch of artificial intelligence which focuses on building models that have the ability to learn from data. As it is such an enormous field that encompasses many areas there is no standardised definition for it, but Arthur Samuels s general definition which describes it well: Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed (Samuel, 2000) In general there are two types of machine learning algorithms supervised, and unsupervised, there are variations on these algorithms such hybrid types (semi-supervised learning) but for this report they will be classified into one of the two categories. The method of supervised learning consists of presenting an algorithm with a training dataset; this dataset consists of training examples and the corresponding expected output for each example. The expected output is general known as the target. A supervised learning algorithm uses this dataset so that it can learn to map the input examples to their expected target. If the training process is implemented correctly the machine learning algorithm should be able to generalise the training data so that it can correctly map new data that it has never seen before. Unsupervised machine learning algorithms do not require training data, they operate on data where the output is unknown. The object of this form of learning is usually to discover patterns in the data that may not be known by the researcher. An example of an unsupervised method would be clustering where the algorithm uses a distance function to group similar data points together. This project focuses exclusively on supervised learning algorithms for the task of text classification, however there are two major applications for supervised methods: Classification: The target output is a class or label, the simplest case is a choice between zero ore one, although there can also be multiple alternative classes. Classification is used for many applications such as test classification, object recognition, and voice recognition software. Regression: In this case the target is a real number or vector of real numbers. Regression is mostly used for prediction. Supervised regression algorithms are used mostly for prediction. Example applications are stock market prediction, in power systems analysis it can be to predict spikes in a network, and most recently they have been used in memory caching to complement the locality of reference method. This project focuses on supervised classification algorithms, the models that were used are described below. After a review of the literature surrounding machine learning algorithms used for sentiment analysis it was found that some of the most commonly used and highest performing were the Naïve Bayes, Decision Tree, Random Forests, and the Support Vector Machine. Using the major literature in the field it was decided to further investigate these algorithms for their suitability to be used in this analysis. 4
Naïve Bayes The Naïve Bayes classifier is a simple probabilistic model which relies on the assumption of feature independent in order to classify input data. Despite its simplicity, the algorithm is commonly used for text classification in many opinion mining applications (Pak Alexander, Paroubek Patrick, 2010) (Alec Go, Lei Huang, Richa Bhayani, 2009) (Prem Melville, Wojciech gryc, Richard D.Lawrence, 2009). Much of it popularity is a result of its extremely simple implementation, low computational cost and it relatively high accuracy. The algorithm assumes that each feature is independent of the absence or presence of any other feature in the input data, because of this assumption it is known as naïve. In reality words in a sentence are strongly related, their positions and presence in a sentence have a major impact on the overall meaning and sentiment in that sentence. Despite this naïve assumption the classifier can produce high classification accuracy when used with quality training and in specific domains. A recent study (Zhang, n.d.) addressed this assumption and presented strong evidence of how the algorithm could be so effective while relying on this assumption. The algorithm itself is derived from Bayes theorem: ( ) ( ( ) ( ) ( ) ( ) Where P(c) and P(f c) are calculated by estimating the relative frequency of a feature f which is extracted from the training data corpus and where ( ) is the number of these features. In the entire training data corpus there are m features. The document d contains the training data or the input data to be classified. In basic terms the algorithm will take every feature (word) in the training set and calculate the probability of it being in each class (positive or negative), now that the probabilities of each feature are calculated the algorithm is ready to classify new data. When a new sentence is being classified it will split it into single word features and the model will use the probabilities that were computed in the training phase to calculate the condition probabilities of the combined features in order to predict its class. Many machine learning algorithms will ignore features which have a weak influence on the overall classification, a major advantage of the Naïve Bayes classifier is that it utilizes all the evidence that is available to it in order to make a classification. Using this approach it takes into account that many weak features which may have a relativity minor effect individually may have a much larger influence on the overall classification when combined. Some of the major work in the field of sentiment analysis using the Naïve Bayes was carried out by (Alexander Pak, Patrick Paroubek, 2010). The training data was collected using the assumption the emoticons contained in text represented the overall sentiment in that text. Using this assumption a large quantity of training data was automatically collected. This study used an ensemble of two different naïve Bayes classifiers; one trained using the presence of unigrams while the second used part of speech tagging. When the two classifiers were combined they produced an accuracy of 74%. Pang et al (Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, 2002) used a single Naïve Bayes classifier on a movie review corpus to achieve similar results as the previous study. Multiple Naïve Bayes models were trained using different features such as part of speech tagging, unigrams, and bigrams. 5
The final model achieved a classification accuracy of 77.3% which was considered a high performance of the algorithm on that domain. Another important study (Prem Melville, Wojciech Gryc, Richard D. Lawrence, 2009) in the field combined Naïve Bayes model with lexical knowledge to produce a classifier that had improved performance over the performance of the individual models. The model was tested on a number of different datasets, the most notable results was an accuracy of 91.21% on a Lotus dataset. From the literature examined surrounding the Naïve Bayes it can be seen that despite its simplicity the algorithm has the ability to produce high classification accuracy on similar dataset to the one used in this project. It has many advantages over some of the more sophisticated algorithms, the major one being how simple it is to understand and implement, the low processing cost of the algorithm can be attributed to this simplicity. Decision Tree Decision trees on one of the most widely used machine learning algorithms much of their popularity is due to the fact that they can be adapted to almost any type of data. They are a supervised machine learning algorithm that divides its training data into smaller and smaller parts in order to identify patterns that can be used for classification. The knowledge is then presented in the form of logical structure similar to a flow chart that can be easily understood without any statistical knowledge. The algorithm is particularly well suited to cases where many hierarchical categorical distinctions can be made. They are built using a heuristic called recursive partitioning. This is generally known as the divide and conquer approach because it uses feature values to split the data into smaller and smaller subsets of similar classes. The structure of a decision tree consists of a root node which represents the entire dataset, decision nodes which perform the computation and leaf nodes which produce the classification. In the training phase the algorithm learns what decisions have to be made in order to split the labelled training data into its classes. Figure 1: Decision Tree Structure (Donaldson, 2012) 6
In order to classify an unknown instance, the data is passed through the tree. At each decision node a specific feature from the input data is compared with a constant that was identified in the training phase. The computation which takes place in each decision node usually compares the selected feature with this predetermined constant, the decision will be based on whether the feature is greater than or less than the constant, creating a two way split in the tree. The data will eventually pass through these decision nodes until it reaches a leaf node which represents its assigned class. There are many different implementations and variations of the decision tree algorithm, this project implements the J48 method which is a Java implementation of the C4.5 algorithm which was the industry standard up until the C5.0 algorithm was released. Some of the major work in the field of sentiment analysis using the Decision tree algorithm was carried out by Castillo et al (Castillo, Carlos, Marcelo Mendoza, Barbara Poblete, 2011). The studies main focus was on accessing the creditability of tweets posted on Twitter but there was also secondary focus on sentiment analysis. A decision tree was implemented using the J48 algorithm to classify sentiment in the twitter dataset. By training the algorithm with hand annotated examples the algorithm produced an accuracy of 70%. (Bifet, Albert, Eibe Frank, 2010) study implemented a Hoeffding tree for sentiment classification on a twitter dataset. The trained the algorithm using a massive dataset of 1,600,000 tweets split into approximately equal representations of each class. The overall accuracy of the completed model was 69.36%. This would have been quite a disappointing result as the there was a vast amount of research and modifications made to the algorithm and the dataset and the final result was only marginally higher than what the algorithm can achieve with no optimisations. From the literature examined surrounding the decision tree algorithm it can be seen that the algorithm can be very effective for text base classification. 7
Random Forests Ensemble learning focuses on techniques to combine the results of different trained models in order to produce a more accurate classifier. Ensemble models generally have considerably improved performance than that of a singular model. The random forest algorithm is an example of an ensemble method which was introduced by (Breiman, 2001), it is quite a simple algorithm but despite its simplicity it can produce state of the art performance in terms of classification. The basic structure of the random forest can be seen in Figure 3 below. Figure 2: Random Forest Structure Random forests are constructed by combining a number of decision tree classifiers, each tree is trained using a bootstrapped subset of the training data. At each decision node a random subset of the features is chosen and the algorithm will only consider splits on those features. The main problem with using an individual tree is that it has high variance that is to say that the arrangement of the training data and features may affect its performance. Each individual tree has high variance but if we average over an ensemble of trees we can reduce the variance of the overall classification. Provided that each tree has better accuracy then pure chance, and that they are not highly correlated with one another the central limit theory states that when they are averaged they will produce a Gaussian distribution. The more decisions that are averaged the lower the variance will become. Reducing the variance will generally increase the overall performance of the model by lowering the overall error. Looking at the literature surrounding the Random Forest algorithm for text classification it was found that major works in this area were very sparse. Some of the works found in this field using the Random forest algorithm was carried out by Aramaki et all (Aramaki, Eiji, Sachiko Maskawa, Mizuki Morita, 2011). The study focused on using machine learning and Twitter to detect influenza epidemics, a number of machine learning algorithms were compared in order to classify tweets containing keywords into two classes. The random forest proved to have an accuracy of 72.9% on the test dataset. 8
Tsutsumi et al (Tsutsumi, Kimitaka; Kazutaka Shimada; Tsutomu Endo, 2007) study implemented a weighted voting random forest on a movie review database. A scoring criterion was used to appoint a weighted vote to each random tree in the forest. Using this method the algorithm produced an accuracy of 83.4% on a dataset of 1400 reviews. From the literature reviewed it can be seen that the Random forest algorithm can produce state of the art performance for text based classification. By combining multiple simple random trees the algorithm can produce significantly higher performance than each tree individually. For such a simple algorithm the accuracy is really astounding. Support Vector Machine The support vector machine was the most sophisticated algorithm evaluated in this project and it is becoming an increasingly common method for text classification. Its increased popularity is largely due to the high classification accuracy that is associated with its use. The support vector machine is classed as a non-probabilistic binary linear classifier. It works by plotting the training data in multidimensional space; it then tries to separate the classes with a hyperplane. If the classes are not immediately linearly separable in the multidimensional space the algorithm will add a new dimension in an attempt to further separate the classes. It will continue this process until it is able to separate the training data into its two separate classes using a hyperplane. A basic representation of how it splits the data is shown in figure 3 below. Figure 3: SVM basic operation (Anon., 2011) One of the main areas where this method differs from other linear classifiers such as the perceptron is in the way it selects the hyperplane. In most cases there may be multiple hyperplanes or in some cases an infinite number of hyperplanes that could separate that classes. The SVM algorithm chooses the hyperplane which provides the maximum separation between the classes has the greatest margin or the maximal margin hyperplane which minimises the upper bound of the classification errors. A standard method for finding the optimum way of separating the classes is to plot two hyperplanes in a way that there are no data points between them, and then by using these 9
planes the final hyperplane can be calculated. This process is shown in figure below. The data points that fall on these planes are known as the supports. Figure 4: Finding the optimum hyperplane (Buch, 2008) Now that the algorithm has calculated the hyperplane that provides the maximum level of separation between the classes, new data can be classified. New instances are mapped into the feature space and are classified by which side of the hyperplane they fall onto. A major problem with the SVM is that by adding extra dimensions the size of the feature space increases exponentially. From a processing point of view the SVM algorithm counteracts this by using dot products in the original space. This method hugely reduces processing as all the calculations are performed in the original space and then mapped to the feature space. From a classification perspective this increase in the size of the feature space has a negative effect on the models ability to accurately classify data this is known as the Hughes effect (Hughes, 1968). This has a strong negative effect of classification because as the feature space increases the training data becomes extremely sparse in that space, to counteract this phenomenon the training data would need to be increased exponentially with every dimension added, which is not really practical in real world applications. Some of the major work in the field of sentiment analysis using the SVM was carried out by Pang et al (Bo Pang, 2002). In this study the SVM was used to extract sentiment from a movie review database, multiple SVM were trained using different features such as part of speech tagging, unigrams, and bigrams. The final model achieved a classification accuracy of 82.9% which was considered as extremely high for that domain. The study proved contrary to the results of other studies in the area (McCallum, Andrew, and Kamal Nigam, 1998) where the more traditional method of using unigram frequency was used over unigram presence. 10
A later study (Read, 2005) used the same principals as Pang et al but the training data was substantially increased. The increase of the training data was due to the researcher using the assumption the emoticons contained in text represented the overall sentiment in that text. Using this assumption large quantities of training data were automatically collected. The final model that was produced had a classification accuracy of 70% on a movie review dataset. From the literature examined surrounding the support vector machine it can be seen that the algorithm has the ability to produce very high classification accuracy on similar dataset to the one used in this project. The major downside of this algorithm is that is complexity makes it difficult to gain a solid understanding of how exactly it works when compared to some of the simpler algorithms. 11
Section 3: Implementation This section aims to give an overview of how the data mining, text processing and machine learning techniques that were implemented in this project. A block diagram of the system can be seen in the figure 5 below. Figure 5: System Architecture Platform and Software This section aim to give a brief overview of the main software languages, environments, principle libraries and a brief description of the platform that was used to implement this project. Platform All processing was carried out on commodity hardware, specifically a 64bit Intel i3 processor (6GB ram) running a windows operating system. Python For the purpose of this project Python 2.7.6 was used as it is a mature, versatile and robust high level programming language. It is an interpreted language which makes the testing and debugging phase s extremely quick as there is no compilation step. There are extensive open source libraries available for this version of python and a large community of users. This version of Python was chosen over the latest version Python 3.0 for numerous reasons the main ones being everything in 12
Python 3.0 is backwards compatible with the older versions and it is felt that Python 2.7 has better documentation and a wider community of users than the latest version. Other high level programming languages such as R and Matlab were considered because they have many benefits such as ease of use but they do not offer the same flexibility and freedom that Python can deliver. A low level language such a C was considered to write functions and some of algorithms that had very high computational cost. Using this low level language may have increased the efficiency of the algorithms and lowered computational overhead. As the project progressed it was found that there was insufficient time to explore this avenue. Python Key libraries: Python- Twitter, NLTK SQL There were many options on how to store the information such as a in a comma separated values (CSV) file, a text file or in a database. It was decided that the optimum approach was to use a SQL database as this method allowed faster read/write times, easier segmentation of the data to be stored for example the date and tweet could be stored in separate columns, and SQL databases also support multithreaded applications which would be beneficial if this project was ever expanded to a real-time classification application in the future. SQLite was used as the database management system as it is an open source application that is easily integrated with the Python programming language. Weka Weka is an open source software environment written in Java that can be used for many data mining applications. The environment contains a collection of machine learning algorithms that are suitable for the task of text classification and sentiment analysis. This software was chosen because it allows developers to quickly and easily pre-process data and build machine learning models. Weka was found to be more desirable for this task over python because of it easy to use GUI and its standardised output results for classifiers allowed models to be evaluated and compared easily. The major downside to using this software was that many of the algorithms were poorly documented and there was a much smaller community of users when compared to python. 13
Data Collection Twitter allows developers access to a range of streaming APIs which offer low latency access to flows of twitter data. For the data collection implementation the public streams API was used, it was found that this was the most suitable method of gathering information for data mining purposes as it allowed access to a global stream of twitter data that could be filtered as required. In order to take advantage of this stream, a python interface library had to be installed this library was necessary for python to interface with twitters API v1.1. For this task there were a number of library s available, python twitter tools v1.14.3 was chosen as it allowed the basic filtering and streaming functionality required for this project. Twitter has numerous regulations and rate limits imposed on it API for this reason it requires that all users must register an account and provide authentication details when they query the API. This registration requires users to provide an email address and telephone number for verification, once the user account is verified the user will be issued with the authentication detail which allows access to the API. A Python script was then created which provided the API with the authentication details and initialised a streaming process where data could be pulled from twitters RESTful web service to a local machine. A filter function was used to allow the program to request twitter content based on specific keywords related to this specific study. All the downloaded data was transmitted in JSON format, it was found that this standard was less verbose than the alternative format that was offered XML. Each JSON formatted package contained a large amount of information but it was decided that for this project only the tweet and the time the tweet was written was required (an example of JSON format can be found in the Appendix C). In order to remove the unwanted content each package was parsed using a python script which located the useful content and stored it in RAM until main memory storage became available. An additional check was performed to ensure all the tweets downloaded were written in the English language. This check involved parsing the JSON content for a Lang tag and then performing an equality check on its content. So now once the required content was removed from the JSON package and stored in RAM it now could be written to main memory. There were many options on how to store the information such as a in a comma separated values (CSV) file, a text file or in a database. It was decided that the optimum approach was to use a SQL database as this method allowed faster read/write times, easier segmentation of the data to be stored for example the date and tweet could be stored in separate columns, and SQL databases also support multithreaded applications which would be beneficial if this project was ever expanded to a real-time classification application in the future. A database was created with a simple table structure which had the fields Primary_Key, Date, and Tweet. The primary key was automatically generated by simply incrementing a counter each time the database was written to. An example of the data collected is shown in table 5 below where Primary_Key is an integer, Date and Tweet are Strings. It is worth noting that Date was stored as a string rather than the SQL DATE data type because it was intended to convert this data to a UNIX timestamp at a later stage in the project. The Python script that was created for this phase can be found in Appendix B. 14
Table 1: Sample collected data Primary_Key Date Tweet 1 Mon May 19 19:43:16 +0000 2014 RT @J2_MyLife: Supernatural Season Finale TOMORROW http://t.co/sgo81mwdbh' 2 Mon May 19 19:44:02 +0000 2014 3 Mon May 19 19:44:52 +0000 2014 4 Mon May 19 19:45:06 +0000 2014 5 Mon May 19 19:45:11 +0000 2014 RT @SPN_Hunter_67: Supernatural season finale tomorrow! So excited/nervous/scared!!!! SUPERNATURAL SEASON FINALE TOMORROW AND APPARENTLY JARED CRIED READING THE SCRIPT. NOOOO well in case i forgot, Supernatural Season Finale TOMORROW is world wide trending\nthanks guys RT @SPN_Threesome: Supernatural Season Finale TOMORROW - expect: blood, fight, guilt, anger, tears, pain, love, death, touching, danger, un\u2026 Training Data In order to train a supervised learning algorithm a training dataset must be collected; this dataset consists of training examples and the corresponding expected output for each example. The expected output is general known as the target. A supervised learning algorithm uses this dataset so that it can learn to map the input examples to their expected target. If the training process is implemented correctly the machine learning algorithm should be able to generalise the training data so that it can correctly map new data that it has never seen before. Training data must contain a class label, this can be achieved through manually assigning each tweet with a class but this is a tedious process and as twitter enforces strict rules on the distribution of its data it has proved difficult to source reliable hand annotated twitter datasets. For this reason other avenues were examined and it was found that a number of researchers (Bo Pang, 2002) have successfully used emoticons (, ) as a noisy label in order to automatically classify sentences. In order to use this method an assumption must be made, this assumption is that the emoticon in the tweet represents the overall sentiment contained in that tweet. This assumption is quite reasonable as the maximum length of a tweet is 140 characters so in the majority of cases the emoticon will correctly represent the overall sentiment of that tweet. An analysis performed by DataGenetics (Berry, 2010) studied over 96million tweets containing emoticons; the study documented the usage frequency of each emoticon. The five emoticons with the highest frequency are tabulated below. 15
Table 2: Emoticon usage frequency Emoticon Usage Percentage :) 32,115,778 33.36% :D 10,595,385 11.006% :( 7,613,078 7.90% ;) 7,238,729 7.519% :-) 4,250,408 4.420% For this project the smiley face and the sad face were chosen as the noisy labels for the training data, this choice was made as they were the two labels with the highest frequency that represented either the positive or negative class. The training data was collect using the data collection process outlined in the data collection phase of this report. It was decided to collect 20,000 examples of each class to make up a training dataset of 40,000 tweets and their associated noisy labels. An example of some of the raw training data can be found in the table below. Table 3: Sample Training Data Tweet Big Thank You To Our Lovely Customers From WeLoveCarlisle For Our Splendid Gift :) http://t.co/trx9wejupq @jackstenhouse69 I really liked it, in my opinion it def is :) @Harry_Styles You and I is amazing, the video is so sweet, be proud guys :) :( \u201c@ew: How awful. Police: Driver kills 2, injures 23 at #SXSW http://t.co/8gmfiouzbs\u201d So pissed I just cracked my phone screen :( I hate you srsly :( @weirdnextdoor_ Class Positive Positive Positive Negative Negative Negative 16
Data Cleaning The aim of the data cleaning process is to remove any unwanted content from the training data and the input tweets. The term unwanted content is used to describe any piece of information within the tweet that will not be useful for the machine learning algorithm to assign a class to that tweet. Data cleaning can not only simplify the classification task for the machine learning model but it also serves to greatly decrease processing cost in the training phase. The unwanted content is tabulated in table 8 below. Table 4: Unwanted Content Unwanted Content ACTION Punctuation (!?,. : ; ) Removed #word Removed # @user Replaced with AT_USER RT Removed Emoticons Removed Unicode formatting Removed Uppercase characters Lowercase all content URLs and web links Replaced with URL A Python script was created to read in each tweet from the database and preform processing on them in order to clean the undesirable data, this program can be found in Appendix B of the report. This Python script also served to remove stopwords from the tweets. Stop words are words such as the, which, is, and at and they have little value for machine learning algorithms as they are contained in approximately equal measure in the positive and the negative training sets. Removing them allows more specific word features to be passed into the classification models and hugely reduces processing during the training stages. There is no definite standard for removing stop word as each application has different requirements, for this project the default stop word list was taken from the Natural Language Tool Kit (NLTK) for Python. When this pre-processing of the tweets was complete the cleaned data was stored in a new table in the database. A number of examples of the data pre cleaning and post processing are contained in the table 9 below. 17
Table 5: Sample Cleaned Data Tweet Big Thank You To Our Lovely Customers From WeLoveCarlisle For Our Splendid Gift :) http://t.co/trx9wejupq big thank lovely customers welovecarlisle splendid gift :) URL @jackstenhouse69 I really liked it, in my opinion it def is :) AT_USER really liked it, opinion def :) So pissed I just cracked my phone screen :( pissed cracked phone screen :( Status Raw Data Cleaned Raw Data Cleaned Raw Data Cleaned :( \u201c@ew: How awful. Police: Driver kills 2, injures 23 at #SXSW Raw Data http://t.co/8gmfiouzbs\u201d :( AT_USER awful police driver kills 2 injures 23 sxsw URL Cleaned Implementation of Classifiers in Weka There are a number of machine learning libraries available for Python such as PyBrain, mlpy, and Scikit-learn but it was decided that Weka offered superior benefits such as ease of use, larger library of classification algorithms, and more sophisticated report generation and visualisation. Weka is a standalone software environment built in Java that facilitates the use of a large collection of machine learning classifiers. It includes tools for pre-processing data, classification, clustering, regression, and visualisation these combined tools make Weka a very powerful application and for this reason it was decided to implement the machine learning classifiers in this environment. The first stage of the implementation required the training data to be reformatted into ARFF (Attribute Relation File Format) so that it could be read by Weka. This process was performed by reading all the tweets from the database and explicitly labelling each tweet with its class, before storing them in a text file with a specific header which specified the relationship between that data in the file. This file was then saved as an ARFF file which could be loaded into Weka. The next stage of the process was to load this training data into Weka and explicitly define the class information. This was a very important step as at this stage the training data contained the tweet itself and a class label, Weka cannot differentiate between the two. The class must be assigned to the data by using the inbuilt classassigner function. When this is complete the training data is ready for further processing. All three of the classifiers that were test required that the input data be in numeric format, this posed a problem as the tweets were currently stored as String variables. In order to proceed with this process it was required that the string data was converted into numeric form. For this step the StringToWordVector filter was used, this took every distinct word in the training set and then created a vector for each tweet. As some parts of Weka are poorly documentation it is unknown if 18
these sparse vectors are simplified in any way to reduce processing overhead. Now that the preprocessing of the training data was complete a machine learning classifier could be trained using the data. For the classification phase three different machine learning algorithms were chosen to be used, the reason behind this was to compare the performance of the models and select the most suitable classifier for the data. The three models that were selected were the Naïve Bayes, Support Vector Machine, and the Random Forest. A description of each of these algorithms can be found in the background section of this report. Once the models were trained they were saved and an evaluation was carried out which can be found in the next section of the report. 19
Section 4: Classifier Evaluation This section will focus on comparing and evaluating the performance of different machine learning algorithms on the training and test datasets, with the aim of selecting the best model for the task of twitter sentiment classification. The section will cover the different processes that were used to evaluate the models and the different attributes that were modified in order to select the optimum model for the task. The algorithm with the highest performance will then be applied to the twitter data collected on the Supernatural television program to produce the overall sentiment in that dataset. All models were evaluated using stratified threefold cross-validation. The cross-validation method was chosen over the standard holdout method to prevent uneven representation of the classes in the training and test set. The standard holdout method generally randomly splits the training data into training and test data, this split varies depending on the researcher but a commonly used method is to segment the data into 70% training and 30% test data. If safeguards are not put in place it is not uncommon for algorithms to split the data so that there is an uneven representation of the classes in the training data, take for example if after using a 70/30% split the test data only contained examples of the negative class. This would hugely decrease the amount of negative examples in the training set which could severely impact the classification algorithms ability to lean the underlying patterns of that class. To counteract this stratified threefold cross-validation segments the data into three approximately equal partitions; each segment contains approximately equal representation of each class. Each partition in turn is used as test data, so the first time one third of the data is used for testing and the remaining two thirds are used for training. This process is repeated until every instance has been used for test data. Research has shown that using this method can increase the performance of the classification model. Tenfold cross-validation is the standard method used in many machine learning applications but due to the high processing cost associated with it this method was not feasible for this project. Each model was evaluated under four criteria; accuracy, precision, recall, and the models F-measure. The accuracy is simply the percentage of correctly classified instances. The precision is calculated for each class and is the number of instances correctly classified as its true class out of all the instances classified as that class. Again the recall is calculated for each class and represents the number correctly classified instances of a class out of all the instances of that class. The F-score or F-measure gives a good indication of the overall performance of a classifier and is calculated using the following formula: ( )( ) This evaluation process focused on finding the optimum classification model for the dataset, this entailed modifying different attributes for each of the algorithms to increase their performance. It was discovered early on that the performance of the Naïve Bayes and J48 was significantly lower 20
than that of the SVM and the random forest. It was decided that because of this relatively low performance there was no benefit of further optimising the algorithms as it was highly unlikely that they would improve the overall performance by such a high degree. During this evaluation an analysis was performed on what effect word stemming had on the performance of the SVM and the Random Forest. The SVM was evaluated using different kernel methods and the Random forest was evaluated using different numbers of random trees to see how these methods and attributes affected the models performance. The results of this evaluation can be found in table 10 below. Table 61: Model Evaluation Model Stemmer Kernel Accuracy Precision Recall F-Measure Class SVM Null Poly kernel 79.335% 0.798 0.785 0.792 Pos 0.788 0.802 0.795 Neg SVM Iterated Lovins Poly kernel 79.32% 0.791 0.798 0.794 Pos 0.797 0.79 0.792 Neg SVM Lovins Poly Kernel 79.32% 0.793 0.796 0.794 Pos 0.795 0.793 0.792 Neg SVM Snowball Poly Kernel 79.335% 0.798 0.785 0.792 Pos 0.788 0.802 0.795 Neg SVM Null Normilised Poly 80.665% 0.811 0.8 0.805 Pos 0.803 0.813 0.808 Neg SVM Null PUK 79.03 0.803 0.77 0.786 Pos 0.779 0.811 0.794 Neg SVM Null RBF Kernel 77.255% 0.786 0.749 0.767 Pos 0.76 0.797 0.778 Neg Random Null N/A 81.02% 0.797 0.832 0.814 Pos Forest * 0.825 0.788 0.806 Neg Random Null N/A 82.17% 0.806 0.847 0.826 Pos Forest** 0.839 0.797 0.817 Neg Random Forest*** Null N/A 82.92 0.818 0.847 0.833 Pos 0.841 0.822 0.826 Neg Naïve Bayes Null N/A 69.2575% 0.685 0.714 0.699 Pos 0.701 0.671 0.686 Neg J48 Null N/A Pos Neg *5 random trees, **10 random trees, ***20 random trees From the evaluation it was found that word stemming had no effect or a negative effect on the Support Vector Machine algorithms performance. It did however have a marginal positive affect when used with the random forest but it was decided to use no stemmer in the training of the final 21
model as the additional computational cost associated with it did not justify its small performance increase. The kernel methods used in the Support Vector Machine did affect its overall performance and it was found that the normalised Polly kernel provided the best results on this dataset out of all the kernel methods tested. It is worth noting that the normalised polynomial kernel incurred a processing cost of almost twice that of the standard polynomial kernel. It was found that by increasing the number of random trees used in the random forest algorithm the overall accuracy of the algorithm increased also. Again adding more random trees increased computational overhead. The accuracy of the Naïve Bayes and the J48 was so far below that of the SVM and the Random Forest that it did not seem reasonable to carry out an in depth evaluation of them, as neither of them will be considered for the final model. From this evaluation it was found that the random forest classifier with 20 random trees had the highest performance on the dataset. For this reason it was decided that this model would be applied to the real data in order to find its sentiment. Classifying Twitter data This section focuses on the process of classifying the data taken from twitters API. The tweets were read from the database and converted into ARFF format so that it could be processed by Weka. They were loaded into the Weka environment and the Random Forrest classifier that was created in the previous section was used to classify the sentiment in each tweet into the positive or negative class. When this classification was complete the results were saved in a text file. As there were a large number of tweets in the dataset a python program was created to calculate the percentage positive and negative tweets in the file and to visualise the results. This Python script can be found in the Appendix B. Manual Verification In order to be thorough a manually verification of the classified data was carried out on a sample of 100 tweets from the data. This process entailed a human manually classifying the sentiment in each sampled tweet and comparing it with the computer generated classification. The results of this process are tabulated in table 12 below. The full sample of tweets and their classifications can be found in the appendix D. It is worth noting that because the neutral class is ignored in this project, it is unfair to the classifier to judge its performance on any neutral tweets that may be contained in the data. The confusion matrix is displayed below. Table 7: Manual Verification Confusion Matrix Actual Class Classified As Positive Negative Positive 58 9 Negative 8 25 22
Using the confusion matrix the overall accuracy, precision, recall and F-measure were calculated. Table 8: Manual validation performance Accuracy Precision Recall F-measure Class 83% 0.878 0.865 0.8714 Positive 0.735 0.757 0.7458 Negative When examining the results from the manual verification it can be seen that they are very similar to the results from the cross-validation. Taking a sample of the input data and manually classifying them has served to validate the computer generated classifications and reassure the user that the high accuracy on the test data was not due to some computational error. 23
Section 5: Results and Conclusion This section will cover the results and evaluation of the actual classification of the input data collected from twitter. It will also provide a conclusion and recommendations for further work on the project. Results The results from the Random Forest classifier when re-evaluated on the dataset containing 8634 Super Natural related tweets that were collected from twitters API are displayed in table 12 below. A visual representation of the overall sentiment contained in the input data is displayed graphically as a pie chart in figure 6. Table 9: Overall Sentiment Number of input instances 8634 Number of instances of class positive 5017 Number of instances of class negative 3617 Percentage of instances classified positive 58.11% Percentage of instances classified negative 41.89% Sentiment Negative Positive Figure 6: Sentiment 24
Evaluation From the evaluation of the results it can be seen that the sentiment towards the television show Supernatural is moderately positive. From the small audience sample taken it is difficult to say that these findings represent the entire population. In order to really gain a representative sample much more data would have to be collected. Another point to consider is that this data was collected over the space of approximately three episodes, so the sentiment may be towards a particular episode rather than at the show as a whole. As there is only a moderately positive sentiment contained in the input data the producers of the show would be advised to perform further analysis on the data. A further analysis may involve collecting larger amounts of data to more accurately represent the viewer population. Conclusion To conclude, this report has illustrated that an effective sentiment analysis can be performed on a television program by collecting a sample audience opinions from Twitter. Throughout the duration of this project many different data analysis tools were employed to collect, clean and mine sentiment from the dataset. Such an analysis could provide valuable feedback to producers and help them to spot a negative turn in viewer s perception of their show. Discovering negative trends early on can allow them to make educated decisions on how to target specific aspects of their show in order to increase its audience s satisfaction. It is apparent from this study that the machine learning classifier used has a major effect on the overall accuracy of the analysis. Commonly used algorithms for text classification were examined such as Naïve Bayes, Decision Tree, Support Vector Machine, and Random Forests. Through the evaluation of different algorithms it was found that out of the models examined the Random Forest algorithm using twenty random trees had the highest performance on this dataset. With machine learning algorithms constantly being developed and improved, massive amounts of computational power becoming readily available both locally and on the cloud, and unfathomable amounts of data being uploaded to social media sites every day, sentiment analysis will become standard practice for marketing and product feedback. Future work If given additional time to expand this project a number of key weaknesses in the project would be addressed. One major consideration would be to include a neutral class in the classification algorithm; this would have entailed clearly defining the positive, negative and neutral classes and collecting large amounts of neutral examples to train the algorithm. This could have been achieved by connecting a Python script to news agencies Twitter feeds through the Twitter API as such users tend to write objective tweets such as news headlines. Adding this additional class would provide a much more accurate representation of the sentiment. The use of clustering such as the k-means algorithm may a provided more valuable insights when combined with this sentiment analysis. A clustering algorithm may discover groups of tweets about a 25
particular character, scene or aspect of the television show. A sentiment analysis on the clusters may provide truly valuable information and insights. Another problem that could be addressed is that because the training data was automatically collected using emoticons, it was required that they were removed from the training dataset to avoid the algorithms giving them a high weighting when separating the classes. They do in fact hold valuable information when it comes to classifying sentiment so they should be included in some way to reinforce the classification. A common way of including them is to create a lexicon of knowledge which would complement the machine learning algorithm. With additional time a larger selection of machine learning algorithms would have been evaluated on the data. Particularly rotational forests which offer similar classification accuracy as the random forests but they use much fewer trees which greatly reduces processing costs. If this application was ever to be considered for commercial purposes the limitations of Twitters API V1.1 would have to be addressed. Currently Twitter allows users to collect approximately 1600 tweets per day and will only provide data that has been uploaded in the last six days. To gain real value from a sentiment analysis it would be required to have massive amounts of data on the product or service which is currently not available without premium accounts or using third parties. 26
Bibliography Alec Go, Lei Huang, Richa Bhayani, 2009. Twitter Sentiment analysis, s.l.: The Stanford Natural Language Processing Group. Alec Go, Lei Huang, Richa Bhayani, 2009. Twitter Sentiment Analysis, s.l.: The Stanford Natural Language Processing Group. Alexander Pak, Patrick Paroubek, 2010. Twitter as a Corpus for Sentiment Analysis and Opinion Mining, s.l.: LREC. Anon., 2011. StatSoft Electronic Statistics Textbook. [Online] Available at: https://www.statsoft.com/textbook/support-vector-machines [Accessed 01 05 2014]. Aramaki, Eiji, Sachiko Maskawa, Mizuki Morita, 2011. Twitter catches the flu: detecting influenza epidemics using Twitter. Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1568-1576. Berry, N., 2010. DataGenetics. [Online] Available at: http://www.datagenetics.com/blog/october52012/index.html [Accessed 14 04 2014]. Bifet, Albert, Eibe Frank, 2010. Sentiment knoweldge discovery in twitter streaming data. In: Discovery Science. s.l.:springer Berlin Heidelberg, pp. 1-15. Bird, Steven, Edward Loper, Ewan Klein, 2009. Natural Language Processing with Python. Sebastopol: O'Reilly Media Inc. Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. Philadelphia, Association for Computer Linguistics. Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques. 2002, Philadelphia : Association for Computer Linguistics. Bo Pang, L. L. S. V., 2002. Thumbs up? Sentiment Classification using Machine Learning, s.l.: Empirical Methods in Natural Language Processing [and Very Large Corpora]. Breiman, L., 2001. Random forests. Machine learning, 45(1), pp. 5-32. Buch, P., 2008. Wikimedia commons. [Online] Available at: http://commons.wikimedia.org/wiki/file:svm_max_sep_hyperplane_with_margin.png# [Accessed 04 05 2014]. Castillo, Carlos, Marcelo Mendoza, Barbara Poblete, 2011. Information credibility on twitter. AMC, Proceedings of the 20th international conference on World wide web. Donaldson, J., 2012. Beautiful Decisions: Inside BigML s Decision Trees. [Online] Available at: http://blog.bigml.com/2012/01/23/beautiful-decisions-inside-bigmls-decision-trees [Accessed 01 05 2014]. 27
Gelbukh, A., 2012. Computational Linguistics and Intelligent Text Processing. New Delhi: Springer. Hughes, G., 1968. On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory, 14(1), pp. 55-63. Kumar, E., 2011. Natural Language Processing. New Delhi: I.K International Publishing house. Liu, B., 2012. Sentiment Analysis and Opinion Mining. Toronto: Morgan & Claypool. McCallum, Andrew, and Kamal Nigam, 1998. A comparison of event models for naive bayes text classification. AAAI98 workshop on learning for text categorization, Volume 752, pp. 41-48. Moore, G. E., 1965. Cramming more components onto integrated circuits. Electronics Magazine, Issue 536, p. 4. Pak Alexander, Paroubek Patrick, 2010. Twitter as a Corpus for Sentiment Analysis and Opinion Mining, s.l.: LREC. Prem Melville, Wojciech Gryc, Richard D. Lawrence, 2009. Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification. s.l., Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Prem Melville, Wojciech gryc, Richard D.Lawrence, 2009. Sentiment Analysis of Blogs by Combining Lexical knowledge with Text Classification. s.l., Proceedings of the 15th ACM SIGKDD international conference on Knoweldge discovery and data mining. Read, J., 2005. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. Proceedings of the ACL Student Research Workshop. Association for Computational Linguistics, pp. 43-48. Read, J., 2005. Using emoticons to reduce dependency in machine learning techniques for sentiment classification, s.l.: The Association for Computer Linguistics. Samuel, A. L., 2000. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 1.2(44), pp. 206-226. Tsutsumi, Kimitaka; Kazutaka Shimada; Tsutomu Endo, 2007. Movie review classification based on a multiple classifier. s.l., the 21st Pacific Asia Conference on Language, information and Computation. Zhang, H., n.d. The Optimality of Naive Bayes. Flair2004 conference, s.n. 28
Appendix Appendix A Python Programs Data Collection Program import json import time from twitter import * import sqlite3 # Function Description # create a table def tablecreate(): # put in try except here #try: #cursor.execute("drop TABLE tweets") #except BaseException, e: # print ' failed droping table, ', str(e) cursor.execute("create TABLE iphone(id INT, date TEXT, tweet TEXT)") #Conect to database and create it if it is not already made. conn = sqlite3.connect("starbucks.db") # or use :memory: to put it in RAM cursor = conn.cursor() #call function to create table tablecreate() # personal authentication details for Twitter API oauthtoken = '2353626416-dzvxEiMk4Ut8svHQAun8yPQBP8lL7LbFhgAdN2F' oauthsecret = '1MuOLrnL5RXLbPtdo8STM8GrVHIpomNA2Ru3C1QwLTEkZ' consumerkey = '9BeCC6ZQt0LxYjztC7GDH6y2l' consumersecret = 'Z1fvigapxuJY1vjkfUzi5AtzWTuffqvDHtIdHVezrx3jTwntH1' # Provide API with personal credentials twitter_stream = TwitterStream(auth=OAuth(oauthToken, oauthsecret, consumerkey, consumersecret)) #stream user statuses using the filter ('keyword') iterator = twitter_stream.statuses.filter(track='iphone 5s') #'Supernatural') ID=0 #print (iterator) #This loop reads in the tweets and checks if each tweet meets certain requirements # if they do the program will store the required data in a database. while (ID < 10001): for tweet in iterator: #print(tweet) try: #print(tweet) 29
exist language= tweet["lang"] #checks if the language is english if language == "en": try: #try except is used incase the text field does not text = tweet["text"] try: #try-except is used in the case that date does not exist date = tweet["created_at"] #print (date) try: cursor.execute("insert INTO iphone(id, date,tweet) VALUES(?,?,?)", (ID, date,text)) conn.commit() ID= ID+1 #print(" Data Stored") except Exception, e: print ' failed writing to database, ', str(e) #store data in database except Exception, e: print ' failed on date, ', str(e) except Exception, e: print ' failed on text, ', str(e) except Exception, e: print 'failed on language, ', str(e) print( "Data Collection Complete") Data Cleaning Program #import libraries import re #import regular expressions from nltk.corpus import stopwords def datacleaner(data): #Remove any unicode format data = str(data) #Remove retweet data = data.replace('rt', '') #Convert all text to lower case data = data.lower() #Convert links to URL data = re.sub('((www\.[\s]+) (https?://[^\s]+))','url',data) #Convert @user to AT_USER data = re.sub('@[^\s]+','at_user',data) #Remove white spaces data = re.sub('[\s]+', ' ', data) #Remove hashtag data = re.sub(r'#([^\s]+)', r'\1', data) #remove white space from beginning and end data = data.strip() return data def puncremover(data): 30
punclist=',?:".!' puncoutput = '' for c in data: if c not in punclist: puncoutput + c return puncoutput #load the stop word list from NLTK stopwordlist = stopwords.words('english') #open the input text file inputfile = open('inputdata.txt','r') #iterate through each line of the text file for line in inputfile: #call cleaner function cleaneddata=datacleaner(line) # call the function to remove punctuation #cleaneddata=puncremover(cleaneddata) outputfile=open('cleaneddata.txt', 'a') x=[i for i in cleaneddata.split() if i not in stopwordlist] for row in x: outputfile.write(str(row)) outputfile.write(' ') #savefile.write(cleaneddata) outputfile.write('\n') outputfile.close() inputfile.close() Database to Text converter #import libraries import time import sqlite3 import re #regular expressions import unicodedata #Conect to database and create it if it is not already made. conn = sqlite3.connect("newtrainingdata.db") # or use :memory: to put it in RAM cursor = conn.cursor() cursor.execute('''select tweet FROM negtweets13''') all_rows = cursor.fetchall() i=0 for row in all_rows: i=i+1 #print (i) #print (row) #tweet=unicodedata.normalize('nfkd', row).encode('ascii','ignore') tweet = row savethis = str(tweet) savefile = open('negativetrainingdata.txt', 'a') savefile.write(savethis) savefile.write('\n') savefile.close() #time.sleep(5) print ("All done") 31
Results Compiler Program # Program Description # This program was used to calculated the percentage of positive and # negative classifications from the output of Weka #imports from future import division import time #initializing variables pos_count = 0 neg_count = 0 total = 0 pos_word = "positive" neg_word = "negative" #Read the file containing the results from Weka file= open('results.txt', 'r') # iterate through each line of the input file for line in file: #increment the counter total_count = total +1 # if the string 'positive' occurs in the line increment the count if pos_word in line: pos_count = pos_count + 1 # else the classification has to be negative else: neg_count = neg_count + 1 #Close input file file.close() # now calculate the percentage of instances of the positive and negative classes pospercentage = (pos_count/total)*100 negpercentage = (neg_count/total)*100 #compiles the results and prints them to screen print ( "Number of instances: " + str(total)) print ( "Number of instances classified as positive: " + str(pos_count)) print ( "Number of instances classified as negative: " + str(neg_count)) print ( "Percentage of instances classified as positive: " + "%.2f" % pospercentage + "%") print ( "Percentage of instances classified as negative: " + "%.2f" % negpercentage + "%") Appendix B 32
JSON Format Example JSON File { "favorited": false, "contributors": null, "truncated": false, "text": "@icclewu Great! I can see this in our inbox so we'll be in touch as soon as we can :) ^JH", "in_reply_to_status_id": 440775991557128192, "user": { "follow_request_sent": null, "profile_use_background_image": true, "default_profile_image": false, "id": 1567914936, "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/433641771235422208/llhj5phi.png", "verified": true, "profile_image_url_https": "https://pbs.twimg.com/profile_images/422701065759244288/o_tu67cn_normal.png", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "followers_count": 8325, "profile_sidebar_border_color": "FFFFFF", "id_str": "1567914936", "profile_background_color": "C0DEED", "listed_count": 42, "is_translation_enabled": false, "utc_offset": 0, "statuses_count": 96857, 33
"description": "Official home of VodafoneUK\u2019s help team. 8am/8pm 7 days/week. Feel free to drop us a tweet or DM. Latest news: @VodafoneUK & great deals: @VodafoneUKdeals", "friends_count": 394, "location": "", "profile_link_color": "0084B4", "profile_image_url": "http://pbs.twimg.com/profile_images/422701065759244288/o_tu67cn_normal.png", "following": null, "geo_enabled": false, "profile_banner_url": "https://pbs.twimg.com/profile_banners/1567914936/1392381491", "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/433641771235422208/llhj5phi.png", "name": "Vodafone UK Help", "lang": "en-gb", "profile_background_tile": false, "favourites_count": 7, "screen_name": "VodafoneUKhelp", "notifications": null, "url": "http://forum.vodafone.co.uk", "created_at": "Thu Jul 04 10:58:51 +0000 2013", "contributors_enabled": false, "time_zone": "London", "protected": false, "default_profile": false, "is_translator": false }, "filter_level": "medium", "geo": null, "id": 440788853650374657, 34
"favorite_count": 0, "lang": "en", "entities": { "symbols": [], "user_mentions": [ { "id": 19183344, "indices": [ 0, 8 ], "created_at": "Tue Mar 04 10:00:26 +0000 2014", "retweeted": false, "coordinates": null, "in_reply_to_user_id_str": "19183344", "source": "<a href=\"http://www.spredfast.com\" rel=\"nofollow\">spredfast app</a>", "in_reply_to_status_id_str": "440775991557128192", "in_reply_to_screen_name": "icclewu", "id_str": "440788853650374657", "place": null, "retweet_count": 0, "in_reply_to_user_id": 19183344 } Appendix C Classifier Evaluation Results 35
Naïve Bayes Time taken to build model: 27.22 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 27703 69.2575 % Incorrectly Classified Instances 12297 30.7425 % Kappa statistic 0.3852 Mean absolute error 0.3722 Root mean squared error 0.4595 Relative absolute error 74.4311 % Root relative squared error 91.9078 % Total Number of Instances 40000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.714 0.329 0.685 0.714 0.699 0.752 positive 0.671 0.286 0.701 0.671 0.686 0.752 negative Weighted Avg. 0.693 0.307 0.693 0.693 0.692 0.752 === Confusion Matrix === a b <-- classified as 14286 5714 a = positive 6583 13417 b = negative Random Forest 10 trees Classifier Model Random forest of 10 trees, each constructed while considering 11 random features. 36
Out of bag error: 0.1824 Time taken to build model: 436.65 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 32868 82.17 % Incorrectly Classified Instances 7132 17.83 % Kappa statistic 0.6434 Mean absolute error 0.2418 Root mean squared error 0.3529 Relative absolute error 48.3661 % Root relative squared error 70.5781 % Total Number of Instances 40000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.847 0.203 0.806 0.847 0.826 0.905 positive 0.797 0.153 0.839 0.797 0.817 0.905 negative Weighted Avg. 0.822 0.178 0.823 0.822 0.822 0.905 === Confusion Matrix === a b <-- classified as 16937 3063 a = positive 4069 15931 b = negative Random Forest 20 trees Classifier Model Random forest of 20 trees, each constructed while considering 11 random features. Out of bag error: 0.1589 37
Time taken to build model: 1183.22 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 33168 82.92 % Incorrectly Classified Instances 6832 17.08 % Kappa statistic 0.6584 Mean absolute error 0.2422 Root mean squared error 0.3466 Relative absolute error 48.4302 % Root relative squared error 69.3295 % Total Number of Instances 40000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.847 0.189 0.818 0.847 0.832 0.911 positive 0.811 0.153 0.841 0.811 0.826 0.911 negative Weighted Avg. 0.829 0.171 0.83 0.829 0.829 0.911 === Confusion Matrix === a b <-- classified as 16944 3056 a = positive 3776 16224 b = negative Support Vector Machine Number of kernel evaluations: 316463797 (73.88% cached) Time taken to build model: 2563.7 seconds === Stratified cross-validation === 38
=== Summary === Correctly Classified Instances 31734 79.335 % Incorrectly Classified Instances 8266 20.665 % Kappa statistic 0.5867 Mean absolute error 0.2067 Root mean squared error 0.4546 Relative absolute error 41.33 % Root relative squared error 90.9175 % Total Number of Instances 40000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.785 0.198 0.798 0.785 0.792 0.793 positive 0.802 0.215 0.788 0.802 0.795 0.793 negative Weighted Avg. 0.793 0.207 0.793 0.793 0.793 0.793 === Confusion Matrix === a b <-- classified as 15696 4304 a = positive 3962 16038 b = negative Appendix D Manual Validation # DB Tweet Actual Classified Correct/Incorrect # Class As 1 4 So Supernatural season finale Positive Positive Correct 39
tomorrow is trending worldwide? Are we that excited or that terrified? Both? 2 8 #Supernatural Stars Preview 'Chilling' Positive Positive Correct Finale Cliffhanger, 'Extreme' Measures to Save\ 3 1504 can't help but be encouraged by what Positive Negative Incorrect he's saying DAMNIT JARED 4 1507 Cant wait for Supernatural Season Positive Negative Incorrect Finale Tomorrow 5 1509 Jared Padalecki cried after reading the Positive Negative Incorrect script for the season finale. #Supernatural 6 1541 Supernatural is ruining my life I hate Negative Negative Correct this show send help 7 1929 The sims 3 supernatural really? i love Positive Positive Correct this game but I hate they're taking away the realistic parts of life in this game 8 4268 Also, #Supernatural fans - am I the only Negative Negative Correct one who feels just a tiny bit sorry for Metatron? Don't get me wrong, I hate him, but... 9 4711 @MariahCarey Thank you for giving us Positive Positive Correct that #beautiful moment with #dembabies on #Supernatural 10 4955 Metatron is such a bastard Negative Negative Correct #Supernatural #StairwayToHeaven @cw_spn I really hate him 11 5249 DON'T MAKE US HATE THIS SEASON Negative Negative Correct MORE #Supernaturalfinale 12 2203 Also, why is Supernatural trending? Lol Negative Positive Incorrect #getridofthatshow @giantlovetingle @likeswaffles 13 2207 I love it when Sam is a geek. reminds Positive Positive Correct me of the good ol season 1 days #Supernatural 14 4491 I hate that I'm missing Supernatural! Positive Negative Incorrect 15 2268 Think action packed supernatural Positive Positive Correct thriller with steamy romance and a wonderful love story 16 2453 I love it when my husband and I match Positive Positive Correct #jensenackles #tabbieackles #flowerpower #supernatural 17 2606 @FabAndSassy I'm really digging the Positive Positive Correct supernatural love story too 18 2662 omg supernatural is SO CUTE! I wanna Positive Positive Correct thank you for this beautiful music, it was definitely worth the wait! I love you so much 19 2146 Supernatural Season Finale Positive Negative Incorrect 40
TOMORROW #Love #Supernatural 20 2675 RT @ihatescottietoo: I love Cry., Positive Positive Correct YDKWTD, Heavenly, Supernatural, 21 2774 Haha\U0001f602 Love Positive Positive Correct this\u2764\ufe0f #Supernatural 22 2267 i love supernatural bc of dean Positive Positive Correct 23 7957 Thank you for giving us that #beautiful Positive Positive Correct moment with #dembabies on #Supernatural 24 3779 @Jenn_Rush I hate Laurel so much. Negative Negative Correct Which is a shame, because I liked Kate Cassidy as Ruby on Supernatural. 25 7340 I've watched all supernatural episoden Positive Negative Incorrect and that makes me sad cuz i liked watching them everyday 26 2193 Its our responsiblity to be there for one Positive Positive Correct another. Love each other and make our world SUPERNATURAL 27 7462 Supernatural Season Finale TODAY is Negative Positive Incorrect happening-- and I don't care like I used to 28 7473 @TimeLordDevious I guess he is the Positive Positive Correct villain but still. I like Crowley. He grows on you, Moose and Squirrel. lol #Supernatural 29 7891 uh oh it looks like the cunty Negative Positive Incorrect supernatural forum found out what a mix 30 7951 RT @im_mishasminion: The love story Positive Positive Correct of an angel and a human is a reality. You can't deny it. #Supernatural 31 7602 @releasethedoves: I'm researching Positive Positive Correct demons for my art class this is so cool man it's like supernatural 32 3329 Love the freaking out that is already Positive Positive Correct happening. Can't wait to see (and experience) the feels after tonight #supernatural 33 7897 I'm having like hardcore anxiety about Positive Negative Incorrect tonight's episode omfg. #Supernatural 34 1894 eericareyes i think it's worth it though. i still love supernatural a lot and the last episodes were definitely not bad Positive Positive Correct 35 3330 I would love finding someone that Positive Positive Correct loves Glee, PLL and Supernatural 36 7432 Metatron is such a bastard Negative Negative Correct #Supernatural #StairwayToHeaven @cw_spn I really hate him 37 1765 Also, #Supernatural fans - am I the only Negative Negative Correct 41
one who feels just a tiny bit sorry for Metatron? Don't get me wrong, I hate him, but... 38 1780 I love it when Sam is a geek. reminds me of the good ol season 1 days #Supernatural 39 2390 if you hate your life start watching supernatural & become emotionally invested in it 40 2606 @FabAndSassy I'm really digging the supernatural love story too 41 6498 Think action packed supernatural thriller with steamy romance and a wonderful love story 42 3020 OH I GOT A LOVE THAT KEEPS ME WAITING Supernatural Season Finale TODAY 43 6342 RT @ihatescottietoo: I love Cry., YDKWTD, Heavenly, Supernatural, 44 6782 Also, why is Supernatural trending? Lol #getridofthatshow @giantlovetingle @likeswaffles 45 7832 Its our responsiblity to be there for one another. Love each other and make our world SUPERNATURAL 46 3742 Love this episode. It teaches you how to never piss off Dean Winchester! 47 4583 DON'T MAKE US HATE THIS SEASON MORE #Supernaturalfinale 48 786 @marshalthepig great review of #supernatural 49 2673 @releasethedoves: I'm researching demons for my art class this is so cool man it's like supernatural 50 3753 RT @calvinball_: How can you not fall in love with this face? #JensenAckles #Supernatural 51 #supernatural is shit! #GOT is where its at!! 52 4684 Love the freaking out that is already happening. Can't wait to see (and experience) the feels after tonight #supernatural 53 3779 @Jenn_Rush I hate Laurel so much. Which is a shame, because I liked Kate Cassidy as Ruby on Supernatural. 54 6412 Just me or was that episode pretty shit? #supernatural 55 2114 I love watching an episode of #Supernatural for the first time and Positive Positive Correct Negative Negative Correct Positive Positive Correct Positive Positive Correct Positive Positive Correct Positive Positive Correct Negative Positive Incorrect Positive Positive Correct Positive Positive Correct Negative Negative Correct Positive Positive Correct Positive Positive Correct Positive Positive Correct Negative Negative Correct Positive Positive Correct Negative Negative Correct Negative Negative Correct Positive Positive Correct 42
having to play Dean Or Not Dean 56 2051 RT @MariannePereyr1: Hahaha that Positive Positive Correct made me laugh ^_^ you gotta love them<3 #Supernatural 57 2041 I've put supernatural onto the disk Positive Positive Correct menu, I'm just going to fall asleep to the lovely repetitive theme song 58 1971 There's two shows I really Positive Positive Correct love.supernatural & Sons of Anarchy. So entertaining 59 5962 Whatever, I like Supernatural, ok? Positive Positive Correct 60 5770 The only thing I do while watching the Negative Negative Correct show Supernatural is complain about how much I hate it.. 61 7122 so about me; i like supernatural, a lot Positive Negative Incorrect of bands, reading, and i hate most people but yeah, that's about it 62 1541 Supernatural is ruining my life I hate Negative Negative Correct this show send help 63 1933 I love it #supernatural Positive Positive Correct 64 1958 RT @iamdivergentris: Dean and Positive Positive Correct Sam\nJensen and Jared\nI love them so much\u0001f60d #Supernatural 65 2334 I love watching an episode of Positive Positive Correct #Supernatural for the first time 66 209 Such a shit final episode #SPNFamily Negative Negative Correct 67 4423 @Reeeggirl: I love how the #SPNFamily Positive Positive Correct is coming together at the mutual fact that we will all be dying emotionally tonight #Supernatural 68 Dean is such a dirty twat Negative Negative Correct @supernatural 69 467 Also, #Supernatural fans - am I the only Negative Negative Correct one who feels just a tiny bit sorry for Metatron? Don't get me wrong, I hate him, but... 70 Hahaha fuck dogs!! I love it! Positive Positive Correct #supernatural 71 3877 "Supernatural", I love it. Positive Positive Correct 72 Love this episode. It teaches you how Positive Positive Correct to never piss off Dean Winchester! 73 381 Supernatural season finale Negative Positive Incorrect TOMORROW" DONT TREND STUFF LIKE THAT DONT REMIND ME 74 8081 the whole of season 9 was shit Negative Negative Correct #supernatural 75 I would rather hang myself then watch Negative Positive Incorrect #supernatural 76 674 I love watching an episode of #Supernatural for the first time and Positive Positive Correct 43
having to play Dean Or Not Dean 77 4250 @B1nkMonstr watching Supernatural Positive Positive Correct again. I love how killing the dog is when the guy goes too far 78 276 Supernatural is ruining my life I hate Negative Negative Correct this show send help 79 872 @Reeeggirl: I love how the #SPNFamily Positive Positive Correct is coming together 80 8323 Absolutely amazing ep #supernatural Positive Positive Correct 81 5873 Also, #Supernatural fans - am I the only Negative Negative Correct one who feels just a tiny bit sorry for Metatron? Don't get me wrong, I hate him, but... 82 3856 I love it #supernatural Positive Positive Correct 83 8266 I just walked right into an argument Negative Positive Incorrect about destiel and the queer rep in supernatural and then just walked back out bc I realized I dont care 84 4175 I just love Supernatural yas bitch yas Positive Positive Correct 85 786 @marshalthepig great review of Positive Positive Correct #supernatural 86 4021 the whole of season 9 was shit but i Negative Negative Correct still dont want it to end, i have such an emotional bond with supernatural it's ridiculous 87 7194 I've watched like180 episodes of Positive Negative Incorrect supernatural not including repeats which is over 120 hours of spn is that sad 88 872 @Reeeggirl: I love how the #SPNFamily Positive Positive Correct is coming together 89 4389 Supernatural Season Finale TODAY \ni Positive Positive Correct LIKE I LOVE 90 5234 Dan is a dirty pig #supernatural Negative Negative Correct 91 3456 Supernatural is ruining my life I hate Negative Negative Correct this show send help 92 4533 If I don't survive the Season 9 finale Positive Positive Correct tonight, someone tell Misha that I love him! 93 5982 Fucking #supernatural blocking up my Negative Negative Correct feed 94 4124 Just watched my first ep of Positive Positive Correct #supernatural and it was good! 95 4464 #Destiel is the best love story ever Positive Positive Correct written. #Supernatural 96 6732 Good one dean you cunt #supernatural Negative Positive Incorrect 97 4738 I love it #supernatural Positive Positive Correct 98 679 Fuck #supernatural shit show Negative Negative Correct 99 4506 @yvetterussell yup it sure does, love everything about psychics and Positive Positive Correct 44
supernatural 100 7178 I forgot how much I loved Supernatural like ow Positive Positive Correct Appendix E Reports and Project management Project Proposal 45
Introduction Objective This project will focus on performing a sentiment analysis on a specific product or service which will be chosen as the project progresses. Customer opinions will be mined predominantly from social media sites such as twitter although more websites and databases may be investigated as the project advances. The main goal of this sentiment analysis is to discover how customers perceive the chosen product or service. The opinions that are mined will be classified into three categories positive, neutral and negative. An analysis will then be performed on the classified data to see what percentage of the population sample fall into each category. Importance of this area Sentiment analysis and Natural Language processing are very important area at the moment; there is a shortage of people skills in this area. There is a massive amount of information being uploaded to the internet daily on social media websites and blogs that computers cannot understand. Traditionally it was not possible to process such large amounts of data, but with computer performance following the projections of Moores law and the introduction of distributed computing (e.g. Hadoop) large data sets can now be processed with relative ease. Major investment is going into this area such as IBMs tireless research into their Natural Language Processing supercomputer Watson and Googles recent acquisition of deep mind technology. With further research and investment into this area machines will soon be able to gain an understanding from text which will greatly improve data analytics and search engines. For example search engines a few years ago relied on searching for key words entered by a user, at the moment the google search engine has a very basic understanding of what a user is looking for but in the future it will understand the users searches and be able to return results based on this understanding and not just the words entered. Contribution to the Knowledge A customer s perception or a product is extremely valuable data to some companies. From the knowledge gained from an analysis such as this a company can identify issues with their products, spot trends before their competitors, create improved communications with their target audience, and gain valuable insight into how effective their marketing campaigns were. Through this knowledge companies gain valuable feedback which allows them to further develop the next generation of their product. Note: from an academic point of view it is not anticipated that there will be anything new added to the knowledge base as this the areas of natural language processing and sentiment analysis which 46
are highly researched areas. With this being a level 8 on the national framework and the short timeframe provided it would be incredibly difficult to make a meaningful contribution to the already massive knowledge base. This does not mean that this project is not worthwhile or not worth doing, a project such as could have great potential for small to medium sized companies to gain value from data analytics without investing heavily in the area. Background Sentiment analysis is the field of study that analyses people s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, and their attributes (Liu, 2012) Sentiment analysis is predominantly implemented in software which can autonomously extract emotions and opinions in text. It has many real world applications such as it allows companies to analyse how their products or brand is being perceived by their consumers, this usage is particularly applicable to this project. It is difficult to classify sentiment analysis as one specific field of study as in incorporates many different areas such as linguistics, Natural Language Processing (NLP), and Machine Learning or Artificial Intelligence. Natural Language Processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages (Kumar, 2011) The NLP part of sentiment analysis focuses on the actual text content for example a tweet or the text in a personal blog. It is the job of NLP to transform the text data into a form that can be read and understood by a computer program. There are many methods of achieving this such by using string processing (Tokenizers, sentence tokenizers, stemmers tagging) or by using speech tagging (n-gram, backoff, Brill, HMM, TnT). (Bird, Steven, Edward Loper, Ewan Klein, 2009) When the NLP is complete artificial intelligence can be used to classify the sentiment. Some methods that are commonly used are Naïve Bayes, Maximum Entropy, Support vector machines, or Artificial Neural Networks. (Gelbukh, 2012). It was found that the Support Vector Machine had the highest level of accuracy out of the methods listed (Bo Pang, 2002). Although Pangs research suggests that the SVM would provide the highest level of accuracy this is not always the case, results can vary dramatically depending on the samples. 47
The results of a sentiment analysis show customer sentiment towards a product or service. Gaining feedback such as this was traditionally very expensive and relied on paying customers to complete surveys or questionnaires which may not always return accurate information. For example often surveys or questionnaires do not address all the problems and sometimes the people taking them can change their opinions depending on the surveyor or the environment. Sentiment analysis allows organisations to gain large amounts of this valuable information at a fraction or the cost and time that more traditional methods would incur. Technical Approach This section will highlight the technical approach that will be followed for this project and will include the system description. Mining Data Data Cleansing Classifying the Data Analysis Mining data The data (tweets) will be collected using twitters API or by parsing (scraping) the website. It is worth noting that twitters fire hose will only present about 1% of all twitter data. There are companies that sell the data in larger quantities up to 50% but using them would incur large costs. Parsing twitter Parsing a webpage involves writing a program to access a webpage in the same way as a standard user would use the page. When the program accesses the page it will see the same content as a normal user this can cause many problems. It may be difficult for a program to differentiate between useless content such as advertisements and logos, and the valuable content that it is looking for. The program will have to look at the source code for the website and take information specified by the programmer. This may present difficulties when writing the program. This method does have some advantages over APIs in some cases. For example twitter will only give about 1% of its content over its API but by parsing twitter a program may be able to access much larger percentages of the data. 48
Twitter API Tweepy is an open source library for python that allows access to the full range of twitters RESTful API (Application Programming Interface) functionality. Tweepy will be used to access Tweets and the results will be compared against the parsing method. The method that preforms the best will be used for the project. Performance will be measured by the amount of data each method can access. Data Cleansing The second phase of the system will be to cleanse the data collected, this will involve removing any punctuation such as (,.!? ) and making everything lower case. This will help in the next stage of the project especially in the Bag of Words approach. Removing lower case words will decrease the redundancy in the database that will be used to store the words. It would be very beneficial if a spell check could be performed on the data collected but at this stage of the project it is unknown whether this would be possible using python without adding large computational overhead. Classifying the data Classifying the data is expected to be the most difficult part of the project; it will entail looking at individual words or groups of words in a tweet and attempting to assign a sentiment to them. This is no easy task as it is very difficult for a computer to understand slang words and sarcasm. Some examples This product is sick, This product is bad ass. These types of tweets could be missclassified because sick and bad are negative words but in this context they infer that the product is really good. Bag of Words Model The bag of words approach will involve building databases of positive, negative and neutral words. Each tweet will be broken up into individual words and then compared to the words in the databases. When there is a match a counter will be incremented or decremented by a fixed amount depending on a weighting assigned. When this process is complete the counter will be used to classify the sentiment for example if the words in the tweet are largely positive the counter should be high. Artificial Intelligence model The author has a great interest in artificial Intelligence and Machine learning and would be extremely interested building Intelligence models from the ground up, it is foreseen that with the time frame provided that this will not be possible. It is hoped that machine learning algorithms can be introduced into the later stages of this project by training premade models which are included in packages such as Weka or Rapidminer. These models would then be used to classify the tweets or used as a benchmark to test the performance of the Bag of Words model 49
Analysis of the data When the data is classified there will have to be some kind of analysis performed on it. This may include simple percentages of customer satisfaction or a more complex analysis could be performed such as comparing the customer sentiment on two similar products with the aim of finding a correlation between good sentiment and high sales of those products. It may even be possible to look at specific features of a product such as the screen or battery life of a smart phone with the aim to find customer sentiment. This would be extremely valuable information as it would allow companies to identify perceived weaknesses in their products and allow them to improve upon them in future generations/iterations. Special resources required For this project it is anticipated that no special resources will be required except for open source software and library s that freely available online. 50
Project Plan This project spans over 14 weeks, the aim will be to have the project complete by week 12 with the two remaining weeks being left to complete the dissertation and presentation preparation. A Gantt chart of the expected timeframe is included in the appendix of this report. Technical Details This section will cover the implementation language and principal libraries that will be used. Python For the purpose of this project Python 2.7.x will be used as it is a mature, versatile and robust programming language. There are extensive open source libraries available for this version of python and a large community of users. This version of Python was chosen over the latest version Python 3.0 for numerous reasons the main ones being everything in Python 3.0 is backwards compatible with the older versions and it is felt that Python 2.7 has better documentation and a wider community of users than the latest version. Other high level programming languages such as R and Matlab were considered because they have many benefits such as ease of use but they do not offer the same flexibility and freedom that Python can deliver. With this being said they are not being completely ruled out for this project and may be used to visualize data or prototype functions. It is very likely that a low level language such a C may be used for certain functions in the project. If some functions are using up substantial CPU resources they may be converted to C and used as an import to python in a bid to increase the efficiency of the program. This choice will be made as the project progresses. 51
In the unlikely event the computation required for this project becomes too much for a standard computer the problem will be distributed over a number of machines using the Hadoop framework on the Amazon Web Services or Windows Azure platform. Python Libraries There are many libraries for python that will be used such as matplotlib, re (regular expressions), numpy etc. It is not appropriate to list them all here but the principal libraries are expected to be Tweepy which is explained in the Technical Approach section of this proposal and NLTK (Natural Language Tool Kit). NLTK NLTK is a collection of resources for python that can be used for text processing, classification, tagging and tokenization. It is believed that this toolbox will play a key role in transforming the text data in the tweets into a format that can be used to extract sentiment from them. Database Management MySQL or SQL lite will be used because they are both open source database management systems that are compatible with python. The decision will be made as the project progresses as it is not known which will offer the best integration with the python programming language. It is still unclear if the tweets taking from twitter will be stored in a database or in a text file format. 52
Appendix E Project Requirement Specification Introduction Purpose The purpose of this document is to define the requirements for a sentiment analysis that will be performed on a specific product or service. The intended customers of this analysis are the owners of the rights to the product or service in question. Project Scope The scope of the project is to perform a sentiment analysis on a product or service. Sentiment will be classified as positive, neutral, or negative, there will be no in-between classes for example it does not matter if the sentiment is somewhat positive or extremely positive both will be classified under the same label positive. The output of this analysis will be a report on how the product or service is perceived by the target audience. A number or tools and computer programs will be developed for this project but it must be noted here that these will not be the intended output it is worth reinstating here that the analysis of the target audience is the output of this project. That being said a system will have to be designed in order to perform this analysis. This system will gather data from twitter, cleanse the data, and then classify that data. An analysis will then be performed on this classified data. The main requirement for this analysis is that it provides a relatively high level of accuracy at a low cost to the client. The results of the analysis must be presented in a way that a client with a low level of technical knowledge can understand them. Definitions, Acronyms, and Abbreviations API NPL GUI Application Programming Interface Natural Language Processing Graphical User Interface 53
User Requirements Definition As this is not a software package that is being developed there is no third party user, the aim of this project is to perform an analysis on a particular product or service. To perform this analysis tools and programming languages will be used by the author but these tools and processes will not be passed on to the customer. The output of this project is a report on how a product is perceived by an audience not any sort of analysis tool or package. The customer will be presented with this report and it will be their decision how to proceed with the information provided within it. From the customers perspective it will be important that the information provided represents the sentiment of their target audience accurately so that they can use this information to evaluate their marketing, brand or product. If the analysis returns negative sentiment towards their product they may wish to perform further analysis to identify the reason for this perceived negativity towards their product in order to identify weakness and further develop their product. Requirements Specification The analysis that is produced by this project should be easily understood by a customer with limited technical knowledge. To achieve this requirement the output should be in graphical form so that the results can be easily visualized by the customer. These visualizations should be provided along with a more detailed technical description of the processes involved in the analysis. It will be required that there is a certain level of accuracy in the system any classification system that has an accuracy of 52% or greater will be deemed acceptable. Even the most advanced NLP algorithms and computers do not achieve 100% accuracy when it comes to dealing with datasets such as twitter this is because the amount of slang abbreviations and sarcasm used in a site such as this is extremely high. Functional Requirements This section of the report lists the functional requirements of the system used to perform the analysis in ranked order. The functional requirements define what the system must achieve. o o o o o The system will collect user data from Twitter regarding a specified product or service. The system will cleanse this data. The cleansed data will be stored in a database or text file. The system will classify the data into three distinct classes (positive, negative, and neutral). An analysis will be performed on the classified data 54
Use Case Diagram The use case diagram below shows an overview of all the functional requirements for this system. Figure 7: Use Case Diagram Requirement 1: Data Collection Description & Priority This requirement focuses on collecting data regarding a specified product or service. The data will be collected through the use of a streaming program which takes advantage of Twitters API. The data will then be stored in a database for further processing. This has the highest level of priority as without this stage no analysis can be performed. Scope The scope of this requirement is to collect specified data from twitter and store it in a database. Use case Diagram The use case diagram below shows the data collection phase of the project. The user will enter the prerequisite which will be the product or service requires information about. The program will be activated by the user and it will proceed to connect with Twitters API. Once the connection with the API is made the program will then authenticate itself which will allow it to access twitters streaming functions. The program will then store the data from the stream into a SQL database. 55
Figure 8: Requirement 1 Use Case Diagram Precondition As this is the first step in the system there are no preconditions that are required for this step to complete successfully. Activation This use case is activated when the user runs the Python script. Main flow 1. User activates the process 2. System establishes connection with database 3. System establishes connection with Twitter API 4. System provides credentials to access API 5. API initialises streaming of data 6. Data is written to database 7. System is terminated by user Exceptional flow 1. User activates the process 2. System establishes connection with database 3. System establishes connection with Twitter API 4. System provides credentials to access API 5. API initialises streaming of data 6. Database or API becomes unavailable 7. System waits 1 second 8. Loop back to step 2 Termination The system will terminate either if it reaches the end of the data being streamed from twitters API or if the user terminates the program. 56
Post condition The system goes into a wait state if there is a problem connecting to twitters API or if the database it is storing the data in is not available to write data. The system will wait 1 second and try the process again Requirement 2: Data Cleansing Description & Priority This requirement focuses on cleansing the collecting data. It will be important to remove any unwanted text or data contained in the tweets collected. Some examples of unwanted content would be and images or URLs contained in the tweets along with RT, @user, and any #content. This cleansing will have a high priority in the system as if the data contains this unwanted content it will be very difficult to classify it accurately in the next stage of the project. Scope The scope of this requirement is to remove any unwanted content form the collected data. Use case Diagram Figure 9: Requirement 2 Use Case Diagram Precondition The precondition for this task is that the data collection phase of the system has completed successfully. Activation This use case is activated when the user runs the Python script. 57
Main flow 1. User activates process 2. System establishes connection with database 3. System reads data from database 4. System removes unwanted content 5. System writes cleaned data back to database 6. System terminates Exceptional flow 1. User activates process 2. System establishes connection with database 3. System reads data from database 4. System removes unwanted content 5. Database becomes unavailable 6. System waits 1 second 7. Loop back to step 2 8. System writes cleaned data back to database 9. System terminates Termination The system will automatically terminate when all the data has been cleansed. Post condition The system will go into a wait state if the database is unavailable for read or write commands. It is unlikely that this post condition will ever be met as the data cleansing program will be the only application accessing the database at this time and it will not be using any form of multithreading. Regardless of the probability of the post condition ever being required it will be included in the program and in the event of the database becoming unavailable the system will wait for 1 second. 58
Requirement 3: Classifying Data Description & Priority This requirement focuses on classifying each piece of the cleansed data into one of three classes (positive, negative, or neutral). At this stage in the project it is still unclear what method will be used for the classification. This classification has a high level of priority as without the tweets being classified into sentiments there will be not data to perform the analysis on. Scope The scope of this requirement is to classify that data that has been collected and cleansed into one of three classes which are positive, negative or neutral sentiment. Use case Diagram Figure 10: Requirement 3 Use Case Diagram Precondition The preconditions for this stage are that the data collection and data cleansing phases have been completed successfully. Activation This use case is activated when the user runs the Python script. Main flow 1. User activates process 2. System establishes connection with database 3. System reads data 4. System classifies data into classes 5. System stores the classified data in database 6. System terminates 59
Termination The system will terminate automatically when all the cleansed data has been classified. Post condition At this stage of the project it is still unknown what method will be used for the classification phase so it is unwise to speculated on what post condition will be used. Interface Requirements This section of the report defines any interfaces that are required the system and other software products or the system and users. GUI As this the system that is being developed will not be passed on to any third parties a graphical user interface will not be included as part of the requirements. It is possible that a basic GUI will be created as the project progresses with its function being to simplify some of the more repetitive processes for the developer. Application Programming Interfaces The system that is being developed will not offer any sort of API but it will take advantage of the Twitter API v1.1. This API will allow the system access to a stream of data provided by Twitter users. Without the access to this interface extra complexity would be added to the task of collecting large datasets and this stage would have to be implemented through parsing thousands of pages from Twitter. At this stage of the project it is still unknown if any other APIs will be used but it is quite possible that others may be taken advantage of to improve the overall system. 60
System Architecture Figure 5 below shows a lock diagram of the system architecture. This diagram defines the structure of the system. Figure 11: System Architecture System Evolution This system could evolve in many ways over time, the part of the project with the most scope for evolution is the classification phase. More classes could be added to include a weighting on how positive or negative the sentiment being analysed is. Larger training dataset could be used to increase the accuracy of the classifier. A reinforcement learning algorithm could be introduced so that the system will improve over time. 61
Appendix E Management Progress Report 1 Period Covered This project management report covers the period from 01/02/2014 to 16/03/2014 Report Purpose The purpose of this report is to provide the project supervisor with the current status of the project. It also serves as a means to monitor the projects progress and outline any potential problems and issues that have arisen during the period. Activities during the Period Research A large portion of the time during this period was spent on researching the key areas of the project. The first 2-3 weeks of the period were spent deciding on a project that will meet the requirements of the module. The key areas that were researched were Sentiment Analysis, Natural Language Processing, and classification models for text. Project Proposal The project proposal report was the first major mile stone in the project. It aimed to define the project objectives and goals. It was completed to a high standard and the submission deadline was met. Requirement Specifications The second major deadline for the project was the Requirements Specifications report. Its purpose was to define the requirements of the proposed project.. It was completed to a high standard and the submission deadline was met. Data Collection The Data Collection phase was the first technical task that was required for the project, a basic overview of this task is listed below: 1. Python script created to utilize Twitter API v1.1 2. Data stream from API was filtered to only collect tweets written in the English language. 3. SQL database was created to store the data 4. Training data-sets were collected containing 20000 examples of positive and negative tweets. 62
Data Cleansing The second technical phase of the project was the Data Cleansing phase. This phase was required to remove any unwanted content from the data that was collect. A basic overview of the process is listed below: 1. Read data from SQL database 2. Remove unwanted content 3. Write cleaned data to new database table Figure 12: Project Mind Map Products Completed during the Period Figure 1 above shows the main stages of the project, the phases coloured in yellow have been completed. To date the Data Collection Phase and the Data collection Phase have been successfully completed. Variance from Plan To date there is very little variance from the original project plan, the project progressing as expected and is on schedule. It is expected that in the next phase of the project there may be delays and issues arising as the classification algorithms are expected to pose the most difficulty. 63
Planned Work for Next Period (16-03-2014 to 29-03-2014) The next period covers two weeks and it is expected that a large amount of work will be achieved in this period. During the period it is planned to work on the feature extraction phase and the Classification phase of the project. It is planned to fully complete a Naïve Bayes Classifier and a Maximum Entropy Classifier and the feature extraction programs for each of them. Product Completions next Period It is planned that the feature extraction phase will be fully completed and the classification phase will be partially completed. It is planned that three different classifiers will be implemented for this project and their results will be compared. The partial competition of the classification phase will involve implementing a Naïve Bayes Classifier and a Maximum Entropy Classifier. The final stage of the classification phase will be to implement a Support Vector Machine but this will be completed in the last period of the project. 64
Appendix E Management Progress Report 2 Period Covered This project management report covers the period from 16/03/2014 to 13/04/2014 Report Purpose The purpose of this report is to provide the project supervisor with the current status of the project. It also serves as a means to monitor the projects progress and outline any potential problems and issues that have arisen during the period. Activities during the Period Due to the heavy work load in the other modules during this period the project progression has been delayed from the original plan. Research The majority of the time during this period was spent on research with the main focus being on the current literature in the field. From this research it was found that the Maximum Entropy Classifier consistently underperforms when compared to the Naive Bayes and Support Vector Machine. With its accuracy generally being lower than both of the other classifiers and its computational overhead being many times in excess of other classifiers it was decided that this model would be removed from the project. Removing this model will help to regain some of the time lost during this period and to help bring the overall project back on schedule. A large amount of time was spent investigating Weka during this period, and it was found that using this machine learning environment could reduce the time taking to train and test models significantly. It has not been decided yet if Weka will be used but it is noted that it is available as a backup if the project runs into any other obstacles during the next period. 65
Figure 13: Project Mind Map Products Completed during the Period Due to the heavy work load in the other modules no products have been completed during this period. Figure 1 above shows the main stages of the project, the phases coloured in yellow have been completed. To date the Data Collection Phase and the Data collection Phase have been successfully completed. Variance from Plan To date the project is behind schedule, it was planned to complete two of the classification modules last period but remains unfinished. To counteract these delays it is planned to remove the Maximum Entropy classifier from the project unless time allows for its inclusion towards the end of the next period. The use of Weka has also been considered which will allow data models to be trained and tested very quickly; this may allow the project to get back on schedule. 66
Planned Work for Next Period (13-04-2014 to 27-04-2014) During the next project period which is two weeks a lot of work will be done to make sure that the project gets back on schedule. During the period it is planned to fully complete the Naïve Bayes classifier, and the Support Vector Machine classifier. While working on these classifiers the final report will be updated with the key information regarding their use. At the end of this period the results of each classifier will be analysed and compared to find the most suitable model to classify consumer sentiment. Product Completions next Period At the end of the next period it is planned to fully complete the Naïve Bayes and Support Vector Machine Classifiers. One they are complete it is planned to implement a evaluation on their performance with the aim of deciding which is most suitable for the task of consumer sentiment analysis. It also is planned to complete the classification and model evaluation sections of the final report during this period. 67