Predicting Short Term Company Performance by Applying Sentiment Analysis and Machine Learning Algorithms on Social Media

Predicting Short Term Company Performance by Applying Sentiment Analysis and Machine Learning Algorithms on Social Media Niels ten Boom University of Twente P.O. Box 27, 7500AE Enschede The Netherlands n.d.j.tenboom@student.utwente.nl ABSTRACT This paper reports about research into the use of sentiment analysis on social media in order to predict a company s short term performance. As a measure for the short term performance, the stock price of a company is used. The sentiment is extracted from a large corpus of tweets mentioning twenty large companies and a few techniques of extracting sentiment are reviewed. We find that for sentiment analysis a Naive Bayes classifier trained with data very similar to the corpus performs best. We use the Naive Bayes classifier to extract the sentiment from tweets. Together with the stock prices of twenty companies, we train various supervised machine learning models. We find that there is a combination of data where the accuracy of a classifier is 65,5%, but most other cases appear to be as bad as an algorithm that classifies randomly. The first tweet clearly has a positive sentiment whereas the second tweet has a negative sentiment. The third tweet does not seem to have sentiment at all and thus can be flagged as neutral. Performing sentiment analysis means that a computer algorithm is used to extract the sentiment of text. The stock price of a company can go up or down or stay relatively the same. This can only happen within the window when the stock is traded (between 9AM and 5PM). An example of the price direction of Apple s stock in one day can be seen in Figure. It is clear that the stock price has gone down that day. Maybe it is possible that this could have been predicted by examining the sentiment of the public, because research suggests that public sentiment has an influence on the financial market [7]. Keywords Sentiment analysis, Twitter, Naive Bayes classification, Machine Learning, Stock price. INTRODUCTION Social media plays a big role in society nowadays. A lot of people use it to share photos, stories and their activities. There are people that use it to express opinions or feelings about various topics and these are posted on the Internet for everyone to see. This research is interested in the opinions/sentiment towards companies and if these opinions are correlated with the stock price of a company. Twitter will be the provider of the social media data on which the sentiment analysis will be performed on. Because Twitter is the social media platform where large amounts of posts are easily filtered and extracted from. Take for example the three following tweets mentioning Apple: #AppleWatch launched by @tim_cook and #Apple team. Looks cool as expected #appstore down for all users :( #Apple #wtf Apple Store app updated with support for AppleWatch. #Apple #iphone #ipad Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 23 th Twente Student Conference on IT June 22 st, 205, Enschede, The Netherlands. Copyright 205, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science. Figure. Sample of daily stock price We will experiment with predicting whether the stock price will go up or down at the end of the day. This will be done by performing sentiment analysis on tweets posted in the same period and by using the results of this analysis with several machine learning algorithms. There exists a lot of controversy regarding the prediction of stock price directions (up or down). Some theories suggest that it can not be done [4] [3], whereas other research reports very positive results [2] [5]. This will be further discussed in Section 2. In this paper we hope to find a clear conclusion regarding this controversy. If the results of this research are positive then a system could be created that can monitor Twitter and predict if it is likely that the stock price of a company will go up or down. Such a system could be of use in the financial domain as an analysis tool to make investment decisions. In this paper we first discuss the related work in Section 2. The research questions are then formulated in Section 3. Then the methodology is described in Section 4 and the results of the experiments described in the methodology section are presented in Section 5. And from these results

a conclusion will be drawn in Section 6. 2. RELATED WORK This research is going to be using different methods for sentiment analysis, an area where much research has been done already [9]. Sentiment analysis of tweets has also been successfully executed [8]. The research in which they accurately predicted box office numbers for movies using social media sentiment is related [], because the success of a movie also contributes to the performance of the company that published the movie. The main inspiration for this research comes from the study by J. Bollen et al. where they used sentiment analysis on a large amount of tweets to predict the direction of a big American market index using machine learning [2]. The results of this research were very positive, they reported an accuracy of 86,7%. But because they tried to predict one single market index, it could have been a matter of favorable circumstances. Their test set consisted of 9 days and thus 9 instances were tested. This research does not try to predict the direction of a market index but tries to predict the direction of stock prices of multiple companies, which results in a larger train and test set. There is more research on predicting the market using large amounts of data. Some research had less optimistic results [6]. In contrast to that, some research claims that they can outperform the market as a whole using large amounts of social media data [5]. The earlier mentioned controversy is that these papers report that they have predicted the directions of stock prices with high accuracy. However this contradicts the widely accepted random walk hypothesis [4]. The random walk hypothesis states that one can not predict whether a stock price goes up or down with greater accuracy than 50%. This is in line with Efficient-market Hypothesis (EMH) [3], which states that if there was a way to predict the stock market, everybody would be doing it, which would influence the market in such a way that it would not be predictable anymore. 3. RESEARCH QUESTIONS The main focus of this research is on the correlation between the sentiment towards companies and the short term stock price directions of these companies. So the main research question can be formulated as: Is it possible to predict the daily stock price direction by performing sentiment analysis on a large amount of social media messages mentioning a company? This main question is answered with the help of the following subquestions:. Which method of sentiment analysis is most effective for analyzing short text messages? There are several methods and tools for sentiment analysis. A few different tools will be used to perform sentiment analysis on the tweets. Also an evaluation of the tools will be done. 2. Which combination of machine learning algorithm and data features yields the best result? After all of the tweets are converted by the sentiment analysis then machine learning algorithms will be implemented and evaluated. Experiments with different combinations of sentiment will be done. For instance by only processing the amount of positivity or negativity or a combination of both. 3. Does taking the sentiment of earlier days into account improve the accuracy of the prediction? It could be that the sentiment of a specific day is only later reflected in the stock price. Therefore, we will experiment with processing the sentiment of up to four days earlier. 4. METHODOLOGY This section describes the methodology used in this research. The methodology can be split up in three parts. The first part is the collection and preprocessing of the data. This is discussed in Section 4.. Section 4.2 describes the sentiment analysis, the tools it used for that and the evaluation of the tools. Section 4.3 describes how all of the data was processed with the machine learning algorithms. 4. Dataset The dataset to perform the analysis was acquired by using the Application Programming Interface (API) of the short message platform Twitter. The API was used to scrape tweets related to twenty companies in America with the largest market capitalization [2]. The reasoning behind this was that larger companies get mentioned more on social media, which should result in enough data to work with. The API was queried for English tweets containing the hashtag name of each company and its stock market ticker. For instance, Apple tweets were saved when they contained the strings: #Apple, #AAPL, AAPL. This process was programmed on a server that ran from March 7, 205 until May, 205. In total about.5 million tweets were saved. The tweets came with UTC timestamps, these were converted to CST timestamps to match the American stock trading window. The stock price data of the twenty companies was extracted from Yahoo Finance [5]. This data was then simplified by classifying it with the UP or DOWN classes per day. Table shows an example by assigning the Apple stock with the DOWN class because the stock went down that day (Open > Close). The situation where a stock value stays the same was disregarded, because this did not occur in the dataset. Table. Example of the stock price data preprocessing Company Date Open Close class Apple 205-04-30 28.64 25.5 DOWN This classification of the stock price data was done to reduce the complexity of the experiments. Exact statistics of all the data can be found in Table 7. From this table it becomes clear that the amount of tweets is not evenly distributed, some companies are being discussed on Twitter more often. The company Berkshire Hathaway received a too small amount of tweets and was discarded. 4.2 Implementing Sentiment Analysis After the collection of data was finished, sentiment had to be extracted from the tweets. As described before, opensource sentiment analysis tools were used to perform the sentiment analysis. The tools had to be given text as input and they would output the sentiment as a string. But before these tools were going to be used, an evaluation was performed to make sure that the classification of the tweets was reliable enough. Using more than three tools 2

was originally planned, but some turned out to be too complex to program them to work with tweets and for that reason they were left out of the evaluation. The first tool was a sentiment analysis tool that was trained with a dataset of IMDB movie reviews [3]. The second tool was the sentiment analysis module of the Stanford NLP toolkit [6]. This tool was also trained by a dataset based on the sentences of movie reviews []. The last tool that was evaluated was a self programmed Multinomial Naive Bayes classifier that was trained with a corpus of 4597 hand-classified tweets [0] that had no punctuation and uppercase characters. The tweets were tokenized based on spaces. The classifier was programmed with the use of the Weka [4] API. Subsection 4.2. elaborates on this classifier. The tools were programmed to determine the polarity of a tweet by tagging it as Positive, Negative or Neutral. This was tested on a sample of 20 hand labeled tweets randomly selected from the total of tweets. The distributions of the training set for the Naive Bayes classifier and the test set can be viewed in Table 2. for a tweet t. The class for which the value is the highest, is the class the tweet is going to be classified as. This algorithm was implemented using the Weka API in Java. The program was designed to convert an input arff file containing tweets to the counts of positive, negative and neutral tweets. Such a file was converted for each day in the dataset per company. This data was saved so that later it could be passed on to the machine learning algorithms. Figure 2, 3 and 4 visualize some data of the positive sentiment and the stock price. At each figure, the correlation coefficient ρ is mentioned. As can be seen, the correlation coefficient is decent for the two single companies. However, when the data of all the companies is plotted together, the correlation coefficient is lower. But this is only the positive sentiment, it could be that the machine learning algorithms are able to discover a pattern together with the rest of the sentiment data. 0.9 0.8 Apple 0.7 Table 2. Distributions of the percentages of positive, negative and neutral tweets in the training and test set. % pos % neg % neu Training set (4597 tweets) 0.0 0.7 79.3 Test set (20 tweets) 0.0.4 78.6 Stock price 0.6 0.5 0.4 0.3 0.2 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Positive Sentiment The results of this evaluation are presented in Section 5. This evaluation led to the decision to exclusively use the self programmed Naive Bayes classifier. 4.2. Classification with Naive Bayes This subsection elaborates on the Naive Bayes classifier that was programmed. The training set had to be modified to train the classifier with. This was done by converting the tweets to word vectors, and for each word a feature was created, this was achieved by applying the StringToWordVector filter in Weka. Naive Bayes makes use of Bayes theorem: P (A B) = P (B A) P (A) P (B) This equation is the foundation for the classifier. Because ultimately we would like to compute that given a tweet t which of the following has the highest probability P (c pos t) or P (c neg t) or P (c neu t). Where c represents the sentiment class positive, negative or neutral. This results in the following equation: Figure 2. Apple s positive sentiment plotted together with its stock price. (ρ = 0.672) Stock price 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 Disney 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Positive Sentiment Figure 3. Disney s positive sentiment plotted together with its stock price. (ρ = 0.592) 0.9 P (c class t) = P (t c class) P (c class ) P (t) P (c class ) can be computed by dividing the the amount of tweets of that class by the total amount in the training set. P (t c class ) can in turn be computed by splitting the tweet up in words and then compute P (w c class ) for each word w in t and then these probabilities were multiplied. P (w c class ) is solved for the times word w occurs in c class divided by the total amount that word occurs in the training set. Training the classifier means computing P (c class ) and P (w c class ) for each word w in the vocabulary and storing these values so they can be used to classify by finding the maximum value of {P (c pos t),p (c neg t), P (c neu t)} Stock price 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Positive sentiment Figure 4. Positive sentiment towards all companies plotted together with its stock prices. (ρ = 0.2899) The file format that Weka uses, it is comparable to a Comma Separated Values (CSV) file 3

4.3 Predicting Stock Price Directions Four machine learning algorithms were evaluated in the prediction of the stock price direction. Random forests, neural networks, support vector machines and logistic regression are the techniques that were used. Weka was used for the implementations of these algorithms. All of the algorithms were applied with their default settings in Weka. For each company the sentiment data was available in the form of the counts of the positive, negative and neutral tweets per day. These counts were normalized separately for each company in the range [0,] with the feature scaling formula: x = x min(x) max(x) min(x) Where x stands for the number of tweets. When for instance the count of the number of positive tweets for a company were normalized, max(x) would be the highest count of positive tweets for that company in the dataset and min(x) the lowest. This was done because some companies received significantly more tweets than others. With the sentiment data for each company on the same scale, better results are expected. An example of a single data instance containing all sentiment features can be viewed in Table 3. The algorithms were also evaluated by using the ratios of the sentiment as features opposed to normalizing them. The ratios of the positive, negative and neutral sentiment were computed by dividing the number of tweets of a specific class on a day by the total number of tweets of that day. This also resulted in values in the range of [0,]. Table 3. Example of a single data instance containing all of the sentiment features with normalized data. positive negative neutral class 0.864406 0.538464 0.23360 UP Then the machine learning algorithms were evaluated by trying out different combinations of features related to the sentiment. The main features were positive, negative and neutral sentiment. The algorithms were tested by several combinations of these features. Experiments with using the sentiment of earlier days were also conducted. These experiments were extended by adding extra features. The features that were added are: the sentiment of the previous day, the sentiment of two days before, the amount of tweets and the stock direction of the previous day. Two combinations of these features was experimented with. The results were documented as percentages of correctly classified instances and can be viewed in Section 5. The algorithms were trained with the gathered data until April 20, 205 and evaluated with the data from April 2, 205 until April 30, 205. So the split was roughly 75% train data and 25% test data. Because the data was of 33 trading days and 9 companies, the total amount of data instances was 9 33 = 627, of which roughly 75% was used to train the algorithms. The distributions of the UP and DOWN classes can be viewed in Table 4. Table 4. Distributions of the directions UP and DOWN in the training and test set. % UP % DOWN Training set (456 instances) 49. 50.9 Test set (7 instances) 50.9 49. 5. RESULTS This section presents the results of the experiments described in Section 4. Section 5. presents the results of the evaluation of the sentiment analysis tools. Section 5.2 presents the results of the experiments where machine learning algorithms were used to try to predict stock price movements. 5. Sentiment Analysis This section presents the results of the evaluation of the three proposed ways to classify tweets on their sentiment. This was done by letting the tools classify 20 hand-labeled tweets. The results of this evaluation are presented in table 5. Table 5. Percentage of correctly classified instances per analysis method. Method Correctly classified Python tool 5.2% StanfordNLP 9.9% Naive Bayes classifier 77.% From Table 5 it becomes clear that the tools that were not exclusively trained with sentiment-labeled tweets do not perform very well in classifying them. An explanation for this could be that the use of language in tweets differs from the use of language in the data these tools were trained with. That is why the Naive Bayes classifier was chosen for the sentiment analysis in this research. Table 6. Confusion matrix of the evaluation of the Naive Bayes classifier. classified as pos neg neu class pos 3 6 neg 0 5 8 neu 3 8 47 Table 6 shows the confusion matrix of the Naive Bayes classifier. What stands out is that a lot of the tweets are incorrectly classified as neutral. But it is safe to state that the algorithm does recognize the difference between positivity and negativity. Only one positive tweet was incorrectly classified as negative. However, a portion of the neutral tweets are classified as either positive and negative. Only 5% of the positive tweets were classified correctly and 2.7% of the negative tweets. An explanation for this is that in the training dataset the tweets are not labeled by the same person who labeled the test set. This means that sentiment could have been interpreted differently in both sets. Better results are probably expected if we would have labeled our own training set for this research. But due to a limited time frame, this was not conducted. 5.2 Prediction of Stock Price Directions The results applying the machine learning algorithms to the data can be viewed in Table 8, Table 9 and Table 4

Table 7. Statistics of the gathered data of the twenty companies Company # of tweets % of total # of DOWN # of UP Apple 396624 25,63 8 5 Google 2739 4,04 9 4 Exxon Mobil 7532 0,49 7 6 Microsoft 96286 6,22 3 20 Berkshire Hathaway 9 0,0 - - Wal-Mart 6282 0,4 2 2 Johnson&Johnson 6889 0,45 6 7 Wells Fargo 557 0,36 3 20 General Electric 9085 7,7 7 6 Procter&Gamble 26997 8,2 20 3 Coca-cola 082 0,66 2 2 JPMorgan 5834 0,38 4 9 Chevron 6492 0,42 6 7 Verizon 6386,06 22 Facebook 23048 4,89 8 5 Pfizer 2747 0,8 5 8 AT&T 3973 8,99 6 7 Oracle 6646,08 8 5 Bank of America 2308,49 8 5 Disney 3768 7,35 5 8 Total 547349 00.0 36 3 Table 8. Percentages of correctly classified instances per machine learning algorithm with normalized data. For combinations of sentiment and a shift in days. The columns with p stand for the cases in which only the positive sentiment was used as feature. pn stands for the positivity and negativity. And pnn stands for the cases in which the positivity, negativity and neutrality were used. same day - day -2 days -3 days -4 days p pn pnn p pn pnn p pn pnn p pn pnn p pn pnn Logistic Regression 54,6 54,6 5,5 42, 43,4 46,7 65,5 64,9 62,6 53,3 53,3 53,3 54,9 54,9 50,4 Support Vector Machine 52,6 52,6 53,5 49,3 48,7 48,7 52,6 60,2 59 57,2 52,6 5,3 54,9 54,9 55,6 Random Forest 53,2 44,7 45,8 46,7 55,2 56,6 50,0 50,3 56, 46,7 47,4 50,7 48,9 39, 43,6 Neural Network 52,6 53,2 53,5 5,3 52 54,6 52,6 52,6 50,9 42,8 42,8 46, 45, 42, 46,6 Table 9. Percentages of correctly classified instances per machine learning algorithm where the sentiment data were ratios. Computed by dividing the amount of tweets of a specific sentiment by the total amount of tweets on that day. The meaning of p, pn and pnn is the same as in Table 8. same day - day -2 days -3 days -4 days p pn pnn p pn pnn p pn pnn p pn pnn p pn pnn Logistic Regression 47,8 47,8 47,8 40,7 46,7 44,7 53,3 56,6 59,9 57,2 53,3 49,3 48,0 5,3 43,4 Support Vector Machine 47,4 47,4 50,7 53,3 42,8 40,8 49,3 5,3 49,3 57,2 57,2 57,2 52,6 53,3 42, Random Forest 47,4 48,8 49,3 45,4 55,9 44, 5,3 52,0 50,7 49,3 50,7 48,0 44, 38,8 40,8 Neural Network 55,0 56,0 5,2 53,3 54,6 54,6 57,9 57,9 59,2 44, 48,7 48,0 40,8 39,5 39,5 Table 0. This table contains the results of extending the previous experiments. By adding the sentiment (positive, negative and neutral) of one and one plus two days earlier as extra features (S and S 2 resp.). The stock direction (UP or DOWN) of the previous day and the amount of tweets were also added as features in these experiments. same day - day -2 days S 2 S S 2 S S 2 S Logistic Regression 63,2 48, 57,9 6,4 62,3 59,6 Support Vector Machine 62,4 48, 59,6 57 55,3 53,5 Random Forest 54,9 42, 50 50,9 54,4 54,4 Neural Network 49,6 38,4 54,4 45,6 5,8 56, 5

0. Table 8 contains the results where the sentiment data was normalized. Table 9 contains the results where the ratios of the sentiment were used by dividing the tweets of a sentiment by the amount of all tweets on that day. Table 0 contains the results of the experiments where firstly the amount of tweets and the stock direction of the previous day were added as extra features. Secondly the experiments were split up so that there was a case where sentiment of the previous day was added as features (results in column S ) and there was a case where the sentiment of the previous day and the day before that were added as features (results in column S 2). For each algorithm the combination of data that resulted in the highest accuracy was highlighted. Considering that a random algorithm should have 50% 2 accuracy in predicting the classes UP or DOWN, the results are not much better than that. Quite a few results score worse than random and most of them score around 50% accuracy. However when trying to predict the classes with the Logistic Regression algorithm and using only the positive sentiment of two days earlier, it predicted 65,5% of the test instances correctly. Which is significantly higher than all of the other results. 6. CONCLUSIONS AND FUTURE WORK Looking at the results, research question has a clear answer. Namely that the most effective way to classify tweets is by training a classifier that was trained by similar data, tweets in this case. A Naive Bayes classifier proved to be most effective. However, when we take the confusion matrix in Table 6 in account, the results were not that great either. Future research should use a better classifier that performs better in distinguishing positive or negative tweets from neutral tweets. Research question 2 looks if there are better results for a specific combination of features for the machine learning algorithms. From the results presented in this paper the conclusion can be drawn that adding the negative and neutral sentiment on top of the positive sentiment does only provide better results for some machine learning algorithms. Adding even more features did seem to improve the overall accuracies, as can be seen in Table 0. Research question 3 can be answered with a maybe, there are some results that seem significantly higher than 50%. However it is not sure if this was due to luck, as the Random walk hypothesis [4] suggests or that there really is some predictive power in the results. Future research should further identify which of the reasons seems most likely. With all these research questions combined the answer to the main research question is that one does not simply predict the stock price direction based on sentiment analysis on a large amount of social media posts. There were a few cases that yielded favorable results, but because this was not further verified, a clear conclusion can not be drawn yet. Future research should gather more data of more companies over a longer period and analyze it with a more reliable sentiment analysis technique. If that leads to similar or better results, better conclusions can be drawn. Another interesting approach would be to focus on a single company. Figure 2 and 3 in Section 4.2. show more structure than Figure 4 in Section 4.2.. It could be that repeating the experiments described in this research leads to better results when only taking a single company into account. 2 not exact as can be seen in Table 4, but we assume 50% for the sake of simplicity 7. REFERENCES [] S. Asur and B. A. Huberman. Predicting the future with social media. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 200 IEEE/WIC/ACM International Conference on, volume, pages 492 499. IEEE, 200. [2] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2(): 8, 20. [3] Github. Sentiment analysis in python. https://github.com/vivekn/sentiment. Accessed: 205-05-02. [4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, ():0 8, 2009. [5] M. Makrehchi, S. Shah, and W. Liao. Stock prediction using event-based sentiment analysis. In Proceedings - 203 IEEE/WIC/ACM International Conference on Web Intelligence, WI 203, volume, pages 337 342, 203. [6] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55 60, 204. [7] J. R. Nofsinger. Social mood and financial economics. The Journal of Behavioral Finance, 6(3):44 60, 2005. [8] A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In LREC, volume 0, pages 320 326, 200. [9] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(-2): 35, 2008. [0] Sananalytics. Twitter sentiment corpus. http:// www.sananalytics.com/lab/twitter-sentiment/. Accessed: 205-05-06. [] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 63, page 642. Citeseer, 203. [2] Theonlineinvestor. 20 largest u.s. companies by market capitalization. https://www.theonlineinvestor.com/large_caps. Accessed: 205-03-7. [3] Wikipedia. Efficient-market hypothesis. http://en.wikipedia.org/wiki/ Efficient-market_hypothesis. Accessed: 205-06-07. [4] Wikipedia. Random walk hypothesis. http://en. wikipedia.org/wiki/random_walk_hypothesis. Accessed: 205-06-04. [5] Yahoo. Yahoo finance. http://finance.yahoo.com/. Accessed: 205-05-0. [6] Y. Yu, W. Duan, and Q. Cao. The impact of social and conventional media on firm equity value: A sentiment analysis approach. Decision Support Systems, 55(4):99 926, 203. 6