The process of gathering and analyzing Twitter data to predict stock returns EC115. Economics



Similar documents
Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

How To Predict Stock Price With Mood Based Models

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

QUANTIFYING THE EFFECTS OF ONLINE BULLISHNESS ON INTERNATIONAL FINANCIAL MARKETS

Using Tweets to Predict the Stock Market

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Can Twitter provide enough information for predicting the stock market?

Is the Forward Exchange Rate a Useful Indicator of the Future Exchange Rate?

Analysis of Tweets for Prediction of Indian Stock Markets

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

The Influence of Sentimental Analysis on Corporate Event Study

Neural Networks for Sentiment Detection in Financial Text

Sentiment analysis on tweets in a financial domain

Market Efficiency and Behavioral Finance. Chapter 12

Stock Market Forecasting Using Machine Learning Algorithms

Prediction of Stock Performance Using Analytical Techniques

Stock Prediction Using Twitter Sentiment Analysis

Big Data and High Quality Sentiment Analysis for Stock Trading and Business Intelligence. Dr. Sulkhan Metreveli Leo Keller

The Use of Twitter Activity as a Stock Market Predictor

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

Italian Journal of Accounting and Economia Aziendale. International Area. Year CXIV n. 1, 2 e 3

Market Efficiency and Stock Market Predictability

On the Predictability of Stock Market Behavior using StockTwits Sentiment and Posting Volume

Advantages and disadvantages of investing in the Stock Market

Applying Machine Learning to Stock Market Trading Bryce Taylor

Equity forecast: Predicting long term stock price movement using machine learning

How Wall Street Works Nightly Business Report

These Two Words Just Made Us 37% In 3 Months. "These Two Words. Just Made Us 37% In 3 Months"

Discussion of Momentum and Autocorrelation in Stock Returns

GLOSSARY OF INVESTMENT-RELATED TERMS FOR NATIONAL ELECTRICAL ANNUITY PLAN PARTICIPANTS

From Saving to Investing: An Examination of Risk in Companies with Direct Stock Purchase Plans that Pay Dividends

WILL TWITTER MAKE YOU A BETTER INVESTOR? A LOOK AT SENTIMENT, USER REPUTATION AND THEIR EFFECT ON THE STOCK MARKET

Goals: What are you saving your money for college, a car, retirement? Decide what you want and how much you will need for each item.

Week TSX Index

Using Twitter as a source of information for stock market prediction

Market Timing Approaches: Non- financial & Technical Indicators. Aswath Damodaran

2. Discuss the implications of the interest rate parity for the exchange rate determination.

Vanguard U.S. Stock ETFs Prospectus

Fuzzy logic decision support for long-term investing in the financial market

The Hollywood Stock Exchange: Efficiency and The Power of Twitter

There are two types of returns that an investor can expect to earn from an investment.

Understanding the Technical Market Indicators

MARKETS, INFORMATION AND THEIR FRACTAL ANALYSIS. Mária Bohdalová and Michal Greguš Comenius University, Faculty of Management Slovak republic

Please note trading advice and risk statement on pages three and four

Is Your Financial Plan Worth the Paper It s Printed On?

Technical analysis is one of the most popular methods

The impact of social media is pervasive. It has

Stock Price Prediction Using Sentiment Detection of Twitter

Quantitative Methods for Finance

Test3. Pessimistic Most Likely Optimistic Total Revenues Total Costs

Prediction of Stock Market Shift using Sentiment Analysis of Twitter Feeds, Clustering and Ranking

Predicting Stock Market Fluctuations. from Twitter

INVESTING & TRADING WITH THE JSE PRICE/EARNINGS (P/E) RATIO

CHAPTER 11: THE EFFICIENT MARKET HYPOTHESIS

EVIDENCE IN FAVOR OF MARKET EFFICIENCY

The Stability of Moving Average Technical Trading Rules on the. Dow Jones Index

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

Verizon 401(k) Savings Plan Investment Advice

NorthCoast Investment Advisory Team

Prospectus Socially Responsible Funds

Introduction To Financial Markets & Investing

Your model to successful individual retirement investment plans

Financial Wellness & Education. Understanding mutual funds

Financial Trading System using Combination of Textual and Numerical Data

Using Twitter to Analyze Stock Market and Assist Stock and Options Trading

Forecasting stock markets with Twitter

Dynamic Relationship between Interest Rate and Stock Price: Empirical Evidence from Colombo Stock Exchange

Multiple Linear Regression in Data Mining

Chapter 2.3. Technical Analysis: Technical Indicators

Morgan Stanley Institutional Fund Trust

Pattern Recognition and Prediction in Equity Market

Sensex Realized Volatility Index

Using News Articles to Predict Stock Price Movements

HIGH-RISK STOCK TRADING: INVESTMENT OR GAMBLING?

The High-Volume Return Premium: Evidence from Chinese Stock Markets

Determining the Social Network Worth: Analysis, Quantitative Approach, and Application

Neural Network Stock Trading Systems Donn S. Fishbein, MD, PhD Neuroquant.com

Chapter 4.3. News Analysis

ACCELERATING SALES DEVELOPMENT WITH CRM. A best practice guide to integrating sales and CRM

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

The relation between news events and stock price jump: an analysis based on neural network

How to Win the Stock Market Game

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Chap 3 CAPM, Arbitrage, and Linear Factor Models

Investing on hope? Small Cap and Growth Investing!

Transcription:

The process of gathering and analyzing Twitter data to predict stock returns EC115 Economics

Purpose Many Americans save for retirement through plans such as 401k s and IRA s and these retirement plans save money through mutual funds, which are groups of stocks. Sometimes these mutual funds can actually lose money for potential retirees. In 2008, on average, employees lost $10,000 on their retirement savings (Brandon, 2009). Currently, investors seeking information about public entities traditionally gather the majority of their data from financial publications and documents filed by a company with the Securities Exchange Commission, which sources typically contain financial data including revenues, earnings per share, price earnings ratios, cash flows, dividend yields, product launches and company management strategies (Rader, 2006). However, this purely financial approach to market prediction neglects the bullish or bearish sentiments of the public that play a significant role in movement of the market. Due to this total dependence on financial data from investors, stock market predictions are generally inaccurate (Ferri, 2013). By 2008, CXO had collected and graded more than 5,000 predictions and the accuracy of these predictions have stabilized at about 48 percent ever since (Figure 1). Figure 1. Data collected on the accuracy of stock market predictions by the CXO advisory group.

Hypothesis The experiment sought a correlation between the calculated social media sentiment value and the stock returns of the company. The hypothesis was that a positive social media sentiment would result in an increase in stock returns and a negative social media sentiment would result in a decrease in stock returns. The null hypothesis was that social media sentiment would have no impact on stock returns. Background Literature Review Social Media is an outlet for instant news which is valuable to investors since accurate news about a company is a reliable predictor of that company's stock value. In one study, principal components analysis was used to reduce the approximately 400 features extracted from social media to about 30 features, capturing about 25% of the variance. An Ordinary Least Squares regression was used to forecast stock price movements from the approximately 30 variables (Sisk, 2013). By conducting these statistical processes on data collected from social media, news metadata can inform short term future price movements and volatility. In a study done by Fisher and Statman (2000), there was no significant relationship between change in sentiment in one month and stock returns in the following month. For large market capitalization stock returns, the relationship was negative for all sentiment groups but never statistically significant. The relationship between change in sentiment during a month and the following month's stock returns was positive for the small market capitalization stocks of the Center for Research in Security Prices (CRSP) 9 10, but that relationship also was not statistically significant.

Economists at the University of Rochester, Cornell University, and the University of Texas concluded that individual investor sentiment predicts future stock returns, and that investor sentiment is independent from data involving past returns or past volume (Kaniel et. al, 2004). Furthermore, the trading of these individual investors predicts weekly returns for stocks of all market capitalizations, while data involving past returns are only predictive of stock with a small market capitalization over the same time period. These findings show that the sentiment of the individual investor is highly indicative of the movement of stock price. When the propensity of investors to speculate is high and when the stock is volatile, sentiment is a significant predictor of the fluctuations in the stock price (Baker et. al, 2007). However, when a stock is stable and investors are calm, sentiment has less bearing on the movement of stock price. Using the Facebook's Gross National Happiness Index (FGNHI), which determined the average optimism and pessimism of its users in a particular region, researchers found a significant positive relation between sentiment and contemporaneous stock market returns, showing that optimistic sentiment is related to gains in the market index and pessimistic sentiment is related to losses in the market index (Siganos, 2014). However, the procedure of using public sentiment as an indicator of the market may be flawed since the relation between sentiment and stock returns is subject to reverse causality. For example, when investors profit or loss from the market, those sentiments may be expressed through social media. These specific sentiments are not indicative of what the movement of the market will be tomorrow. Bollen, Mao, and Zeng (2011) investigated whether measurements of collective mood states derived from large scale Twitter feeds are correlated to the value of the Dow Jones

Industrial Average (DJIA) over time. They used OpinionFinder to measure positive vs. negative mood and Google Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). Through this process, an accuracy of 86.7% was achieved in predicting the daily up and down changes in the closing values of the DJIA and the Mean Average Percentage Error (MAPE) was reduced by more than 6%. Research Methodology Six companies were chosen for the calculations: Kirkland (KIRK), Zynga (ZNGA), Aramark (ARMK), Foot Locker (FL), Verizon (VZ), and Disney (DIS). These companies were chosen from the Russell Index Member Lists. Kirkland and Zynga were chosen from the Russell Microcap List (Russell Investments, 2011), which lists the stocks of highly successful small companies; Aramark and Foot Locker were chosen from the Russell Midcap List (Russell Investments, 2011), which lists the stocks of highly successful mid size companies; Verizon and Disney were chosen from the Russell 1000 Index List, which lists the stocks of highly successful large companies. These companies were chosen for their varying product fields, their importance to consumer markets, and their distinctive names. A Python program, entitled StockPile was developed. The program asked for a company name to predict the stock price of. Using the Twitter REST API, all the tweets mentioning the name of the specified company in the last eight days was collected. As the Twitter REST API has a rate limit on the number of queries possible per fifteen minute intervals, multiple API accounts were created and cycled through each time one hit the limit. Python s Natural Language Toolkit (NLTk) associated a sentiment and polarity value for each Tweet and the Twitter API associated the number of favorites, the number of retweets, and the number of followers of the

author for each Tweet. However, the number of followers metric was removed after its statistical insignificance was noted. The Tweet, with the corresponding data, was stored as a TweetObj Object as shown in Figure 2. The metrics of the TweetObjs were then related with the relevant stock prices of that company with a two day time shift in the future. The TweetObjs were saved in a Pickle file. This process was repeated for each of the tested companies: Kirkland (KIRK), Zynga (ZNGA), Aramark (ARMK), Foot Locker (FL), Verizon (VZ), and Disney (DIS). Figure 2. Class Diagram for the TweetObj class. A Least Squares Multiple regression was conducted on the metrics of the TweetObjs and the stock prices with a certain time shift. The most accurate time shift was determined by conducting a Least Squares Multiple regression on all the different time shifts from a zero time shift to a time shift of a week. Whichever regression model provides the least p value and the greatest correlation coefficient of the predictive model will determine the delay in the sentiment on twitter of a company affecting the stock price of that company. The predictive model with the most accurate time shift will provide an equation to predict the stock price in the following form:

[ n C if i ] + K = P t i=1 Where n is the number of features, C i is the coefficient of the feature value, F i is the feature value, K is a constant, and P t is the stock price after a certain time delay. Figure 3. Data Flow Diagram for the StockPile algorithm. After the algorithms were developed, for the following three days, a testing algorithm, encapsulated in testing.py, was run after the markets closed at 3:00 PM EST. This program collected Tweets via the Twitter REST API for a period of one day, and ran the values through the predictive equation to determine a predicted stock price associated with each Tweet. These stock prices were then compared to the actual stock price at the germane time. An R 2 score was then calculated comparing the predicted values to the actual values.

Figure 4. Tweets referencing a certain company are collected using the Twitter Streaming API. Each terminal window is collecting Tweets about a certain company. Figure 5. The output the computer program that collects the Tweets and associates them with a stock price is shown above. The output shows each Tweets ID, properties of that Tweet, and the associated stock price with that Tweet. The last line of the output shows the coefficients and the constant for the predicted model. The coefficients are in the brackets. The companies being run

in the terminals from the top left terminal to bottom left terminal clockwise are Kirkland, Zynga, Aramark, Foot Locker, Verizon, and Disney. Figure 6. Using the predictive model provided by the previous program, this program tests the accuracy of the predictive model. The R 2 value is shown at the bottom of the terminal window. The companies being run in the terminals from the top left terminal to bottom left terminal clockwise are Kirkland, Zynga, Aramark, Foot Locker, Verizon, and Disney. Results Kirkland (KIRK) Average volume of Tweets per day: 3442 Equation: 5.16412696e 01a + 2.07975484e 05b + 9.28207328e 02c + 24.4283273762 Where a is the sentiment, b is the number of favorites, and c is the number of retweets Day R 2 Value Day 1 1.70142808328 Day 2 0.958742568133 Day 3 36.3152009966 Figure 7. Data for Kirkland

Zynga (ZNGA) Average volume of Tweets per day: 938 Equation: 0.05174558a 0.00014439b 0.00452934c + 2.56291473118 Where a is the sentiment, b is the number of favorites, and c is the number of retweets Day R 2 Value Day 1 0.0866621636987 Day 2 0.924072874293 Day 3 0.413653843975 Figure 8. Data for Zynga Aramark (ARMK) Average volume of Tweets per day: 179 Equation: 0.20719973a + 0.16352239b 0.01903519c + 31.920903806 Where a is the sentiment, b is the number of favorites, and c is the number of retweets Day R 2 Value Day 1 0.0500613156024 Day 2 19.2256210256 Day 3 7.89317668631 Figure 9. Data for Aramark Foot Locker (FL) Average volume of Tweets per day: 1103 Equation: 0.07723532a 0.08410707b + 0.0054169c + 56.7938719015 Where a is the sentiment, b is the number of favorites, and c is the number of retweets Day R 2 Value Day 1 331.704823394 Day 2 300.694360902

Day 3 178.346760997 Figure 10. Data for Foot Locker Verizon (VZ) Average volume of Tweets per day: 35251 Equation: 1.37212215e 01a 8.71009324e 05b 2.69172043e 03c + 47.203529803 Where a is the sentiment, b is the number of favorites, and c is the number of retweets Day R 2 Value Day 1 0.981403546164 Day 2 14.3259053938 Day 3 2.06868027039 Figure 11. Data for Verizon Disney (DIS) Average volume of Tweets per day: 358214 Equation: 2.90521402e 01a + 3.97562468e 06b + 4.45964933e 02c + 94.8674834768 Where a is the sentiment, b is the number of favorites, and c is the number of retweets Day R 2 Value Day 1 1242.50846904 Day 2 7.7987392641 Day 3 5.52669624237 Figure 12. Data for Disney Conclusions The R 2 value is the coefficient of determination. The coefficient of determination is the proportion of the variance in the dependent variable that is predictable from the independent

variable. The higher the R 2 value, the datasets are more correlative with an upper bound of 1. The lower the R 2 value, the less correlated the datasets. The datasets compared in this experiment are the sets of predicted stock values using the model based on Tweets, related features, and the actual stock price. All but one of the trials in the experiment had a negative coefficient of determination. This means the model predicted the stock price inaccurately. The Efficient Market Hypothesis states that information relevant to stock price will immediately affect the stock price. Since the social media was highly uncorrelated with the stock price according the values of the coefficient of determination, the Efficient Market Hypothesis proves that information on social media is not relevant to the fluctuation in stock price. The results of this experiment also prove that stock movement follows a Markov Chain. A Markov Chain means that the state of something currently is not related at all to the previous states that it has been in. An extension of this type of movement is a random walk. A random walk is a succession of random steps. In the case of stocks, the movement of stocks can go either up or down. These movements represent the total randomness of fluctuations in the stock market. Nobel Prize winning Economist, Gene Fama (1965) finds that empirical evidence provides strong support for the random walk model. He notes that economists must prove that models can predict prices better than simply randomly choosing buy or sell. As Figure 1 showed, the average accuracy of predictive models is only 48 percent, less than the fifty fifty chance of choosing randomly. The results of this experiment corroborate his findings. The major finding of this experiment was that social media is irrelevant to determining stock returns. Further research to seek a correlation between Social Media and stock returns may include taking into account longer periods of time for data gathering and testing, using more sensitive natural

text analysis tools, and calculating based on a time shift, with an assumption that sentiment data from the previous days would impact future days.

Bibliography Baker, M. and Wurgler, J. (2007). Investor Sentiment in the Stock Market. Bollen, J., Mao, H. and Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), pp.1 8. Fama, E. (1965). Random Walks in Stock Market Prices. Financial Analysts Journal, 21(5), pp.55 59. Brandon, E. (2009, February 12). How Did Your 401(k) Really Stack Up in 2008? [online] US News. Available at: http://money.usnews.com/money/retirement/articles/2009/02/12/how did your 401k reallystack up in 2008 Ferri, R. (2013). It's Official! Gurus Can't Accurately Predict Markets. [online] Forbes. Available at: http://www.forbes.com/sites/rickferri/2013/01/10/ts official gurus cant accurately predictmarkets/ [Accessed 28 Dec. 2014]. Fisher, K. and Statman, M. (2000). Investor Sentiment and Stock Returns. Financial Analysts Journal, 56(2), pp.16 23. Kaniel, R., Saar, G. and Titman, S. (2004). Individual Investor Sentiment and Stock Returns. Li, X., Xie, H., Chen, L., Wang, J. and Deng, X. (2014). News impact on stock price return via sentiment analysis. Knowledge Based Systems, 69, pp.14 23.

Minh, D. (2013). Sentiment and Influence Analysis of Twitter Tweets. US20130103667 A1. Paniagua, J. and Sapena, J. (2014). Business performance and social media: Love or hate?. Business Horizons, 57(6), pp.719 728. Rader, J. (2006). Method and system for conducting sentiment analysis for securities research. US20060242040 A1. Russell Investments, (2011). Russell 1000 Index Member List. Washington: Russell Investments. Russell Investments, (2011). Russell Microcap Index Membership List. Washington: Russell Investments. Russell Investments, (2011). Russell Midcap Index Member List. Washington: Russell Investments. Siganos, A., Vagenas Nanos, E. and Verwijmeren, P. (2014). Facebook's daily sentiment and international stock markets. Journal of Economic Behavior & Organization, 107, pp.730 743. Sisk, J. (2013). Methods and systems for predicting market behavior based on news and sentiment analysis. US20130138577 A1.

Acknowledgements We would like to acknowledge and thank several people for providing us with an abundance of support and assistance. We would like to express our gratitude to Dr. Alex Tabarrok of George Mason University, Professor Francis DiTraglia of the University of Pennsylvania, Dr. Andrew Lo of the Massachusetts Institute of Technology, and William Li of the Massachusetts Institute of Technology for their highly supportive feedback on the Economic and programming side. We would also like to thank Dr. Csaba Gabor and Mr. Phillip Ero of the Thomas Jefferson High School for Science and Technology for providing us with support in understanding our statistical analyses. We are grateful to our lab director, Dr. Dan Burden of the Thomas Jefferson High School for Science and Technology, for patiently and proactively ensuring that we had everything we needed to conduct our experiment. Finally, and most importantly, we want to appreciate our families for their unparalleled support.